Analytics Happenings: October 2010

Monday, October 25, 2010

In data base Analytics

Teradata just announced a new version of their database that allows organizations to compare current and historical data. They hope to use this functionality to attract Small and Medium Enterprises which otherwise do not have access to a lot of analytics.

Two things that interest me here:

(a) Another instance of somebody using in-database analytics. I am personally not very sure about this. I definitely is useful but very soon users will be asking for more advanced model driven features.

(b) Another player positioning itself for the SME space. A really useful (and probably winning) solution in this space will in my opinion have black box, semi customizable advanced modeling modules for different business decisions like targeting, cross sell, churn, forecasting etc. Not sure if in -database is the answer here.

-- datamining_guy

Saturday, October 23, 2010

Participant composition at Hearst Challenge

Here is the latest on Hearst Challenge. We are up to 400+ participants. An interesting question is whether a team from industry will take the title, or will academia prevail?

In terms of participation the split approximately is 85% non academic and 15% academic. The academic number might be somewhat underestimated as those who have registered with their gmail/yahoo/hotmail type accounts have been tagged as non academic.

In an interesting development, NUIM is offering extra credit to its students for serious participation in the competition.

Who will win it all?

watch this space for more ----

datamining_guy

Friday, October 22, 2010

IBM's Shopping Spree

SPSS, UNICA, NETEZAA, CLARITY, OPENPAGES .... IBM has really built up its analytics arsenal over past year. It is clearly positioning itself as the one shop analytics shop and a credible rival to SAS. In fact, with a lot of its recent acquisitions it probably has greater spread of offerings than SAS.

I am looking at this in the context of the next big frontier where the analytics marketing war is likely to be fought -- the small and medium enterprise space. This is a space where in my opinion, SAS has a natural disadvantage, due to its front heavy investment requirements, and IBM is clearly positioning itself as the front runner. SAS should probably lead with JMP in this space rather than its main product.

Another player to watch in this space is Google. With Google Analytics, they have really introduced analytics to a lot of Small and Medium Enterprises, and whetted their appetite... are they going to follow up with a comprehensive solution? We will be watching.

-datamining_guy

Tuesday, October 19, 2010

Hearst Challenge Update

Hearst Challenge is well underway. 330+ teams have signed up and we expect the number to go up significantly. There is a lot of activity on the discussion forum, and a lot of the questions are about the final evaluation data set. Something that truly excited me is the global nature of the participation. Here is an approximate breakup of the participants from the different countries:

Participants by Country
Country	Participants
United States	165
India	38
Information Not Available	27
Australia	16
Canada	16
United Kingdom	12
Taiwan	8
New Zealand	6
Spain	5
France	3
Hungary	3
Netherlands	3
South Africa	3
Austria	2
Germany	2
Indonesia	2
Poland	2
Russian Federation	2
Slovenia	2
South Korea	2
American Samoa	1
Bulgaria	1
Chile	1
China	1
Denmark	1
France, Metropolitan	1
Guatemala	1
Iran	1
Israel	1
Japan	1
Kuwait	1
Portugal	1
Sudan	1
Sweden	1
Switzerland	1
Turkey	1
Ukraine	1
Total	336

-- datamining_guy

Friday, October 15, 2010

Hearst Challenge

Hearst Challenge finally launched yesterday. It has been fun planning for the event over the last few months.
The entire gang (Pritish, Sunayna, Divya, Rakhi) worked very hard on this and is really excited. Then of course there is Anthony from Kaggle who has been such a great partner. There was a lot of last minute scrambling as we decided to move to a more powerful server to accommodate the large data set and the likely number of participants. A lot of interesting questions on the bulletin board. Keeping fingers crossed that things continue to go smoothly over the rest of the competition!

--datamining_guy

Wednesday, October 6, 2010

Why does Ensembling Work?

A new and popular feature in the data mining world is ensembling, a practice by which you combine the output of several models to arrive at the final result. To be honest, it not that new anymore, and has been around for several years, but is only now starting to be deployed on a more broader scale.

For those not very familiar with the topic, applications of ensembling can be very straightforward, where you take some simple measure (like max, min, average), or more complex and developed using regressions or neural networks. I have also seen applications where decision trees are used to develop a segment level strategy for ensembling.

A question that often gets raised in this context is, why does ensembling work? You will very likely get asked about this when you are proposing an ensembling solution to a business partner, as it is so much harder to interpret. I am not sure I have a complete answer to this question, but intuitively it makes sense. If more models are telling you that somebody is a good prospect, then it is more likely to be true. I often give the movie analogy, where if one person tells you that a movie is great, you might or might not like it, depending on your tastes; however, when a few hundred people tell you the same thing, your chance of not liking the movie is very less. Again, not sure if there is full 1x1 correspondence between the workings of ensembling and this analogy, but I find it does appeal to peoples intuitive senses.

One aspect of ensembling that is not captured in the above analogy, is that of complementarity. Do the models complement each other and reinforce each others strengths, very much like an ensemble of musical instruments in an orchestra, which complement each others sounds to create a rich experience? Some of this definitely happens, because it is very common for an ensemble of different modeling techniques to work very well together. I have personally seen and have others also narrate for example, that an ensemble of neural net and logistic regression will probably give better results than just 2 different logistic regressions or two different neural networks.

Appreciate your perspectives.

-- datamining_guy

Monday, October 4, 2010

Developing a Segmented Model

As a modeler, I have always had to deal with the need to develop and justify a segmented model. There is always the tension between trying to develop a richer single model that leverages all the data, versus a suite of models that explains the niches well, but then uses less data and so has stability implications. Over the years have developed a number of best practices around it.

Then last year Varun and I decided to formalize our findings in the form of an approach. Essentially, segmentation is desirable if one of the following happens:

(a) When there are some clear business or external knowledge reason for segmentation

(b) When data availability/coverage vary across sub segments

(d) When the relationship of certain key predictors with target variable are not stable across sub pockets of the population

(e) When it is possible to identify some patterns in the error terms of the base model

Based on this we recommend the following approach:

Step 1: Any business knowledge or data availability reasons to work on pre-defined segments?

− If YES, go to Step 2

− If NO, go to Step 3

Step 2: Develop segment level models [refer scenario (b) in Sec. 2] and go to Step 1

Step 3: Build an aggregate model and go to Step 4

Step 4: Any binary predictor, whose contribution is very high?

− If YES, go to Step 5

− If NO, go to Step 6

Step 5: Develop segment level models [refer scenario (a) in Sec. 2] and go to Step 4

Step 6: Any predictor, across whose classes/cut-off values, direction of impact of remaining predictors on the target variable gets flipped or change significantly?

− If YES, go to Step 7

− If NO, go to Step 8

Step 7: Develop segment level models [refer scenario (c) in Sect. 2] and go to Step 4

Step 8: Any patterns (based on classification tree) in residuals of aggregate model?

− If YES, go to Step 9

− If NO, go to STOP

Step 9: Develop segment level models [refer scenario (d) in Sec. 2] and go to Step 4

STOP: No segmentation needed

I look forward to your comments.

--datamining_guy

Friday, October 1, 2010

My 5 favorite books on analytics

Here is a list of my 5 favorite books on Analytics/Econometrics:

1. Applied Statistics and the SAS Programming Language (5th Edition) Cody and Smith ( Really helped me keep my first job)

2. Econometric Analysis William Greene ( In my mind must for every student of Econometrics)

3.Logistic Regression Using the SAS System: Theory and Application Paul Allison ( When it comes to teaching applied work, Paul is the Best. His Survival Analysis book is also great)

4. Web Analytics: An Hour a Day Avinash Kaushik ( Web Analytics is probably the fastest growing sub discipline in terms of job specialization, and Kaushik has a great perspective to share)

5.Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems) --- Whitten and Frank ( This was a tough one, but chose this because it is a good first book to understand the area)

I would love to know your favorites. I will bring out a compiled list once I get enough responses.

-- datamining_guy