Monday, October 25, 2010

In data base Analytics

Teradata just announced a new version of their database that allows  organizations to compare current and historical data.  They hope to use this functionality to attract Small and Medium Enterprises which otherwise do not  have  access to a lot of analytics.

Two things that interest me here:


(a) Another instance of somebody using in-database analytics. I am personally not very sure about this. I definitely is useful but very soon  users will be asking for more advanced model driven features.

(b) Another player positioning itself for the  SME space.   A really useful  (and probably winning) solution in this space will  in my opinion  have black box, semi customizable  advanced modeling modules for different business decisions like targeting, cross sell, churn,  forecasting etc. Not sure if in -database is the answer here.


-- datamining_guy

Saturday, October 23, 2010

Participant composition at Hearst Challenge

Here is the latest on Hearst Challenge. We are up to 400+ participants. An interesting question is whether a team from industry will take the title, or will academia prevail?

In terms of participation the split approximately is 85% non academic and 15% academic. The academic number might be somewhat underestimated  as those who have registered with their gmail/yahoo/hotmail type accounts have been tagged as  non academic.

In an interesting  development, NUIM is offering extra credit  to its students for  serious participation in the  competition.

Who will  win  it all?

watch this space for more ----

datamining_guy

Friday, October 22, 2010

IBM's Shopping Spree

SPSS, UNICA, NETEZAA,  CLARITY, OPENPAGES .... IBM has really  built up its analytics arsenal over  past year.  It is clearly positioning itself as the one shop analytics shop and a credible rival to SAS. In fact,  with a  lot of its recent  acquisitions it probably has greater spread of offerings than SAS.

I am  looking at this in  the context of the next big frontier  where the analytics marketing war is likely to be fought -- the small and medium enterprise space.  This is a space where in my opinion, SAS has a natural disadvantage,  due to its  front heavy investment requirements,  and IBM is clearly positioning itself as  the front runner.  SAS should probably lead with JMP in this space rather than its main product.

Another player to watch in this space is Google.  With Google Analytics, they have really introduced analytics to a lot of Small and Medium Enterprises, and whetted their appetite... are they going to follow up with a comprehensive solution?  We will be watching.


-datamining_guy

Tuesday, October 19, 2010

Hearst Challenge Update

 Hearst Challenge is well underway. 330+ teams have signed up and  we expect the number to go up significantly. There is a lot of activity on the discussion forum, and  a lot of the questions are about the final evaluation data set. Something that truly excited me is the global nature of the participation. Here is an approximate breakup of the  participants from the different countries:

Participants by Country
CountryParticipants
United States165
India38
Information Not Available27
Australia16
Canada16
United Kingdom12
Taiwan8
New Zealand6
Spain5
France3
Hungary3
Netherlands3
South Africa3
Austria2
Germany2
Indonesia2
Poland2
Russian Federation2
Slovenia2
South Korea2
American Samoa1
Bulgaria1
Chile1
China1
Denmark1
France, Metropolitan1
Guatemala1
Iran1
Israel1
Japan1
Kuwait1
Portugal1
Sudan1
Sweden1
Switzerland1
Turkey1
Ukraine1
Total336
-- datamining_guy

Friday, October 15, 2010

Hearst Challenge

Hearst Challenge finally launched yesterday. It has been fun planning for the event over the  last few months.
The entire gang (Pritish, Sunayna, Divya, Rakhi)  worked very hard on this and is really excited. Then of course there is Anthony from  Kaggle who has been such a great partner. There was a lot of last minute  scrambling as we decided to move to a more powerful server to accommodate the large data set and the  likely number of participants.  A lot of interesting questions on  the bulletin board. Keeping fingers crossed   that things continue to go smoothly over the  rest of the competition!

--datamining_guy

Wednesday, October 6, 2010

Why does Ensembling Work?

A new and popular feature in the data mining world is ensembling, a practice by which you combine the output of several models to arrive at the final result. To be honest, it not that new anymore, and has been around for  several years, but is only now starting to be deployed  on a more broader scale.

For those not very familiar with the topic, applications of ensembling can be very straightforward, where you take some simple measure (like max, min, average), or more complex and developed using regressions or neural networks. I have also seen applications where  decision trees are used to develop a segment level strategy for ensembling.

A question that often gets raised in this context is, why does ensembling work?  You will very likely get asked about this  when you are proposing an ensembling solution to a business partner, as it is so much harder to interpret.     I am not sure I have a complete  answer to this question, but intuitively it makes sense. If  more models are telling you that somebody is a good prospect, then it is more likely to be true.  I often give the movie analogy, where if one person tells you that a movie is great,  you might  or might not like it, depending on your tastes; however, when a few hundred people tell you the same thing, your  chance of not liking the movie is very less. Again, not sure if there is full 1x1 correspondence between the workings of ensembling and  this analogy, but I find it does appeal to peoples intuitive senses.

One aspect of ensembling that is not captured in the above analogy, is that of complementarity.  Do the models complement each other and reinforce each others strengths, very much like an ensemble of musical instruments in an orchestra, which complement each others sounds to create a rich experience? Some of this  definitely happens, because it is very common for  an ensemble of different modeling techniques to work very well together. I have personally seen and have others also narrate for example, that  an ensemble of neural net and logistic regression will probably give better results than just 2 different logistic regressions or two different neural networks.

Appreciate your perspectives.

-- datamining_guy

Monday, October 4, 2010

Developing a Segmented Model

As a modeler, I have always had to deal with the need to develop and justify a segmented model.   There is always the tension between trying to develop a richer single model that leverages all the data, versus a  suite of models that explains the  niches well, but then uses less data and so has stability implications. Over the years have developed a  number of best practices around it.

Then last year Varun and I decided to  formalize our findings in the form of an approach. Essentially, segmentation is desirable if one of the following happens:

(a)   When there are some  clear business or external knowledge reason for segmentation
(b)   When data availability/coverage  vary across sub segments
(c)    When the model is over dependent on certain predictors
(d)   When the relationship of certain key predictors with target variable are not stable across sub pockets of the population
(e)   When it is possible to identify some patterns in the error terms of the base model

Based on this we recommend the following  approach:

Step 1:   Any business knowledge or data availability reasons to work on pre-defined segments?
    If YES, go to Step 2
    If NO, go to Step 3
Step 2:   Develop segment level models [refer scenario (b) in Sec. 2] and go to Step 1
Step 3:   Build an aggregate model and go to Step 4
Step 4:   Any binary predictor, whose contribution is very high?
    If YES, go to Step 5
    If NO, go to Step 6
Step 5:   Develop segment level models [refer scenario (a) in Sec. 2] and go to Step 4
Step 6:   Any predictor, across whose classes/cut-off values, direction of impact of remaining predictors on the target variable gets flipped or change significantly?
    If YES, go to Step 7
    If NO, go to Step 8
Step 7:   Develop segment level models [refer scenario (c) in Sect. 2] and go to Step 4

Step 8:   Any patterns (based on classification tree) in residuals of aggregate model?
    If YES, go to Step 9
    If NO, go to STOP
Step 9:   Develop segment level models [refer scenario (d) in Sec. 2] and go to Step 4

STOP:    No segmentation needed


I look forward to your comments.

--datamining_guy

Friday, October 1, 2010

My 5 favorite books on analytics

Here  is a list of my 5 favorite books on Analytics/Econometrics:

1. Applied Statistics and the SAS Programming Language (5th Edition) Cody  and Smith ( Really helped me keep my first job)

2. Econometric Analysis William Greene ( In my mind must for every student of Econometrics)

3.Logistic Regression Using the SAS System: Theory and Application  Paul Allison ( When it comes to teaching applied  work, Paul is the Best. His Survival Analysis book is also great)

4. Web Analytics: An Hour a Day Avinash Kaushik  ( Web Analytics is probably the  fastest growing sub discipline in terms of job specialization, and  Kaushik has a great perspective to share)

5.Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems) --- Whitten and Frank (  This was a tough one, but chose this because it is a good first book to understand the area)



I would love to  know your favorites. I will bring out a compiled list once I get enough responses.

-- datamining_guy