Wednesday, December 29, 2010

Dating Sites Try Adaptive Matchmaking  - Technology Review

Dating Sites Try Adaptive Matchmaking - Technology Review

Tuesday, December 21, 2010

Using a Computer to Fight Medicare Fraud - WSJ.com

A very interesting article in the WSJ on how data mining techniques are being used to fight medicare fraud.

The California example in the article is particularly interesting --- very similar to work done in the credit card space.


Using a Computer to Fight Medicare Fraud - WSJ.com

AOL Buys about.me, the Personal Analytics and Social Content Site

AOL Buys about.me, the Personal Analytics and Social Content Site

Thursday, December 16, 2010

Hearst Challenge Update

The Hearst challenge session at the NCDM2010 conference went off very well yesterday!  The three finalist teams, MIRACLE ( Xiaoshi Lu),  One Million Monkeys ( Eric Jackson)  and A^3 ( Aleksey Fadeev, Aleksey Ashkimin and Arthur Abdullin) did an excellent job with the presentations! All finalists we represented with  beautiful crafted crystal trophies from Tiffany. 

Congratulations to A^3 for winning the grand prize of $25,000!


Here are some details  on the tools and techniques used by the participants:


Looking forward to next years competition!

--Datamining_guy

Foursquare looks to bolster digital marketing capabilities with data mining | RICG

Looks like Foursquare is getting into the Amazon/Netflix type collaborative filtering based recommendations business:



Foursquare looks to bolster digital marketing capabilities with data mining RICG

Friday, December 10, 2010

Civil Liberties and Datamining?

As the use of data mining to make better(or profitable) policy/business decisions is gaining ground, an undercurrent of concerns related to improper use of data is also developing.

I recently posted an article about this on the Analytics Happenings linkedin group, which deals with the controversy related to using data mining for for medical marketing.

Here is another interesting article from The Constitution Project that echoes some of these concerns.

Basically, there is call the have civil liberties and privacy law concerns baked into policy related data mining


Opposites Agree on Data Mining's Importance and the Need for Controls Security Management

For those working in lending or insurance industries, the call for these restrictions might not be anything new, as some of are already in place there.

However, I hope this undercurrent of concern does not slow down or kill the adoption of analytics in newer areas.

Here is the link the original report: 

http://www.constitutionproject.org/pdf/DataMiningPublication.pdf

Thursday, December 2, 2010

Can Crowdsourcing be an alternative to traditional Consulting?

During the course of Hearst challenge, several acquaintances  have commented on the beauty of the  analytics competition business model? Putting up amounts which are significantly less than normal consulting fees, companies can get large number of people to work on a problem that is of interest to them.

While a lot of this is true, I do see some drawbacks of the  competition or crowd sourcing approach:
  • To the extent organizations invest in  analytics to gain a competitive advantage,  the crowd sourcing approach has a disadvantage that it is harder to keep a secret in a crowd. You might swear the winner to secrecy, but what about the guy who almost won?
  • Data  confidentiality is another issue. As a consultant, I have always seen clients being very sensitive to giving others access to their own data. Therefore,  they will be very reluctant to post really sensitive or important data in a public forum
  • Another problem with the crowd sourcing model is that it is a winner(s) take all system, putting  a lot of risk on the participant. for example, 750 teams participated in the Hearst Challenge and put in 6 weeks of effort, but only 1 will get the $25k prize.  Therefore, for someone to be willing to put in that kind of effort, they must be either doing it part time  in spare time, or just starting out. So majority of  full time participants in these competitions are likely to be students or organizations looking to make a name for themselves. This  might have scalability.

These drawbacks do not mean that this is not a viable way of solving business problems. Just that some more thinking and improvisation might be needed to make it scalable, or else it might just be a niche strategy --but definitely a very enticing one.

--Datamining_guy

Monday, November 29, 2010

Hospital Data Mining Hits Paydirt

Hospital Data Mining Hits Paydirt -- intersting article on how hospitals are using data mining to identify revenue opportunities. I was very excited to see this is soemthign I have done for other verticals like trasportation and logistics. Good to see good analytics practice moving to newer verticals. A truly exciting time to be in analytics!

-- Datamining_guy

Wednesday, November 24, 2010

results from another datamining competition

 FICO, UCSD Announce Winners of International Predictive Analytics Competition - MarketWatch http://goo.gl/yGIQT


The competition  asked participants to predict future purchases for consumers.


The competition was divided into two categories -- one category utilized raw data, and one category utilized transformed data -- and each category had a Graduate and Undergraduate division. The top three finishers in each category and each division shared $10,000 in cash prizes. The winners were:
Undergraduate Division - Raw Data Category
1st place         Shivam Juneja, Institute of Engineering and Technology, Bhaddal
                  (India)
2nd place         Benjamin Hamner, Duke University (USA)
3rd place         Rohan Anil, Birla Institute of Technology and Science, Pilani
                  (India)
Undergraduate Division - Transformed Data Category
1st place         Benjamin Hamner, Duke University (USA)
2nd place         Harsh Pareek, Chiraag Juvekar and Santosh Ananthakrishnan; Indian
                  Institute of Technology, Bombay (India)
3rd place         Rohan Anil, Birla Institute of Technology and Science, Pilani
                  (India)
Graduate Division - Raw Data Category
1st place         Quan Sun, University of Waikato (New Zealand)
2nd place         Alexey Gorodilov, Moscow Institute of Physics and Technology
                  (Russia)
3rd place         Jianfei Wu, North Dakota State University (USA)
Graduate Division - Transformed Data Category
1st place         Santi Villalba, University College Dublin (Ireland)
2nd place         Jaeyong Lee, Pohang University of Science and Technology (Korea)
3rd place         Ilhwan Ko, Pohang University of Science and Technology (Korea)

--datamining_guy

Tuesday, November 23, 2010

Hearst Challenge Update

We are entering the home stretch of the competition.  The last time I checked we had 700 teams registered. Also after a long time the leader board saw some movement at the top, with "alegro" from Ukraine taking the top spot. One of the most exciting things for me has been the rich diversity in the participants. Here is a recent breakup by country:

CountryNumber
of
Participants
United States309
India66
Information not available41
Canada33
Australia29
United Kingdom27
Taiwan17
China11
Hungary9
Spain8
New Zealand7
Brazil6
France5
Germany5
Netherlands5
Poland5
Russian Federation5
South Africa5
Denmark4
Israel3
Mexico3
Slovenia3
Sweden3
Austria2
Chile2
France, Metropolitan2
Indonesia2
Iran2
South Korea2
Turkey2
United Arab Emirates2
Afghanistan1
American Samoa1
Argentina1
Bangladesh1
Bosnia and Herzegovina1
Bulgaria1
Colombia1
Ecuador1
Egypt1
Finland1
Guatemala1
Hong Kong1
Japan1
Kuwait1
Luxembourg1
Malaysia1
Pakistan1
Portugal1
Romania1
Singapore1
Sri Lanka1
Sudan1
Switzerland1
Uganda1
Ukraine1
Viet Nam1
Grand Total651


A total of 58 countries! Watch this space for more updates.

--Datamining_guy

Wednesday, November 17, 2010

Dynamic Pricing -- The Customer Antidote!

Having worked in the transportation industry before getting into consulting,  the idea of dynamic pricing has always greatly appealed to me.  The space on a flight or truck or train is a perishable commodity, and  can be priced based on capacity, customers willingness to pay and market conditions.

Back in 2004 my employer was getting into dynamic pricing, so I spent a fair amount of time understanding PROS and even attended their annual event at Houston. PROS and SABRE  between themselves almost served the entire airline industry and supplied them with their dynamic pricing capability.

Timing  of the transaction has big role to play in any dynamic pricing scheme, and for a long time the buyers of transportation services ( especially consumers but also businesses) have tried to figure out ways to understand how the dynamic pricing scheme works and trying to outsmart it.

While I have heard of  one off cases of success at this, I had never seen a systematic effort to understand this till now. Check this out:  A datamining approach to outsmarting dynamic pricing.  It is now integrated with BING's travel functionality.

Now I am  waiting for the reaction from the dynamic pricing engines and the counter reaction. Let the games begin!

--Datamining_guy

Friday, November 12, 2010

Top 25 Articles in Economics -- By downloads

A nostalgic list for the Economist in me:


RankJournal ArticleFile Downloads
 Total  
1The Market for 'Lemons': Quality Uncertainty and the Market Mechanism20,316  
George A. Akerlof
2The Pricing of Options and Corporate Liabilities18,940
Fischer Black and Myron S. Scholes
3Prospect Theory: An Analysis of Decision under Risk13,839
Daniel Kahneman and Amos Tversky
4Credit Rationing in Markets with Imperfect Information11,435
Joseph Stiglitz and Andrew Weiss
5Increasing Returns and Long-run Growth9,187
Paul Michael Romer
6Co-integration and Error Correction: Representation, Estimation, and Testing9,145  
Robert F. Engle and Clive W. J. Granger
7A Contribution to the Empirics of Economic Growth8,801
N. Gregory Mankiw, David Romer and David Weil
8Theory of Rational Option Pricing8,569
Robert C. Merton
9Common risk factors in the returns on stocks and bonds8,385
Eugene F. Fama and Kenneth French
10Agency Problems and the Theory of the Firm8,053
Eugene F. Fama
11Corruption and Growth7,751  
Paolo Mauro
12The pyramid of corporate social responsibility: Toward the moral management of organizational stakeholders7,228
Archie B. Carroll
12Efficient Capital Markets: A Review of Theory and Empirical Work7,228
Eugene F. Fama
14A Theory of the Term Structure of Interest Rates7,197
John C Cox, Ingersoll, Jonathan E, and Stephen A Ross
15Endogenous Technological Change7,176
Paul Michael Romer
16Finance and Growth: Schumpeter Might Be Right6,968  
Robert King and Ross Levine
17Event Studies in Economics and Finance6,823
A. Craig MacKinlay
18Time to Build and Aggregate Fluctuations6,460
Finn E. Kydland and Edward C. Prescott
19A Model of Balance-of-Payments Crises6,387
Paul Krugman
20Expectations and Exchange Rate Dynamics6,380
Rudiger Dornbusch
21Rules Rather Than Discretion: The Inconsistency of Optimal Plans6,262  
Finn E. Kydland and Edward C. Prescott
22The Cross-Section of Expected Stock Returns6,171
Eugene F. Fama and Kenneth French
23The Costs and Benefits of Ownership: A Theory of Vertical and Lateral Integration6,148
Sanford Jay Grossman and Oliver D. Hart
24Production, Information Costs, and Economic Organization6,115
Armen A Alchian and Harold Demsetz
25Economic Growth in a Cross Section of Countries6,012
Robert J. Barro

JMP 9 Tree Functionality

I have been a heavy user of  CART( SALFORD SYSTEMS) for the past 6-7 years and really love it.   Today I had the opportunity to see the Tree functionality that comes with JMP 9, and must say I am very impressed!

While the visuals are  not very appealing ( Why?  contradicts the  otherwise visual rich appeal of JMP), the functionalities are very good.  I  particularly loved the ability to prune and shape  any node the way I want to and really control the overall tree.  I have always maintained that this is something which can be a very dangerous functionality and would not recommend it for  novices, in the hands of an experienced analyst, the inferences are going to be so much richer! Love it.

Now waiting for  JMP to incorporate the multi way splits  available with Knowledge Seeker (Angoss) !



-- Datamining_guy

Thursday, November 11, 2010

Hearst Challenge Update

Here is the latest on Hearst Challenge:

600+ teams now registered

we have moved the date when the final evaluation dataset will be available to December 1.

There was some controversy last night, when the current leader decided to hang up his boots and  posted his code for the world to see.  On one hand it is correct that sharing the knowledge leads to  the development of a superior model, but  if you take the thought to the  other extreme -- it would not be much of a competition if every team was required to share its  code and methodology  with all the participants.  In balance, given the short duration of the competition, I wish the code was not posted.  Anyway, it is  water under the bridge and the show goes on --- :)

-- Datamining_guy

Tuesday, November 9, 2010

The future of Datamining in Medical Diagnosis?

I have for long wondered about the  relevance of datamining in medical diagnosis. Think about what a doctor does: Collects data  by examination of the patient, through pathological and radiographic tests and on the basis of these data points, makes an inference or diagnosis about the  ailment inflicting the patient.  So much  in common to the basic  tenets of datamining!

I got a first glimpse of this about 4 years back during the 2006 KDD CUP related to Pulmonary Embolism sponsored  by Siemens.  The focus of the competition  was to assist  in automated detection of the disease, with  separate problems related to  false positives and false negatives. The Holy Grail of the competition was  an algorithm  to predict with 100% certainty if a patient was  healthy!

The 2008 KDD CUP competition again was again related to this area but dealt with breast cancer.


It seems , like everything else  in analytics, IBM is dabbling in this space as well. Check out this very interesting  article in the Atlantic.  As the article indicates, a big hurdle here is to get the practitioners to embrace the technology.

-- datamining_guy

Tuesday, November 2, 2010

Hearst Challenge Update

The competition at the Hearst Challenge is heating up and  530+ teams are now registered.  A pack of about 5 teams  are now slightly separated from the rest of the field and  represent, Australia, US, UK and China.

Here are the top  5 as of this afternoon:

In other news, we have started  planning for next years competition!

--datamining_guy

Monday, November 1, 2010

Top 10 things in IT/Analytics related world for 2011!

Here is a very interesting article  from the Gartner's Symposium earlier this month. Look at number 5 and 6  on the list.

Next generation Analytics --- Which is basically real time analytics where every business function is supported by automated analysis and predictions about the future!

Social Analytics --- A broad lumping together of Social media and social network related analytics!

I agree with both and I think a big application would be the  union of the two ( see for example, my earlier posting on FourSquare).

One thing not on the list,  but which has a potential for being very big is Speech Analytics.

-- Datamining_guy

Monday, October 25, 2010

In data base Analytics

Teradata just announced a new version of their database that allows  organizations to compare current and historical data.  They hope to use this functionality to attract Small and Medium Enterprises which otherwise do not  have  access to a lot of analytics.

Two things that interest me here:


(a) Another instance of somebody using in-database analytics. I am personally not very sure about this. I definitely is useful but very soon  users will be asking for more advanced model driven features.

(b) Another player positioning itself for the  SME space.   A really useful  (and probably winning) solution in this space will  in my opinion  have black box, semi customizable  advanced modeling modules for different business decisions like targeting, cross sell, churn,  forecasting etc. Not sure if in -database is the answer here.


-- datamining_guy

Saturday, October 23, 2010

Participant composition at Hearst Challenge

Here is the latest on Hearst Challenge. We are up to 400+ participants. An interesting question is whether a team from industry will take the title, or will academia prevail?

In terms of participation the split approximately is 85% non academic and 15% academic. The academic number might be somewhat underestimated  as those who have registered with their gmail/yahoo/hotmail type accounts have been tagged as  non academic.

In an interesting  development, NUIM is offering extra credit  to its students for  serious participation in the  competition.

Who will  win  it all?

watch this space for more ----

datamining_guy