Monday 27 May 2013

Big Fuss About Big Data


One of those linkedin articles that makes big fuss about small things is again making exaggerated claims: this time about the so called "big data". Try this excerpt:

The advances in analyzing big data allow us to e.g. decode human DNA in minutes, find cures for cancer, accurately predict human behavior, foil terrorist attacks, pinpoint marketing efforts and prevent disease.

...and enable time travel by recording all possible spin states of all the electrons in the world and restoring them as necessary to restore the world to an earlier state, and create slave robots to serve all our needs and wants, and keep us alive forever and make people fall in love with us and enable total and complete world domination for me and my friends...

Big data has become big business (yes I occasionally do state the obvious), and this business seems to be driven by promises about magical feats to be achieved by sorting and analyzing electronic information about pretty much everything and everything else. The kind of ingenuous articles are just a marketing gimmick to drive a perception of a need for specialists of data manipulation to enable you to understand your world better, and them to enjoy their lives better, through enhanced consumption and investment possibilities.

Data is best handled by the people closest to the source of that data. Without a more organic understanding of what data represents and the dynamics of the processes that generate the data, any mechanistic analysis of data according to preset and generalized techniques runs risk of grossly misinterpreting whatever it tries to interpret ( human behavior, seriously?!!). 

Here's a real life example. Me and my foes were assigned a project to check the validity of a modified CAPM in the Indian equity markets. So we go download some not so small data from the a stock exchange website, which claims to give past data extending from the mid-nineties regarding prices of various stocks traded. Horror of horrors, the daily price data downloaded had certain omissions in them. That is to say, occasionally there would be prices for a few days missing in the series of prices. Analyzing this price data would then yield spurious results, indicated in our analysis by abnormally large values of the test statistic (t-statistic in our case), making it easy for us to reject the validity of a valuation model for the stocks considered.

(Feel free to skip some ugly details of the analysis: Once we got the data, it was a straightforward task to check correlations between predicted and actual prices of stocks. High correlations are said to indicate validity of the model, and further statistical tests leading to regression of predicted and actual values yield a t-statistic that can be compared with a limit value of it based on assumption that the stock prices follow a lognormal distribution. Larger t-statistics than the threshold values corresponding to a level of confidence enable us to reject the hypothesis that the model predicts the stock prices. Needless to say we got abnormally large t-statistic values)

This was a huge eye-opener. If the biggest stock exchange in the country hosts incomplete data regarding stock prices, which should ideally be bread-and-butter stuff for them, then who can vouch for the accuracy of any sort of data collection and distribution system in the country? We could of course have accessed the Bloomberg terminals in our institute and hopefully got complete data there, but this database is of course not free to the public at large. In any case one set of analyses is all I do for a fraction of a course credit. I hadn't realized till I got the ridiculous results that there had to be something up with the data. Thankfully for me, the data was not too big that I couldn't scroll through it for a while and figure out the exact nature of errors in the data. 

An effort to generate implicit trust in the size, or more accurately, volume, of data glosses over several pitfalls that will come the way of those who delegate handling of their data to third-parties. Firstly  If the underlying process has stochastic elements to it, then you can never predict it accurately. Secondly, the data collection and storage processes of even much celebrated databases leave a lot to be desired. Thirdly, given that people intend to make big (huge?) bucks out of it, they have an incentive to under-report possible sources of errors. Finally, given that the data to be analyzed is rather voluminous, checking it for consistency and accuracy is a difficult task. I don't even want to start talking about the strategic issues involved in having the knowledge form your information go to folks who may work for competing entities in the future, or the erroneous strategic insights that arise form applying industry standard practices to data that may be unique to a firm. 

To sum up, I will try to deal with each of the achievements attributed to big data in the quoted excerpt form the linked article. You may be able to decode human DNA in minutes. Possibly, so good for you. A cure for cancer. Sounds like a Nobel in the waiting. Foil terrorist attacks? How many have you prevented to date? Pin-point marketing efforts? Yeah sure, I had a lot of marketing efforts pin-pointed at me. Did a lot of window shopping till I got bored and installed ad-block plus. Prevent disease, oh please!