The Size of Your … Data

It’s not the size of your data, it’s what you do with it.
Or so claim men that are insecure about theirs. [Disclaimer: I have big hands. ‘nuff said.]

There appears to be confusion about what Big Data may achieve. There’s the Marketing anecdotes (sic) about spurious relations, the long-standing anecdotes of credit card companies’ fraud detection, and there’s talk of ‘smart data’ use in e.g. organizations’ process mining to establish internal control quality. One party bashing the others over ‘abuse’ of terms.

What happen? [for those that know their History of Memes]

1. We have more data available than ever before. We had various generations of data analysis tools already; every new generation introducing a. a higely increased capability to deal with the data set sizes of their age, b. ease of use through prettier interfaces, c. less requirements on ex ante correlation hypothesis definitions. Now, we seem to come to a phase where we have all the data we might possibly want (not) with tools that allow any fatastic result to appear out of thin air (not).

2. Throwing humongous amounts of (usually, marketing- or generic socmed) data at an analysis tool may automatically deliver correlations, but these come in various sorts and sizes:

  • The usual suspects; loyal brand/product followers will probably be hard to get into lift-shift-retention mode. If (not when) valid, still are not really interesting because they would (should!) have been known already from ‘traditional’ research;
  • The false positives; spurious correlations, co-variance, etc., induced by data noise. Without human analysis, many wrongs can be assumed. All too often, correlation is taken (emotionally at least) to be close to or somewhat the same as causation; over and over again. How dumb can an analyst be, to not (sufficiently) be aware of their own psychological biases! Tons of them are around and impact the work, and the more one is convinced not to be (psychologically) biased, the more onewillbe and the worse the impact will be. Let alone that systemic biases can be found all too often;
  • The true positives. The hidden gems.

We don’t have any data on the amount of the spurious results vis-a-vis useful results (next bullet) to know how well (effective, efficient) we do with both automated correlation discovery and human analysis, which would be tell-tale in the world of analyse-everything. But what would you expect from this overwhelmingly inductive approach?

3. Yes, until now, in history we seem to have done quite well with deductive approaches, from the pre-Socratics until the recent discovery of the Higgs boson… in all sciences including the classic social/sociological scientists like the Greek and Roman authors (yes, the humanities are deductive sociology) and the deep thinking by definition philosophers.
The ‘scientists’ who relied on inductive approaches … we don’t even know their names (anymore) because their ‘theories’ were all refuted so completely. Yet, the above data bucket approach is no more than just pure and blind induction.

4. Ah, but then you say, ‘We aren’tthatstupid, we do take care to select the right data, filter and massage it until we know it may deliver something useful.’ Well, thank you for numbing down your data; out go the false but also the true positives..! And the other results, you should have had already long time ago via ‘traditional’ aproaches. No need to call Big Data Analysis what you do now. Either take it wholesale, or leave it!

5. Taking it wholesale will take tons of human analysis and control (over the results!); the Big Data element will dwindle to negligable proportionsifyou do this right. Big Data will be just a small start to a number of process steps that,ifdone right, will lean towards refinement through deduction much more than being induction-only that Big Data is trumpeted to be. This can be seen in e.g. some TLA having collected the world’s communications and Internet data; there’s so many dots and so many dots connected, that the significant connected dots are missed time and time again — it appears infeasible to separate the false positives from the true positives or we have a Sacrifice Coventry situation. So, repeated, no need to call this all, Big Data.

5. And then there’s the ‘smart data’ approach of not even using too much data, but using what’s available because there’s not yottabytes out there. I mean, even the databases of business transactions in the biggest global companies don’t hold what we’d call Big Data. But there’s enough (internal) transaction data to be able to establish through automated analysis, how the data flows through the organization, which is then turned into ‘process flow apparent’ schemes. Handy, but what then …? And there’s no need at all to call this stuff Big Data, either.

So, we conclude that ‘Big Data’ is just a tool, and the world is still packed with fools. Can we flip the tools back again to easily test hypotheses? Then we may even allow someinductive automated correlation searches, to hintat possible hidden causations that may be refined and tested before they can be useful.
Or we’ll remain stuck in the ‘My Method is More What Big Data Is Than Yours’.

So, I can confidently say: Size does matter, and what you do with it.

Next blog will be about how ‘predictive’ ‘analysis’ isn’t on both counts.

Maverisk / Étoiles du Nord