The Size of Your … Data

It’s not the size of your data, it’s what you do with it.
Or so claim men that are insecure about theirs. [Disclaimer: I have big hands. ‘nuff said.]

There appears to be confusion about what Big Data may achieve. There’s the Marketing anecdotes (sic) about spurious relations, the long-standing anecdotes of credit card companies’ fraud detection, and there’s talk of ‘smart data’ use in e.g. organizations’ process mining to establish internal control quality. One party bashing the others over ‘abuse’ of terms.

What happen? [for those that know their History of Memes]

1. We have more data available than ever before. We had various generations of data analysis tools already; every new generation introducing a. a higely increased capability to deal with the data set sizes of their age, b. ease of use through prettier interfaces, c. less requirements on ex ante correlation hypothesis definitions. Now, we seem to come to a phase where we have all the data we might possibly want (not) with tools that allow any fatastic result to appear out of thin air (not).

2. Throwing humongous amounts of (usually, marketing- or generic socmed) data at an analysis tool may automatically deliver correlations, but these come in various sorts and sizes:

  • The usual suspects; loyal brand/product followers will probably be hard to get into lift-shift-retention mode. If (not when) valid, still are not really interesting because they would (should!) have been known already from ‘traditional’ research;
  • The false positives; spurious correlations, co-variance, etc., induced by data noise. Without human analysis, many wrongs can be assumed. All too often, correlation is taken (emotionally at least) to be close to or somewhat the same as causation; over and over again. How dumb can an analyst be, to not (sufficiently) be aware of their own psychological biases! Tons of them are around and impact the work, and the more one is convinced not to be (psychologically) biased, the more onewillbe and the worse the impact will be. Let alone that systemic biases can be found all too often;
  • The true positives. The hidden gems.

We don’t have any data on the amount of the spurious results vis-a-vis useful results (next bullet) to know how well (effective, efficient) we do with both automated correlation discovery and human analysis, which would be tell-tale in the world of analyse-everything. But what would you expect from this overwhelmingly inductive approach?

3. Yes, until now, in history we seem to have done quite well with deductive approaches, from the pre-Socratics until the recent discovery of the Higgs boson… in all sciences including the classic social/sociological scientists like the Greek and Roman authors (yes, the humanities are deductive sociology) and the deep thinking by definition philosophers.
The ‘scientists’ who relied on inductive approaches … we don’t even know their names (anymore) because their ‘theories’ were all refuted so completely. Yet, the above data bucket approach is no more than just pure and blind induction.

4. Ah, but then you say, ‘We aren’tthatstupid, we do take care to select the right data, filter and massage it until we know it may deliver something useful.’ Well, thank you for numbing down your data; out go the false but also the true positives..! And the other results, you should have had already long time ago via ‘traditional’ aproaches. No need to call Big Data Analysis what you do now. Either take it wholesale, or leave it!

5. Taking it wholesale will take tons of human analysis and control (over the results!); the Big Data element will dwindle to negligable proportionsifyou do this right. Big Data will be just a small start to a number of process steps that,ifdone right, will lean towards refinement through deduction much more than being induction-only that Big Data is trumpeted to be. This can be seen in e.g. some TLA having collected the world’s communications and Internet data; there’s so many dots and so many dots connected, that the significant connected dots are missed time and time again — it appears infeasible to separate the false positives from the true positives or we have a Sacrifice Coventry situation. So, repeated, no need to call this all, Big Data.

5. And then there’s the ‘smart data’ approach of not even using too much data, but using what’s available because there’s not yottabytes out there. I mean, even the databases of business transactions in the biggest global companies don’t hold what we’d call Big Data. But there’s enough (internal) transaction data to be able to establish through automated analysis, how the data flows through the organization, which is then turned into ‘process flow apparent’ schemes. Handy, but what then …? And there’s no need at all to call this stuff Big Data, either.

So, we conclude that ‘Big Data’ is just a tool, and the world is still packed with fools. Can we flip the tools back again to easily test hypotheses? Then we may even allow someinductive automated correlation searches, to hintat possible hidden causations that may be refined and tested before they can be useful.
Or we’ll remain stuck in the ‘My Method is More What Big Data Is Than Yours’.

So, I can confidently say: Size does matter, and what you do with it.

Next blog will be about how ‘predictive’ ‘analysis’ isn’t on both counts.

Double blind

Just a question: Would anyone know some definitive source, or pointers, to discussions either formal or informal, on the logic behind double secrets i.e. situations where it is a secret that some secret exists ..?

Yerah, it’s relevant in particular now that some countries’ government seems to have failed to keep that double secret completely, but should be more systematically dealt with, I think, also re regular business-to-business (and -to-consumer) interactions.

So, if you have some neat write-ups of formal logic systems approaches, I’d be grateful. TIA!

No Ethical Hacking, Please!

We still see quite a market for ‘ethical’ hacking out in the information security consulting world. However, if this type of activity should have a name, it would be wise the name would be descriptive, right? Rather than deceiting, swindling… We certainly won’t do that, sir, no way.

We’d call it ‘ethical’ if the purpose of it all would be to further the ethical goals of the ones doing it. Now take a look at who’s doing it. ‘Ethical’ hacking. And for what: Moneyyy! Hey indeed, it is the consultants and Big4 accountants that will only and exclusively do it for the money. You say No? Have you tried to talk off just an hour of their bills because the hacking that they do (more on that, below), serves some ethical purpose that they are happy to work on for free ..? A great many would consider doing just anything that pays and not doing any of it otherwise, the direct opposite, the utmost perversion of ‘ethical’ behaviour. Yet, that’s where we are with ‘ethical’ hacking.

Now for the ‘hacking’ part. Most of that is non-existent again. It’s primarily penetration testing using off-the-shelf freeware tools. Can be done from any phablet while driving, or it’s so outdated that it should serve no purpose. OK, you got me there. Even antiquated tools will find big holes in clients’ defenses that could and should have been fixed aeons ago, you know, decades of internet time (a couple of years in our time). And about that entering through a small hole: it’s still rather common to not go there, stay virgin and only do some port scanning.
So, [except for the few good men that do understand what they’re concocting] no hacking together one’s own new baby tools takes place. Yes, hacking, as in state-of-the-art coding (programming for those of you who have been hibernating the last decade) without the need for any bureaucrat’s architecture principles but with a deep understanding of languages’ strenghts and pitfalls.

So there we have it. Let loose some basic scanning tools, write up a fat report with some fancy letterhead and the usual suspects in findings; long live copy-paste, and bill ‘em for some ridiculous amount that goes straight into the coffers of some elderly gentlemen partners that don’t know how to use the Internet … except for, well, you know, searching for pictures.

Therefore, in search for a truthful descriptory name, let’s either revert to ‘penetration testing’ which for most men wouldn’t feel comfortable or even just ‘port scanning, or find some new designation. Mammon scanning, or so. But let’s not call it ‘ethical’ ‘hacking’ – two humongous wrongs don’t make a right.

Next up, maybe, a rephrased repost of @meneer’s #ditchcyber argument.

Maverisk / Étoiles du Nord