Big Data – Maverisk / Étoiles du Nord

Data Classinocation

I was studying this ‘old’ idea of mine of drafting some form of impact-based criteria for data sensitivity when, along with a couple of fundamental logical errors in some of the most formally adopted (incl legal) standards and laws, I suddenly realised:

In these times of easily provable easy de-anonymisation of even the most protective homomorphic encryption multiplied with the ease of de-anonymisation throught data correlation of even the most innocent data points, all even the most innocent data points/elements must (not should) be classified at the highest sensitivity levels so why classifiy data ..!?

This may not be a popular point, but that doesn’t make it less true.
In similar vein, in European context where one is only to process data in the first place if (big if) there is no alternative and one can process for the Original intent and purpose only,

To prevent data from unauthorised disclosure internally or externally, without tight need-to-know/need-to-use IAM implementation, one already does too little; with, enough.

That’s right; ‘internal use only’ is waaay too sloppy hence illegal — it breaks the legal requirement for due (sic) protection, and if the use of data is, ‘by negligence’ not changing a thing here, let possible, the European privacy directive (and its currently active precursors) do not allow you to even have the data. This may be a stretch but is still understandable and valid once you take the effort to think it through a bit.
Maybe also not too popular.

Needless to say that both points will not be understood the least by all the ‘privacy officer’ types that have rote learned the laws and regulations, but have no experience/clue how to actually use those in practice and just wave legal ‘arguments’ (quod non) around as if that their (song and) dance is the end purpose of the organisation but cannot answer even the most simple questions re allowablity of some data/processing with anything that logically or linguistically approaches clarity. [Note the ‘or’ is a logical one, not the sometimes interpreted xor that the too-simpletons (incl ‘privacy officers’) interpret but don’t know exists.]

OK. So far, no good. Plus:

[Not a fortress, nor a real maze once you see the structure; Valencia]

Big Data as a sin

Not just any sin, the Original one. Eating from the ultimate source of Knowledge that Big, Totalitarian, All-Thinkable Data is, in the ideal (quod non).
We WEIRDS (White, Educated, Industrialised, Rich Democratic people), a.k.a. Westeners, know what that leads to. Forever we will toil on spurious correlations…

Quick Note: Big Data or ..?

“I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing with pebbles on the seashore, and diverting myself in now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay undiscovered before me” [I. Newton]

Whence my feeling when reading this, that I was looking into Big Data ..? Maybe Big Data could be made to work when set loose on the world’s major problems. So, no petty process analysis or what have we; onto serious fruition!

But then, it turns out that such problem solving, in particular such problem solving, needs no more data but can be solved, as shown throughout history, to be solvable without it and where data was available (yes, far more commonly available in the heads of otherwise decent much less looking-away kind of people) it wasn’t used properly or even in opposition.
Apart from the applications where it is used fully wherever more comes available and still not bringing us much closer to eternal humanitarian bliss.

For the humanity’s departure:
DSCN0124

Deinduction

OK. To be, think, human, two things seem to be required:
No, not the dichotomy of deduction versus induction. Not so literally (literally, I mean like owemygawd). But the top-to-ground-then-back-up-again ‘logical’ goal-directed problem-solving reasoning, versus the speculative wandering of the mind. Perspiration, and Inspiration. Taking correlation for causation, versus fuzzy-logic supported hypothesizing. OK, I admit I threw in the fuzzy logic part to confuse, and to discombobulate your comprehension.
But still, therein lies the foundation of Theories, the brickwork of thinking: Is there a priori knowledge, or is everything we know only valid within its own framework of reference..? Is the definition of definition circular or not, or in some circle..? Should, must be, to be basis for theory-building.
Expanded upward by Kuhn and Lakatos, drilled down by a great many, philosophers mostly — that haven’t delivered workable answers yet. Not workable at least, to span the gap in between neurobiology and psychology. Which is where AI-as-we-know-it will have its place, after which it will be vastly expanded to cover it all. Maybe not individually embodied, but will.
And, there’s no either/or. There’s the spectrum ..!!

And all this, relevant for the grounding (both ways, please) of ‘Big Data’. Think that one through!

Also,

[Close, but no torte in the Sacher Stube…]

One IoTA FYI

To close off [almost, since @KPN fraud themselves away from bankruptcy by series of outright lies to customers and tort] the year with a wild shot, ahead:
There is value in the information analysis in IoT, as described in Gelernter and many since, of the two-way flow of information. One, flowing up are information in the form of answers as aggregations or pattern matched tuples(ets); the other going down, being both commands and inquiries/questions.

This fits the IoT world snugly, and should be taken into account when developing IoTAuditing frameworks:
What we’re after of course in all of auditing — and this we consider self-evident or else go back to study auditing fundamentals, from agency theory! — is the controls that keep the quality of the back/forth i.e. down/up information flows within (client-!)required margins. No more! But be aware of who the client really is, not the one doing the actual paying. So, we may focus on the integrity of the information flows first and foremost, then the continuity (availability), and then confidentiality as an afterthought.
With neat break-downs to isolation, appropriate input/output buffering (anyone still aware of the difference between an interrupt and a trap? If not, take a hike and learn, and weep), integrity controls above all. And some thing on (establishing) the quality of aggregation and of the questions being pushed down — when the wrong questions get asked e.g. by lack of understanding of the subject matter (sic), as is so very commonplace in the vast majority of organisations today, the wrong results will turn up from within the data pool (reporting ‘up’wards).

And of course there’s the divide between
the operational world where actual business is done (either administratively in offices though one could argue (i.e. proof beyond recovery) that this isn’t actually doing anything worthwhile, or producing stuff), and
the busybodies world ‘above’ (quod non) that, which thinks (wrongly) to be able to ‘control’ and ‘steer’ the productive body, sometimes rising itself into the thin air levels of absolute ridicule (by) branding itself ‘governance’.
But do re-read all of last year’s posts and weep. But do also see the implications for variance in the integrity, availability, and confidentiality needs at various (sub)levels.

And:

[The 2016 way is up; Cala at Barça]

Big Data, Little Decision-making

Are you ready for the coming revolution? That is in the wings by way of the data deluge that will cripple your ability to accomplish anything because you’re overwhelmed with data (“information” quod non!) to act upon in masses so vast you can’t even begin to use actionable results from analysis of it in a way that actual decisions are reached, communicated, and put into actual action.
Yes, yes, some of you will say that AI will arrive just-in-time to save the day. But that is much more wishful thinking out of fear than realistic futuring. And no, the exponential growth of data cannot be caught up with by exponential growth of AI capabilities and -spread before you’ve drowned.

Anyone see a way out, other than just ignoring or stifling data growth until by the skin of our teeth we can continue..?

Oh well, this:

[Reckon you’ll win ..!? in Berlin]

Half an argument for mainframes

This here article is somewhat interesting… Explanatory, but also lacking some. E.g., some strengths are given, but not how they would be competitive advantages over a mega-dc of blades or so, as the really big players do.
Oh well. Who cares ..?

For now:

[Plain ~~vanilla~~ Vienna damn auto’rekt]

Diversified Reporting Assurance

Yes, let’s call it DRA. The new wave of “accountants’ statements” in the wings.
[Warning: for those not interested in accountancy, the rest will be boring. Or, let me restate that: very boring. Or even deadly boring.]
Continue reading “Diversified Reporting Assurance”

Mo’Data, Mo’Problems

Some time ago, I was triggered by this tweet (by @meneer; no surprise in that):

Weer bizar weerbeeld… prognose #weerplaza van zuidoost naar noordwest, #buienradar van noordoost naar zuidwest— André Koot RCX (@meneer) July 26, 2014

that somewhat-translates (i.e., manually, however clunky still better than machine translation as that doesn’t get Dutch unstructuredness…) to: “Bizarro weather picture again: forecast #somechannel/app from the South-East to the North-West, #someotherchannel/app from the North-East to the South-West” referring to some predictions about clouds and (turned out quite torrential) rain passing over the minute geography of the Netherlands.

And another about this article – that explains, in a more scientifically styled prose, that having ever more data makes it ever more difficult to connect the dots you’d want to connect…

Both of which are poignant reminders that:

Big Data is not a tool but a mere tool, to be used very carefully even (or in particular?) by the few that have really big data sets. If you collect focusedly, it can hardly be called Big, rather ‘Smart-‘ or just plain ‘data analysis’, no more; if you collect as much as you can, you are destroying objectives achievement – the required method destroys the results;
If, very big if, Big Data would result in anything, why haven’t weather predictions improved ..? The enormity of data that had already been around in that arena for decades, will have exploded over the past one, and should have resulted in far better predictions instead of the worse that the predictions seem to have gotten. And we’re talking patterns, not even the zoom-in to tinier details that one commonly associates with BD (the major patterns are usually skipped for being too well known already). Hence, what hope would we have for other areas..?
Reliance on apps for info is getting more and more dangerous, almost literally so far, but in an indirect sense, already, widely. What if… when now as already well-known, some search giant might have monopolized Search and skews the results you get…? That would theoretically be a disaster. Oh.

So, think again, be ever more critical of Shallows app usage and reliance… I’ll leave you with:
[Lucca: ‘modern’ Italian parade]

The Size of Your … Data

It’s not the size of your data, it’s what you do with it.
Or so claim men that are insecure about theirs. [Disclaimer: I have big hands. ‘nuff said.]

There appears to be confusion about what Big Data may achieve. There’s the Marketing anecdotes (sic) about spurious relations, the long-standing anecdotes of credit card companies’ fraud detection, and there’s talk of ‘smart data’ use in e.g. organizations’ process mining to establish internal control quality. One party bashing the others over ‘abuse’ of terms.

What happen? [for those that know their History of Memes]

1. We have more data available than ever before. We had various generations of data analysis tools already; every new generation introducing a. a higely increased capability to deal with the data set sizes of their age, b. ease of use through prettier interfaces, c. less requirements on ex ante correlation hypothesis definitions. Now, we seem to come to a phase where we have all the data we might possibly want (not) with tools that allow any fatastic result to appear out of thin air (not).

2. Throwing humongous amounts of (usually, marketing- or generic socmed) data at an analysis tool may automatically deliver correlations, but these come in various sorts and sizes:

The usual suspects; loyal brand/product followers will probably be hard to get into lift-shift-retention mode. If (not when) valid, still are not really interesting because they would (should!) have been known already from ‘traditional’ research;
The false positives; spurious correlations, co-variance, etc., induced by data noise. Without human analysis, many wrongs can be assumed. All too often, correlation is taken (emotionally at least) to be close to or somewhat the same as causation; over and over again. How dumb can an analyst be, to not (sufficiently) be aware of their own psychological biases! Tons of them are around and impact the work, and the more one is convinced not to be (psychologically) biased, the more onewillbe and the worse the impact will be. Let alone that systemic biases can be found all too often;
The true positives. The hidden gems.

We don’t have any data on the amount of the spurious results vis-a-vis useful results (next bullet) to know how well (effective, efficient) we do with both automated correlation discovery and human analysis, which would be tell-tale in the world of analyse-everything. But what would you expect from this overwhelmingly inductive approach?

3. Yes, until now, in history we seem to have done quite well with deductive approaches, from the pre-Socratics until the recent discovery of the Higgs boson… in all sciences including the classic social/sociological scientists like the Greek and Roman authors (yes, the humanities are deductive sociology) and the deep thinking by definition philosophers.
The ‘scientists’ who relied on inductive approaches … we don’t even know their names (anymore) because their ‘theories’ were all refuted so completely. Yet, the above data bucket approach is no more than just pure and blind induction.

4. Ah, but then you say, ‘We aren’tthatstupid, we do take care to select the right data, filter and massage it until we know it may deliver something useful.’ Well, thank you for numbing down your data; out go the false but also the true positives..! And the other results, you should have had already long time ago via ‘traditional’ aproaches. No need to call Big Data Analysis what you do now. Either take it wholesale, or leave it!

5. Taking it wholesale will take tons of human analysis and control (over the results!); the Big Data element will dwindle to negligable proportionsifyou do this right. Big Data will be just a small start to a number of process steps that,ifdone right, will lean towards refinement through deduction much more than being induction-only that Big Data is trumpeted to be. This can be seen in e.g. some TLA having collected the world’s communications and Internet data; there’s so many dots and so many dots connected, that the significant connected dots are missed time and time again — it appears infeasible to separate the false positives from the true positives or we have a Sacrifice Coventry situation. So, repeated, no need to call this all, Big Data.

5. And then there’s the ‘smart data’ approach of not even using too much data, but using what’s available because there’s not yottabytes out there. I mean, even the databases of business transactions in the biggest global companies don’t hold what we’d call Big Data. But there’s enough (internal) transaction data to be able to establish through automated analysis, how the data flows through the organization, which is then turned into ‘process flow apparent’ schemes. Handy, but what then …? And there’s no need at all to call this stuff Big Data, either.

So, we conclude that ‘Big Data’ is just a tool, and the world is still packed with fools. Can we flip the tools back again to easily test hypotheses? Then we may even allow someinductive automated correlation searches, to hintat possible hidden causations that may be refined and tested before they can be useful.
Or we’ll remain stuck in the ‘My Method is More What Big Data Is Than Yours’.

So, I can confidently say: Size does matter, and what you do with it.

Next blog will be about how ‘predictive’ ‘analysis’ isn’t on both counts.