Wow that seems like two handfuls of something…
This, about a little part of the AI problem complex at the moment: ETL, as Extract, Transform, Load is usually called but few people care to remember, is Data Analytics’ talk for getting the data from some basic simple systems like SAP (E), pruning it and applying all sorts of other manual corrections (yikes! the T), and then slurping it into some Analytics (or visualisation) tool (L).
The L is easiest. Once one would have gone through the E part, plus some T sauce. Not this. The T sauce, this one, however is too hot to handle.
Since it is so error-prone. Error, as in accidental human failing at the medium, data, information and ethical levels. Error, as in failing in bad faith, at the same levels. Bias, anyone?
Problem being, when there would be e.g., bias in the source data, does one toss the respective data points out? Or might they contain valid information [Note: I take it you understand the most kindergarten basic concept of discrimination: distinguishing people on irrelevant criteria – what to do when the criteria are relevant!?] that one misses when dismissing the misfits outright, even before one has a chance to find out whether they had some role in the original data. Ethically-unwantedly biased or not.
And how good would a human be at detecting biases in source data ..? Not very. The very value of many a latter-day ML tool is in finding those hidden patterns that we miss. The experts miss.
Plus, how would you correct? If one were to leave out all cases where it turned out that bias played a role, you’ll end up with ideal cases only. But then your trained ML system will give results that are incomparable with (past) practice and for effective (… → later on) and efficient ML-trained rule-based systems development, one needs to optimise the fit with the past. That’s where your F1 score comes from. Mess with the source data, destroy the learning results.
Above, what to do with later continued learning where self-learing, unsupervised, is all the rage?
In between already, when one prunes to get out the rules one wants and dismisses others, why not turn all found patterns and rules, into a classical expert system, without or preferably with, fuzzy logic? Most explainable, transparent…
But above all, what is ethically unwanted ..? Apparently, the inputs lead to relevant outcomes as they turn out to exist in the source bias. The ML is there to detect such patterns; if there would be no relevance, no pattern would be calculated (sic; and leaving aside small-sample errors that aren’t biases but just errors).
Rather, who determines the vague ideas of what ‘society’ happens to consider just, for some time ..!? E.g., many Western societies have a core of values that are proclaimed to be based on Christianity; either some interpretation of how Jesus Christ’s words would apply in those later centuries, i.e., big fat interpretations on very often shady hidden intents, or hearkening back to the original intent as much as possible – where JS (as a full-on Jew) and those of any intellectual propensity above ignorant peasant level, would have found the idea that salvation or support for one’s neighbours would be available for non-Jews quite despicable, bordering on the unthinkable. The Golden Rule wouldn’t apply to anyone outside the close circle… All ‘ethical’ discussions since are very time- and circumstance-bound even when putting it mildly. As e.g., ‘democracy’ is so much on the decline around the world [fact]. And people don’t care about millions starving but do care about stray dogs in the same countries. Those against discrimination don’t bat an eye over discrimination of e.g., white male elderly (over 30) on the job market. When one wants to ‘correct’ a bias through some measure of equally low or worse moral value, one has no right to enforce [what one loathes oneself; or you’re in breach of the Golden Rule again].
OK, so much for difficulties with manual T. Now, …:
[Non-random colour scheme; Dublin]
One thought on “Auditing the T of ETL for ML-AI is a human affair (still)”