Machine Learning is really about mirroring data, and mirroring it in a way that sometimes approximates the thing that generates the data — the real world process and its statistical footprints. If we over mirror the data, we are subject to bias, and if we under mirror the data, we are subject to error. Over mirroring the data implies that we don’t expect it to change much, i.e., we are (arrogantly) confident that there is nothing new under the sun. Under mirroring suggests that we haven’t assimilated what we’ve seen fully, nor do we believe everything we see. Reality lies somewhere in the middle, or to the left or right of that middle. The goal of machine learning is to identify that balancing point, such that we are able to confidently mirror, not the data, but the underlying process which generates that data. So, our machine (or algorithm) is trying to mimic the thing spitting out data, as if it were that thing. This is called learning. Imagine an Elvis impersonator. Sometimes they over do it. Sometimes they don’t do enough. Sometimes, we can’t tell the difference. Even so, whether we are talking about small data, or big data, we are ultimately not interested in the data so much as what is generating the data? Indeed, i’m sure the real Elvis may have had many sides to his personality that most of us will never be privy to. An impersonation that captures those other unobserved aspects of his persona, will have little influence on our judgment as to whether a particular impersonator is believable — so, we probably shouldn’t downplay the importance of observations too much.
We start with data. These days, lots of it. I mean lots and lots of it. Not megabytes. Not gigabytes. Not terabytes. Petabytes. That is 2^50 bytes. This makes me chuckle as I remember drooling over the words “…support for up to 2 megabytes of RAM“ as a n00b to the world of computing and Macintosh. Nevertheless, I digress. That data has a distribution — rather, it comes from a distribution. We know nothing about it. We only know what we observe (or what we can observe). Observations are samples. Samples are our best approximation of truth. The question being is truth = Truth? We don’t know Truth. We want to know Truth. What does Truth look like? Well, we hope it looks like truth. At this point, if you are asking questions like: Does the distribution of data that we observe look anything like the distribution that it comes from? …then you are on the right track.
Thanks to sampling and our implicit belief in empirical knowledge, we tend to believe strongly that with more samples, we can hope to approach that Truth.
In the old paradigm (Oh yes he did!), Truth looked like a Normal distribution. It was always Normal. Even if it wasn’t Normal, it was Normal. No matter what the data was, Truth was always Normal. We somehow knew this wasn’t really the Truth. But, we aren’t well equipped to deal with non-Normality, and frequently we take lazy comfort in not being too wrong.
This is a bit like losing a ring in the middle of a dark road, but looking for it under the nearest well lit street lamp. A comforting effort (since we’re “doing something” and “taking action”) but a futile effort (little to no hope of success) as the chances of finding it in a place we didn’t originally lose it, are perhaps low.
Well not exactly. Of course, we can always approximate Truth as being something like a Normal distribution, and we won’t be too far off, mostly; except when we are. What if there was a better way? A more accurate way? Would we want to take that way? Well, it would depend on the required trade off. What is the trade off? Cost, Complexity, and Return. Is the cost and/or complexity of the better way worth the effort? Sometimes, No. Sometimes, yes. And sometimes, Yes!
This is where machine learning comes in. Real world processes leave a trail of bread crumbs (a series of observations) that give evidence as to their statistical characteristics. They leave footprints in the sand, so to speak, for us to follow. We like following footprints that suggest we might end up discovering a fairy of the woodland realm, or better yet a wily leprechaun with a pot of gold; we would be a little more hesitant/reluctant if we thought the footprints were those of an abominable snowman, or worse, the insufferably ubiquitous Elmo. You get the point though — we can infer something about the culprit, given their footprints. That can be a very useful thing. Sometimes those statistical properties suggest a high degree of complexity behind the curtain, and potentially a high degree of surprise that we may not be ready for. The complexity of what goes on behind the curtain sometimes matters. Sometimes that complexity hides useful behaviors that we can exploit. Sometimes it just hides danger.
In a world where data is cheap, algorithms are cheap, and computation is cheap we are increasingly able to say yes to each incremental marginal effort to extract additional value. Arguably, good data (as opposed to just voluminous data) is not so cheap. Certainly, today’s data management is not cheap. But, on longer and longer time scales, everything is eventually cheap. The value of a thing is constantly depreciating, as it becomes a commodity, unless it has some redeeming feature which retains its uniqueness over time. That said, in a world where uncovering that complexity means opportunity, risk, and reward, it becomes increasingly in our interests to do so skillfully, efficiently, and quickly — at the risk of our own obsolescence in a competitive market that is hungrily capturing every opportunity available.
To tackle the non-Normality of these complex distributions arising from intertwined processes, comprising multiple interacting parts, feedback mechanisms, and external influences, we need methods that are adaptive, that correctly weight new data relative to old data, and which are neither too sensitive, nor too rigid. In other words, we need a tuned process that mimics the real underlying process with some degree of accuracy and/or precision.
A machine learning algorithm, irrespective of its underlying structure, does precisely this. It learns from real data, to identify features and parameters (sometimes hidden) which drive outcomes in not easy to grasp ways. Of course, most of us are familiar with the simplest form of statistical learning: linear regression. Machine learning, in a sense, stretches this basic concept and generalizes it to non-Normal and non-stationary processes. We are really good at grasping linear phenomena. Most of our methodologies are designed to measure, capture, and fine-tune linear things. We are generally terrible at dealing with non-linear things, because we treat them like they are linear things. That is the equivalent of treating a monkey like a squirrel. I would venture that squirrels, despite their ferocious reputations, are generally limited in the amount of damage they can do to you — monkeys on the other hand, are not so easily contained. We are terrible at dealing with complexity. We are also not as good as we think we are at dealing with change. In statistics, change is observed in the form of non-stationarity. We expect things to stay the same. Change requires work. We are too lazy for change. We recoil at both of these problems: The problem of complexity (arising from non-linearity) and the problem of dealing with change (arising from non-stationarity). We are not good at predicting them. We don’t often understand them or their implications. Quite frankly, we are not honest enough with ourselves to react to them efficiently, in a timely manner. With many non-linearities, reacting in a timely matter is more important than you might think. Consider the consequences of not being prepared: Macondo disaster, global financial meltdown, etc. These illustrate the downstream effects of not paying attention to the non-linearities and non-stationarities within the system or process, i.e., assuming things will always be what they always were. One should pay attention to the little things. Those little things talk. Sometimes, that talk leads to mischief. We should know how to manage mischief. In less dramatic terms, consider the disruption caused by technology, and how it dramatically impacts the way we behave in a fairly short period of time (e.g., MySpace to Facebook, Uber, etc.). These shifts were not easy to foresee, but they were there, and they were due to fundamental structural changes in the system that occurred as a result of nonlinear relationships between the consumer and the thing being consumed. The assumptions we make and their consequences frequently take us quite by surprise. That is, you can do everything right, but still be totally exposed. As a result, we are left exposed to risk. The kind of risk you cannot easily see coming. Yep, we hate that too. Savvy business leaders, in particular, are not very fond of unmitigated risks.
So, what do we do about it? Well this is where we turn to the dark arts of those mythical unicorns we call Data Scientists. What do they do? Well, ideally, they are trained to deal with those things we really hate, and they can contextualize those things in terms of the business priorities and economic realities. We therefore pay them large sums of money to find ways of making those things we hate more manageable. No but I mean, what do data scientists actually do? They take time to understand things that are messy. They put structure to things that don’t seem to have any. They attempt to create a more systematic awareness of the world that empowers organizations to react well to it, to hack it, and to come out slightly ahead in the long run. That is, they understand the behavior of systems, try to put bounds on the uncertainties of those systems, quantify the value/cost of those uncertainties to the system, all while potentially looking for exploits to drive value. Then, they apply mathematical/statistical methods to model those things and to relate how those models reflect reality, all while hopefully containing/exploiting complexity and improving our systemic response to change. Next, they take these concepts from the blackboard to the computer by writing algorithms to simulate/model those messy things in order to find better ways of dealing with complexity and change, typically under the code name of machine learning, to deal with complex processes, that give rise to complex statistical signatures, which in turn, cause complex problems for the rest of us and our organizations. The reason we pay them large sums of money to do all this, is (i) it is hard work and those who can do it well are rare, and (ii) the cost of not doing all this is increasingly recognized to be multiples of those large sums of money we pay them to solve these problems. In other words, the ROI is often ridiculously high.
By not doing these things, we are leaving money on the table or we are potentially exposing ourselves to unmitigated risk. Will even the best data scientists be able to eliminate all the risks and capture all the rewards in a system? No! But, not even trying is asking to be replaced.