March 26, 2017
A huge fraction of analytics is about monitoring. People rarely want to frame things in those terms; evidently they think “monitoring” sounds boring or uncool. One cost of that silence is that it’s hard to get good discussions going about how monitoring should be done. But I’m going to try anyway, yet again.
Business intelligence is largely about monitoring, and the same was true of predecessor technologies such as green paper reports or even pre-computer techniques. Two of the top uses of reporting technology can be squarely described as monitoring, namely:
- Watching whether trends are continuing or not.
- Seeing if there are any events — actual or impending as the case may be — that call for response, in areas such as:
- Machine breakages (computer or general metal alike).
- Resource shortfalls (e.g. various senses of “inventory”).
Yes, monitoring-oriented BI needs investigative drilldown, or else it can be rather lame. Yes, purely investigative BI is very important too. But monitoring is still the heart of most BI desktop installations.
Predictive modeling is often about monitoring too. It is common to use statistics or machine learning to help you detect and diagnose problems, and many such applications have a strong monitoring element.
I.e., you’re predicting trouble before it happens, when there’s still time to head it off.
As for incident response, in areas such as security — any incident you respond to has to be noticed first Often, it’s noticed through analytic monitoring.
Hopefully, that’s enough of a reminder to establish the great importance of analytics-based monitoring. So how can the practice be improved? At least three ways come to mind, and only one of those three is getting enough current attention.
The one that’s trendy, of course, is the bringing of analytics into “real-time”. There are many use cases that genuinely need low-latency dashboards, in areas such as remote/phone-home IoT (Internet of Things), monitoring of an enterprise’s own networks, online marketing, financial trading and so on. “One minute” is a common figure for latency, but sometimes a couple of seconds are all that can be tolerated.
I’ve posted a lot about all this, for example in posts titled:
One particular feature that could help with high-speed monitoring is to meet latency constraints via approximate query results. This can be done entirely via your BI tool (e.g. Zoomdata’s “query sharpening”) or more by your DBMS/platform software (the Snappy Data folks pitched me on that approach this week).
Perennially neglected, on the other hand, are opportunities for flexible, personalized analytics. (Note: There’s a lot of discussion in that link.) The best-acknowledged example may be better filters for alerting. False negatives are obviously bad, but false positives are dangerous too. At best, false positives are annoyances; but too often, alert fatigue causes you employees to disregard crucial warning signals altogether. The Gulf of Mexico oil spill disaster has been blamed on that problem. So was a fire in my own house. But acknowledgment != action; improvement in alerting is way too slow. And some other opportunities described in the link above aren’t even well-acknowledged, especially in the area of metrics customization.
Finally, there’s what could be called data anomaly monitoring. The idea is to check data for surprises as soon as it streams in, using your favorite techniques in anomaly management. Perhaps an anomaly will herald a problem in the data pipeline. Perhaps it will highlight genuinely new business information. Either way, you probably want to know about it.
David Gruzman of Nestlogic suggests numerous categories of anomaly to monitor for. (Not coincidentally, he believes that Nestlogic’s technology is a great choice for finding each of them.) Some of his examples — and I’m summarizing here — are:
- Changes in data format, schema, or availability. For example:
- Data can completely stop coming in from a particular source, and the receiving system might not immediately realize that. (My favorite example is the ad tech firm that accidentally stopped doing business in the whole country of Australia.)
- A data format change might make data so unreadable it might as well not arrive.
- A decrease in the number of approval fields might highlight a questionable change in workflow.
- Data quality NULLs or malformed values might increase suddenly, in particular fields and data segments.
- Data value distribution This category covers a lot of cases. A few of them are:
- A particular value is repeated implausibly often. A bug is the likely explanation.
- E-commerce results suddenly decrease, but only from certain client technology configuration. Probably there is a bug affecting only those particular clients.
- Clicks suddenly increase from certain client technologies. A botnet might be at work.
- Sales suddenly increase from a particular city. Again this might be fraud — or more benignly, perhaps some local influencers have praised your offering.
- A particular medical diagnosis becomes much more common in a particular city. Reasons can range from fraud, to a new facility for certain kinds of tests, to a genuine outbreak of disease.
David offered yet more examples of significant anomalies, including ones that could probably only be detected via Nestlogic’s tools. But the ones I cited above can probably be found via any number of techniques — and should be, more promptly and accurately than they currently are.