Anomaly Detection
Posted on Tue 01 December 2015 in TDDA
The Broader Process: Anomaly detection and Alerting.
The fourth major area we will focus on as we develop the ideas of test-driven data analysis is the correctness of the broader process. This relates partly to some of the ideas about consistency checking discussed earlier, but goes further.
A common situation with analysis processes is that they are used repeatedly on some kind of feed of data. When this is the case, in addition to checking the internal consistency of data, we have the opportunity to compare the current dataset with the previous datasets that have been seen with a view to detecting and reporting sudden and potentially unexpected changes. Simple examples might include changes in data volumes, shifts in distributions, increases in missing data rates and new or disappearing categories. More complex examples are multivariate, involving changes in the relationships between variables over time.
While this can be a complex topic, simply tracking a time series of summary stats about various data (especially inputs) and setting thresholds for deviations between the current data and what's come before can catch a good many problems. A more sophisticated and ambitious approach might involve trying to do general automatic anomaly detection on the incoming data, again using information about data previously seen as a reference point.
Depending on how automated the process is, it might be appropriate for the result of such anomaly detection to be a simple section or note in the output, or the creation of some kind of alert (such as a triggered email).