Generalized Overfitting: Errors of Applicability

Posted on Mon 14 December 2015 in TDDA • Tagged with tdda, errors, applicability

Everyone building predictive models or performing statistical fitting knows about overfitting. This arises when the function represented by the model includes components or aspects that are overly specific to the particularities of the sample data used for training the model, and that are not general features of datasets to which the model might reasonably be applied. The failure mode associated with overfitting is that the performance of the model on the data we used to train it is significantly better than the performance when we apply the model to other data.

Figure: Overfitting

Figure: Overfitting. Points drawn from sin(x) + Gaussian noise. Left: Polynomial fit, degree 3 (cubic; good fit). Right: Polynomial fit, degree 10 (overfit).

Statisticians use the term cross-validation to describe the process of splitting the training data into two (or more) parts, and using one part to fit the model, and the other to assess whether or not it exhibits overfitting. In machine learning, this is more often referred to as a "test-training" approach.

A special form of this approach is longitudinal validation, in which we build the model on data from one time period and then check its performance against data from a later time period, either by partitioning the data available at build time into older and newer data, or by using outcomes collected after the model was built for validation. With longitudinal validation, we seek to verify not only that we did not overfit the characteristics of a particular data sample, but also that the patterns we model are stable over time.

Validating against data for which the outcomes were not known when the model was developed has the additional benefit of eliminating a common class of errors that arises when secondary information about validation outcomes "leaks" during the model building process. Some degree of such leakage—sometimes known as contaminating the validation data—is quite common.

Generalized Overfitting

As its name suggests, overfitting as normally conceived is a failure mode specific to model building, arising when we fit the training data "too well". Here, we are are going to argue that overfitting is an example of a more general failure mode that can be present in any analytical process, especially if we use the process with data other than that used to build it. Our suggested name for this broader class of failures is errors of applicability.

Here are some of the failure modes we are thinking about:

Changes in Distributions of Inputs (and Outputs)

  1. New categories. When we develop the analytical process, we see only categories A, B and C in some (categorical) input or output. In operation, we also see category D. At this point our process may fail completely ("crash"), produce meaningless outputs or merely produce less good results.

  2. Missing categories. The converse can be a problem too: what if a category disappears? Most prosaically, this might lead to a divide-by-zero error if we've explicitly used each category frequency in a denominator. Subtler errors can also creep in.

  3. Extended ranges. For numeric and other ordered data, the equivalent of new categories is values outside the range we saw in the development data. Even if the analysis code runs without incident, the process will be being used in a way that may be quite outside that considered and tested during development, so this can be dangerous.

  4. Distributions. More generally, even if the range of the input data doesn't change, its distribution may, either slowly or abruptly. At the very least, this indicates the process is being used in unfamiliar territory.

  5. Nulls. Did nulls appear in any fields where there were none when we developed the process? Does the process cater for this appropriately? And are such nulls valid?

  6. Higher Dimensional Shifts. Even if the the data ranges and distribution for individual fields don't change, their higher dimensional distributions (correlations) can change significantly. The pair of 2-dimensional distributions below illustrates this point in an extreme way. The distributions of both x and y values on the left and right are identical. But clearly, in 2 dimensions, we see that the space occupied by the two datasets is actually non-overlapping, and on the left x and y are negatively correlated, while on the right they are positively correlated.

    Figure: A shift in distribution (2D)

    Figure: The same x and y values are shared between these two plots (i.e. the disibution of x and y is identical in each case). However, the pairing of x and y coordinates is different. A model or other analytical process built with with negatively correlated data like that on the left might not work well for positively correlated data like that on the right. Even if it does work well, you may want to detect and report a fundamental change like this.

  7. Time (always marching on). Times and dates are notoriously problematical. There are many issues around date and time formats, many specifically around timezones (and the difference between a local times and times in a fixed time zone, such as GMT or UTC).

    For now, let's assume that we have an input that is a well-defined time, correctly read and analysed in a known timezone—say UTC.1 Obviously, new data will tend to have later times—sometimes non-overlapping later times. Most often, we need to change these to intervals measured with respect to a moving date (possibly today, or some variable event date, e.g. days since contact). But in other cases, absolute times, or times in a cycle matter. For example, season, time of month or time of day may matter—the last two, probably in local time rather than UTC.

    In handling time, we have to be careful about binnings, about absolute vs. relative measurement (2015-12-11T11:00:00 vs. 299 hours after the start of the current month), universal vs. local time, and appropriate bin boundaries that move or expand with the analytic time window being considered.

    Time is not unique in the way that its range and maximum naturally increase with each new data sample. Most obviously, other counters (such as customer number) and sum-like aggregates may have this same monotonically increasing character, meaning that it should be expected that new, higher (but perhaps not new lower) values will be present in newer data.

Concrete and Abstract Definitions

There's a general issue with choosing values based on data used during development. This concerns the difference between what we will term concrete and abstract values, and what it means to perform "the same" operation on different datasets.

Suppose we decide to handle outliers differently from the rest of the data in a dataset, at least for some part of the analysis. For example, suppose we're looking at flight prices in Sterling and we see the following distribution.

Figure: Ticket Prices

Figure: Ticket prices, in £100 bins to £1,000, then doubling widths to £256,000, with one final bin for prices above £256,000. (On the graph, the £100-width bins are red; the rest are blue.)

On the basis of this, we see that well over 99% of the data has prices under £4,000, and also that while there are a few thousand ticket prices in the £4,000–£32,000 range (most of which are probably real) the final few thousand probably contain bad data, perhaps as a result of currency conversion errors.

We may well want to choose one or more threshold values from the data—say £4,000 in this case—to specify some aspect of our analytical process. We might, for example, use this threshold in the analysis for filtering, outlier reporting, setting a final bin boundary or setting the range for the axes of a graph.

The crucial question here is: How do we specify and represent our threshold value?

  • Concrete Value: Our concrete value is £4,000. In the current dataset there are 60,995 ticket prices (0.55%) above this value and 10,807,905 (99.45%) below. (There are no prices of exactly £4,000.) Obviously, if we specify our threshold using this concrete value—£4,000—it will be the same for any dataset we use with the process.

  • Abstract Value: Alternatively, we might specify the value indirectly, as a function of the input data. One such abstract specification is the price P below which which 99.45% of ticket prices the dataset lie. If we specify a threshold using this abstract definition, it will vary according to the content of the dataset.

    • In passing, 99.45% is not precise: if we select the bottom 99.45% of this dataset by price we get 10,808,225 records with a maximum price of £4,007.65. The more precise specification is that 99.447046% of the dataset has prices under £4,000.

    • Of course, being human, if we were specifying the value in this way, we would probably round the percentage to 99.5%, and if we did that we would find that we shifted the threshold so that the maximum price below it was £4,186.15, and the minimum price above was £4,186.22.

  • Alternative Abstract Specifications: Of course, if we want to specify this threshold abstractly, there are countless other ways we might do it, some fraught with danger.

    Two things we should definitely avoid when working with data like this are means and variances across the whole column, because they will be rendered largely meaningless by outliers. If we blindly calculate the mean, μ, and standard deviation, σ, in this dataset, we get μ=£2,009.85 and σ=£983,956.28. That's because, as we noted previously, there are a few highly questionable ticket prices in the data, including a maximum of £1,390,276,267.42.2 Within the main body of the data—the ~99.45% with prices below £4,000.00—the corresponding values are μ=£462.09 and σ=£504.82. This emphasizes how dangerous it would be to base a definition on full-field moments3 such as mean or variance.

    In contrast, the median is much less affected by outliers. In the full dataset, for example the median ticket price is £303.77, while the median of those under £4,000.00 is £301.23. So another reasonably stable abstract definition of a threshold around £4,000.00 would be something like 13 times the median.

The reason for labouring this point around abstract vs. concrete definitions is that it arises very commonly and it is not always obvious which is preferable. Concrete definitions have the advantage of (numeric) consistency between analyses, but may result in analyses that are not well suited to a later dataset, because different choices would have been made if that later data had been considered by the developer of the process. Conversely, abstract definitions often make it easier to ensure that analyses are suitable for a broader range of input datasets, but can make comparability more difficult; they also tend to make it harder to get "nice" human-centric scales, bin boundaries and thresholds (because you end up, as we saw above, with values like £4,186.22, rather than £4,000).

Making a poor choice between abstract and concrete specifications of any data-derived values can lead to large sections of the data being omitted (if filtering is used), or made invisible (if used for axis boundaries), or conversely can lead to non-comparability between results or miscomputations if values are associated with bins having different boundaries in different datasets.

NOTE: A common source of the leakage of information from validation data into training data, as discussed above, is the use of the full dataset to make decisions about thresholds such as those discussed here. To get the full benefit of cross-validation, all modelling decisions need to be made solely on the basis of the training data; even feeding back performance information from the validation data begins to contaminate that data.

Data-derived thresholds and other values can occur almost anywhere in an analytical process, but specific dangers include:

  1. Selections (Filters). In designing analytical processes, we may choose to filter values, perhaps to removing outliers or nonsensical values. Over time, the distribution may shift, and these filters may become less appropriate and remove ever-increasing proportions of the data.

    A good example of this we have seen recently involves negative charges. In early versions of ticket price information, almost all charges were positive, and those that were negative were clearly erroneous, so we added a filter to remove all negative charges from the dataset. Later, we started seeing data in which there were many more, and less obviously erroneous negative charges. It turned out that a new data source generated valid negative charges, but we were misled in our initial analysis and the process we built was unsuitable for the new context.

  2. Binnings (Bandings, Buckets). Binning data is very common, and it is important to think carefully about when you want bin boundaries to be concrete (common across datasets) and when they should be abstract (computed, and therefore different for different datasets). Quantile binnings (such as deciles), of course, are intrinsically adaptive, though if those are used you have to be aware that any given bin in one dataset may have different boundaries from the "same" bin in another dataset.

  3. Statistics. As noted above, some care has to be taken when any statistic is used in the dataset to determine whether it should be recorded algorithmically (as an abstract value) in analysis or numerically (as a concrete value), and particular care should be taken with statistics that are sensitive to outliers.

Other Challenges to Applicability

In addition to the common sources of errors of applicability we have outlined above, we will briefly mention a few more.

  1. Non-uniqueness. Is a value that was different for each record in the input data now non-unique?

  2. Crazy outliers. Are there (crazy) outliers in fields where there were none before?

  3. Actually wrong. Are there detectable data errors in the operational data that were not seen during development?

  4. New data formats. Have formats changed, leading to misinterpretation of values?

  5. New outcomes. Even more problematical than new input categories or ranges are new outcome categories or a larger range of output values. When we see this, we should almost always re-evaluate our analytical processes.

Four Kinds of Analytical Errors

In the overview of TDDA we published in Predictive Analytic Times (available here), we made an attempt to summarize how the four main classes of errors arise with the following diagram:

Figure: Four Kinds of Analytical Error

While this was always intended to be a simplification, a particular problem is that it suggests there's no room for errors of interpretation in the operationalization phase, which is far from the case.4 Probably a better representation is as follows:

Figure: Four Kinds of Analytical Error (revisited)


  1. UTC is the curious abbreviation (malacronym?) used for coordinated universal time, which is the standardized version of Greenwich Mean Time now defined by the scientific community. It is the time at 0º longitude, with no "daylight saving" (British Summer Time) adjustment. 

  2. This is probably the result of a currency conversion error. 

  3. Statistical moments are the characterizations of distributions starting with mean and variance, and continuing with skewness and kurtosis. 

  4. It's never too late to misinterpret data or results. 


Overview of TDDA in Predictive Analytics Times

Posted on Fri 11 December 2015 in TDDA • Tagged with tdda

We have an overview piece in Predictive Analytics Times this week.

You can find it here.


Anomaly Detection

Posted on Tue 01 December 2015 in TDDA • Tagged with tdda, components

The Broader Process: Anomaly detection and Alerting.

The fourth major area we will focus on as we develop the ideas of test-driven data analysis is the correctness of the broader process. This relates partly to some of the ideas about consistency checking discussed earlier, but goes further.

A common situation with analysis processes is that they are used repeatedly on some kind of feed of data. When this is the case, in addition to checking the internal consistency of data, we have the opportunity to compare the current dataset with the previous datasets that have been seen with a view to detecting and reporting sudden and potentially unexpected changes. Simple examples might include changes in data volumes, shifts in distributions, increases in missing data rates and new or disappearing categories. More complex examples are multivariate, involving changes in the relationships between variables over time.

While this can be a complex topic, simply tracking a time series of summary stats about various data (especially inputs) and setting thresholds for deviations between the current data and what's come before can catch a good many problems. A more sophisticated and ambitious approach might involve trying to do general automatic anomaly detection on the incoming data, again using information about data previously seen as a reference point.

Depending on how automated the process is, it might be appropriate for the result of such anomaly detection to be a simple section or note in the output, or the creation of some kind of alert (such as a triggered email).


Unit Testing

Posted on Sat 28 November 2015 in TDDA • Tagged with tdda, components

Systematic Unit Tests, System Tests and Reference Tests

The third major idea in test-driven data analysis is the one most directly taken from test-driven development, namely systematically developing both unit tests for small components of the analytical process and carefully constructed, specific tests for the whole system or larger components.

With regression testing, we emphasized the idea of taking whole-system analyses that have already been performed and turning those into overall system tests. In contrast, here we are talking about creating (or selecting) specific patterns in the input data for particular functions that exercise both core functionality (the main code paths) and so-called edge cases (the less-trodden paths). This is definitely significant extra work, and may (particularly if retrofitted) require either restructuring of code or use of some kind of mocking.

When we talk about "edge cases", we are referring to patterns, code-paths or functionality that is used only a minority of the time, or perhaps only on rare occasions. Examples might include handling missing values, extreme values, small data volumes and so forth. Some questions that might help to illuminate common edge cases include:

  1. What happens if we use a standard deviation and there is only one value (so that the standard deviation is not defined)?

  2. What happens if there are no records?

  3. What happens if some or all of the data is null?

  4. What happens if a few extreme (and possibly erroneous) values occur: will this, for example, cause a mean to have an non-useful value or bin boundaries to be set to useless values?

  5. What happens if two fields are perfectly correlated: will this cause instability or errors when performing, for example, a statistical regression?

  6. What happens if there are invalid characters in string data, especially field names or file names?

  7. What happens if input data formats (e.g. string encodings, date formats, separators) are not as expected.

  8. Are the various styles of line-endings (e.g. Unix vs. pc.) handled correctly?

  9. What happens if an external system (such as a database) is running in some unexpected mode, such as with a different locale?


  1. This idea of performing checks throughout the software does not really have an analogue in mainstream TDD, but is certainly good software engineering practice—see, e.g. the timeless Little Bobby Tables XKCD https://xkcd.com/327/—and relates fairly closely to the software engineering practice of defensive programming https://en.wikipedia.org/wiki/Defensive_programming

  2. Obviously, in many situations, it's fine for identifiers or keys to be repeated, but it is also often the case that in a particular table a field value must be unique, typically when the records act as master records, defining the entities that exist in some category. Such tables are often referred to as master tables in database contexts https://encyclopedia2.thefreedictionary.com/master+file


Constraints and Assertions

Posted on Thu 26 November 2015 in TDDA • Tagged with tdda, components

Consistency Checking of Inputs, Outputs and Intermediates

While the idea of regression testing comes straight from test-driven development, the next idea we want to discuss is associated more with general defensive progamming than TDD. The idea is consistency checking, i.e. verifying that what might otherwise be implicit assumptions are in fact met by adding checks at various points in the process.1

Initially, we will assume that we are working with tabular data, but the ideas can be extended to other kinds of data.

Inputs. It is useful to perform some basic checks on inputs. Typical things to consider include:

  1. Are the names and types fields in the input data as we expect? In most cases, we also expect field names to be distinct, and perhaps to conform to some rules.

  2. Is the distribution of values in the fields reasonable? For example, are the minimum and maximum values reasonable?

  3. Are there nulls (missing values) in the data, and if so, are they permitted where they occur? If so, are there any restrictions (e.g. may all the value for a field or record be null?)

  4. Is the volume of data reasonable (exactly as expected, if there is a specific size expectation, or plausible, if the volume is variable)?

  5. Is any required metadata2 included?

In addition to basic sense checks like these, we can also often formulate self-consistency checks on the data. For example:

  1. Are any of the fields identifiers or keys for which every value should occur only once?3

  2. Are there row-level identities that should be true? For example, we might have a set of category counts and an overall total, and expect the category totals to sum to the overall total:

    nCategoryA + nCategoryB + nCategoryC = nOverall
    
  3. For categorical data, are all the values found in the data allowed, and are any required values missing?

  4. If the data has a time structure, are the times and dates self-consistent? For example, do any end dates precede start dates? Are there impossible future dates?

  5. Are there any ordering constraints on the data, and if so are they respected?

Our goal in formulating TDDA is pragmatic: we are not suggesting it is necessary to check for every possible inconsistency in the input data. Rather, we propose that even one or two simple, well-chosen checks can catch a surprising number of problems. As with regression testing, an excellent time to add new checks is when you discover problems. If you add a consistency check every time you discover bad inputs that such a test would have caught, you might quickly build up a powerful, well-targeted set of diagnostics and verification procedures. As we will see below, there is also a definite possibility of tool support in this area.

Intermediate Results and Outputs. Checking intermediates and outputs is very similar to checking inputs, and all same kinds of tests can be applied. Some further questions to consider in these contexts include:

  1. If we look up reference data, do (and should) all the lookups succeed? And are failed lookups handled appropriately?

  2. If we calculate a set of results that should exhibit some identity properties, do those hold? Just as physics has conservation laws for quantities such as energy and momentum, there are similar conservation principles in some analytical calculations. As a simple example, if we categorize spending into different, non-overlapping categories, the sum of the category totals should usually equal the sum of all the transactions, as long as we are careful about things like non-categorized values.

  3. If we build predictive models, do they cross-validate correctly (if we split the data into a training subset and a validation subset)? And, ideally, do they also validate longitudinally (i.e., on later data, if this is available)?

Transfer checks. With data analysis, our inputs are frequently generated by some other system or systems. Often those systems already perform some checking or reporting of the data they produce. If any information about checks or statistics from source systems is available, it is useful to verify that equivalent statistics calculated over the input data produce the same results. If our input data is transactional, maybe the source system reports (or can report) the number or value of transactions over some time period. Perhaps it breaks things down by category. Maybe we know other summary statistics or there are checksums available that can be verified.

The value of checking that the data received is the same as the data the source system was supposed to send is self-evident, and can help us to detect a variety of problems including data loss, data duplication, data corruption, encoding issues, truncation errors and conversion errors, to name but a few.

Tool Support: Automatic constraint suggestion

A perennial obstacle to better testing is the perception that it is a "nice to have", rather than a sine qua non, and that implementing it will require much tedious work. Because of this, any automated support that tools could provide would seem to be especially valuable.

Fortunately, there is low-hanging fruit in many areas, and one of of our goals with this blog is to explore various tool enhancements. We will do this first in our own Miró software and then, as we find things that work, will try to produce some examples, libraries and tools for broader use, probably focused around the Python data analysis stack.

In the spirit of starting simply, we're first going to look at what might be possible by way of automatic input checking.

One characteristic of data analysis is that we often start by trying to get a result from some particular dataset, rather than setting out to implement an analytical process to be used repeatedly with different inputs. In fact, when we start, we may not even have a very specific analytical goal in mind: we may simply have some data (perhaps poorly documented) and perform exploratory analysis with some broad analytical goal in mind. Perhaps we will stop when we have a result that seems useful, and which we have convinced ourselves is plausible. At some later point, we may get a similar dataset (possibly pertaining to a later period, or a different entity) and need to perform a similar analysis. It's at this point we may go back to see whether we can re-use our previous, embryonic analytical process, in whatever form it was recorded.

Let's assume, for simplicity, that the process at least exists as some kind of executable script, but that it's "hard-wired" to the previous data. We then have three main choices.

  • Edit the script (in place) to make it work with the new input data.

  • Take a copy of the script and make that work with the new input data.

  • Modify the script to allow it to work with either the new or the old data, by parameterizing and generalizing it.

Presented like this, the last sounds like the only sensible approach, and in general it is the better way forward. However, we've all taken the other paths from time to time, often because under pressure just changing a few old hard-wired values to new hard-wired values seems as if it will get us to our immediate result faster.4

The problem is that even if were very diligent when preparing the first script, in the context of the original analysis, it is easy for there to be subtle differences in a later dataset that might compromise or invalidate the analysis, and it's hard to force ourselves to be as vigilent the second (third, fourth, ...) time around.

A simple thing that can help is to generate statements about the original dataset and record these as constraints. If a later dataset violates these constraints, it doesn't necessarily mean that anything is wrong, but being alerted to the difference at least offers us an opportunity to consider whether this difference might be significant or problematical, and indeed, whether it might indicate a problem with the data.

Concretely: let's think about what we probably know about a dataset whenever we work with it. We'll use the Periodic Table as an example dataset, based on a snapshot of data I extracted from Wikipedia a few years ago. This is how Miró summarizes the dataset if we ask for a "long" listing of the fields with its ls -l command:

             Field      Type                        Min                         Max    Nulls
                 Z       int                          1                          92       0
              Name    string                   Actinium                   Zirconium       0
            Symbol    string                         Ac                          Zr       0
            Period       int                          1                           7       0
             Group       int                          1                          18      18
    ChemicalSeries    string                   Actinoid            Transition metal       0
      AtomicWeight      real                       1.01                      238.03       0
         Etymology    string                      Ceres                      zircon       0
RelativeAtomicMass      real                       1.01                      238.03       0
     MeltingPointC      real                    -258.98                    3,675.00       1
MeltingPointKelvin      real                      14.20                    3,948.00       1
     BoilingPointC      real                    -268.93                    5,596.00       0
     BoilingPointF      real                    -452.07                   10,105.00       0
           Density      real                       0.00                       22.61       0
       Description    string                        0.2            transition metal      40
            Colour    string    a soft silver-white ...    yellowish green or g ...      59

For each field, we have a name, a type, minimum and maximum values and a count of the number of missing values. [Scroll sideways if your window is too narrow to see the Nulls column on the right.] We also have, implicitly, the field order.

This immediately suggests a set of constraints we might want to construct. We've added an experimental command to Miró for generating constraints based on the field metadata shown earlier and a few other statistics. First, here's the "human-friendly" view that Miró produces if we use its autoconstraints -l command.

Figure 2: Auto-constraints from 92-element Periodic Table

In this table, the green cells represent constraints the system suggests for fields, and the orange cells show areas in which potential constraints were not constructed, though they would have been had the data been different. Darker shades of orange indicate constraints that were closer to be met within the data.

In addition to this human-friendly view, Miró generates out a set of declarations, which can be thought of as candidate assertions. Specifically, they are statements that are true in the current dataset, and therefore constitute potential checks we might want to carry out on any future input datasets we are using for the same analytical process.

Here they are:

declare (>= (min Z) 1)
declare (<= (max Z) 92)
declare (= (countnull Z) 0)
declare (non-nulls-unique Z)

declare (>= (min (length Name)) 3)
declare (<= (min (length Name)) 12)
declare (= (countnull Name) 0)
declare (non-nulls-unique Name)

declare (>= (min (length Symbol)) 1)
declare (<= (min (length Symbol)) 2)
declare (= (countnull Symbol) 0)
declare (non-nulls-unique Symbol)

declare (>= (min Period) 1)
declare (<= (max Period) 7)
declare (= (countnull Period) 0)

declare (>= (min Group) 1)
declare (<= (max Group) 18)

declare (>= (min (length ChemicalSeries)) 7)
declare (<= (min (length ChemicalSeries)) 20)
declare (= (countnull ChemicalSeries) 0)
declare (= (countzero
            (or (isnull ChemicalSeries)
                (in ChemicalSeries (list "Actinoid" "Alkali metal"
                                         "Alkaline earth metal"
                                         "Halogen" "Lanthanoid"
                                         "Metalloid" "Noble gas"
                                         "Nonmetal" "Poor metal"
                                         "Transition metal"))))
           0)

declare (>= (min AtomicWeight) 1.007946)
declare (<= (max AtomicWeight) 238.028914)
declare (= (countnull AtomicWeight) 0)
declare (> (min AtomicWeight) 0)

declare (>= (min (length Etymology)) 4)
declare (<= (min (length Etymology)) 39)
declare (= (countnull Etymology) 0)

declare (>= (min RelativeAtomicMass) 1.007946)
declare (<= (max RelativeAtomicMass) 238.028914)
declare (= (countnull RelativeAtomicMass) 0)
declare (> (min RelativeAtomicMass) 0)

declare (>= (min MeltingPointC) -258.975000)
declare (<= (max MeltingPointC) 3675.0)

declare (>= (min MeltingPointKelvin) 14.200000)
declare (<= (max MeltingPointKelvin) 3948.0)
declare (> (min MeltingPointKelvin) 0)

declare (>= (min BoilingPointC) -268.930000)
declare (<= (max BoilingPointC) 5596.0)
declare (= (countnull BoilingPointC) 0)

declare (>= (min BoilingPointF) -452.070000)
declare (<= (max BoilingPointF) 10105.0)
declare (= (countnull BoilingPointF) 0)

declare (>= (min Density) 0.000089)
declare (<= (max Density) 22.610001)
declare (= (countnull Density) 0)
declare (> (min Density) 0)

declare (>= (min (length Description)) 1)
declare (<= (min (length Description)) 83)

declare (>= (min (length Colour)) 4)
declare (<= (min (length Colour)) 80)

Each green entry in the table maps to a declaration in this list. Let's look at a few:

  1. Min and Max. Z is the atomic number. Each element has an atomic number, which is the number of protons in the nucleus, and each is unique. Hydrogen has the smallest number of protons, 1, and in this dataset, Uranium has the largest number—92. So the first suggested constraints are that these values should be in the observed range. These show up as the first two declarations:

    declare (>= (min Z) 1)
    declare (<= (max Z) 92)
    

    We should say a word about how these constraints are expressed. Miró includes expression language called (lisp-like), (because it's essentially a dialect of Lisp). Lisp is slightly unusual in that instead of writing f(x, y) you write (f x y). So the first expression would be more commonly expressed as

    min(Z) >= 1
    

    in regular ("infix") languages.

    Lisp weirdness aside, are these sensible constraints? Well, the first certainly is. Even if we find some elements beyond Uranium (which we will, below), we certainly don't expect them to have zero or negative numbers of protons, so the first constraint seems like a keeper.

    The second constraint is much less sensible. In fact, given that we know the dataset includes every value of Z from 1 to 92, we confidently expect that any future revisions of the periodic table will include values higher than 92. So we would probably discard that constraint.

    The crucial point is that no one wants to sit down and write out a bunch of constraints by hand (and anyway, "why have a dog an bark yourself?"). People are generally much more willing to review a list of suggested constraints and delete the ones that don't make sense, or modify them so that they do.

  2. Nulls. The next observation about Z is that it contains no nulls. This turns into the (lisp-like) constraint:

    declare (= (countnull Z) 0)
    

    This is also almost certainly a keeper: we'd probably be pretty unhappy if we received a Periodic Table with missing Z values for any elements.

    (Here, (countnull Z) just counts the number of nulls in field Z, and = tests for equality, so the expression reads "the number of nulls in Z is equal to zero".)

  3. Sign. The sign column is more interesting. Here, we have recorded the fact that all the values in Z are positive. Clearly, this is a logically implied by the fact that the minimum value for Z is 1, but we think it's useful to record two separate observations about the field—first, that its minimum value is 1, and secondly that it is always strictly positive. In cases where the minimum is 1, for an integer field, these statements are entirely equivalent, but if the minimum had been (say) 3, they would be different. The value of recording these observations separately arises if at some later stage the minimum changes, while remaining positive. In that case, we might want to discard the specific minimum constraint, but leave in place the constraint on the sign.

    Although we record the sign as a separate constraint in the table, in this case it does not generate a separate declaration, as it would be identical to the constraint on the minimum that we already have.

    In contrast, AtomicWeight, has a minimum value around 1.008, so it does get a separate sign constraint:

    declare (> (min AtomicWeight) 0)
    
  4. Uniqueness of Values. The next thing our autoconstraints framework has noticed about Z is that none of its values is repeated in the data—that all are unique (a.k.a. distinct). The table reports this as yes (the values are unique) and 92/92 (100%), meaning that there are 92 distinct values and 92 non-null values in total, so that 100% of values are unique. Other fields, such as Etymology, have potential constraints that are not quite met: Etymology has 89 different values in the field, so the ratio of distinct values to values is about 97%.

    NOTE: in considering this, we ignore nulls if there are any. You can see this if you look at the Unique entry for the field Group: here there are 18 different (non-null) values for Group, and 74 records have non-null values for Group.

    There is a dedicated function in (lisp-like) for checking whether the non-null values in a field are all distinct, so the expression in the declaration is just:

    (non-nulls-unique Z)
    

    which evaluates to true5 or false.

  5. Min and Max for String Fields. For string fields, the actual minimum and maximum values are usually less interesting. (Indeed, there are lots of reasonable alternative sort orders for strings, given choices such as case sensitivity, whether embedded numbers should be sorted numerically or alphanumerically, how spaces and punctuation should be handled, what to do with accents etc.) In the initial implementation, instead of using any min and max string values as the basis of constraints, we suggest constraints based on string length.

    For the string fields here, none of the constraints is particularly compelling, though a minimum length of 1 might be interesting and you might even think that a maximum length of 2 is sensible the symbol is useful. But in many cases they will be. One common case is fixed-length strings, such as the increasingly ubiquitous UUIDs,6 where the minimum and maximum values would both 36 if they are canonically formatted. (Of course, we can add much stronger constraints if we know all the strings in a field are UUIDs.)

  6. Categorical Values. The last kind of automatically generated constraint we will discuss today is a restriction of the values in a field to be chosen from some fixed set. In this case, Miró has noticed that there are only 10 different non-null values for ChemicalSeries, so has suggested a constraint to capture that reality. The slightly verbose way this currently gets expressed as a constraint is:

    declare (= (countzero
                (or (isnull ChemicalSeries)
                    (in ChemicalSeries
                        (list "Actinoid" "Alkali metal"
                              "Alkaline earth metal"
                              "Halogen" "Lanthanoid" "Metalloid" "Noble gas"
                              "Nonmetal" "Poor metal" "Transition metal"))))
               0)
    

    (The or statement starting on the second line is true for field values that are either in the list or null. The countzero function, when applied to booleans, counts false values, so this is saying that none of the results of the or statement should be false, i.e. all values should be null or in the list. This would be more elegantly expressed with an (all ...) statement; we will probably change it to that formulation soon, though the current version is more useful for reporting failures.)

    The current implementation generates these constraints only when the number of distinct values it sees is 20 or fewer, only for string fields, and only when not all the values in the field are distinct, but all of these aspect can probably be improved, and the user can override the number of categories to allow.

In addition to these constraints, we should also probably generate constraints on the field types and, as we will discuss in future articles, dataset-level constraints.

Tool Support: Using the Declarations

Obviously, if we run test the constraints against the same dataset we used to generate them, all the constraints should be (and are!) satisfied. Things are slightly more interesting if we run them against a different dataset. In this case, we excluded transuranic elements from the dataset we used to generate the constraints. But we can add them in. If we do so, and then execute a script (e92.miros) containing the autogenerated constraints, we get the following output:

$ miro
This is Miro, version 2.1.90.
Copyright © Stochastic Solutions 2008-2015.
Seed: 1463187505
Logs started at 2015/11/25 17:08:23 host tdda.local.
Logging to /Users/njr/miro/log/2015/11/25/session259.

[1]> load elements
elements.miro: 118 records; 118 (100%) selected; 16 fields.
[2]> . e92
[3]> # Autoconstraints for dataset elements92.miro.
[4]> # Generated from session /Users/njr/miro/log/2015/11/25/session256.miros
[5]> declare (>= (min Z) 1)
[6]> declare (<= (max Z) 92)

Miro Warning: Declaration failed: (<= (max Z) 92)

[7]> declare (= (countnull Z) 0)
[8]> declare (non-nulls-unique Z)
[9]> declare (>= (min (length Name)) 3)
[10]> declare (<= (min (length Name)) 12)
[11]> declare (= (countnull Name) 0)
[12]> declare (non-nulls-unique Name)
[13]> declare (>= (min (length Symbol)) 1)
[14]> declare (<= (min (length Symbol)) 2)
[15]> declare (= (countnull Symbol) 0)
[16]> declare (non-nulls-unique Symbol)
[17]> declare (>= (min Period) 1)
[18]> declare (<= (max Period) 7)
[19]> declare (= (countnull Period) 0)
[20]> declare (>= (min Group) 1)
[21]> declare (<= (max Group) 18)
[22]> declare (>= (min (length ChemicalSeries)) 7)
[23]> declare (<= (min (length ChemicalSeries)) 20)
[24]> declare (= (countnull ChemicalSeries) 0)
[25]> declare (= (countzero
                  (or (isnull ChemicalSeries)
                      (in ChemicalSeries (list "Actinoid" "Alkali metal"
                                               "Alkaline earth metal"
                                               "Halogen" "Lanthanoid"
                                               "Metalloid" "Noble gas"
                                               "Nonmetal" "Poor metal"
                                               "Transition metal"))))
                 0)
[26]> declare (>= (min AtomicWeight) 1.007946)
[27]> declare (<= (max AtomicWeight) 238.028914)

Miro Warning: Declaration failed: (<= (max AtomicWeight) 238.028914)

[28]> declare (= (countnull AtomicWeight) 0)

Miro Warning: Declaration failed: (= (countnull AtomicWeight) 0)

[29]> declare (> (min AtomicWeight) 0)
[30]> declare (>= (min (length Etymology)) 4)
[31]> declare (<= (min (length Etymology)) 39)
[32]> declare (= (countnull Etymology) 0)

Miro Warning: Declaration failed: (= (countnull Etymology) 0)

[33]> declare (>= (min RelativeAtomicMass) 1.007946)
[34]> declare (<= (max RelativeAtomicMass) 238.028914)

Miro Warning: Declaration failed: (<= (max RelativeAtomicMass) 238.028914)

[35]> declare (= (countnull RelativeAtomicMass) 0)

Miro Warning: Declaration failed: (= (countnull RelativeAtomicMass) 0)

[36]> declare (> (min RelativeAtomicMass) 0)
[37]> declare (>= (min MeltingPointC) -258.975000)
[38]> declare (<= (max MeltingPointC) 3675.0)
[39]> declare (>= (min MeltingPointKelvin) 14.200000)
[40]> declare (<= (max MeltingPointKelvin) 3948.0)
[41]> declare (> (min MeltingPointKelvin) 0)
[42]> declare (>= (min BoilingPointC) -268.930000)
[43]> declare (<= (max BoilingPointC) 5596.0)
[44]> declare (= (countnull BoilingPointC) 0)

Miro Warning: Declaration failed: (= (countnull BoilingPointC) 0)

[45]> declare (>= (min BoilingPointF) -452.070000)
[46]> declare (<= (max BoilingPointF) 10105.0)
[47]> declare (= (countnull BoilingPointF) 0)

Miro Warning: Declaration failed: (= (countnull BoilingPointF) 0)

[48]> declare (>= (min Density) 0.000089)
[49]> declare (<= (max Density) 22.610001)

Miro Warning: Declaration failed: (<= (max Density) 22.610001)

[50]> declare (= (countnull Density) 0)

Miro Warning: Declaration failed: (= (countnull Density) 0)

[51]> declare (> (min Density) 0)
[52]> declare (>= (min (length Description)) 1)
[53]> declare (<= (min (length Description)) 83)
[54]> declare (>= (min (length Colour)) 4)
[55]> declare (<= (min (length Colour)) 80)

10 warnings and 0 errors generated.

Job completed after a total of 10.2801 seconds.
Logs closed at 2015/11/25 17:08:23 host tdda.local.
Logs written to /Users/njr/miro/log/2015/11/25/session259.

By default, Miró generates warnings when declared constraints are violated. In this case, ten of the declared constraints were not met, so there were ten warnings. We can also set the declarations to generate errors rather than warnings, allowing us to stop execution of a script if the data fails to meet our declared expectations.

In this case, the failed declarations are mostly unsurprising and untroubling. The maximum values for Z, AtomicWeight, RelativeAtomicMass, and Density all increase in this version of the data, which is expected given that all the new elements are heavier than those in the initial analysis set. Equally, while the fields AtomicWeight, RelativeAtomicMass, Etymology, BoilingPointC, BoilingPointF and Density were all populated in the original dataset, each now contains nulls. Again, this is unsurprising in this case, but in other contexts, detecting these sorts of changes in a feed of data might be important. Specifically, we should always be interested in unexpected differences between the datasets used to develop an analytical process, and ones for which that process is used at a later time: it is very possible that they will not be handled correctly if they were not seen or considered when the process was developed.

There are many further improvements we could make to the current state of the autoconstraint generation, and there are other kinds of constraints it can generate that we will discuss in later posts. But as simple as it is, this level of checking has already identified a number of problems in the work we have been carrying out with Skyscanner and other clients.

We will return to this topic, including discussing how we might add tool support for revising constraint sets in the light of failures, merging different sets of constraints and adding constraints that are true only of subsets of the data.

Parting thoughts

Outputs and Intermediates. While developing the ideas about automatically generating constraints, our focus was mostly on input datasets. But in fact, most of the ideas are almost as applicable to intermediate results and outputs (which, after all, often form the inputs to the next stage of an analysis pipeline). We haven't performed any analysis in this post, but if we had, there might be similar value in generating constraints for the outputs as well.

Living Constraints and Type Systems. In this article, we've also focused on checking constraints at particular points in the process—after loading data, or after generating results. But it's not too much of a stretch to think of constraints as statements that should always be true of data, even as we append records, redefine fields etc. We might call these living or perpetual constraints. If we do this, individual field constraints become more like types. This idea, together with dimensional analysis, will be discussed in future posts.


  1. See e.g. the timeless Little Bobby Tables XKCD https://xkcd.com/327/ and the Wikipedia entry on Defensive Programming

  2. Metadata is data about data. In the context of tabular data, the simplest kinds of metadata are the field names and types. Any statistics we can compute are another form of metadata, e.g. minimum and maximum values, averages, null counts, values present etc. There is literally no limit to what metadata can be associated with an underlying dataset. 

  3. Obviously, in many situations, it's fine for identifiers or keys to be repeated, but it is also often the case that in a particular table a field value must be unique, typically when the records act as master records, defining the entities that exist in some category. Such tables are often referred to as master tables in database contexts https://encyclopedia2.thefreedictionary.com/master+file

  4. We're not saying this conviction is wrong: it is typically quicker just to whack in the new values each time. Our contention is that this is a more error-prone, less systematic approach. 

  5. (lisp-like) actually follows an amalgam of Lisp conventions, using t to represent True, like Common Lisp, and f for False, which is more like Scheme or Clojure. But it doesn't really matter here. 

  6. A so-called "universally unique identifier" (UUID) is a 128-bit number, usually formatted as a string of 32 hex digits separated into blocks of 8, 4, 4, 4, and 12 digits by hyphens—for example 12345678-1234-1234-1234-123456789abc. They are also known as globally unique identifiers (GUIDs) and are usually generated randomly, sometimes basing some bits on device and time to reduce the probability of collisions. Although fundamentally numeric in nature, it is fairly common for them to be stored and manipulated as strings. Wikipedia entry