Constraints and Assertions

Posted on Thu 26 November 2015 in TDDA • Tagged with tdda, components

Consistency Checking of Inputs, Outputs and Intermediates

While the idea of regression testing comes straight from test-driven development, the next idea we want to discuss is associated more with general defensive progamming than TDD. The idea is consistency checking, i.e. verifying that what might otherwise be implicit assumptions are in fact met by adding checks at various points in the process.1

Initially, we will assume that we are working with tabular data, but the ideas can be extended to other kinds of data.

Inputs. It is useful to perform some basic checks on inputs. Typical things to consider include:

  1. Are the names and types fields in the input data as we expect? In most cases, we also expect field names to be distinct, and perhaps to conform to some rules.

  2. Is the distribution of values in the fields reasonable? For example, are the minimum and maximum values reasonable?

  3. Are there nulls (missing values) in the data, and if so, are they permitted where they occur? If so, are there any restrictions (e.g. may all the value for a field or record be null?)

  4. Is the volume of data reasonable (exactly as expected, if there is a specific size expectation, or plausible, if the volume is variable)?

  5. Is any required metadata2 included?

In addition to basic sense checks like these, we can also often formulate self-consistency checks on the data. For example:

  1. Are any of the fields identifiers or keys for which every value should occur only once?3

  2. Are there row-level identities that should be true? For example, we might have a set of category counts and an overall total, and expect the category totals to sum to the overall total:

    nCategoryA + nCategoryB + nCategoryC = nOverall
    
  3. For categorical data, are all the values found in the data allowed, and are any required values missing?

  4. If the data has a time structure, are the times and dates self-consistent? For example, do any end dates precede start dates? Are there impossible future dates?

  5. Are there any ordering constraints on the data, and if so are they respected?

Our goal in formulating TDDA is pragmatic: we are not suggesting it is necessary to check for every possible inconsistency in the input data. Rather, we propose that even one or two simple, well-chosen checks can catch a surprising number of problems. As with regression testing, an excellent time to add new checks is when you discover problems. If you add a consistency check every time you discover bad inputs that such a test would have caught, you might quickly build up a powerful, well-targeted set of diagnostics and verification procedures. As we will see below, there is also a definite possibility of tool support in this area.

Intermediate Results and Outputs. Checking intermediates and outputs is very similar to checking inputs, and all same kinds of tests can be applied. Some further questions to consider in these contexts include:

  1. If we look up reference data, do (and should) all the lookups succeed? And are failed lookups handled appropriately?

  2. If we calculate a set of results that should exhibit some identity properties, do those hold? Just as physics has conservation laws for quantities such as energy and momentum, there are similar conservation principles in some analytical calculations. As a simple example, if we categorize spending into different, non-overlapping categories, the sum of the category totals should usually equal the sum of all the transactions, as long as we are careful about things like non-categorized values.

  3. If we build predictive models, do they cross-validate correctly (if we split the data into a training subset and a validation subset)? And, ideally, do they also validate longitudinally (i.e., on later data, if this is available)?

Transfer checks. With data analysis, our inputs are frequently generated by some other system or systems. Often those systems already perform some checking or reporting of the data they produce. If any information about checks or statistics from source systems is available, it is useful to verify that equivalent statistics calculated over the input data produce the same results. If our input data is transactional, maybe the source system reports (or can report) the number or value of transactions over some time period. Perhaps it breaks things down by category. Maybe we know other summary statistics or there are checksums available that can be verified.

The value of checking that the data received is the same as the data the source system was supposed to send is self-evident, and can help us to detect a variety of problems including data loss, data duplication, data corruption, encoding issues, truncation errors and conversion errors, to name but a few.

Tool Support: Automatic constraint suggestion

A perennial obstacle to better testing is the perception that it is a "nice to have", rather than a sine qua non, and that implementing it will require much tedious work. Because of this, any automated support that tools could provide would seem to be especially valuable.

Fortunately, there is low-hanging fruit in many areas, and one of of our goals with this blog is to explore various tool enhancements. We will do this first in our own Miró software and then, as we find things that work, will try to produce some examples, libraries and tools for broader use, probably focused around the Python data analysis stack.

In the spirit of starting simply, we're first going to look at what might be possible by way of automatic input checking.

One characteristic of data analysis is that we often start by trying to get a result from some particular dataset, rather than setting out to implement an analytical process to be used repeatedly with different inputs. In fact, when we start, we may not even have a very specific analytical goal in mind: we may simply have some data (perhaps poorly documented) and perform exploratory analysis with some broad analytical goal in mind. Perhaps we will stop when we have a result that seems useful, and which we have convinced ourselves is plausible. At some later point, we may get a similar dataset (possibly pertaining to a later period, or a different entity) and need to perform a similar analysis. It's at this point we may go back to see whether we can re-use our previous, embryonic analytical process, in whatever form it was recorded.

Let's assume, for simplicity, that the process at least exists as some kind of executable script, but that it's "hard-wired" to the previous data. We then have three main choices.

  • Edit the script (in place) to make it work with the new input data.

  • Take a copy of the script and make that work with the new input data.

  • Modify the script to allow it to work with either the new or the old data, by parameterizing and generalizing it.

Presented like this, the last sounds like the only sensible approach, and in general it is the better way forward. However, we've all taken the other paths from time to time, often because under pressure just changing a few old hard-wired values to new hard-wired values seems as if it will get us to our immediate result faster.4

The problem is that even if were very diligent when preparing the first script, in the context of the original analysis, it is easy for there to be subtle differences in a later dataset that might compromise or invalidate the analysis, and it's hard to force ourselves to be as vigilent the second (third, fourth, ...) time around.

A simple thing that can help is to generate statements about the original dataset and record these as constraints. If a later dataset violates these constraints, it doesn't necessarily mean that anything is wrong, but being alerted to the difference at least offers us an opportunity to consider whether this difference might be significant or problematical, and indeed, whether it might indicate a problem with the data.

Concretely: let's think about what we probably know about a dataset whenever we work with it. We'll use the Periodic Table as an example dataset, based on a snapshot of data I extracted from Wikipedia a few years ago. This is how Miró summarizes the dataset if we ask for a "long" listing of the fields with its ls -l command:

             Field      Type                        Min                         Max    Nulls
                 Z       int                          1                          92       0
              Name    string                   Actinium                   Zirconium       0
            Symbol    string                         Ac                          Zr       0
            Period       int                          1                           7       0
             Group       int                          1                          18      18
    ChemicalSeries    string                   Actinoid            Transition metal       0
      AtomicWeight      real                       1.01                      238.03       0
         Etymology    string                      Ceres                      zircon       0
RelativeAtomicMass      real                       1.01                      238.03       0
     MeltingPointC      real                    -258.98                    3,675.00       1
MeltingPointKelvin      real                      14.20                    3,948.00       1
     BoilingPointC      real                    -268.93                    5,596.00       0
     BoilingPointF      real                    -452.07                   10,105.00       0
           Density      real                       0.00                       22.61       0
       Description    string                        0.2            transition metal      40
            Colour    string    a soft silver-white ...    yellowish green or g ...      59

For each field, we have a name, a type, minimum and maximum values and a count of the number of missing values. [Scroll sideways if your window is too narrow to see the Nulls column on the right.] We also have, implicitly, the field order.

This immediately suggests a set of constraints we might want to construct. We've added an experimental command to Miró for generating constraints based on the field metadata shown earlier and a few other statistics. First, here's the "human-friendly" view that Miró produces if we use its autoconstraints -l command.

Figure 2: Auto-constraints from 92-element Periodic Table

In this table, the green cells represent constraints the system suggests for fields, and the orange cells show areas in which potential constraints were not constructed, though they would have been had the data been different. Darker shades of orange indicate constraints that were closer to be met within the data.

In addition to this human-friendly view, Miró generates out a set of declarations, which can be thought of as candidate assertions. Specifically, they are statements that are true in the current dataset, and therefore constitute potential checks we might want to carry out on any future input datasets we are using for the same analytical process.

Here they are:

declare (>= (min Z) 1)
declare (<= (max Z) 92)
declare (= (countnull Z) 0)
declare (non-nulls-unique Z)

declare (>= (min (length Name)) 3)
declare (<= (min (length Name)) 12)
declare (= (countnull Name) 0)
declare (non-nulls-unique Name)

declare (>= (min (length Symbol)) 1)
declare (<= (min (length Symbol)) 2)
declare (= (countnull Symbol) 0)
declare (non-nulls-unique Symbol)

declare (>= (min Period) 1)
declare (<= (max Period) 7)
declare (= (countnull Period) 0)

declare (>= (min Group) 1)
declare (<= (max Group) 18)

declare (>= (min (length ChemicalSeries)) 7)
declare (<= (min (length ChemicalSeries)) 20)
declare (= (countnull ChemicalSeries) 0)
declare (= (countzero
            (or (isnull ChemicalSeries)
                (in ChemicalSeries (list "Actinoid" "Alkali metal"
                                         "Alkaline earth metal"
                                         "Halogen" "Lanthanoid"
                                         "Metalloid" "Noble gas"
                                         "Nonmetal" "Poor metal"
                                         "Transition metal"))))
           0)

declare (>= (min AtomicWeight) 1.007946)
declare (<= (max AtomicWeight) 238.028914)
declare (= (countnull AtomicWeight) 0)
declare (> (min AtomicWeight) 0)

declare (>= (min (length Etymology)) 4)
declare (<= (min (length Etymology)) 39)
declare (= (countnull Etymology) 0)

declare (>= (min RelativeAtomicMass) 1.007946)
declare (<= (max RelativeAtomicMass) 238.028914)
declare (= (countnull RelativeAtomicMass) 0)
declare (> (min RelativeAtomicMass) 0)

declare (>= (min MeltingPointC) -258.975000)
declare (<= (max MeltingPointC) 3675.0)

declare (>= (min MeltingPointKelvin) 14.200000)
declare (<= (max MeltingPointKelvin) 3948.0)
declare (> (min MeltingPointKelvin) 0)

declare (>= (min BoilingPointC) -268.930000)
declare (<= (max BoilingPointC) 5596.0)
declare (= (countnull BoilingPointC) 0)

declare (>= (min BoilingPointF) -452.070000)
declare (<= (max BoilingPointF) 10105.0)
declare (= (countnull BoilingPointF) 0)

declare (>= (min Density) 0.000089)
declare (<= (max Density) 22.610001)
declare (= (countnull Density) 0)
declare (> (min Density) 0)

declare (>= (min (length Description)) 1)
declare (<= (min (length Description)) 83)

declare (>= (min (length Colour)) 4)
declare (<= (min (length Colour)) 80)

Each green entry in the table maps to a declaration in this list. Let's look at a few:

  1. Min and Max. Z is the atomic number. Each element has an atomic number, which is the number of protons in the nucleus, and each is unique. Hydrogen has the smallest number of protons, 1, and in this dataset, Uranium has the largest number—92. So the first suggested constraints are that these values should be in the observed range. These show up as the first two declarations:

    declare (>= (min Z) 1)
    declare (<= (max Z) 92)
    

    We should say a word about how these constraints are expressed. Miró includes expression language called (lisp-like), (because it's essentially a dialect of Lisp). Lisp is slightly unusual in that instead of writing f(x, y) you write (f x y). So the first expression would be more commonly expressed as

    min(Z) >= 1
    

    in regular ("infix") languages.

    Lisp weirdness aside, are these sensible constraints? Well, the first certainly is. Even if we find some elements beyond Uranium (which we will, below), we certainly don't expect them to have zero or negative numbers of protons, so the first constraint seems like a keeper.

    The second constraint is much less sensible. In fact, given that we know the dataset includes every value of Z from 1 to 92, we confidently expect that any future revisions of the periodic table will include values higher than 92. So we would probably discard that constraint.

    The crucial point is that no one wants to sit down and write out a bunch of constraints by hand (and anyway, "why have a dog an bark yourself?"). People are generally much more willing to review a list of suggested constraints and delete the ones that don't make sense, or modify them so that they do.

  2. Nulls. The next observation about Z is that it contains no nulls. This turns into the (lisp-like) constraint:

    declare (= (countnull Z) 0)
    

    This is also almost certainly a keeper: we'd probably be pretty unhappy if we received a Periodic Table with missing Z values for any elements.

    (Here, (countnull Z) just counts the number of nulls in field Z, and = tests for equality, so the expression reads "the number of nulls in Z is equal to zero".)

  3. Sign. The sign column is more interesting. Here, we have recorded the fact that all the values in Z are positive. Clearly, this is a logically implied by the fact that the minimum value for Z is 1, but we think it's useful to record two separate observations about the field—first, that its minimum value is 1, and secondly that it is always strictly positive. In cases where the minimum is 1, for an integer field, these statements are entirely equivalent, but if the minimum had been (say) 3, they would be different. The value of recording these observations separately arises if at some later stage the minimum changes, while remaining positive. In that case, we might want to discard the specific minimum constraint, but leave in place the constraint on the sign.

    Although we record the sign as a separate constraint in the table, in this case it does not generate a separate declaration, as it would be identical to the constraint on the minimum that we already have.

    In contrast, AtomicWeight, has a minimum value around 1.008, so it does get a separate sign constraint:

    declare (> (min AtomicWeight) 0)
    
  4. Uniqueness of Values. The next thing our autoconstraints framework has noticed about Z is that none of its values is repeated in the data—that all are unique (a.k.a. distinct). The table reports this as yes (the values are unique) and 92/92 (100%), meaning that there are 92 distinct values and 92 non-null values in total, so that 100% of values are unique. Other fields, such as Etymology, have potential constraints that are not quite met: Etymology has 89 different values in the field, so the ratio of distinct values to values is about 97%.

    NOTE: in considering this, we ignore nulls if there are any. You can see this if you look at the Unique entry for the field Group: here there are 18 different (non-null) values for Group, and 74 records have non-null values for Group.

    There is a dedicated function in (lisp-like) for checking whether the non-null values in a field are all distinct, so the expression in the declaration is just:

    (non-nulls-unique Z)
    

    which evaluates to true5 or false.

  5. Min and Max for String Fields. For string fields, the actual minimum and maximum values are usually less interesting. (Indeed, there are lots of reasonable alternative sort orders for strings, given choices such as case sensitivity, whether embedded numbers should be sorted numerically or alphanumerically, how spaces and punctuation should be handled, what to do with accents etc.) In the initial implementation, instead of using any min and max string values as the basis of constraints, we suggest constraints based on string length.

    For the string fields here, none of the constraints is particularly compelling, though a minimum length of 1 might be interesting and you might even think that a maximum length of 2 is sensible the symbol is useful. But in many cases they will be. One common case is fixed-length strings, such as the increasingly ubiquitous UUIDs,6 where the minimum and maximum values would both 36 if they are canonically formatted. (Of course, we can add much stronger constraints if we know all the strings in a field are UUIDs.)

  6. Categorical Values. The last kind of automatically generated constraint we will discuss today is a restriction of the values in a field to be chosen from some fixed set. In this case, Miró has noticed that there are only 10 different non-null values for ChemicalSeries, so has suggested a constraint to capture that reality. The slightly verbose way this currently gets expressed as a constraint is:

    declare (= (countzero
                (or (isnull ChemicalSeries)
                    (in ChemicalSeries
                        (list "Actinoid" "Alkali metal"
                              "Alkaline earth metal"
                              "Halogen" "Lanthanoid" "Metalloid" "Noble gas"
                              "Nonmetal" "Poor metal" "Transition metal"))))
               0)
    

    (The or statement starting on the second line is true for field values that are either in the list or null. The countzero function, when applied to booleans, counts false values, so this is saying that none of the results of the or statement should be false, i.e. all values should be null or in the list. This would be more elegantly expressed with an (all ...) statement; we will probably change it to that formulation soon, though the current version is more useful for reporting failures.)

    The current implementation generates these constraints only when the number of distinct values it sees is 20 or fewer, only for string fields, and only when not all the values in the field are distinct, but all of these aspect can probably be improved, and the user can override the number of categories to allow.

In addition to these constraints, we should also probably generate constraints on the field types and, as we will discuss in future articles, dataset-level constraints.

Tool Support: Using the Declarations

Obviously, if we run test the constraints against the same dataset we used to generate them, all the constraints should be (and are!) satisfied. Things are slightly more interesting if we run them against a different dataset. In this case, we excluded transuranic elements from the dataset we used to generate the constraints. But we can add them in. If we do so, and then execute a script (e92.miros) containing the autogenerated constraints, we get the following output:

$ miro
This is Miro, version 2.1.90.
Copyright © Stochastic Solutions 2008-2015.
Seed: 1463187505
Logs started at 2015/11/25 17:08:23 host tdda.local.
Logging to /Users/njr/miro/log/2015/11/25/session259.

[1]> load elements
elements.miro: 118 records; 118 (100%) selected; 16 fields.
[2]> . e92
[3]> # Autoconstraints for dataset elements92.miro.
[4]> # Generated from session /Users/njr/miro/log/2015/11/25/session256.miros
[5]> declare (>= (min Z) 1)
[6]> declare (<= (max Z) 92)

Miro Warning: Declaration failed: (<= (max Z) 92)

[7]> declare (= (countnull Z) 0)
[8]> declare (non-nulls-unique Z)
[9]> declare (>= (min (length Name)) 3)
[10]> declare (<= (min (length Name)) 12)
[11]> declare (= (countnull Name) 0)
[12]> declare (non-nulls-unique Name)
[13]> declare (>= (min (length Symbol)) 1)
[14]> declare (<= (min (length Symbol)) 2)
[15]> declare (= (countnull Symbol) 0)
[16]> declare (non-nulls-unique Symbol)
[17]> declare (>= (min Period) 1)
[18]> declare (<= (max Period) 7)
[19]> declare (= (countnull Period) 0)
[20]> declare (>= (min Group) 1)
[21]> declare (<= (max Group) 18)
[22]> declare (>= (min (length ChemicalSeries)) 7)
[23]> declare (<= (min (length ChemicalSeries)) 20)
[24]> declare (= (countnull ChemicalSeries) 0)
[25]> declare (= (countzero
                  (or (isnull ChemicalSeries)
                      (in ChemicalSeries (list "Actinoid" "Alkali metal"
                                               "Alkaline earth metal"
                                               "Halogen" "Lanthanoid"
                                               "Metalloid" "Noble gas"
                                               "Nonmetal" "Poor metal"
                                               "Transition metal"))))
                 0)
[26]> declare (>= (min AtomicWeight) 1.007946)
[27]> declare (<= (max AtomicWeight) 238.028914)

Miro Warning: Declaration failed: (<= (max AtomicWeight) 238.028914)

[28]> declare (= (countnull AtomicWeight) 0)

Miro Warning: Declaration failed: (= (countnull AtomicWeight) 0)

[29]> declare (> (min AtomicWeight) 0)
[30]> declare (>= (min (length Etymology)) 4)
[31]> declare (<= (min (length Etymology)) 39)
[32]> declare (= (countnull Etymology) 0)

Miro Warning: Declaration failed: (= (countnull Etymology) 0)

[33]> declare (>= (min RelativeAtomicMass) 1.007946)
[34]> declare (<= (max RelativeAtomicMass) 238.028914)

Miro Warning: Declaration failed: (<= (max RelativeAtomicMass) 238.028914)

[35]> declare (= (countnull RelativeAtomicMass) 0)

Miro Warning: Declaration failed: (= (countnull RelativeAtomicMass) 0)

[36]> declare (> (min RelativeAtomicMass) 0)
[37]> declare (>= (min MeltingPointC) -258.975000)
[38]> declare (<= (max MeltingPointC) 3675.0)
[39]> declare (>= (min MeltingPointKelvin) 14.200000)
[40]> declare (<= (max MeltingPointKelvin) 3948.0)
[41]> declare (> (min MeltingPointKelvin) 0)
[42]> declare (>= (min BoilingPointC) -268.930000)
[43]> declare (<= (max BoilingPointC) 5596.0)
[44]> declare (= (countnull BoilingPointC) 0)

Miro Warning: Declaration failed: (= (countnull BoilingPointC) 0)

[45]> declare (>= (min BoilingPointF) -452.070000)
[46]> declare (<= (max BoilingPointF) 10105.0)
[47]> declare (= (countnull BoilingPointF) 0)

Miro Warning: Declaration failed: (= (countnull BoilingPointF) 0)

[48]> declare (>= (min Density) 0.000089)
[49]> declare (<= (max Density) 22.610001)

Miro Warning: Declaration failed: (<= (max Density) 22.610001)

[50]> declare (= (countnull Density) 0)

Miro Warning: Declaration failed: (= (countnull Density) 0)

[51]> declare (> (min Density) 0)
[52]> declare (>= (min (length Description)) 1)
[53]> declare (<= (min (length Description)) 83)
[54]> declare (>= (min (length Colour)) 4)
[55]> declare (<= (min (length Colour)) 80)

10 warnings and 0 errors generated.

Job completed after a total of 10.2801 seconds.
Logs closed at 2015/11/25 17:08:23 host tdda.local.
Logs written to /Users/njr/miro/log/2015/11/25/session259.

By default, Miró generates warnings when declared constraints are violated. In this case, ten of the declared constraints were not met, so there were ten warnings. We can also set the declarations to generate errors rather than warnings, allowing us to stop execution of a script if the data fails to meet our declared expectations.

In this case, the failed declarations are mostly unsurprising and untroubling. The maximum values for Z, AtomicWeight, RelativeAtomicMass, and Density all increase in this version of the data, which is expected given that all the new elements are heavier than those in the initial analysis set. Equally, while the fields AtomicWeight, RelativeAtomicMass, Etymology, BoilingPointC, BoilingPointF and Density were all populated in the original dataset, each now contains nulls. Again, this is unsurprising in this case, but in other contexts, detecting these sorts of changes in a feed of data might be important. Specifically, we should always be interested in unexpected differences between the datasets used to develop an analytical process, and ones for which that process is used at a later time: it is very possible that they will not be handled correctly if they were not seen or considered when the process was developed.

There are many further improvements we could make to the current state of the autoconstraint generation, and there are other kinds of constraints it can generate that we will discuss in later posts. But as simple as it is, this level of checking has already identified a number of problems in the work we have been carrying out with Skyscanner and other clients.

We will return to this topic, including discussing how we might add tool support for revising constraint sets in the light of failures, merging different sets of constraints and adding constraints that are true only of subsets of the data.

Parting thoughts

Outputs and Intermediates. While developing the ideas about automatically generating constraints, our focus was mostly on input datasets. But in fact, most of the ideas are almost as applicable to intermediate results and outputs (which, after all, often form the inputs to the next stage of an analysis pipeline). We haven't performed any analysis in this post, but if we had, there might be similar value in generating constraints for the outputs as well.

Living Constraints and Type Systems. In this article, we've also focused on checking constraints at particular points in the process—after loading data, or after generating results. But it's not too much of a stretch to think of constraints as statements that should always be true of data, even as we append records, redefine fields etc. We might call these living or perpetual constraints. If we do this, individual field constraints become more like types. This idea, together with dimensional analysis, will be discussed in future posts.


  1. See e.g. the timeless Little Bobby Tables XKCD https://xkcd.com/327/ and the Wikipedia entry on Defensive Programming

  2. Metadata is data about data. In the context of tabular data, the simplest kinds of metadata are the field names and types. Any statistics we can compute are another form of metadata, e.g. minimum and maximum values, averages, null counts, values present etc. There is literally no limit to what metadata can be associated with an underlying dataset. 

  3. Obviously, in many situations, it's fine for identifiers or keys to be repeated, but it is also often the case that in a particular table a field value must be unique, typically when the records act as master records, defining the entities that exist in some category. Such tables are often referred to as master tables in database contexts https://encyclopedia2.thefreedictionary.com/master+file

  4. We're not saying this conviction is wrong: it is typically quicker just to whack in the new values each time. Our contention is that this is a more error-prone, less systematic approach. 

  5. (lisp-like) actually follows an amalgam of Lisp conventions, using t to represent True, like Common Lisp, and f for False, which is more like Scheme or Clojure. But it doesn't really matter here. 

  6. A so-called "universally unique identifier" (UUID) is a 128-bit number, usually formatted as a string of 32 hex digits separated into blocks of 8, 4, 4, 4, and 12 digits by hyphens—for example 12345678-1234-1234-1234-123456789abc. They are also known as globally unique identifiers (GUIDs) and are usually generated randomly, sometimes basing some bits on device and time to reduce the probability of collisions. Although fundamentally numeric in nature, it is fairly common for them to be stored and manipulated as strings. Wikipedia entry


Site News: Glossary; Table of Contents; Feeds

Posted on Mon 23 November 2015 in blog • Tagged with site news, glossary

The site now has a glossary, and also a table of contents, both linked from the side panel (which is at the top on mobile). The plan, obviously, is to keep these up-to-date as we discuss more topics. The table of contents is similar to the archives link at the top, but is chronological, rather than reverse-chronological, and has a short description of each article.

While writing the glossary, we decided that, in addition to the two classes of errors we discussed in Why Test-Driven Data Analysis—errors of implementation and errors of interpretation—we should probably break out a third category, namely errors of process. The first of the "interpretation" questions we listed was "Is the input data correct?". Presenting incorrect data to an analytical process certainly seems more like an error of process than an error of interpretation (though as we will discuss in one of the next posts, arguably the process should detect at least some kinds of input errors). We will certainly discuss other examples of process errors in future posts. We'll probably update the Why... post with at least with a footnote describing new interpretation.

We were also informed that some of the links to the RSS and Atom feeds were broken, even though the feeds themselves were OK. Apologies for this. As far as we can tell, they're all OK now. Please let us know if you try them and find they're not OK, or indeed if you find any other problems or errors.


Infinite Gain: The First Test

Posted on Mon 16 November 2015 in TDDA • Tagged with regression tests, reference tests

The first idea we want to appropriate from test-driven development is that of regression testing, and our specific analytical variant of this, the idea of a reference test.

We propose a "zeroth level" of test-driven data analysis as recording one or more specific sets of inputs to an analytical process, together with the corresponding outputs generated, and ensuring that the process can be re-run using those recorded inputs. The first test can then simply be checking that the results remain the same if the analysis is re-run.

In the language of test-driven development, this is a regression test, because it tests that no regressions have occurred, i.e. the results are the same now as previously. It is also a system test, in the sense that it checks the functioning of the whole system (the analytical process), rather than one or more specific subunits, as is the case with unit tests.

In our work with Skyscanner, Stochastic Solutions maintains a number of tests of this type for each of our major analytical processes. They help to ensure that as we make changes to the analysis scripts, and any of the software they depend on, we don't break anything without noticing. We also run them whenever we install new versions on Skyscanner servers, to check that we get identical results on their platforms as on our own development systems. We call these whole-system regression tests reference tests, and run them as part of the special commit process we use each time we update the version number of the software. In fact, our process only allows the version number to be updated if the relevant tests—including the relevant reference tests—pass.

Some practical considerations

  1. Stochastic (Randomized) Analyses

    We assume that our analytical process is deterministic. If it involves a random component, we can make it deterministic by fixing the seed (or seeds) used by the random number generators. Any seeds should be treated as input parameters; if the process seeds itself (e.g. from the clock), it is important it writes out the seeds to allow the analysis to be re-run.

  2. Correctness

    We also assume that the analyst has performed some level of checking of the results to convince herself that they are correct. In the worst case, this may consist of nothing more than verifying that the program runs to completion and produces output of the expected form that is not glaringly obviously incorrect.

    Needless to say, it is vastly preferable if more diligent checking than this has been carried out, but even if the level of initial checking of results is superficial, regression tests deliver value by allowing us to verify the impact of changes to the system. Specifically, they allow us to detect situations in which a result is unexpectedly altered by some modification of the process—direct or indirect—that was thought to be innocuous (see below).

  3. Size / Time

    Real analysis input datasets can be large, as can outputs, and complex analyses can take a long time. If the data is "too large" or the run-time excessive, it is quite acceptable (and in various ways advantageous) to cut it down. This should obviously be done with a view to maintaining the richness and variability of the inputs. Indeed, the data can also be changed to include more "corner cases", or, for example, to anonymize it, if it is sensitive.

    The main reason we are not specifically advocating cutting down the data is that we want to make the overhead of implementing a reference test as low as possible.

  4. Feeds

    If the analytical process directly connects to some dynamic data feed, it will be desirable (and possibly necessary) to replace that feed with a static input source, usually consisting of a snapshot of the input data. Obviously, in some circumstances, this might be onerous, though in our experience it is usually not very hard.

  5. Time-dependent analysis.

    Another factor that can cause analysis of fixed input data, with a fixed analytical process, to produce different results is explicit or implicit time-dependence in the analysis. For example, the analysis might convert an input that is a date stamp to something like "number of whole days before today", or the start of the current month. Obviously, such transformations produce different results when run on different days. As with seeds, if there are such transformations in the analysis code, they need to handled. To cope with this sort of situation, we typically look up any reference values such as today early in the analytical process, and allow optional override parameters to be provided. Thus, in ordinary use we might run an analysis script by saying:

      python analysis_AAA.py
    

    but in testing replace this by something like

      AAA_TODAY="2015/11/01" python analysis_AAA.py
    

    to set the environment variable AAA_TODAY to an override value, or with a command such as

     python analysis_AAA.py -d 2015/11/01
    

    to pass in the date as a command-line option to our script.

  6. Numerical Precision.

    Computers are basically deterministic, and, regardless of what numerical accuracy they achieve, if they are asked to perform the same operations, on the same inputs, in the same order, they will normally produce identical results every time. Thus even if our outputs are floating-point values, there is no intrinsic problem with testing them for exact equality. The only thing we really need to be careful about is that we don't perform an equality test between a rounded output value and an floating-point value held internally without rounding (or, more accurately, held as an IEEE floating point value, rather than a decimal value of given precision). In practice, when comparing floating-point values, we either need to compare formatted string output, rounded in some fixed manner, or compare to values to some fixed level of precision. In most cases, the level of precision will not matter very much, though in particular domains we may want to exercise more care in choosing this.

    To make this distinction clear, look at the following Python code:

      $ python
      Python 2.7.10 (default, Jul 14 2015, 19:46:27)
      [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      >>> from __future__ import division
      >>> a = 1/3
      >>> b = 1/3
      >>> print a
      0.333333333333
      >>> a == 0.333333333333
      False
      >>> a == b
      True
      >>> round(a, 12) == round(0.333333333333, 12)
      True
      >>> str(a) == '0.333333333333'
      True
      >>> '%.12f' % a == '0.333333333333'
      True
    

    In this code fragment,

    • The first line tells Python to return floating-point values from integer division (always a good idea).

    • The next two lines just assign a and b each to be a third.

    • The following line confirms the result of this is, as we'd expect 0.3333... But, crucially, this value is not exact. If we print it to 60 decimal places, we see:

      >>> print "%.60f" % a
      0.333333333333333314829616256247390992939472198486328125000000
      
    • Unsurprisingly, therefore, when in the next statement we ask Python whether a is equal to 0.333333333333, the result is False.

    • After this, as expected, we confirm that a == b is True.

    • We then confirm that if we round a to 12 decimal places, the result is exactly round(0.333333333333, 12). Do we need the round on the right-hand side? Probably not, but be aware that 0.333333333333 is not a value that can be stored exactly in binary, so:

      >>> print '%.60f' % 0.333333333333
      0.333333333333000025877623784253955818712711334228515625000000
      

      It's probably, therefore, both clearer to round both sides, or to use string comparisons.

    • Finally, we perform two string comparisons. The first relies on Python's default string formatting rules, and the second is more explicit.

    NOTE: When it comes to actually writing tests, Python's unittest module includes an assertAlmostEqual method, that takes a number of decimal places, so if a function f(x) is expected to return the result 1/3 when x = 1, the usual way to test this to 12dp is with the following code fragment:

      def testOneThird(self):
          self.assertAlmostEqual(f(1), 0.333333333333, 12)
    
  7. Parallel Processing.

    Another factor that can cause differences in results is parallel execution, which can often result in subtle changes of detailed sequence of operations carried out. A simple example would be a task farm in which each of a number of workers calculates a result. If those results are then summed by the controller process in the order they are returned, rather than in a predefined sequence, numerical rounding errors may result in different answers. Thus, more care has to be taken in these sorts of cases.

  8. Variable output.

    A final implementation detail is that we sometimes have to be careful about simply comparing output logs, graph files etc. It is very common for output to include things that may vary from run-to-run, such as timestamps, version information or sequence numbers (run 1, run 2...) In these cases, the comparison process needs to make suitable affordances. We will discuss some methods for handling this in a future article.

Reasons a Regression Test Might Fail

Changes to the system not intended to change the result, but sometimes doing so, can take many forms. For example:

  • We might extend our analysis code to accommodate some variation in the input data handled.

  • We might add an extra parameter or code path to allow some variation in the analysis performed.

  • We might upgrade some software, e.g. the operating system, libraries, the analysis software or the environment in which the software runs.

  • We might upgrade the hardware (e.g. adding memory, processing capacity or GPUs), potentially causing different code paths to be followed.

  • We might run the analysis on a different machine.

  • We might change the way in which the input data is stored, retrieved or presented to the software.

  • Hardware and software can develop faults, and data corruption can and does occur.

The Law of Software Regressions

Experience shows that regression tests are a very powerful tool for identifying unexpected changes, and that such changes occur more often than anyone expects. In fact writing this reminds me of the self-referential law1 proposed by Doug Hofstadter:

Hofstadter's Law:

It always takes longer than you expect, even when you take into account Hofstadter's Law.

Gödel, Esher Bach: An Eternal Golden Braid, Douglas R. Hofstadter.

In a similar vein, we might coin a Law of Software Regressions:

The Law of Software Regressions:

Software regressions happen more often than expected, even when you take into account the Law of Software Regressions.


  1. Douglas R. Hofstadter, Gödel, Esher Bach: An Eternal Golden Braid, p. 152. Penguin Books (Harmondsworth) 1980. 


How is this Misleading Data Misleading Me?

Posted on Fri 13 November 2015 in TDDA • Tagged with tdda, implementation, interpretation, correctness

"Why is this lying bastard lying to me?"

Louis Heren,1 often attributed to Jeremy Paxman.

In a previous post, we made a distinction between two kinds of errors—implementation errors and errors of interpretation. I want to amplify that today, focusing specifically on interpretation.

The most important question to keep in mind at all times is not whether the analysis is computing the thing we wanted it to compute, but rather whether the result we have produced means what we think it means. The distinction is crucial.

As a simple example, let's suppose we specify the goal of our analysis as calculating the mean of a set of numbers. We can test that by adding them up and dividing by the number of items. But if we think the goal is to characterize a typical transaction size, we have to ask whether the arithmetic mean is the right metric for understanding that. As we move more towards a business or conceptual goal, rather than a mathematical or algorithmic formulation of a calculation, we have more complex and nuanced considerations, such as:

  • Do we believe the inputs are correct?

  • Is our chosen metric capable of addressing our underlying need (in this case, determining a typical transaction size)?

  • How do we handle nulls (missing values)?

  • Will outliers (perhaps extremely large values) or invalid inputs (perhaps negative values) invalidate the calculation?

  • If the values have dimensionality,2 do all of the values have the same dimensionality, and in the same units (e.g. all money and all in pounds sterling, or all distances and all measured in miles).

  • For that matter, are the inputs even commensurate, i.e. do they quantify sufficiently similar things that calculating their mean is even meaningful?

Paxman/Heren's constant question quoted above—Why is this lying bastard lying to me?—will serve as an excellent question to keep in mind every time we view an analytical result, perhaps recast as how is this misleading data misleading me? There is a great temptation to believe beautifully formatted, painstakingly calculated results produced by the almost unfathomable power of modern computers. In fact, there is much to be said for thinking of the combination of data and processing as an adversary constantly trying to fool you into drawing false conclusions.

The questions of implementation are concerned with checking that the data received as input to the analytical process has been faithfully transmitted from the source systems, and that the calculations and manipulations performed in the analysis correctly implement the algorithms we intended to use. In contrast, as we outlined previously, the questions of interpretation emphasize that we need to be ever vigilent, asking ourselves:

  • Is the input data correct?

  • Is our interpretation of the input data correct?

  • Are the algorithms we are applying to the data meaningful and appropriate?

  • Is our interpretation of the results we produce correct?

  • Are the results plausible?

  • What am I missing?

  • How is this misleading data misleading me?


  1. This quote is usually attributed to Jeremy Paxman, as noted in The Guardian article Paxman answers the questions https://www.theguardian.com/media/2005/jan/31/mondaymediasection.politicsandthemedia of 31st January 2005. According to the article, however, the true origin is a former deputy editor of the Times, Louis Heren, in his memoirs, with the full quote being "When a politician tells you something in confidence, always ask yourself: 'Why is this lying bastard lying to me?'" Still other reports, however, say Heren himself, was merely quoting advice he was given. Melvin J. Lasky writes in Profanity, Obscenity and the Media, Transaction Publishers (New Brunswick) 2005:

    "Find out why the lying bastards are lying!" This is the famous phrase of an editor of the Times, Louis Heren, who received it as "advice given him early in his career by ... a correspondent of the Daily Worker [the Communist daily in London]: 'Always ask yourself why these lying bastards are lying to you.'"

  2. Here, we use dimensionality in the sense of Dimensional Analysis, which allows us to make inferences about the results of calculations based on classifying the inputs by category. For example, we would distinguish lengths, from times from quantities of money and so forth. We would also treat separately dimensionless quantities, such as counts or ratios of quantitities of the same dimension (e.g. a ratio of two lengths lengths). 


Test-Driven Development: A Review

Posted on Mon 09 November 2015 in TDDA • Tagged with tdd

Since a key motivation for developing test-driven data analysis (TDDA) has been test-driven development (TDD), we need to conduct a lightning tour of TDD before outlining how we see TDDA developing. If you are already familiar with test-driven development, this may not contain too much that is new for you, though we will present it with half an eye to the repurposing of it that we plan as we move towards test-driven data analysis.

Test-driven development (TDD) has gained notable popularity as an approach to software engineering, both in its own right and as a key component of the Agile development methodology. Its benefits, as articulated by its adherents, include higher software quality, greater development speed, improved flexibility during development (i.e., more ability to adjust course during development), earlier detection of bugs and regressions1 and an increased ability to restructure ("refactor") code.

The Core Idea of Test-Driven Development

Automation + specification + verification + refactoring

The central idea in test-driven development is that of using a comprehensive suite of automated tests to specify the desired behaviour of a program and to verify that it is working correctly. The goal is to have enough, sufficiently detailed tests to ensure that when they all pass we feel genuine confidence that the system is functioning correctly.

The canonical test-driven approach to software development consists of the following stages:

  • First, a suite of tests is written specifying the correct behaviour of a software system. As a trivial example, if we are implementing a function, f, to compute the sum of two inputs, a and b, we might specify a set of correct input-output pairs. In TDD, we structure our tests as a series of assertions, each of which is a statement that must be satisfied in order for the test to pass. In this case, some possible assertions, expressed in pseudo-code, would be:

    assert f( 0,  0)  =  0
    assert f( 1,  7)  =  8
    assert f(-2, 17)  = 15
    assert f(-3, +3)  =  0
    

    Importantly, the tests should also, in general, check and specify the generation of errors and the handling of so-called edge cases. Edge cases are atypical but valid cases, which might include extreme input values, handling of null values and handling of empty datasets. For example:

    assert f("a", 7) --> TypeError
    assert f(MAX_FLOAT, MAX_FLOAT) = Infinity
    

NOTE This is not a comprehensive set of tests for f. We'll talk more about what might be considered adequate for this function in later posts. The purpose of this example is simply to show the general structure of typical tests.

  • An important aspect of testing frameworks is that they allow tests to take the form of executable code that can be run even before the functionality under test has been written. At this stage, since we have not even defined f, we expect the tests not to pass, but to produce errors such as "No such function: f". Once a minimal definition for f has been provided, such as one that always returns 0, or that returns no result, the errors should turn into failures, i.e. assertions that are not true.

  • When we have a suite of failing tests, software is written with the goal of making all the tests pass.

  • Once all the tests pass, TDD methodology dictates that coding should stop because if the test suite is adequate (and free of errors) we have now demonstrated that the software is complete and correct. Part of the TDD philosophy is that if more functionality is required, one or more further tests should be written to specify and demonstrate the need for more (or different) code.

  • There is one more important stage in test-driven development, namely refactoring. This is the process of restructuring, simplifying or otherwise improving code while maintaining its functionality (i.e., keeping the tests passing). It is widely accepted that complexity is one of the biggest problems in software, and simplifying code as soon as the tests pass allows us to attempt to reduce complexity as early as possible. It is a recognition of the fact that the first successful implementation of some feature will typically not be the most direct and straightforward.

The philosophy of writing tests before the code they are designed to validate leads some to suggest that the second "D" in TDD (development) should really stand for design (e.g. Allen Houlob3). This idea grows out of the observation that with TDD, testing is moved from its traditional place towards the end of the development cycle to a much earlier and more prominent position where specification and design would traditionally occur.

TDD advocates tend to argue for making tests very quick to run (preferably mere seconds for the entire suite) so that there is no impediment to running them frequently during development, not just between each code commit,4 but multiple times during the development of each function.

Another important idea is that of regression testing. As noted previously a regression is a defect that is introduced by a modification to the software. A natural consequence of maintaining and using a comprehensive suite of tests is that when such a regressions occur, they should be detected almost immediately. When a bug does slip through without triggering a test failure, the TDD philosophy dictates that before it is fixed, one or more failing tests should be added to demonstrate the incorrect behaviour. By definition, when the bug is fixed, these new tests will pass unless they themselves contain errors.

Common Variations, Flavours and Implementations

A distinction is often made between unit tests and system tests (also known as integration tests). Unit tests are supposed to test low-level software units (such individual functions, methods or classes). There is often a particular focus on these low-level unit tests, partly because these can often be made to run very quickly, and partly (I think) because there is an implicit belief or assumption that if each individual component is well tested, the whole system built out of those components is likely to be reliable. (Personally, I think this is a poor assumption.)

In contrast, system tests and integration tests exercise many parts of the system, often completing larger, more realistic tasks, and more often interfacing with external systems. Such tests are often slower and it can be hard to avoid their having side effects (such as updating entries in databases).

The distinction, however, between the different levels is somewhat subjective, and some organizations give more equal or greater weight to higher level tests. This will be an interesting issue as we consider how to move towards test-driven data analysis.

Another practice popular within some TDD schools is that of mocking. The general idea of mocking is to replace some functionality (such as a database lookup, a URL fetch, a disk write, a trigger event or a function call) with a simpler function call or a static value. This is done for two main reasons. First, if the mocked functionality is expensive, or has side effects, test code can often be made much faster and side-effect free if its execution is bypassed. Secondly, mocking allows a test to focus on the correctness of a particular aspect of functionality, without any dependence on the external part of the system being mocked out.

Other TDD practitioners are less keen on mocking, feeling that it leads to less complete and less realistic testing, and raises the risk of missing some kinds of defects. (Those who favour mocking also tend to place a strong emphasis on unit testing, and to argue that more expensive, non-mocked tests should form part of integration testing, rather than part of the more frequently run core unit test suite.)

While no special software is strictly required in order to follow a broadly test-driven approach to development, good tools are extremely helpful. There are standard libraries that support of this for most mainstream programming languages. The xUnit family of test software (e.g. CUnit for C, jUnit for Java, unittest for Python), uses a common architecture designed by Kent Beck.2 It is worth noting that the rUnit package is such a system for use with the popular data analysis package R.

Example

As an example, the following Python code tests a function f, as described above, using Python's unittest module. Even if you are completely unfamilar with Python, you will be able to see the six crucial lines that implement exactly the six tests described in pseudo-code above, in this case through four separate test methods.

import sys
import unittest


def f(a, b):
    return a + b


class TestAddFunction(unittest.TestCase):
    def testNonNegatives(self):
        self.assertEqual(f(0, 0), 0)
        self.assertEqual(f(1, 7), 8)

    def testNegatives(self):
        self.assertEqual(f(-2, 17), 15)
        self.assertEqual(f(-3, +3), 0)

    def testStringInput(self):
        self.assertRaises(TypeError, f, "a", 7)

    def testOverflow(self):
        self.assertEqual(f(sys.float_info.max, sys.float_info.max),
                         float('inf'))

if __name__ == '__main__':
    unittest.main()

If this code is run, including the function definition for f, the output is as follows:

$ python add_function.py
....
----------------------------------------------------------------------
Ran 4 tests in 0.000s

OK

Here, each dot signifies a passing test.

However, if this is run without defining f, the result is the following output:

$ python add_function.py
EEEE
======================================================================
ERROR: testNegatives (__main__.TestAddFunction)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "add_function.py", line 13, in testNegatives
    self.assertEqual(f(-2, 17), 15)
NameError: global name 'f' is not defined

======================================================================
ERROR: testNonNegatives (__main__.TestAddFunction)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "add_function.py", line 9, in testNonNegatives
    self.assertEqual(f(0, 0), 0)
NameError: global name 'f' is not defined

======================================================================
ERROR: testOverflow (__main__.TestAddFunction)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "add_function.py", line 20, in testOverflow
    self.assertEqual(f(sys.float_info.max, sys.float_info.max),
NameError: global name 'f' is not defined

======================================================================
ERROR: testStringInput (__main__.TestAddFunction)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "add_function.py", line 17, in testStringInput
    self.assertRaises(TypeError, f, "a", 7)
NameError: global name 'f' is not defined

----------------------------------------------------------------------
Ran 4 tests in 0.000s

FAILED (errors=4)

Here the four E's at the top of the output represent errors when running the tests. If a dummy definition of f is provided, such as:

def f(a, b):
    return 0

the tests will fail, producing F, rather than raising the errors that result in E's.

Benefits of Test-Driven Development

Correctness. The most obvious reason to adopt test-driven development is the pursuit of higher software quality. TDD proponents certainly feel that there is considerable benefit to maintaining a broad and rich set of tests that can be run automatically. There is rather more debate about how important it is to write the tests strictly before the code it is designed to test. I would say that to qualify as test-driven development, the tests should be produced no later than immediately after each piece of functionality is implemented, but purists would take a stricter view.

Regression detection. The second benefit of TDD is in the detection of regressions, i.e. failures of code in areas that previously ran successfully. In practice, regression testing is even more powerful than it sounds because not only can many different failure modes be detected by a single test, but experience shows that there are often areas of code that are susceptible to similar breakages from many different causes and disturbances. (This can be seen as a rare case of combinatorial explosion working to our advantage: there are many ways to get code wrong, and far fewer to get it right, so a single test can catch many different potential failures.)

Specification, Design and Documentation. One of the stronger reasons for writing tests before the functions they are designed to verify is that the test code then forms a concrete specification. In order even to write the test, a certain degree of clarity has to be brought to the question of precisely what the function that is being written is supposed to do. This is the key insight that leads towards the idea of TDD as test-driven design over test-driven development. A useful side effect of the test suite is that it also forms a precise and practical form of documentation as to exactly how the code can be used successfully, and one that, by definition, has to be kept up to date—a perenial problem for documentation.

Refactoring. The benefits listed so far are relatively unsurprising. The fourth is more profound. In many software projects, particularly large and complex ones, once the software is deemed to be working acceptably well, some areas of the code come to be regarded as too dangerous to modify, even when problems are discovered. Developers (and managers) who know how much pain and effort was required to make something work (or more-or-less work) become fearful that the risks associated with fixing or upgrading code are simply too high. In this way, code becomes brittle and neglected and thus essentially unmaintainable.

In my view, the single biggest benefit of test-driven development is that it goes a long way to eliminating this syndrome, allowing us to re-write, simplify and extend code safely, confident in the knowledge that if the tests continue to function, it is unlikely that anything very bad has happened to the code. The recommended practice of refactoring code as soon as the tests pass is one aspect of this, but the larger benefit of maintaining comprehensive set of tests is that such refactoring can be performed at any time.

These are just the most important and widely recognized benefits of TDD. Additional benefits include the ability to check that code is working correctly on new machines or systems, or in any other new context, providing a useful baseline of performance (if timed and recorded) and providing an extremely powerful resource if code needs to be ported or reimplemented.


  1. A software regression is a bug in a later version of software that was not present in a previous version of the software. It contrasts with bugs that may always have been present but were not detected. 

  2. Kent Beck, Test-Driven Development, Addison Wesley (Vaseem) 2003. 

  3. Allen Houlob, Test-Driven Design, Dr. Dobbs Journal, May 5th 2014. https://www.drdobbs.com/architecture-and-design/test-driven-design/240168102

  4. Most non-trivial software development uses a so-called revision control system to provide a comprehensive history of versions of the code. Developers normally run code frequently, and typically commit changes to the revision-controlled repository somewhat less frequently (though still, perhaps, many times a day). With TDD, the tests form an integral part of the code base, and it is common good practice to require that code is only committed when the tests pass. Sometimes this requirement is merely a rule or convention, while in other cases systems are set up in such a way as to enable code to be committed only when all of its associated tests pass.