Our Approach to Data Provenance

Posted on Tue 12 December 2017 in TDDA • Tagged with data lineage, data provenance, data governance, tdda, constraints, miro

NEW DATA GOVERNANCE RULES: — We need to track data provenance. — No problem! We do that already! — We do? — We do! — (thinks) Results2017_final_FINAL3-revised.xlsx

Our previous post introduced the idea of data provenance (a.k.a. data lineage), which has been discussed on a couple of podcasts recently. This is an issue that is close to our hearts at Stochastic Solutions. Here, we'll talk about how we handle this issue, both methodologically and in our Miró software.

We'll focus on seven key ideas from our approach to data provenance:

  1. Automatic Logging
  2. Audit trail stored in datasets
  3. Recording of field definitions and contexts
  4. Constraint generation and verification
  5. Data Dictionary Generation & Import
  6. Data signatures (hashing)
  7. Comparing datasets (diff commands)

Automatic Logging

Automatic logging provides a concrete record of all analysis performed in the software.

Our analysis software, Miró, is normally used through a scripting interface that automatically writes several detailed logs. Of these, the most important are:

  • A log of all commands executed (in editable, re-executable form)

  • A full interleaved log of commands issued and the resulting output, in several forms (including plain text and rich HTML).

Logs build up in a year/month/day/session hierarchy indefinitely, providing a comprehensive (and searchable) record of the analysis that has been performed.

Even when the software is used through the API, the sequence of operations is recorded in the log, though in that case the ability to re-execute the operations is normally lost.

Audit trail

The audit trail in a dataset tracks changes to the data and metadata across sessions, users, and machines, making it possible to see the sequence of operations that led to the current state of the data.

Like a table in a database, or a DataFrame in Pandas or R, Miró's most important data structure is a dataset—a tabular structure with named fields as columns and different observations (records) stored in rows. These form a column-based data store, and datasets can be saved to disk with a .miro extension—a folder that contains the typed data together with rich metadata.

Every time a change is made to a dataset, the operation that caused the change is recorded in the Audit Trail section of the dataset. This is true both for changes to data and to metadata:

  • Examples of changes to data in a dataset include creating new fields, deleting fields, filtering out records, appending new records and (more rarely) changing the original values in the data.1

  • Miró maintains many types of metadata, including, field descriptions, field and dataset tags, binnings on fields, long and short names, custom colours and labels, a current record selection (filter) and various formatting information.

In this post, we'll illustrate the most concept using the following small dataset containing transactions for four different customers, identified by id:

id date categ amount days-since-prev
1 2009-01-31 00:00:00 A 1,000.00
2 2009-02-02 00:00:00 A 2,000.00
2 2009-02-02 22:22:22 B 2,222.22 0.93
3 2009-03-03 00:00:00 A 1,000.00
3 2009-03-03 13:33:33 B 3,000.00 0.56
3 2009-03-03 23:33:33 B 3,333.33 0.42
4 2009-04-04 00:00:00 A 1,000.00
4 2009-04-04 04:44:44 B 1,111.11 0.20
4 2009-04-04 14:44:44 B 0.42
4 2009-04-04 20:44:44 B 4,444.44 0.25

Here is the audit trail recorded in that dataset:

Date Description Command Host Session Line
2017/12/07 14:05:30 Load from flat file /Users/njr/python/artists/miro/testdata/trans.csv load testdata/trans.csv godel.local /Users/njr/miro/log/2017/12/02/session142 1
2017/12/07 14:05:30 Save as /miro/datasets/trans.miro save trans godel.local /Users/njr/miro/log/2017/12/02/session142 3
2017/12/11 12:48:39 Set dataset description to "Some demo transactions". description -d "Some demo transactions" bartok.local /Users/njr/miro/log/2017/12/11/session013 2
2017/12/11 12:48:39 Set description for field amount to "transaction value (GBP)". description amount "transaction value (GBP)" bartok.local /Users/njr/miro/log/2017/12/11/session013 3
2017/12/11 12:48:39 Defined field days-since-prev as (/ (- date (prev-by date id)) 24 60 60) def days-since-prev (/ (- date (prev-by date id)) 24 60 60) bartok.local /Users/njr/miro/log/2017/12/11/session013 4
2017/12/11 12:48:39 Tag field days-since-prev with L="Time Since Previous Transaction (days)". tag L="Time Since Previous Transaction (days)" days-since-prev bartok.local /Users/njr/miro/log/2017/12/11/session013 5
2017/12/11 12:48:39 Save as /miro/datasets/trans.miro save bartok.local /Users/njr/miro/log/2017/12/11/session013 6

In this case, the history is quite short, but includes information about

  • where the data originally came from (first line)
  • when the data has been saved (second and seventh lines)
  • metadata changes (third, fourth and sixth lines)
  • changes to the data content (in this case, creation of a new field, days-since-prev).
  • detail about when (column 1) and where (column 4) changes were made, in what session these occurred (column 5, linking to the logs), what commands were used (column 3) and where in the log files to find those commands and context (column 6).

Field Definitions and Contexts

Fields remember their definitions, including—where relevant—the context in which they were created. This allows us to understand where any value in a dataset came from, as a sequence of transformations of input values.

In the previous section, we saw that the audit trail contained information about a derived field in the data, including the expression used to derive it. That information is also available as a basic property of the field:

[2]> ls -D days-since-prev
          Field                                 Definition
days-since-prev    (/ (- date (prev-by date id)) 24 60 60)

In other cases, entire datasets are created by taking "measurements" from a base dataset, such as the transactional data shown above. For example, we might want to create a dataset that measures how many transactions each customer has, and their total value.

One way of doing this is Miró is with the following command:

[4]> xtab -d -R MEASURES count sum amount by id
MEASURES: 4 records; 4 (100%) selected; 3 fields.

which creates a new dataset. As you might expect, this is the result:

id count sumamount
1 1 1,000.00
2 2 4,222.22
3 3 7,333.33
4 4 6,555.55

The field definitions in the dataset created by default) are attached to the fields, as we can see:

[6]> MEASURES.ls -D
    Field                                  Definition
       id             Rollup variable [in trans.miro]
    count         count by id [in dataset trans.miro]
sumamount    sum amount by id [in dataset trans.miro]

Of course, the audit trail for MEASURES also contains this information, together with more detailed information about the session or sessions in which the relevant commands were issued.

Constraint Generation and Verification

Automatically generated constraints can be used to identify anomalous and possibly incorrect data within a source dataset. The can also be used to check that new data with the same structure has similar or expected properties.

We've talked in previous posts (here, here, here, and here) and in a white paper about automatically generating constraints that characterize either all of, or an inferred "good" subset of, the data in a dataset. Such constraints are useful for finding bad and anomalous data in the original dataset, and also for checking ("verifying") new data as it comes in.

We won't go over all the details of constraint generation and verification in this post, but do note that this relates to Roger Peng's idea, discussed in the last post, of tracking changes to data tests as a surrogate for tracking changes to the actual data. Obviously, having generated constraints, it's a good idea to store the constraints under version control, to facilitate such tracking. More directly, the results of verification allow you to see some changes to data directly.

Data Dictionary

The data dictionary provides a useful reference for any user of the data particularly when it includes helpful (and up-to-date) annotations. By making it easily editable, we encourage users to record useful information alongside the data.

Miró can generate a data dictionary from the data. This contains:

  • Summary information about the overall dataset
  • Per-field information, including

    • basic metadata about each field, including name and type
    • some information about the range of each field, including minimum and maximum values, null count etc.
    • further characterization information, including whether there are duplicate values,
    • any annotations that have been added to the dataset (such as descriptions, alternate names, tags etc.)

Here's an example of the first part of the data dictionary for our transaction dataset:

Name trans.miro
Path /miro/datasets/trans.miro
Host bartok.local
Hash 0971dde52de7bc2fb2ad2282a572f6ca295c33fae105d8b0fab7a618f4c70b71
Description Some demo transactions
Tags
Creation Date 2017-12-11 13:33:36
Number of Records 10
Number of Fields 5

And here's the second part, with an entry for each field:

Name Type Min Max Min Length Max Length # nulls # empty/zero # positive # negative Any duplicates Values Description Tags Long Name Short Name Definition
id int 1 4 0 0 10 0 yes
date date 2009-01-31T00:00:00 2009-04-04T20:44:44 0 0 10 0 no
categ string A B 1 1 0 0 yes "A" "B"
amount real 1,000.00 4,444.44 1 0 9 0 yes transaction value (GBP)
days-since-prev real 0.20 0.93 4 0 6 0 yes Time Since Previous Transaction (days) (/ (- date (prev-by date id)) 24 60 60)

While some of the information in the data dictionary is derived directly from the data, other parts (descriptions, alternate names, and tags) are created by annotation actions, whether by humans, bots or scripts. Although there are Miró commands for setting all the (editable) metadata properties, to encourage maximum use, Miró can also export the metadata to a spreadsheet, where users can update it, and then the appropriate parts of the metadata can be re-imported using Miró's importmetadata command.

Data signatures

A data signature is a very compact way to summarize a dataset. This allows quick and efficient checking that analysis is being performed using the correct data.

One of the properties reported in the data dictionary (the first table above) is a hash. If you're not familiar with hashing, a hash function is one that takes a (typically large) input and converts it to a much smaller output. Hash functions are designed so that different inputs tend (but are not guaranteed) to map to different outputs. So if you store the hashes of two large objects and they are different, you can be certain that the objects are different. If the hashes are the same, this does not absolutely guarantee that the objects are the same, but hash functions are designed to make so-called "hash collisions" extremely rare, especially between similar inputs. For most practical purposes, therefore, we can safely assume that if two hashes are the same, the inputs were the same.2

The point of storing the hash is that it acts as a much smaller, very efficient proxy for our original data, and if we want to know whether some dataset we have lying around contains the same data as the one used to generate the data dictionary, all we have to do is compare the hashes.

The hashes that Miró constructs for fields use only the data values, not any metadata, as inputs.3 This is a choice, of course, and we could also hash some or all of the metadata, but our primary concern here is whether we have the same underlying data or not, so we view it as an advantage that the hash is based solely on the data values.

There is an option to store hashes for individual fields, as well, which we have not used in generating the data dictionary shown.

Comparing Datasets (diff commands)

The ability to compare two datasets, and when they are different, to see clearly what the differences are, is as fundamental as the ability to compare two files in Unix or git, or to perform a track changes operation on a Word document.

If we want to compare two datasets to see if they are the same, comparing hashes is a very efficient way to do so.

If, however, we actually want to understand the differences between datasets, we need something more like Unix's diff command, which was discussed in the previous post, or Microsoft Word's Compare Documents functionality.

Miró includes a ddiff command for comparing two datasets. Let's look at an example.

Here's our transaction data again, without the derived days-since-prev field:

id date categ amount
1 2009-01-31 00:00:00 A 1,000.00
2 2009-02-02 00:00:00 A 2,000.00
2 2009-02-02 22:22:22 B 2,222.22
3 2009-03-03 00:00:00 A 1,000.00
3 2009-03-03 13:33:33 B 3,000.00
3 2009-03-03 23:33:33 B 3,333.33
4 2009-04-04 00:00:00 A 1,000.00
4 2009-04-04 04:44:44 B 1,111.11
4 2009-04-04 14:44:44 B
4 2009-04-04 20:44:44 B 4,444.44

and here's a variant of it:

id date categ amount
1 2009-01-31 00:00:00 A 1,000.00
2 2009-02-02 00:00:00 A 2,000.00
2 2009-02-02 22:22:22 B 2,222.22
3 2009-03-03 00:00:00 A 1,000.00
3 2009-03-03 13:33:33 B 3,000.00
3 2009-03-03 23:33:33 B 3,333.33
4 2009-04-04 00:00:00 A 1,000.00
4 2009-04-04 04:44:44 B 1,111.11
4 2009-04-04 14:44:44 B 3,874.18
4 2009-04-04 20:44:44 A 4,444.44

These datasets are small enough that you can probably see the differences by inspection (though it might take a little while to be confident that you've spotted them all), but when there are millions of rows and hundreds of columns, that becomes less easy.

Using the hash trick we talked about previously, we can see whether there are any differences, and slightly more besides. Assuming the current working dataset in Miró is the first, and the second is in TRANS2, we can hash them both:

[46]>  hash
19451c13321284f2b0dd7736b75b443945fac1eae08a8118d600ec0d49b6bf87 id
f3c08f1d2d23abaa06da0529237f63bf8099053b9088328dfd5642d9b06e8f6a date
3a0070ae42b9f341e7e266a18ea1c78c7d8be093cb628c7be06b6175c8b09f23 categ
f5b3f6284f7d510f22df32daac2784597122d5c14b89b5355464ce05f84ce120 amount

b89ae4b74f95187ecc5d49ddd7f45a64849a603539044ae318a06c2dc7292cf9 combined

[47]>  TRANS2.hash
19451c13321284f2b0dd7736b75b443945fac1eae08a8118d600ec0d49b6bf87 id
f3c08f1d2d23abaa06da0529237f63bf8099053b9088328dfd5642d9b06e8f6a date
fcd11dbd69eee0bf6d2a405a6e4ef9227bb3f0279d9cc7866e2efe5b4c97112c categ
64c5c97e9e9676ec085b522303d75ff11b0ebe01a1ceebaf003719b3718f12bb amount

2e171d2a24183e5e25bbcc50d9cd99ad8b4ca48ee7e1abfa6027edd291a22584 combined

We can see immediately that these are different, but that the individual hashes for the id and date fields are the same, indicating that their content is (almost certainly) the same. It's the categ and amount fields that differ between the two datasets.

We can use the ddiff command to get a more detailed diagnosis:

[49]>  ddiff -P TRANS2

Number of differences       Field Pair
           0:                   id : id-2
           0:                 date : date-2
           1:                categ : categ-2
           1:               amount : amount-2

Diff fields:
           1: diff-amount
           1: diff-categ

Total number of differences found: 2

The output here confirms that there are no differences between the id and date fields in the two datasets, but that one value differs for each of the categ and amount fields. The -P flag that we passed to the ddiff command told it to preserve information about the differences, and if we now look at the data, we see five extra fields (on the first dataset)—the fields as they were in the other dataset, TRANS2. Miró also creates an overall diff field showing whether each record has any differences across the two datasets.

[50] show
id date categ categ-2 diff-categ amount amount-2 diff-amount diff
1 2009-01-31 00:00:00 A A 0 1,000.00 1,000.00 0 0
2 2009-02-02 00:00:00 A A 0 2,000.00 2,000.00 0 0
2 2009-02-02 22:22:22 B B 0 2,222.22 2,222.22 0 0
3 2009-03-03 00:00:00 A A 0 1,000.00 1,000.00 0 0
3 2009-03-03 13:33:33 B B 0 3,000.00 3,000.00 0 0
3 2009-03-03 23:33:33 B B 0 3,333.33 3,333.33 0 0
4 2009-04-04 00:00:00 A A 0 1,000.00 1,000.00 0 0
4 2009-04-04 04:44:44 B B 0 1,111.11 1,111.11 0 0
4 2009-04-04 14:44:44 B B 0 3,874.18 1 1
4 2009-04-04 20:44:44 B A 1 4,444.44 4,444.44 0 1

This makes is easy to identify and select only those fields or records with differences, which is one of the key tasks when trying to track data lineage.

As powerful as Miró's ddiff and related commands are, there is also much more that we would like (and plan) to support. Our comparison is fundamentally based on joining the two datasets (either on one or more nominated key fields, or, as in this case, implicitly on record number). When we are using a join key, it's quite easy to deal with row addition and deletion, but that is harder when we are just joining on record number. It would be useful to have a Unix diff-like ability to spot single rows or groups of rows that have been inserted, deleted, or re-ordered, but we don't have that today. In certain cases, spotting other kinds of systematic edits would be interesting—for example, thinking of the table in speadsheet-like terms, it would be useful to spot cases in which blocks of cells shift up, down, left or right. This situation is not very common in the data we most commonly work with, but there are domains in which those sorts of changes might be frequent.

What Next?

We surveyed a few of the ways we think about and implement features in our software (and workflows) to help track data provenance and data lineage. There's a great deal more we could do, and over time we will almost certainly add more. Hopefully these ideas will prove useful and interesting, and obviously any of you fortunate enough to use Miró can try them out.

We'll keep you posted as we extend our thinking.

Credits

Thanks to our Social Media Manager, for actively policing our content even as it was being created.

Alfie, our Social Media Manager, inspects progress on the blog post.


  1. Changing (original) values in data is actually so rare that Miró does provides very few facilities for doing so directly, but data can effectively be changed through deleting one field or record and adding another, and there are a couple of operations that can change values in place—primarily anonymization operations and functions that update special, automatically created fields, albeit usually by deletion and regeneration. In fact, one of the rules on our "unwritten list" of good practices is never to replace an input field with an edited copy, but instead always to derive a new field with a variant name, so that when we see a field from source data we know its values have not been altered. 

  2. A familiar example comes from git, which uses hashes to compare the contents of files efficiently, and also uses a SHA-1 hash to identify a commit, by hashing all of the important information about that commit. 

  3. The overall hash depends on the name of the fields and their order, as well as the data in the fields, but the individual field hashes do not. As a result, two fields containing the same values (in the same order) will receive the same hash, but datasets in which fields have been renamed or reordered will have different overall hashes. We should also note that missing values (NULLs) also contribute to the field hashes. 


Data Provenance and Data Lineage: the View from the Podcasts

Posted on Thu 30 November 2017 in TDDA • Tagged with data lineage, data provenance, data governance, tdda, constraints

In Episode 49 of the Not So Standard Deviations podcast, the final segment (starting at 59:32) discusses data lineage, after Roger Peng listened to the September 3rd (2017) episode of another podcast, Linear Digressions, which discussed that subject.

This is a topic very close to our hearts, and I thought it would be useful to summarize the discussions on both podcasts, as a precursor to writing up how we approach some of these issues at Stochastic Solutions—in our work, in our Miró software and through the TDDA approaches discussed on this blog.

It probably makes sense to begin by summarizing the Linear Digressions, episode, in which Katie Malone explains the ideas of data lineage (also known as data provenance) as the tracking of changes to a dataset.

Any dataset starts from one or more "original sources"—the sensors that first measured the quantities recorded, or the system that generated the original data. In almost all cases, a series of transformations is then applied to the data before it is finally used in some given application. For example, in machine learning, Katie describes typical transformation stages as:

  1. Cleaning the data
  2. Making additions, subtractions and merges to the dataset
  3. Aggregating the data in some way
  4. Imputing missing values

She describes this as the process view of data lineage.

An alternative perspective focuses less on the processes than the resulting sequence of datasets, as snapshots. As her co-host, Ben Jaffe points out, this is more akin to the way version control systems view files or collections of files, and diff tools (see below) effectively reconstruct the process view1 from the data.

In terms of tooling, Katie reckons that the tools for tracking data lineage are relatively well developed (but specialist) in large scientific collaborations such as particle physics (she used to work on the LHC at CERN) and genomics, but her sense is tools are less well developed in many business contexts.

She then describes five reasons to care about data provenance:

  1. (To improve/track/ensure) data quality
  2. (To provide) an audit trail
  3. (To aid) replication (her example was enabling the rebuilding of a predictive model that had been lost if you knew how it had been produced, subject obviously, not only to having the data but full details of the parameters, training regime and any random number seeds used)
  4. (To support) attribution (e.g. providing a credit to the original data collector when publishing a paper)
  5. Informational, (i.e. to keep track of and aid navigation within large collection of datasets).

I recommend listening to the episode.

In Not-so-Standard Deviations 49, Roger Peng introduces the idea of tracking and versioning data, as a result of listening to Katie and Ben discuss the issue on their podcast. Roger argues that while you can stick a dataset into Git or other version control software, doing so it not terribly helpful because, most of the time the dataset acts essentially as a blob,2 rather than as a structured entity that diff tools help you to understand.

In this, he is exactly right. In case you're not familar with version control and diff tools, let me illustrate the point. In a previous post on Rexpy, I added a link to a page about our Miró software between two edits. If I run the relevant git diff command, this is its output:

git diff of two versions of the markdown for a blog post

As you can see, this shows pretty clearly what's changed. Using a visual diff tool, we get an even clearer picture of the changes (especially when the changes are more numerous and complex):

visual diff (opendiff) of two versions of the markdown for a blog post

In contrast, if I do a diff on two Excel files stored in Git, I get the following:

git diff b1b85ddc448a723845c36688480cfe5072f28c1a -- test-excel-sheet1.xlsx 
diff --git a/testdata/test-excel-sheet1.xlsx b/testdata/test-excel-sheet1.xlsx
index 0bd63cb..91e5b0e 100644
Binary files a/testdata/test-excel-sheet1.xlsx and b/testdata/test-excel-sheet1.xlsx differ

This is better than nothing, but gives no insight into what has changed. (In fact, it's worse than it looks, because even changes to the metadata inside an Excel file--such as the location of the selected cell—will cause the files to be shown as different. As a result, there are many hard-to-detect false positives when using diff commands with binary files.)

Going back to the podcast, Hilary Parker then talked about the difficulty of the idea of data provenance in the context of streaming data, but argued there was a bit more hope with batch processes, since at least the datasets used then are "frozen".

Roger then argued that there are good custom tools used in particular places like CERN, but those are not tools he can just pick up and use. He then wondered aloud whether such tools cannot really exist, because they require too much understanding on analytical goals. (I don't agree with this, as the next post will show.) He then rowed back slightly, saying that maybe it's too hard for general data, but more reasonable in a narrower context such as "tidy data".

If you aren't familiar with the term "tidy data", it really just refers to data stored in the same way that relational databases store data in tables, with columns corresponding to variables, rows corresponding to whatever the items being measured are (the observations), consistent types for all the values in a column and a resulting regular, grid-like structure. (This contrasts with, say, JSON data, which is hierarchical, and with many spreadsheets, in which different parts of the sheet are used for different things, and with data in which different observations are in columns and the different variables are in rows.) So "tidy data" is an extremely large subset of the data we use in structured data analysis, at least after initial regularization.

Hilary then mentioned an R package called testdat, that she had had worked on at an "unconference". This aimed to test things check things like missing values and contiguity of dates in datasets. These ideas are very similar to those of constraint generation and verification, which we frequently discuss on this blog, as supported in the TDDA package. But Hilary (and others?) concluded that the package was not really useful, and that what was more important was tools to make writing tests for data easier. (I guess we disagree that general tools like this aren't useful, but very much support the latter point.)

Roger then raised the idea of tracking changes to such tests as a kind of proxy for tracking changes to the data, though clearly this is a very partial solution.

Both podcasts are interesting and worth a listen, but both of them seemed to feel that there is very little standardization in this area, and that it's really hard.

We have a lot of processes, software and ideas addressing many aspects of these issues, and I'll discuss them and try to relate them to the various points raised here in a subsequent post, fairly soon, I hope.


  1. Strictly speaking, diff tools cannot know what the actual processes used to transform the data were, but construct a set of atomic changes (known as patches) that are capable of transforming the old dataset into the new one. 

  2. a blob, in version control systems like Git, is a Binary Large OBject. Git allows you to store blobs, and will track different versions of them, but all you can really see is whether two versions are the same, whereas for "normal" files (text files), visual diff tools normally allow you to see easily exactly what changes have been made, much like the track changes feature in Word documents. 


Automatic Constraint Generation and Verification White Paper

Posted on Fri 06 October 2017 in TDDA • Tagged with tdda, constraints, verification, bad data

We have a new White Paper available:

Automatic Constraint Generation and Verification

Abstract

Correctness is a key problem at every stage of data science projects: completing an entire analysis without a serious error at some stage is surprisingly hard. Even errors that reverse or completely invalidate the analysis can be hard to detect. Test-Driven Data Analysis (TDDA) attempts to identify, reduce, and aid correction of such errors. A core tool that we use in TDDA is Automatic Constraint Discovery and Verification, the focus of this paper.

Download White Paper here


Constraint Generation in the Presence of Bad Data

Posted on Thu 21 September 2017 in TDDA • Tagged with tdda, constraints, discovery, verification, suggestion, cartoon, bad data

Bad data is widespread and pervasive.1

Only datasets and analytical processes that have been subject to rigorous and sustained quality assurance processes are typically capable of achieving low or zero error rates. "Badness" can take many forms and have various aspects, including incorrect values, missing values, duplicated entries, misencoded values, values that are inconsistent with other entries in the same dataset, and values that are inconsistent with those in other datasets, to name but a few.

Person A: 'After our data quality drive, I am happy to say we know of no remaining bad data.' Person B: 'But every day I find problems, and mail the bad data account. They never get fixed.' A: 'Sure, but that's an unmonitored account.' (Covers ears, closes eyes.) A: 'Like I said, we know of no remaining bad data.' B: (Thinks): 'What is this hell?'

We have previously discussed automatic constraint discovery and verification as a mechanism for finding, specifying, and executing checks on data, and have advocated using these to verify input, output, and intermediate datasets. Until now, however, such constraint discovery has been based on the assumption that the datasets provided to the discovery algorithm contain only good data—a significant limitation.

We have recently been thinking about ways to extend our approach to constraint generation to handle cases in which this assumption is relaxed, so that the "discovery" process can be applied to datasets that include bad data. This article discusses and illustrates one such approach, which we have prototyped, as usual, in our own Miró software. We plan to bring the same approach to the open-source Python tdda library as soon as we have gained further experience using it and convinced ourselves we are going in a good direction.

The most obvious benefit of extending constraint generation to function usefully even when the dataset used contains some bad data is increased applicability. Perhaps a more important benefit is that it means that, in general, more and tighter constraints will be generated, potentially increasing their utility and more clearly highlighting areas that should be of concern.

The Goal and Pitfalls: "Seeing Through" Bad Data

Our aim is to add to constraint generation a second mode of operation whose goal is not to discover constraints that are true over our example data, but rather to generate constraints for which there is reasonable evidence, even if some of them are not actually satisfied by all of the example data. This is not so much constraint discovery as constraint suggestion. We want the software to attempt to "see through" the bad data to find constraints similar to those that would have been discovered had there been only good data. This is hard (and ill specified) because the bad data values are not identified, but we shall not be deterred.

For constraints generated in this way, the role of a human supervisor is particularly important: ideally, the user would look at the constraints produced—both the ordinary ones actually satisfied by the data and the more numerous, tighter suggestions typically produced by the "assuming bad data" version—and decide which to accept on a case-by-case basis.

Extending constraint generation with the ability to "see through" bad data is very powerful, but carries obvious risks: as an example, constraint suggestion might look at lending data and conclude that defaults (non-payment of loans—hopefully a small proportion of customers) are probably outliers that should be excluded from the data. Clearly, that would be the wrong conclusion.

Notwithstanding the risks, our initial experiences with generation in the presence of bad data have been rather positive: the process has generated constraints that have identified problems of which we had been blissfully unaware. The very first time we used it highlighted:

  • a nearly but not-quite unique identifier (that should, of course, have been unique)
  • a number of low-frequency, bad categorical values in a field
  • several numeric fields with much smaller true good ranges than we had realized.

Additionally, even when the constraints discovered go beyond identifying incorrect data, they often highlight data that we are happy to have flagged because they do represent outliers deserving of scrutiny.

Our Approach

Our initial attempt at extending constraint discovery to the case in which there is (or might be) bad data is straightforward: we simply allow that a proportion of the data might be bad, and look for the best constraints we can find on that basis.

With our current implementation, in Miró, the user has to supply an upper limit on the proportion, p, of values that are thought likely to be bad. Eventually, we expect to be able to determine this proportion heuristically, at least by default. In practice, for datasets of any reasonable size, we are mostly just using 1%, which seems to be working fairly well.

We should emphasize that the "bad" proportion p provided is not an estimate of how much of the data is bad, but an upper limit on how much we believe is reasonably likely to be bad.2 As a result, when we use 1%, our goal is not to find the tighest possible constraints that are consistent with 99% of data, but rather to work with the assumption that at least 99% of the data values are good, and then to find constraints that separate out values that look very different from the 99%. If this turns out to be only 0.1%, or 0.001%, or none at all, so much the better.

The way this plays out is different for different kinds of constraint. Let's work through some examples.

Minimum and Maximum Constraints

Looking for minimum and maximum constraints in the presence of possible bad values is closely connected to univariate outlier detection. We cannot, however, sensibly make any assumptions about the shape of the data in the general case, so parametric statistics (means, variances etc.) are not really appropriate. We also need to be clear that what we are looking for is not the tails of some smooth distribution, but rather values that look as if they might have been generated by some completely different process, or drawn from some other distribution.

An example of the kind of thing what we are looking for is something qualitatively like the situation in the top diagram (though perhaps considerably more extreme), where the red and blue sections of the distribution look completely different from the main body, and we want to find cutoffs somewhere around the values shown. This is the sort of thing we might see, for example, if the main body of values are prices in one currency and the outliers, in contrast, are the similar prices, but measured in two other currencies. In contrast, we are not aiming to cut off the tails of a smooth, regular distribution, such as the normal distribution shown in the lower diagram.

Top: a distribution with a double-peaked centre, and two small (coloured) peaks well to the left and right of the centre. Bottom: a normal distribution, with a small part of each tail coloured.

In our initial implementation we have used the possible bad proportion p to define an assumed "main body" of the distribution as running between the quantiles at (1 – p) and p (from the 1st percentile to the 99th, when p = 1%). We assume all values in this range are good. We then search out from this main body of the distribution, looking for a gap between successive values that is at least α times the (1% – 99%) interquantile range. We are using α=0.5 for now. Once we have identified a gap, we then pick a value within it, currently favouring the end nearer the main distribution and also favouring "round" numbers.

Let's look at an example.

We have a dataset with 12,680,141 records and 3 prices (floating-point values), Price1, Price2 and Price3, in sterling. If we run ordinary TDDA constraint discovery on this, we get the following results:

Individual Field Constraints
Name Type Allowed Min Allowed Max Allowed
Price1 real 0.0 198,204.34
Price2 real –468,550,685.56 2,432,595.87
Price3 real 0.0 1,390,276,267.42

With ordinary constraint discovery, the min and max constraints are just set to the largest and smallest values in the dataset, so this tells us that the largest value for Price1 in the dataset is nearly £200,000, that Price2 ranges from close to –£500m to about +£2.5m, and that Price3 runs up to about £1.4bn.

If we rerun the constraint generation allowing for up to 1% bad data, we get the following instead.

Individual Field Constraints
Name Type Allowed Min Allowed Max Allowed
Price1 real 0.0 38,000.0
Price2 real –3,680.0 3,870.0
Price3 real 0.0 125,000.0

Let's look at why these values were chosen, and see whether they are reasonable.

Starting with Price1, let's get a few more statistics.

min Price1 percentile 1 Price1 median Price1 percentile 99 Price1 max Price1
0.00 24.77 299.77 2,856.04 198,204.34

So here our median value is just under £300, while our first percentile value is about £25 and our 99th percentile is a little under £3,000, giving us an interquantile range also somewhat under £3,000 (£2,831.27, to be precise).

The way our algorithm currently works, this interquantile range defines the scale for the gap we need to define something as an outlier—in this case, half of £2,831.27, which is about £1,415.

If we were just looking to find the tighest maximum constraint consistent with 1% of the data being bad (with respect to this constraint), we would obviously just set the upper limit to fractionally above £2,856.04. But this would be fairly nonsensensical as we can see if we look at the data around that threshold. It is a feature of this particular data that most values are duplicated a few times (typically between 3 and 10 times), so we first deduplicate them. Here are the first 20 distinct values above £2,855:

2,855.02   2,855.42   2,855.80   2,856.04
2,855.16   2,855.50   2,855.82   2,856.08
2,855.22   2,855.64*  2,855.86   2,856.13
2,855.38   2,855.76   2,855.91   2,856.14

Obviously, there's nothing special about £2,854.04 (marked with *), and there is no meaningful gap after it, so that would be a fairly odd place to cut off.

Now let's look at the (distinct) values above £30,000:

30,173.70   31,513.21   39,505.67   44,852.01   67,097.01    72,082.60
30,452.45   32,358.52   39,703.93   45,944.21   60,858.88    72,382.57
30,562.55   32,838.23   40,888.79   47,026.70   60,911.84    72,388.55
30,586.96   33,906.98   40,999.04   47,058.99   63,160.28    72,657.04
30,601.67   34,058.27   41,252.50   47,126.60   63,984.64    72,760.90
30,620.78   35,302.33   41,447.00   49,827.28   64,517.16    90,713.28
31,118.10   36,472.53   41,513.95   51,814.22   67,256.27   103,231.44
31,139.86   36,601.86   42,473.36   53,845.76   68,081.68   123,629.97
31,206.71   36,941.76*  43,384.46   55,871.84   70,782.28   198,204.34
31,449.57   39,034.38   43,510.03   56,393.14   71,285.95

The cutoff the TDDA algorithm has chosen (£38,000) is just after the value marked *, which is the start of the first sizable gap. We are not claiming that there is anything uniquely right or "optimal" about the cutoff that has been chosen in the context of this data, but it looks reasonable. The first few values here are still comparatively close together—with gaps of only a few hundred between each pair of successive value—and the last ones are almost an order of magnitude bigger.

We won't look in detail at Price2 and Price3, where the algorithm has narrowed the range rather more, except to comment briefly on the negative values for Price2, which you might have expected to be excluded. This table shows why:3

countnegative Price2 countzero Price2 countpositive Price2 countnull Price2 countnonnull Price2
1,300 51,640 39,750 15,880 92,690

Although negative prices are not very common, they do account for about 1.2% of the data (critically, more than 1%). Additionally, nearly 15% of the values for Price2 are missing, and we exclude nulls from this calculation, so the truly relevant figure is that 1,300 of 92,690 non-null values are negative, or about 1.4%.

Other Constraints

We will now use a slightly wider dataset, with more field types, to look at how constraint generation works for other kinds of constraints in the presence of bad data. Here are the first 10 records from a dataset with 108,559 records, including a similar set of three price fields (with apologies for the slightly unintuitive colouring of the Colour field).

ID Date Price1 Price2 Price3 Code nItems Parts Colour
af0370b4-16e4-1891-8b30-cbb075438394 2016-11-01 00:38:34 830.25 830.25 NLOM 1 62444 red
126edf08-16e5-1891-851c-d926416373f2 2016-11-01 00:41:21 983.08 0.00 983.08 QLDZ 1 62677 yellow
73462586-16ed-1891-b164-7da16012a3ab 2016-11-01 01:41:20 540.82 0.00 540.82 TKNX 1 62177 62132 red
1cd55aac-170e-1891-aec3-0ffc0ab020e7 2016-11-01 05:35:08 398.89 0.00 398.89 PKWI 1 61734 red
68f74486-170e-1891-8b30-cbb075438394 2016-11-01 05:37:16 314.54 8.21 314.54 PLUA 1 62611 red
28dff654-16fa-1891-bd69-03d25ee96dee 2016-11-01 03:12:18 479.75 479.75 UHEG 1 62128 red
14321670-1703-1891-ab75-77d18aced2f5 2016-11-01 04:16:09 733.41 0.00 733.41 RTKT 2 61829 red
60957d68-1717-1891-a5b2-67fe1e248d60 2016-11-01 06:41:27 537.81 0.00 537.81 OBFZ 1 61939 62371 red
4fea2bb6-171d-1891-80a6-2f322b85f525 2016-11-01 07:23:56 132.60 2.42 135.02 TACG 2 62356 red
204ed3d6-1725-1891-a1fb-15b6ffac52ce 2016-11-01 08:19:52 866.32 866.32 XHLE 2 61939 red

If we run ordinary constraint discovery on this (not allowing for any bad data), these are the results:

Individual Field Constraints
Name Type Allowed Min Allowed Max Allowed Sign Allowed Nulls Allowed Duplicates Allowed Values Allowed # Regular Expressions
ID string length 32 length 36 0 - 2
Date date 1970-01-01 2017-06-03 23:59:06 0
Date:time-before-now timedelta 107 days, 11:28:05 17428 days, 11:27:11 positive
Price1 real 0.0 19,653,405.06 non-negative 0
Price2 real -4,331,261.54 589,023.50
Price3 real 0.0 20,242,428.57 non-negative 0
Code string length 4 length 4 0 - 3
nItems int 1 99 positive -
Parts string length 5 length 65 1 - 5
Colour string length 3 length 6 0 - 6 values 2

There are a few points to note:

  • To make the table more manageable, the regular expressions and field values are not shown, by default. We'll see them later.
  • The third line of constraints is generated is slightly different from the others, and looks not at the actual values of the dates, but rather than how far in the past or future they are. This is something else we are experimenting with, but doesn't matter too much for this example.

Let's repeat the process telling the software that up to 1% of the data might be bad.

Individual Field Constraints
Name Type Allowed Min Allowed Max Allowed Sign Allowed Nulls Allowed Duplicates Allowed Values Allowed # Regular Expressions
ID string length 32 length 36 0 no 1
Date date 1970-01-01 01:23:20 2017-06-03 23:59:06 0
Date:time-before-now timedelta 107 days, 11:55:27 17428 days, 11:54:33 positive
Price1 real 0.0 25000.0 non-negative 0
Price2 real -520.0 790.0
Price3 real 0.0 24000.0 non-negative 0
Code string length 4 length 4 0 - 2
nItems int 1 10 positive 0 -
Parts string length 5 length 65 0 - 2
Colour string length 3 length 6 0 - 3 values 1

Now let's consider the (other) constraint kinds in turn.

Sign

Nothing has changed for any of the sign constraints. In general, all that we do in this case is see whether less than our nominated proportion p is negative, or less than our nominated proportion p is positive, and if so write a constraint to this effect, but in the example nothing has changed.

Nulls

The nulls allowed constraint puts an upper limit on the number of nulls allowed in a field. The only two values we ever use are 0 and 1, with 0 meaning that no nulls are allowed and 1 meaning that a single null is allowed. The Parts field originally had a value of 1 here, meaning that there is a single null in this field in the data, and therefore the software wrote a constraint that a maxumim of 1 null is permitted. When the software is allowed to assume that 1% of the data might be bad, a more natural constraint is not to allow nulls at all.

No constraint on nulls was produced for the field nItems originally, but now we have one. If we check the number of nulls in nItems, it turns out to be just 37 or around 0.03%. Since that is well below our 1% limit, a "no-nulls" constraint has been generated.

Duplicates

In the original constraint discovery, no constraints banning duplicates were generated, whereas in this case ID did get a "no duplicates" constraint. If we count the number of distinct values in ID, there are in fact, 108,569 for 108,570 records, so there is clearly one duplicate, affecting 2 records. Since 2/108,570 is again well below our 1% limit (more like 0.002%), the software generates this constraint.

Values allowed.

In the original dataset, a constraint on the values for the field Colour was generated. Specifically, the six values it allowed were:

"red" "yellow" "blue" "green" "CC1734" "???"

When we ran the discovery process allowing for bad data, the result was that only 3 values were allowed, which turn out to be "red", "yellow", and "blue". If we look at the breakdown of the field, we will see why.

Colour count
red 88,258
yellow 10,979
blue 8,350
green 978
CC1734 4
??? 1

The two values we might have picked out as suspicious (or at least, having a different form from the others) are obviously CC1734 and ???, and constraint generation did exclude those, but also green, which we probably would not have done. There are 978 green values (about 0.9%), which is slighty under out 1% cutoff, so it can be excluded by the algorithm, and is. The algorithm simply works through the values in order, starting with the least numerous one, and removes values until the maximum possible bad proportion p is reached (cumulatively). Values with the same frequency are treated together, which means that if we had had another value (say "purple") with exactly the same frequency as green (978), neither would have been excluded.

Regular Expressions

Sets of regular expressions were generated to characterize each of the four string fields, and in every case the number generated was smaller when assuming the possible presence of bad data than when not.

For the ID field, two regular expressions were generated:

^[0-9a-f]{32}$
^[0-9a-f]{8}\-[0-9a-f]{4}\-1891\-[0-9a-f]{4}\-[0-9a-f]{12}$

The first of these just corresponds to a 32-digit hex number, while the second is a 32-digit hex number broken into five groups of 8, 4, 4, 4, and 12 digits, separated by dashes—i.e. a UUID.4

If we get Rexpy to give us the coverage information We see that all but three of the IDs are properly formatted UUIDs, with just three being plain 32-digit numbers.

Regular Expression Incremental Coverage
^[0-9a-f]{8}\-[0-9a-f]{4}\-1891\-[0-9a-f]{4}\-[0-9a-f]{12}$ 108,567
^[0-9a-f]{32}$ 3

and indeed, it is the 3 records that are excluded by the constraint generation assuming bad data.

We won't go through all the cases, but will look at one more. The field Parts is a list of 5-digit numbers, separated by spaces, with a single one being generated most commonly. Here are the regular expressions that Rexpy generated, together with the coverage information.

Regular Expression Incremental Coverage
^\d{5}$ 99,255
^\d{5} \d{5}$ 8,931
^\d{5} \d{5} \d{5}$ 378
^\d{5} \d{5} \d{5} \d{5}$ 4
^61000 61000 61000 61000 61000 61000 61000 61000 61000 61000 61000$ 1

As you can see, Rexpy has generated five separate regular expressions, where in an ideal world we might have preferred it produced a single one:

^\d{5}( \d{5})*$

In fact, however, the fact it has produced separate expressions for five clearly distinguishable cases turns out to be very helpful for TDDA purposes.

In this case, we can see that the vast bulk of the cases have either one or two 5-digit codes (which are the two regular expressions retained by constraint generation), but we would almost certainly consider the 382 cases with three and four codes to be also correct. The last one is more interesting. First, the number of codes (eleven) is noticably larger than for any other record. Secondly, the fact that it is eleven copies of a single code, that is a relatively round number is suspicious. (Looking at the data, it turns out that in no other case are any codes repeated when there are multiple codes, suggesting even more strongly that there's something not right with this record.)

Verifying the Constraint Generation Data Against the Suggested Constraints

With the previous approach to constraint discovery, in which we assume that the example data given to us contains only good data, it should always be the case that if we verify the example data against constraints generated against it, they will all pass.5 With the new approach, this is no longer the case, because our hope is that the constraints will help us to identify bad data. We show below the result of running verification against the generated contraints for the example we have been looking at:

Constraints
47
Records
108,570
Fields
10
Values
1,085,700
Failing Constraints
15
Failing Records
1,722
Failing Fields
10
Failing Values
1,887

We can also see a little more information about where the failures were in the table below.

Name Failures Type Minimum Maximum Sign Max Nulls Duplicates Values Rex
Values Constraints Allowed Actual Allowed Actual Allowed Actual Allowed Actual Allowed Actual Allowed Actual Allowed Actual Allowed Actual
Colour 983 2 string string length 3 length 3 length 6 length 6 - - - 0 0 - - - 3 values e.g. "green" 1 pattern e.g. "???"
Parts 384 2 string string length 5 length 5 length 65 length 65 - - - 0 1 - - - - - - 2 patterns e.g. "62702 62132 62341"
Price2 281 2 real real -520.00 -4,331,261.54 790.00 589,023.50 - - - - 15880 - - - - - - - - - -
Price3 92 1 real real 0.00 0.00 24,000.00 20,242,428.57 ≥ 0 0 0 - - - - - - - - -
Price1 91 1 real real 0.00 0.00 25,000.00 19,653,405.06 ≥ 0 0 0 - - - - - - - - -
nItems 44 2 int int 1 1 10 99 > 0 0 37 - - - - - - - - -
Code 4 1 string string length 4 length 4 length 4 length 4 - - - 0 0 - - - - - - 2 patterns e.g. "A__Z"
ID 3 2 string string length 32 length 32 length 36 length 36 - - - 0 0 no e.g. "374e4f9816e51891b1647da16012a3ab" - - - 1 pattern e.g. "374e4f9816e51891b1647da16012a3ab"
Date 3 1 date date 1970-01-01 01:23:20 1970-01-01 2017-06-03 23:59:06 2017-06-03 23:59:06 - - - 0 0 - - - - - - - - -
Date:time-before-now 2 1 - - - 109 days, 16:27:17 109 days, 16:28:09 17430 days, 16:26:23 17430 days, 16:27:15 > 0 - 0 - - - - - - - - - -

Because the verification fails in this way, and in doing so creates indicator fields for each failing constraint, and an overall field with the number of failures for each record, it is then easy to narrow down to the data being flagged by these constraints to see whether the constraints are useful or over zealous, and adjust them as necessary.

Summary

In this post, we've shown how we're extending Automatic Constraint Generation in TDDA to cover cases where the datasets used are not assumed to be perfect. We think this is quite a significant development. We'll use it a bit more, and when it's solid, extend the open-source tdda library to include this functionality.


  1. As previously, I am aware that, classically, data is the plural of datum, and that purists would prefer the assertion to be "bad data are widespread and pervastive." I apologise to anyone whose sensibilities are offended by my use of the word data in the singular. 

  2. As a further clarification, the proportion p is a used for each potential constraint on each field separately, so it's not "1% of all values" might be bad, but rather "1% of the Ages might be higher than the maximum constraint we generate" and so on. Obviously, we could generalize this approach to allow different possible propotions for each constraint type, or each field, or both, at the cost of increasing the number of free parameters. 

  3. Like you, I look at those figures and am immediately suspicious that all the counts shown are multiples of 10. But this is one of those cases our suspicions are wrong: this is a random sample from larger data, but the roundness of these numbers is blind chance. 

  4. In fact, looking carefully the third group of digits is fixed and starts with 1, indicating this is a UUID-1, which is something we hadn't noticed in this data until we got Rexpy to generate a regular expression for us, as part of TDDA. 

  5. With the time-delta constraints we are now generating, this is not strictly true, but this need not concern us in this case. 


Obtaining the Python tdda Library

Posted on Thu 14 September 2017 in TDDA • Tagged with tdda, python

This post is a standing post that we plan to try to keep up to date, describing options for obtaining the open-source Python TDDA library that we maintain.

Using pip from PyPI

Assuming you have a working pip setup, you should be able to install the tdda library by typing:

pip install tdda

or, if your permissions don't allow use in this mode

sudo pip install tdda

If pip isn't working, or is associated with a different Python from the one you are using, try:

python -m pip install tdda

or

sudo python -m pip install tdda

The tdda library supports both Python 3 (tested with 3.6 and 3.7) and Python 2 (tested with 2.7). (We'll start testing against 3.8 real soon!)

Upgrading

If you have a version of the tdda library installed and want to upgrade it with pip, add -U to one of the command above, i.e. use whichever of the following you need for your setup:

pip install -U tdda
sudo pip install -U tdda
python -m pip install -U tdda
sudo python -m pip install -U tdda

Installing from Source

The source for the tdda library is available from Github and can be cloned with

git clone https://github.com/tdda/tdda.git

or

git clone git@github.com:tdda/tdda.git

When installing from source, if you want the command line tdda utility to be available, you need to run

python setup.py install

from the top-level tdda directory after downloading it.

Documentation

The main documentation for the tdda library is available on Read the Docs.

You can also build it youself if you have downloaded the source from Github. In order to do this, you will need an installation of Sphinx. The HTML documentation is built, starting from the top-level tdda directory by running:

cd doc
make html

Running TDDA's tests

Once you have installed TDDA (whether using pip or from source), you can run its tests by typing

tdda test

If you have all the dependencies, including optional dependencies, installed, you should get a line of dots and the message OK at the end, something like this:

$ tdda test
........................................................................................................................
----------------------------------------------------------------------
Ran 122 tests in 3.251s

OK

If you don't have some of the optional dependencies installed, some of the dots will be replaced by the letter 's'. For example:

$ tdda test
.................................................................s.............................s........................
----------------------------------------------------------------------
Ran 120 tests in 3.221s

OK (skipped=2)

This does not indicate a problem, and simply means there will be some of the functionality unavailable (e.g. usually one or more database types).

Using the TDDA examples

The tdda library includes three sets of examples, covering reference testing, automatic constraint discovery and verification, and Rexpy (discovery of regular expressions from examples, outside the context of constraints).

The tdda command line can be used to copy the relevant files into place. To get the examples, first change to a directory where you would like them to be placed, and then use the command:

tdda examples

This should produce the following output:

Copied example files for tdda.referencetest to ./referencetest-examples
Copied example files for tdda.constraints to ./constraints-examples
Copied example files for tdda.rexpy to ./rexpy-examples

Quick Reference Guides

There is a quick reference guides available for the TDDA library. These are often a little behind the current release, but are usually still quite helpful.

These are available from here.

Tutorial from PyData London

There is a video online of a workshop at PyData London 2017. Watching a video of a workshop probably isn't ideal, but it does have a fairly detailed and gentle introduction to using the library, so if you are struggling, it might be a good place to start.