Data Provenance and Data Lineage: the View from the Podcasts

Posted on Thu 30 November 2017 in TDDA • Tagged with data lineage, data provenance, data governance, tdda, constraints

In Episode 49 of the Not So Standard Deviations podcast, the final segment (starting at 59:32) discusses data lineage, after Roger Peng listened to the September 3rd (2017) episode of another podcast, Linear Digressions, which discussed that subject.

This is a topic very close to our hearts, and I thought it would be useful to summarize the discussions on both podcasts, as a precursor to writing up how we approach some of these issues at Stochastic Solutions—in our work, in our Miró software and through the TDDA approaches discussed on this blog.

It probably makes sense to begin by summarizing the Linear Digressions, episode, in which Katie Malone explains the ideas of data lineage (also known as data provenance) as the tracking of changes to a dataset.

Any dataset starts from one or more "original sources"—the sensors that first measured the quantities recorded, or the system that generated the original data. In almost all cases, a series of transformations is then applied to the data before it is finally used in some given application. For example, in machine learning, Katie describes typical transformation stages as:

  1. Cleaning the data
  2. Making additions, subtractions and merges to the dataset
  3. Aggregating the data in some way
  4. Imputing missing values

She describes this as the process view of data lineage.

An alternative perspective focuses less on the processes than the resulting sequence of datasets, as snapshots. As her co-host, Ben Jaffe points out, this is more akin to the way version control systems view files or collections of files, and diff tools (see below) effectively reconstruct the process view1 from the data.

In terms of tooling, Katie reckons that the tools for tracking data lineage are relatively well developed (but specialist) in large scientific collaborations such as particle physics (she used to work on the LHC at CERN) and genomics, but her sense is tools are less well developed in many business contexts.

She then describes five reasons to care about data provenance:

  1. (To improve/track/ensure) data quality
  2. (To provide) an audit trail
  3. (To aid) replication (her example was enabling the rebuilding of a predictive model that had been lost if you knew how it had been produced, subject obviously, not only to having the data but full details of the parameters, training regime and any random number seeds used)
  4. (To support) attribution (e.g. providing a credit to the original data collector when publishing a paper)
  5. Informational, (i.e. to keep track of and aid navigation within large collection of datasets).

I recommend listening to the episode.

In Not-so-Standard Deviations 49, Roger Peng introduces the idea of tracking and versioning data, as a result of listening to Katie and Ben discuss the issue on their podcast. Roger argues that while you can stick a dataset into Git or other version control software, doing so it not terribly helpful because, most of the time the dataset acts essentially as a blob,2 rather than as a structured entity that diff tools help you to understand.

In this, he is exactly right. In case you're not familar with version control and diff tools, let me illustrate the point. In a previous post on Rexpy, I added a link to a page about our Miró software between two edits. If I run the relevant git diff command, this is its output:

git diff of two versions of the markdown for a blog post

As you can see, this shows pretty clearly what's changed. Using a visual diff tool, we get an even clearer picture of the changes (especially when the changes are more numerous and complex):

visual diff (opendiff) of two versions of the markdown for a blog post

In contrast, if I do a diff on two Excel files stored in Git, I get the following:

git diff b1b85ddc448a723845c36688480cfe5072f28c1a -- test-excel-sheet1.xlsx 
diff --git a/testdata/test-excel-sheet1.xlsx b/testdata/test-excel-sheet1.xlsx
index 0bd63cb..91e5b0e 100644
Binary files a/testdata/test-excel-sheet1.xlsx and b/testdata/test-excel-sheet1.xlsx differ

This is better than nothing, but gives no insight into what has changed. (In fact, it's worse than it looks, because even changes to the metadata inside an Excel file--such as the location of the selected cell—will cause the files to be shown as different. As a result, there are many hard-to-detect false positives when using diff commands with binary files.)

Going back to the podcast, Hilary Parker then talked about the difficulty of the idea of data provenance in the context of streaming data, but argued there was a bit more hope with batch processes, since at least the datasets used then are "frozen".

Roger then argued that there are good custom tools used in particular places like CERN, but those are not tools he can just pick up and use. He then wondered aloud whether such tools cannot really exist, because they require too much understanding on analytical goals. (I don't agree with this, as the next post will show.) He then rowed back slightly, saying that maybe it's too hard for general data, but more reasonable in a narrower context such as "tidy data".

If you aren't familiar with the term "tidy data", it really just refers to data stored in the same way that relational databases store data in tables, with columns corresponding to variables, rows corresponding to whatever the items being measured are (the observations), consistent types for all the values in a column and a resulting regular, grid-like structure. (This contrasts with, say, JSON data, which is hierarchical, and with many spreadsheets, in which different parts of the sheet are used for different things, and with data in which different observations are in columns and the different variables are in rows.) So "tidy data" is an extremely large subset of the data we use in structured data analysis, at least after initial regularization.

Hilary then mentioned an R package called testdat, that she had had worked on at an "unconference". This aimed to test things check things like missing values and contiguity of dates in datasets. These ideas are very similar to those of constraint generation and verification, which we frequently discuss on this blog, as supported in the TDDA package. But Hilary (and others?) concluded that the package was not really useful, and that what was more important was tools to make writing tests for data easier. (I guess we disagree that general tools like this aren't useful, but very much support the latter point.)

Roger then raised the idea of tracking changes to such tests as a kind of proxy for tracking changes to the data, though clearly this is a very partial solution.

Both podcasts are interesting and worth a listen, but both of them seemed to feel that there is very little standardization in this area, and that it's really hard.

We have a lot of processes, software and ideas addressing many aspects of these issues, and I'll discuss them and try to relate them to the various points raised here in a subsequent post, fairly soon, I hope.


  1. Strictly speaking, diff tools cannot know what the actual processes used to transform the data were, but construct a set of atomic changes (known as patches) that are capable of transforming the old dataset into the new one. 

  2. a blob, in version control systems like Git, is a Binary Large OBject. Git allows you to store blobs, and will track different versions of them, but all you can really see is whether two versions are the same, whereas for "normal" files (text files), visual diff tools normally allow you to see easily exactly what changes have been made, much like the track changes feature in Word documents. 


Automatic Constraint Generation and Verification White Paper

Posted on Fri 06 October 2017 in TDDA • Tagged with tdda, constraints, verification, bad data

We have a new White Paper available:

Automatic Constraint Generation and Verification

Abstract

Correctness is a key problem at every stage of data science projects: completing an entire analysis without a serious error at some stage is surprisingly hard. Even errors that reverse or completely invalidate the analysis can be hard to detect. Test-Driven Data Analysis (TDDA) attempts to identify, reduce, and aid correction of such errors. A core tool that we use in TDDA is Automatic Constraint Discovery and Verification, the focus of this paper.

Download White Paper here


Constraint Generation in the Presence of Bad Data

Posted on Thu 21 September 2017 in TDDA • Tagged with tdda, constraints, discovery, verification, suggestion, cartoon, bad data

Bad data is widespread and pervasive.1

Only datasets and analytical processes that have been subject to rigorous and sustained quality assurance processes are typically capable of achieving low or zero error rates. "Badness" can take many forms and have various aspects, including incorrect values, missing values, duplicated entries, misencoded values, values that are inconsistent with other entries in the same dataset, and values that are inconsistent with those in other datasets, to name but a few.

Person A: 'After our data quality drive, I am happy to say we know of no remaining bad data.' Person B: 'But every day I find problems, and mail the bad data account. They never get fixed.' A: 'Sure, but that's an unmonitored account.' (Covers ears, closes eyes.) A: 'Like I said, we know of no remaining bad data.' B: (Thinks): 'What is this hell?'

We have previously discussed automatic constraint discovery and verification as a mechanism for finding, specifying, and executing checks on data, and have advocated using these to verify input, output, and intermediate datasets. Until now, however, such constraint discovery has been based on the assumption that the datasets provided to the discovery algorithm contain only good data—a significant limitation.

We have recently been thinking about ways to extend our approach to constraint generation to handle cases in which this assumption is relaxed, so that the "discovery" process can be applied to datasets that include bad data. This article discusses and illustrates one such approach, which we have prototyped, as usual, in our own Miró software. We plan to bring the same approach to the open-source Python tdda library as soon as we have gained further experience using it and convinced ourselves we are going in a good direction.

The most obvious benefit of extending constraint generation to function usefully even when the dataset used contains some bad data is increased applicability. Perhaps a more important benefit is that it means that, in general, more and tighter constraints will be generated, potentially increasing their utility and more clearly highlighting areas that should be of concern.

The Goal and Pitfalls: "Seeing Through" Bad Data

Our aim is to add to constraint generation a second mode of operation whose goal is not to discover constraints that are true over our example data, but rather to generate constraints for which there is reasonable evidence, even if some of them are not actually satisfied by all of the example data. This is not so much constraint discovery as constraint suggestion. We want the software to attempt to "see through" the bad data to find constraints similar to those that would have been discovered had there been only good data. This is hard (and ill specified) because the bad data values are not identified, but we shall not be deterred.

For constraints generated in this way, the role of a human supervisor is particularly important: ideally, the user would look at the constraints produced—both the ordinary ones actually satisfied by the data and the more numerous, tighter suggestions typically produced by the "assuming bad data" version—and decide which to accept on a case-by-case basis.

Extending constraint generation with the ability to "see through" bad data is very powerful, but carries obvious risks: as an example, constraint suggestion might look at lending data and conclude that defaults (non-payment of loans—hopefully a small proportion of customers) are probably outliers that should be excluded from the data. Clearly, that would be the wrong conclusion.

Notwithstanding the risks, our initial experiences with generation in the presence of bad data have been rather positive: the process has generated constraints that have identified problems of which we had been blissfully unaware. The very first time we used it highlighted:

  • a nearly but not-quite unique identifier (that should, of course, have been unique)
  • a number of low-frequency, bad categorical values in a field
  • several numeric fields with much smaller true good ranges than we had realized.

Additionally, even when the constraints discovered go beyond identifying incorrect data, they often highlight data that we are happy to have flagged because they do represent outliers deserving of scrutiny.

Our Approach

Our initial attempt at extending constraint discovery to the case in which there is (or might be) bad data is straightforward: we simply allow that a proportion of the data might be bad, and look for the best constraints we can find on that basis.

With our current implementation, in Miró, the user has to supply an upper limit on the proportion, p, of values that are thought likely to be bad. Eventually, we expect to be able to determine this proportion heuristically, at least by default. In practice, for datasets of any reasonable size, we are mostly just using 1%, which seems to be working fairly well.

We should emphasize that the "bad" proportion p provided is not an estimate of how much of the data is bad, but an upper limit on how much we believe is reasonably likely to be bad.2 As a result, when we use 1%, our goal is not to find the tighest possible constraints that are consistent with 99% of data, but rather to work with the assumption that at least 99% of the data values are good, and then to find constraints that separate out values that look very different from the 99%. If this turns out to be only 0.1%, or 0.001%, or none at all, so much the better.

The way this plays out is different for different kinds of constraint. Let's work through some examples.

Minimum and Maximum Constraints

Looking for minimum and maximum constraints in the presence of possible bad values is closely connected to univariate outlier detection. We cannot, however, sensibly make any assumptions about the shape of the data in the general case, so parametric statistics (means, variances etc.) are not really appropriate. We also need to be clear that what we are looking for is not the tails of some smooth distribution, but rather values that look as if they might have been generated by some completely different process, or drawn from some other distribution.

An example of the kind of thing what we are looking for is something qualitatively like the situation in the top diagram (though perhaps considerably more extreme), where the red and blue sections of the distribution look completely different from the main body, and we want to find cutoffs somewhere around the values shown. This is the sort of thing we might see, for example, if the main body of values are prices in one currency and the outliers, in contrast, are the similar prices, but measured in two other currencies. In contrast, we are not aiming to cut off the tails of a smooth, regular distribution, such as the normal distribution shown in the lower diagram.

Top: a distribution with a double-peaked centre, and two small (coloured) peaks well to the left and right of the centre. Bottom: a normal distribution, with a small part of each tail coloured.

In our initial implementation we have used the possible bad proportion p to define an assumed "main body" of the distribution as running between the quantiles at (1 – p) and p (from the 1st percentile to the 99th, when p = 1%). We assume all values in this range are good. We then search out from this main body of the distribution, looking for a gap between successive values that is at least α times the (1% – 99%) interquantile range. We are using α=0.5 for now. Once we have identified a gap, we then pick a value within it, currently favouring the end nearer the main distribution and also favouring "round" numbers.

Let's look at an example.

We have a dataset with 12,680,141 records and 3 prices (floating-point values), Price1, Price2 and Price3, in sterling. If we run ordinary TDDA constraint discovery on this, we get the following results:

Individual Field Constraints
Name Type Allowed Min Allowed Max Allowed
Price1 real 0.0 198,204.34
Price2 real –468,550,685.56 2,432,595.87
Price3 real 0.0 1,390,276,267.42

With ordinary constraint discovery, the min and max constraints are just set to the largest and smallest values in the dataset, so this tells us that the largest value for Price1 in the dataset is nearly £200,000, that Price2 ranges from close to –£500m to about +£2.5m, and that Price3 runs up to about £1.4bn.

If we rerun the constraint generation allowing for up to 1% bad data, we get the following instead.

Individual Field Constraints
Name Type Allowed Min Allowed Max Allowed
Price1 real 0.0 38,000.0
Price2 real –3,680.0 3,870.0
Price3 real 0.0 125,000.0

Let's look at why these values were chosen, and see whether they are reasonable.

Starting with Price1, let's get a few more statistics.

min Price1 percentile 1 Price1 median Price1 percentile 99 Price1 max Price1
0.00 24.77 299.77 2,856.04 198,204.34

So here our median value is just under £300, while our first percentile value is about £25 and our 99th percentile is a little under £3,000, giving us an interquantile range also somewhat under £3,000 (£2,831.27, to be precise).

The way our algorithm currently works, this interquantile range defines the scale for the gap we need to define something as an outlier—in this case, half of £2,831.27, which is about £1,415.

If we were just looking to find the tighest maximum constraint consistent with 1% of the data being bad (with respect to this constraint), we would obviously just set the upper limit to fractionally above £2,856.04. But this would be fairly nonsensensical as we can see if we look at the data around that threshold. It is a feature of this particular data that most values are duplicated a few times (typically between 3 and 10 times), so we first deduplicate them. Here are the first 20 distinct values above £2,855:

2,855.02   2,855.42   2,855.80   2,856.04
2,855.16   2,855.50   2,855.82   2,856.08
2,855.22   2,855.64*  2,855.86   2,856.13
2,855.38   2,855.76   2,855.91   2,856.14

Obviously, there's nothing special about £2,854.04 (marked with *), and there is no meaningful gap after it, so that would be a fairly odd place to cut off.

Now let's look at the (distinct) values above £30,000:

30,173.70   31,513.21   39,505.67   44,852.01   67,097.01    72,082.60
30,452.45   32,358.52   39,703.93   45,944.21   60,858.88    72,382.57
30,562.55   32,838.23   40,888.79   47,026.70   60,911.84    72,388.55
30,586.96   33,906.98   40,999.04   47,058.99   63,160.28    72,657.04
30,601.67   34,058.27   41,252.50   47,126.60   63,984.64    72,760.90
30,620.78   35,302.33   41,447.00   49,827.28   64,517.16    90,713.28
31,118.10   36,472.53   41,513.95   51,814.22   67,256.27   103,231.44
31,139.86   36,601.86   42,473.36   53,845.76   68,081.68   123,629.97
31,206.71   36,941.76*  43,384.46   55,871.84   70,782.28   198,204.34
31,449.57   39,034.38   43,510.03   56,393.14   71,285.95

The cutoff the TDDA algorithm has chosen (£38,000) is just after the value marked *, which is the start of the first sizable gap. We are not claiming that there is anything uniquely right or "optimal" about the cutoff that has been chosen in the context of this data, but it looks reasonable. The first few values here are still comparatively close together—with gaps of only a few hundred between each pair of successive value—and the last ones are almost an order of magnitude bigger.

We won't look in detail at Price2 and Price3, where the algorithm has narrowed the range rather more, except to comment briefly on the negative values for Price2, which you might have expected to be excluded. This table shows why:3

countnegative Price2 countzero Price2 countpositive Price2 countnull Price2 countnonnull Price2
1,300 51,640 39,750 15,880 92,690

Although negative prices are not very common, they do account for about 1.2% of the data (critically, more than 1%). Additionally, nearly 15% of the values for Price2 are missing, and we exclude nulls from this calculation, so the truly relevant figure is that 1,300 of 92,690 non-null values are negative, or about 1.4%.

Other Constraints

We will now use a slightly wider dataset, with more field types, to look at how constraint generation works for other kinds of constraints in the presence of bad data. Here are the first 10 records from a dataset with 108,559 records, including a similar set of three price fields (with apologies for the slightly unintuitive colouring of the Colour field).

ID Date Price1 Price2 Price3 Code nItems Parts Colour
af0370b4-16e4-1891-8b30-cbb075438394 2016-11-01 00:38:34 830.25 830.25 NLOM 1 62444 red
126edf08-16e5-1891-851c-d926416373f2 2016-11-01 00:41:21 983.08 0.00 983.08 QLDZ 1 62677 yellow
73462586-16ed-1891-b164-7da16012a3ab 2016-11-01 01:41:20 540.82 0.00 540.82 TKNX 1 62177 62132 red
1cd55aac-170e-1891-aec3-0ffc0ab020e7 2016-11-01 05:35:08 398.89 0.00 398.89 PKWI 1 61734 red
68f74486-170e-1891-8b30-cbb075438394 2016-11-01 05:37:16 314.54 8.21 314.54 PLUA 1 62611 red
28dff654-16fa-1891-bd69-03d25ee96dee 2016-11-01 03:12:18 479.75 479.75 UHEG 1 62128 red
14321670-1703-1891-ab75-77d18aced2f5 2016-11-01 04:16:09 733.41 0.00 733.41 RTKT 2 61829 red
60957d68-1717-1891-a5b2-67fe1e248d60 2016-11-01 06:41:27 537.81 0.00 537.81 OBFZ 1 61939 62371 red
4fea2bb6-171d-1891-80a6-2f322b85f525 2016-11-01 07:23:56 132.60 2.42 135.02 TACG 2 62356 red
204ed3d6-1725-1891-a1fb-15b6ffac52ce 2016-11-01 08:19:52 866.32 866.32 XHLE 2 61939 red

If we run ordinary constraint discovery on this (not allowing for any bad data), these are the results:

Individual Field Constraints
Name Type Allowed Min Allowed Max Allowed Sign Allowed Nulls Allowed Duplicates Allowed Values Allowed # Regular Expressions
ID string length 32 length 36 0 - 2
Date date 1970-01-01 2017-06-03 23:59:06 0
Date:time-before-now timedelta 107 days, 11:28:05 17428 days, 11:27:11 positive
Price1 real 0.0 19,653,405.06 non-negative 0
Price2 real -4,331,261.54 589,023.50
Price3 real 0.0 20,242,428.57 non-negative 0
Code string length 4 length 4 0 - 3
nItems int 1 99 positive -
Parts string length 5 length 65 1 - 5
Colour string length 3 length 6 0 - 6 values 2

There are a few points to note:

  • To make the table more manageable, the regular expressions and field values are not shown, by default. We'll see them later.
  • The third line of constraints is generated is slightly different from the others, and looks not at the actual values of the dates, but rather than how far in the past or future they are. This is something else we are experimenting with, but doesn't matter too much for this example.

Let's repeat the process telling the software that up to 1% of the data might be bad.

Individual Field Constraints
Name Type Allowed Min Allowed Max Allowed Sign Allowed Nulls Allowed Duplicates Allowed Values Allowed # Regular Expressions
ID string length 32 length 36 0 no 1
Date date 1970-01-01 01:23:20 2017-06-03 23:59:06 0
Date:time-before-now timedelta 107 days, 11:55:27 17428 days, 11:54:33 positive
Price1 real 0.0 25000.0 non-negative 0
Price2 real -520.0 790.0
Price3 real 0.0 24000.0 non-negative 0
Code string length 4 length 4 0 - 2
nItems int 1 10 positive 0 -
Parts string length 5 length 65 0 - 2
Colour string length 3 length 6 0 - 3 values 1

Now let's consider the (other) constraint kinds in turn.

Sign

Nothing has changed for any of the sign constraints. In general, all that we do in this case is see whether less than our nominated proportion p is negative, or less than our nominated proportion p is positive, and if so write a constraint to this effect, but in the example nothing has changed.

Nulls

The nulls allowed constraint puts an upper limit on the number of nulls allowed in a field. The only two values we ever use are 0 and 1, with 0 meaning that no nulls are allowed and 1 meaning that a single null is allowed. The Parts field originally had a value of 1 here, meaning that there is a single null in this field in the data, and therefore the software wrote a constraint that a maxumim of 1 null is permitted. When the software is allowed to assume that 1% of the data might be bad, a more natural constraint is not to allow nulls at all.

No constraint on nulls was produced for the field nItems originally, but now we have one. If we check the number of nulls in nItems, it turns out to be just 37 or around 0.03%. Since that is well below our 1% limit, a "no-nulls" constraint has been generated.

Duplicates

In the original constraint discovery, no constraints banning duplicates were generated, whereas in this case ID did get a "no duplicates" constraint. If we count the number of distinct values in ID, there are in fact, 108,569 for 108,570 records, so there is clearly one duplicate, affecting 2 records. Since 2/108,570 is again well below our 1% limit (more like 0.002%), the software generates this constraint.

Values allowed.

In the original dataset, a constraint on the values for the field Colour was generated. Specifically, the six values it allowed were:

"red" "yellow" "blue" "green" "CC1734" "???"

When we ran the discovery process allowing for bad data, the result was that only 3 values were allowed, which turn out to be "red", "yellow", and "blue". If we look at the breakdown of the field, we will see why.

Colour count
red 88,258
yellow 10,979
blue 8,350
green 978
CC1734 4
??? 1

The two values we might have picked out as suspicious (or at least, having a different form from the others) are obviously CC1734 and ???, and constraint generation did exclude those, but also green, which we probably would not have done. There are 978 green values (about 0.9%), which is slighty under out 1% cutoff, so it can be excluded by the algorithm, and is. The algorithm simply works through the values in order, starting with the least numerous one, and removes values until the maximum possible bad proportion p is reached (cumulatively). Values with the same frequency are treated together, which means that if we had had another value (say "purple") with exactly the same frequency as green (978), neither would have been excluded.

Regular Expressions

Sets of regular expressions were generated to characterize each of the four string fields, and in every case the number generated was smaller when assuming the possible presence of bad data than when not.

For the ID field, two regular expressions were generated:

^[0-9a-f]{32}$
^[0-9a-f]{8}\-[0-9a-f]{4}\-1891\-[0-9a-f]{4}\-[0-9a-f]{12}$

The first of these just corresponds to a 32-digit hex number, while the second is a 32-digit hex number broken into five groups of 8, 4, 4, 4, and 12 digits, separated by dashes—i.e. a UUID.4

If we get Rexpy to give us the coverage information We see that all but three of the IDs are properly formatted UUIDs, with just three being plain 32-digit numbers.

Regular Expression Incremental Coverage
^[0-9a-f]{8}\-[0-9a-f]{4}\-1891\-[0-9a-f]{4}\-[0-9a-f]{12}$ 108,567
^[0-9a-f]{32}$ 3

and indeed, it is the 3 records that are excluded by the constraint generation assuming bad data.

We won't go through all the cases, but will look at one more. The field Parts is a list of 5-digit numbers, separated by spaces, with a single one being generated most commonly. Here are the regular expressions that Rexpy generated, together with the coverage information.

Regular Expression Incremental Coverage
^\d{5}$ 99,255
^\d{5} \d{5}$ 8,931
^\d{5} \d{5} \d{5}$ 378
^\d{5} \d{5} \d{5} \d{5}$ 4
^61000 61000 61000 61000 61000 61000 61000 61000 61000 61000 61000$ 1

As you can see, Rexpy has generated five separate regular expressions, where in an ideal world we might have preferred it produced a single one:

^\d{5}( \d{5})*$

In fact, however, the fact it has produced separate expressions for five clearly distinguishable cases turns out to be very helpful for TDDA purposes.

In this case, we can see that the vast bulk of the cases have either one or two 5-digit codes (which are the two regular expressions retained by constraint generation), but we would almost certainly consider the 382 cases with three and four codes to be also correct. The last one is more interesting. First, the number of codes (eleven) is noticably larger than for any other record. Secondly, the fact that it is eleven copies of a single code, that is a relatively round number is suspicious. (Looking at the data, it turns out that in no other case are any codes repeated when there are multiple codes, suggesting even more strongly that there's something not right with this record.)

Verifying the Constraint Generation Data Against the Suggested Constraints

With the previous approach to constraint discovery, in which we assume that the example data given to us contains only good data, it should always be the case that if we verify the example data against constraints generated against it, they will all pass.5 With the new approach, this is no longer the case, because our hope is that the constraints will help us to identify bad data. We show below the result of running verification against the generated contraints for the example we have been looking at:

Constraints
47
Records
108,570
Fields
10
Values
1,085,700
Failing Constraints
15
Failing Records
1,722
Failing Fields
10
Failing Values
1,887

We can also see a little more information about where the failures were in the table below.

Name Failures Type Minimum Maximum Sign Max Nulls Duplicates Values Rex
Values Constraints Allowed Actual Allowed Actual Allowed Actual Allowed Actual Allowed Actual Allowed Actual Allowed Actual Allowed Actual
Colour 983 2 string string length 3 length 3 length 6 length 6 - - - 0 0 - - - 3 values e.g. "green" 1 pattern e.g. "???"
Parts 384 2 string string length 5 length 5 length 65 length 65 - - - 0 1 - - - - - - 2 patterns e.g. "62702 62132 62341"
Price2 281 2 real real -520.00 -4,331,261.54 790.00 589,023.50 - - - - 15880 - - - - - - - - - -
Price3 92 1 real real 0.00 0.00 24,000.00 20,242,428.57 ≥ 0 0 0 - - - - - - - - -
Price1 91 1 real real 0.00 0.00 25,000.00 19,653,405.06 ≥ 0 0 0 - - - - - - - - -
nItems 44 2 int int 1 1 10 99 > 0 0 37 - - - - - - - - -
Code 4 1 string string length 4 length 4 length 4 length 4 - - - 0 0 - - - - - - 2 patterns e.g. "A__Z"
ID 3 2 string string length 32 length 32 length 36 length 36 - - - 0 0 no e.g. "374e4f9816e51891b1647da16012a3ab" - - - 1 pattern e.g. "374e4f9816e51891b1647da16012a3ab"
Date 3 1 date date 1970-01-01 01:23:20 1970-01-01 2017-06-03 23:59:06 2017-06-03 23:59:06 - - - 0 0 - - - - - - - - -
Date:time-before-now 2 1 - - - 109 days, 16:27:17 109 days, 16:28:09 17430 days, 16:26:23 17430 days, 16:27:15 > 0 - 0 - - - - - - - - - -

Because the verification fails in this way, and in doing so creates indicator fields for each failing constraint, and an overall field with the number of failures for each record, it is then easy to narrow down to the data being flagged by these constraints to see whether the constraints are useful or over zealous, and adjust them as necessary.

Summary

In this post, we've shown how we're extending Automatic Constraint Generation in TDDA to cover cases where the datasets used are not assumed to be perfect. We think this is quite a significant development. We'll use it a bit more, and when it's solid, extend the open-source tdda library to include this functionality.


  1. As previously, I am aware that, classically, data is the plural of datum, and that purists would prefer the assertion to be "bad data are widespread and pervastive." I apologise to anyone whose sensibilities are offended by my use of the word data in the singular. 

  2. As a further clarification, the proportion p is a used for each potential constraint on each field separately, so it's not "1% of all values" might be bad, but rather "1% of the Ages might be higher than the maximum constraint we generate" and so on. Obviously, we could generalize this approach to allow different possible propotions for each constraint type, or each field, or both, at the cost of increasing the number of free parameters. 

  3. Like you, I look at those figures and am immediately suspicious that all the counts shown are multiples of 10. But this is one of those cases our suspicions are wrong: this is a random sample from larger data, but the roundness of these numbers is blind chance. 

  4. In fact, looking carefully the third group of digits is fixed and starts with 1, indicating this is a UUID-1, which is something we hadn't noticed in this data until we got Rexpy to generate a regular expression for us, as part of TDDA. 

  5. With the time-delta constraints we are now generating, this is not strictly true, but this need not concern us in this case. 


Obtaining the Python tdda Library

Posted on Thu 14 September 2017 in TDDA • Tagged with tdda, python

This post is a standing post that we plan to try to keep up to date, describing options for obtaining the open-source Python TDDA library that we maintain.

Using pip from PyPI

Assuming you have a working pip setup, you should be able to install the tdda library by typing:

pip install tdda

or, if your permissions don't allow use in this mode

sudo pip install tdda

If pip isn't working, or is associated with a different Python from the one you are using, try:

python -m pip install tdda

or

sudo python -m pip install tdda

The tdda library supports both Python 3 (tested with 3.6 and 3.7) and Python 2 (tested with 2.7). (We'll start testing against 3.8 real soon!)

Upgrading

If you have a version of the tdda library installed and want to upgrade it with pip, add -U to one of the command above, i.e. use whichever of the following you need for your setup:

pip install -U tdda
sudo pip install -U tdda
python -m pip install -U tdda
sudo python -m pip install -U tdda

Installing from Source

The source for the tdda library is available from Github and can be cloned with

git clone https://github.com/tdda/tdda.git

or

git clone git@github.com:tdda/tdda.git

When installing from source, if you want the command line tdda utility to be available, you need to run

python setup.py install

from the top-level tdda directory after downloading it.

Documentation

The main documentation for the tdda library is available on Read the Docs.

You can also build it youself if you have downloaded the source from Github. In order to do this, you will need an installation of Sphinx. The HTML documentation is built, starting from the top-level tdda directory by running:

cd doc
make html

Running TDDA's tests

Once you have installed TDDA (whether using pip or from source), you can run its tests by typing

tdda test

If you have all the dependencies, including optional dependencies, installed, you should get a line of dots and the message OK at the end, something like this:

$ tdda test
........................................................................................................................
----------------------------------------------------------------------
Ran 122 tests in 3.251s

OK

If you don't have some of the optional dependencies installed, some of the dots will be replaced by the letter 's'. For example:

$ tdda test
.................................................................s.............................s........................
----------------------------------------------------------------------
Ran 120 tests in 3.221s

OK (skipped=2)

This does not indicate a problem, and simply means there will be some of the functionality unavailable (e.g. usually one or more database types).

Using the TDDA examples

The tdda library includes three sets of examples, covering reference testing, automatic constraint discovery and verification, and Rexpy (discovery of regular expressions from examples, outside the context of constraints).

The tdda command line can be used to copy the relevant files into place. To get the examples, first change to a directory where you would like them to be placed, and then use the command:

tdda examples

This should produce the following output:

Copied example files for tdda.referencetest to ./referencetest-examples
Copied example files for tdda.constraints to ./constraints-examples
Copied example files for tdda.rexpy to ./rexpy-examples

Quick Reference Guides

There is a quick reference guides available for the TDDA library. These are often a little behind the current release, but are usually still quite helpful.

These are available from here.

Tutorial from PyData London

There is a video online of a workshop at PyData London 2017. Watching a video of a workshop probably isn't ideal, but it does have a fairly detailed and gentle introduction to using the library, so if you are struggling, it might be a good place to start.


GDPR, Consent and Microformats: A Half-Baked Idea

Posted on Fri 08 September 2017 in TDDA • Tagged with tdda

Last night I went to The Protectors of Data Scotland Meetup on the subject of Marketing and GDPR. If you're not familiar with Europe's fast-approaching General Data Protection Regulation, and you keep or process any personal data about humans,1, you probably ought to learn about it. A good place to start is episode 202 of Horace Dediu's The Critical Path podcast, in which he interviews Tim Walters.

During the meeting, I had an idea, and though it is rather less than half-baked, right now it seems just about interesting enough that I thought I'd record it.

One of the key provisions in GDPR is that data processing generally requires consent of the data subject, and that consent is defined as

any freely given, specific, informed and unambiguous indication of his or her wishes by which the data subject, either by a statement or by a clear affirmative action, signifies agreement to personal data relating to them being processed

This is further clarified as follows:

This could include ticking a box when visiting an Internet website, choosing technical settings for information society services or by any other statement or conduct which clearly indicates in this context the data subject’s acceptance of the proposed processing of their personal data.

Silence, pre-ticked boxes or inactivity should therefore not constitute consent.

The Idea in a Nutshell

Briefly, the idea is this:

  • Websites (and potentially apps) requesting consents should include a digital specification of that consent in a standardized format to be defined (probably either an HTML microformat or a standardized JSON bundle).

  • This would allow software to understand the consents being requested unambiguously and present them in a standardized, uniform, easy-to-understand format. It would also encourages businesses and other organizations to standardize the forms of consent they request. I imagine that if this happened, initially browser extensions and special apps such as password managers would learn to read the format and present the information clearly, but if were successful, eventually web browsers themselves would do this.

  • Software could also allow people to create one or more templates or default responses, allowing, for example, someone who never wants to receive marketing to make this their default response, and someone who wants as many special offers as possible to have settings that reflect that. Obviously, you might want several different formats for organizations towards which you have different feelings.

  • A very small extension to the idea would extend the format to record the choices made, allowing password managers, browsers, apps etc. to record for the user exactly what consents were given.2

Benefits

I believe such a standard has potential benefits for all parties—businesses and other organizations requesting consent, individuals giving consent, regulators and courts:

  • Businesses and other data processing/capturing organizations would benefit from a clear set of consent kinds, each of which could have a detailed description (perhaps on an EU or W3C document) that could be referenced by a specific label (e.g. marketing_contact_email_organization). Best practice would hopefully quickly move to using these standard categories.

  • Software could present information in a standardized, clear way to users, highlighting non-standard provisions (preferably with standard symbols, a bit like the Creative Commons Symbols

  • By using template responses, users could more easily complete consent forms with less effort and less fear of ticking the wrong boxes.

Digital Specification? Microformat? JSON?

What are we actually talking about here?

The main (textual) content of a web page consists of the actual human-readable text together with annotation ("markup") to specify formatting and layout, as well as special features like the sort of checkboxes used to request consent. In older versions of the web, a web page was literally a text file in the a special format originally defined by Tim Berners-Lee (HTML).

For example, in HTML, this one-sentence paragraph with the word very in bold might be written as follows:

<p>For example, in HTML, this one-sentence paragraph with the
word <b>very</b> in bold might be written as follows:</p>

Since the advent of "Web 2.0", many web pages are generated dynamically, with much of the data being sent in a format called JSON. A simple example of some JSON for describing (say) an element in the periodic might be

{
    "name": "Lithium",
    "atomicnumber": 3,
    "metal": true,
    "period": 2,
    "group": 1,
    "etymology": "Greek lithos"
}

It would be straightforward3 to develop a format for allowing all the common types of marketing consent (and indeed, many other kinds of processing consent) to be expressed either in JSON or an HTML microformat (which might not be rendered directly by the webpage). As a sketch, the kind of thing a marketing consent request might look like in JSON could be:

{
    "format": "GDPR-Marketing-Consent-Bundle",
    "format_version": "1.0",
    "requesting_organization": "Stochastic Solutions Limited",
    "requesting_organization_data_protection_policy_page:
        "https://stochasticsolutions.com/privacy.html",
    "requesting_organization_partners": [],
    "requested_consents": [
        "marketing_contact_email_organization",
        "marketing_contact_mobile_phone_organization",
        "marketing_contact_physical_mail_organization"
    ],
    "request_date": "2017-09-08",
    "request_url": "https://StochasticSolutios.com/give-us-your-data"
}

Key features I am trying to illustrate here are:

  • The format would include details about the organization making the request
  • The format would have the capacity to list partner organizations in cases in which consent for partner marketing or processing was also requested
  • The format would be granular with a taxonomy of known kinds of consents. These might be parameterized, rather than being simple strings. In this case, I've included a few different contact mechanisms and the suffix "organization" to indicate this is consent for the organization itself, rather than any partners or other randoms.

Undoubtedly, a real implementation would end up a bit bigger than this, and perhaps more hierarchical, but hopefully not too much bigger.

The format could be extended very simply to include the response, which could then be sent back to the site and also made available on the page to the browser/password manager/apps etc. Here is an augmentation of the request format that also captures the responses:

{
    "format": "GDPR-Marketing-Consent-Bundle",
    "format_version": "1.0",
    "requesting_organization": "Stochastic Solutions Limited",
    "requesting_organization_data_protection_policy_page:
        "https://stochasticsolutions.com/privacy.html",
    "requesting_organization_partners": [],
    "requested_consents": {
        "marketing_contact_email_organization": false,
        "marketing_contact_mobile_phone_organization": false
        "marketing_contact_physical_mail_organization: true
    },
    "request_date": "2017-09-08",
    "request_url": "https://StochasticSolutions.com/give-us-your-data"
}

This example indicates consent to marketing contact by paper mail from the organization, but not by phone or email.

Exactly the same could be achieved with an HTML Microformat, perhaps with something like this:

<div id="GDPR-Marketing-Consent-Bundle">
    <span id="format_version">1.0</span>
    <span id="requesting_organization">
        Stochastic Solutions Limited
    </span>
    <span id="requesting_organization_data_protection_policy_page">
        "https://stochasticsolutions.com/privacy.html"
    </span>
    <ol id="requesting_organization_partners">
    </old>
    <ol id="requested_consents">
        <li>marketing_contact_email_organization</li>
        <li>marketing_contact_mobile_phone_organization<li>
        <li>marketing_contact_physical_mail_organization</li>
    </ol>
    <span id="request_date">2017-09-08</span>
    <span id="request_url">
        https://StochasticSolutions.com/give-us-your-data
    </span>
</div>

(Again, I've no idea whether this is actually what HTML-based microformats typically look like; this is purely illustrative.)

Useful?

I don't know whether this idea is useful or feasible, nor whether it is merely a half-baked version of something that an phalanx of people in Brussels has already specified, though I did perform many seconds of arduous due dilligence in the form of a web searches for terms like "marketing consent microformat" without turning up anything obviously relevant.

It seems to me that if something like this were created and adopted, it might help make GDPR and web/app-based consent avoid the ignominious fate of the cookie pop-ups that were so well intentioned but such a waste of time in practice. Ideally, some kind of collaboration between the relevant part of the EU and either W3C would produce (or at least endorse) any format.

Do get in touch through any of the channels (@tdda0, mail to info@ this domain etc.) if you have thoughts.


  1. So-called personally identifiable information (PII)

  2. Possibly even on a blockchain, if you want to be terribly au courrant and have the possibility of cryptographic verification. 

  3. technically straightforward; obviously this would require much work and hammering out of special cases and mechanisms for non-standard requirements.