Test-Driven Data Analysis

Constraint Generation in the Presence of Bad Data

Posted on Thu 21 September 2017 in TDDA • Tagged with tdda, constraints, discovery, verification, suggestion, cartoon, bad data

Bad data is widespread and pervasive.¹

Only datasets and analytical processes that have been subject to rigorous and sustained quality assurance processes are typically capable of achieving low or zero error rates. "Badness" can take many forms and have various aspects, including incorrect values, missing values, duplicated entries, misencoded values, values that are inconsistent with other entries in the same dataset, and values that are inconsistent with those in other datasets, to name but a few.

Person A: 'After our data quality drive, I am happy to say we know of no remaining bad data.' Person B: 'But every day I find problems, and mail the bad data account. They never get fixed.' A: 'Sure, but that's an unmonitored account.' (Covers ears, closes eyes.) A: 'Like I said, we know of no remaining bad data.' B: (Thinks): 'What is this hell?'

We have previously discussed automatic constraint discovery and verification as a mechanism for finding, specifying, and executing checks on data, and have advocated using these to verify input, output, and intermediate datasets. Until now, however, such constraint discovery has been based on the assumption that the datasets provided to the discovery algorithm contain only good data—a significant limitation.

We have recently been thinking about ways to extend our approach to constraint generation to handle cases in which this assumption is relaxed, so that the "discovery" process can be applied to datasets that include bad data. This article discusses and illustrates one such approach, which we have prototyped, as usual, in our own Miró software. We plan to bring the same approach to the open-source Python tdda library as soon as we have gained further experience using it and convinced ourselves we are going in a good direction.

The most obvious benefit of extending constraint generation to function usefully even when the dataset used contains some bad data is increased applicability. Perhaps a more important benefit is that it means that, in general, more and tighter constraints will be generated, potentially increasing their utility and more clearly highlighting areas that should be of concern.

The Goal and Pitfalls: "Seeing Through" Bad Data

Our aim is to add to constraint generation a second mode of operation whose goal is not to discover constraints that are true over our example data, but rather to generate constraints for which there is reasonable evidence, even if some of them are not actually satisfied by all of the example data. This is not so much constraint discovery as constraint suggestion. We want the software to attempt to "see through" the bad data to find constraints similar to those that would have been discovered had there been only good data. This is hard (and ill specified) because the bad data values are not identified, but we shall not be deterred.

For constraints generated in this way, the role of a human supervisor is particularly important: ideally, the user would look at the constraints produced—both the ordinary ones actually satisfied by the data and the more numerous, tighter suggestions typically produced by the "assuming bad data" version—and decide which to accept on a case-by-case basis.

Extending constraint generation with the ability to "see through" bad data is very powerful, but carries obvious risks: as an example, constraint suggestion might look at lending data and conclude that defaults (non-payment of loans—hopefully a small proportion of customers) are probably outliers that should be excluded from the data. Clearly, that would be the wrong conclusion.

Notwithstanding the risks, our initial experiences with generation in the presence of bad data have been rather positive: the process has generated constraints that have identified problems of which we had been blissfully unaware. The very first time we used it highlighted:

a nearly but not-quite unique identifier (that should, of course, have been unique)
a number of low-frequency, bad categorical values in a field
several numeric fields with much smaller true good ranges than we had realized.

Additionally, even when the constraints discovered go beyond identifying incorrect data, they often highlight data that we are happy to have flagged because they do represent outliers deserving of scrutiny.

Our Approach

Our initial attempt at extending constraint discovery to the case in which there is (or might be) bad data is straightforward: we simply allow that a proportion of the data might be bad, and look for the best constraints we can find on that basis.

With our current implementation, in Miró, the user has to supply an upper limit on the proportion, p, of values that are thought likely to be bad. Eventually, we expect to be able to determine this proportion heuristically, at least by default. In practice, for datasets of any reasonable size, we are mostly just using 1%, which seems to be working fairly well.

We should emphasize that the "bad" proportion p provided is not an estimate of how much of the data is bad, but an upper limit on how much we believe is reasonably likely to be bad.² As a result, when we use 1%, our goal is not to find the tighest possible constraints that are consistent with 99% of data, but rather to work with the assumption that at least 99% of the data values are good, and then to find constraints that separate out values that look very different from the 99%. If this turns out to be only 0.1%, or 0.001%, or none at all, so much the better.

The way this plays out is different for different kinds of constraint. Let's work through some examples.

Minimum and Maximum Constraints

Looking for minimum and maximum constraints in the presence of possible bad values is closely connected to univariate outlier detection. We cannot, however, sensibly make any assumptions about the shape of the data in the general case, so parametric statistics (means, variances etc.) are not really appropriate. We also need to be clear that what we are looking for is not the tails of some smooth distribution, but rather values that look as if they might have been generated by some completely different process, or drawn from some other distribution.

An example of the kind of thing what we are looking for is something qualitatively like the situation in the top diagram (though perhaps considerably more extreme), where the red and blue sections of the distribution look completely different from the main body, and we want to find cutoffs somewhere around the values shown. This is the sort of thing we might see, for example, if the main body of values are prices in one currency and the outliers, in contrast, are the similar prices, but measured in two other currencies. In contrast, we are not aiming to cut off the tails of a smooth, regular distribution, such as the normal distribution shown in the lower diagram.

Top: a distribution with a double-peaked centre, and two small (coloured) peaks well to the left and right of the centre. Bottom: a normal distribution, with a small part of each tail coloured.

In our initial implementation we have used the possible bad proportion p to define an assumed "main body" of the distribution as running between the quantiles at (1 – p) and p (from the 1st percentile to the 99th, when p = 1%). We assume all values in this range are good. We then search out from this main body of the distribution, looking for a gap between successive values that is at least α times the (1% – 99%) interquantile range. We are using α=0.5 for now. Once we have identified a gap, we then pick a value within it, currently favouring the end nearer the main distribution and also favouring "round" numbers.

Let's look at an example.

We have a dataset with 12,680,141 records and 3 prices (floating-point values), Price1, Price2 and Price3, in sterling. If we run ordinary TDDA constraint discovery on this, we get the following results:

Individual Field Constraints
Name	Type Allowed	Min Allowed	Max Allowed
Price1	real	0.0	198,204.34
Price2	real	–468,550,685.56	2,432,595.87
Price3	real	0.0	1,390,276,267.42

With ordinary constraint discovery, the min and max constraints are just set to the largest and smallest values in the dataset, so this tells us that the largest value for Price1 in the dataset is nearly £200,000, that Price2 ranges from close to –£500m to about +£2.5m, and that Price3 runs up to about £1.4bn.

If we rerun the constraint generation allowing for up to 1% bad data, we get the following instead.

Individual Field Constraints
Name	Type Allowed	Min Allowed	Max Allowed
Price1	real	0.0	38,000.0
Price2	real	–3,680.0	3,870.0
Price3	real	0.0	125,000.0

Let's look at why these values were chosen, and see whether they are reasonable.

Starting with Price1, let's get a few more statistics.

min Price1	percentile 1 Price1	median Price1	percentile 99 Price1	max Price1
0.00	24.77	299.77	2,856.04	198,204.34

So here our median value is just under £300, while our first percentile value is about £25 and our 99th percentile is a little under £3,000, giving us an interquantile range also somewhat under £3,000 (£2,831.27, to be precise).

The way our algorithm currently works, this interquantile range defines the scale for the gap we need to define something as an outlier—in this case, half of £2,831.27, which is about £1,415.

If we were just looking to find the tighest maximum constraint consistent with 1% of the data being bad (with respect to this constraint), we would obviously just set the upper limit to fractionally above £2,856.04. But this would be fairly nonsensensical as we can see if we look at the data around that threshold. It is a feature of this particular data that most values are duplicated a few times (typically between 3 and 10 times), so we first deduplicate them. Here are the first 20 distinct values above £2,855:

2,855.02   2,855.42   2,855.80   2,856.04
2,855.16   2,855.50   2,855.82   2,856.08
2,855.22   2,855.64*  2,855.86   2,856.13
2,855.38   2,855.76   2,855.91   2,856.14

Obviously, there's nothing special about £2,854.04 (marked with *), and there is no meaningful gap after it, so that would be a fairly odd place to cut off.

Now let's look at the (distinct) values above £30,000:

30,173.70   31,513.21   39,505.67   44,852.01   67,097.01    72,082.60
30,452.45   32,358.52   39,703.93   45,944.21   60,858.88    72,382.57
30,562.55   32,838.23   40,888.79   47,026.70   60,911.84    72,388.55
30,586.96   33,906.98   40,999.04   47,058.99   63,160.28    72,657.04
30,601.67   34,058.27   41,252.50   47,126.60   63,984.64    72,760.90
30,620.78   35,302.33   41,447.00   49,827.28   64,517.16    90,713.28
31,118.10   36,472.53   41,513.95   51,814.22   67,256.27   103,231.44
31,139.86   36,601.86   42,473.36   53,845.76   68,081.68   123,629.97
31,206.71   36,941.76*  43,384.46   55,871.84   70,782.28   198,204.34
31,449.57   39,034.38   43,510.03   56,393.14   71,285.95

The cutoff the TDDA algorithm has chosen (£38,000) is just after the value marked *, which is the start of the first sizable gap. We are not claiming that there is anything uniquely right or "optimal" about the cutoff that has been chosen in the context of this data, but it looks reasonable. The first few values here are still comparatively close together—with gaps of only a few hundred between each pair of successive value—and the last ones are almost an order of magnitude bigger.

We won't look in detail at Price2 and Price3, where the algorithm has narrowed the range rather more, except to comment briefly on the negative values for Price2, which you might have expected to be excluded. This table shows why:³

countnegative Price2	countzero Price2	countpositive Price2	countnull Price2	countnonnull Price2
1,300	51,640	39,750	15,880	92,690

Although negative prices are not very common, they do account for about 1.2% of the data (critically, more than 1%). Additionally, nearly 15% of the values for Price2 are missing, and we exclude nulls from this calculation, so the truly relevant figure is that 1,300 of 92,690 non-null values are negative, or about 1.4%.

Other Constraints

We will now use a slightly wider dataset, with more field types, to look at how constraint generation works for other kinds of constraints in the presence of bad data. Here are the first 10 records from a dataset with 108,559 records, including a similar set of three price fields (with apologies for the slightly unintuitive colouring of the Colour field).

ID	Date	Price1	Price2	Price3	Code	nItems	Parts	Colour
af0370b4-16e4-1891-8b30-cbb075438394	2016-11-01 00:38:34	830.25	∅	830.25	NLOM	1	62444	red
126edf08-16e5-1891-851c-d926416373f2	2016-11-01 00:41:21	983.08	0.00	983.08	QLDZ	1	62677	yellow
73462586-16ed-1891-b164-7da16012a3ab	2016-11-01 01:41:20	540.82	0.00	540.82	TKNX	1	62177 62132	red
1cd55aac-170e-1891-aec3-0ffc0ab020e7	2016-11-01 05:35:08	398.89	0.00	398.89	PKWI	1	61734	red
68f74486-170e-1891-8b30-cbb075438394	2016-11-01 05:37:16	314.54	8.21	314.54	PLUA	1	62611	red
28dff654-16fa-1891-bd69-03d25ee96dee	2016-11-01 03:12:18	479.75	∅	479.75	UHEG	1	62128	red
14321670-1703-1891-ab75-77d18aced2f5	2016-11-01 04:16:09	733.41	0.00	733.41	RTKT	2	61829	red
60957d68-1717-1891-a5b2-67fe1e248d60	2016-11-01 06:41:27	537.81	0.00	537.81	OBFZ	1	61939 62371	red
4fea2bb6-171d-1891-80a6-2f322b85f525	2016-11-01 07:23:56	132.60	2.42	135.02	TACG	2	62356	red
204ed3d6-1725-1891-a1fb-15b6ffac52ce	2016-11-01 08:19:52	866.32	∅	866.32	XHLE	2	61939	red

If we run ordinary constraint discovery on this (not allowing for any bad data), these are the results:

Individual Field Constraints
Name	Type Allowed	Min Allowed	Max Allowed	Sign Allowed	Nulls Allowed	Duplicates Allowed	Values Allowed	# Regular Expressions
ID	string	length 32	length 36		0	-		2
Date	date	1970-01-01	2017-06-03 23:59:06		0
Date:time-before-now	timedelta	107 days, 11:28:05	17428 days, 11:27:11	positive
Price1	real	0.0	19,653,405.06	non-negative	0
Price2	real	-4,331,261.54	589,023.50
Price3	real	0.0	20,242,428.57	non-negative	0
Code	string	length 4	length 4		0	-		3
nItems	int	1	99	positive		-
Parts	string	length 5	length 65		1	-		5
Colour	string	length 3	length 6		0	-	6 values	2

There are a few points to note:

To make the table more manageable, the regular expressions and field values are not shown, by default. We'll see them later.
The third line of constraints is generated is slightly different from the others, and looks not at the actual values of the dates, but rather than how far in the past or future they are. This is something else we are experimenting with, but doesn't matter too much for this example.

Let's repeat the process telling the software that up to 1% of the data might be bad.

Individual Field Constraints
Name	Type Allowed	Min Allowed	Max Allowed	Sign Allowed	Nulls Allowed	Duplicates Allowed	Values Allowed	# Regular Expressions
ID	string	length 32	length 36		0	no		1
Date	date	1970-01-01 01:23:20	2017-06-03 23:59:06		0
Date:time-before-now	timedelta	107 days, 11:55:27	17428 days, 11:54:33	positive
Price1	real	0.0	25000.0	non-negative	0
Price2	real	-520.0	790.0
Price3	real	0.0	24000.0	non-negative	0
Code	string	length 4	length 4		0	-		2
nItems	int	1	10	positive	0	-
Parts	string	length 5	length 65		0	-		2
Colour	string	length 3	length 6		0	-	3 values	1

Now let's consider the (other) constraint kinds in turn.

Sign

Nothing has changed for any of the sign constraints. In general, all that we do in this case is see whether less than our nominated proportion p is negative, or less than our nominated proportion p is positive, and if so write a constraint to this effect, but in the example nothing has changed.

Nulls

The nulls allowed constraint puts an upper limit on the number of nulls allowed in a field. The only two values we ever use are 0 and 1, with 0 meaning that no nulls are allowed and 1 meaning that a single null is allowed. The Parts field originally had a value of 1 here, meaning that there is a single null in this field in the data, and therefore the software wrote a constraint that a maxumim of 1 null is permitted. When the software is allowed to assume that 1% of the data might be bad, a more natural constraint is not to allow nulls at all.

No constraint on nulls was produced for the field nItems originally, but now we have one. If we check the number of nulls in nItems, it turns out to be just 37 or around 0.03%. Since that is well below our 1% limit, a "no-nulls" constraint has been generated.

Duplicates

In the original constraint discovery, no constraints banning duplicates were generated, whereas in this case ID did get a "no duplicates" constraint. If we count the number of distinct values in ID, there are in fact, 108,569 for 108,570 records, so there is clearly one duplicate, affecting 2 records. Since 2/108,570 is again well below our 1% limit (more like 0.002%), the software generates this constraint.

Values allowed.

In the original dataset, a constraint on the values for the field Colour was generated. Specifically, the six values it allowed were:

"red" "yellow" "blue" "green" "CC1734" "???"

When we ran the discovery process allowing for bad data, the result was that only 3 values were allowed, which turn out to be "red", "yellow", and "blue". If we look at the breakdown of the field, we will see why.

Colour	count
red	88,258
yellow	10,979
blue	8,350
green	978
CC1734	4
???	1

The two values we might have picked out as suspicious (or at least, having a different form from the others) are obviously CC1734 and ???, and constraint generation did exclude those, but also green, which we probably would not have done. There are 978 green values (about 0.9%), which is slighty under out 1% cutoff, so it can be excluded by the algorithm, and is. The algorithm simply works through the values in order, starting with the least numerous one, and removes values until the maximum possible bad proportion p is reached (cumulatively). Values with the same frequency are treated together, which means that if we had had another value (say "purple") with exactly the same frequency as green (978), neither would have been excluded.

Regular Expressions

Sets of regular expressions were generated to characterize each of the four string fields, and in every case the number generated was smaller when assuming the possible presence of bad data than when not.

For the ID field, two regular expressions were generated:

^[0-9a-f]{32}$
^[0-9a-f]{8}\-[0-9a-f]{4}\-1891\-[0-9a-f]{4}\-[0-9a-f]{12}$

The first of these just corresponds to a 32-digit hex number, while the second is a 32-digit hex number broken into five groups of 8, 4, 4, 4, and 12 digits, separated by dashes—i.e. a UUID.⁴

If we get Rexpy to give us the coverage information We see that all but three of the IDs are properly formatted UUIDs, with just three being plain 32-digit numbers.

Regular Expression	Incremental Coverage
^[0-9a-f]{8}\-[0-9a-f]{4}\-1891\-[0-9a-f]{4}\-[0-9a-f]{12}$	108,567
^[0-9a-f]{32}$	3

and indeed, it is the 3 records that are excluded by the constraint generation assuming bad data.

We won't go through all the cases, but will look at one more. The field Parts is a list of 5-digit numbers, separated by spaces, with a single one being generated most commonly. Here are the regular expressions that Rexpy generated, together with the coverage information.

Regular Expression	Incremental Coverage
^\d{5}$	99,255
^\d{5} \d{5}$	8,931
^\d{5} \d{5} \d{5}$	378
^\d{5} \d{5} \d{5} \d{5}$	4
^61000 61000 61000 61000 61000 61000 61000 61000 61000 61000 61000$	1

As you can see, Rexpy has generated five separate regular expressions, where in an ideal world we might have preferred it produced a single one:

^\d{5}( \d{5})*$

In fact, however, the fact it has produced separate expressions for five clearly distinguishable cases turns out to be very helpful for TDDA purposes.

In this case, we can see that the vast bulk of the cases have either one or two 5-digit codes (which are the two regular expressions retained by constraint generation), but we would almost certainly consider the 382 cases with three and four codes to be also correct. The last one is more interesting. First, the number of codes (eleven) is noticably larger than for any other record. Secondly, the fact that it is eleven copies of a single code, that is a relatively round number is suspicious. (Looking at the data, it turns out that in no other case are any codes repeated when there are multiple codes, suggesting even more strongly that there's something not right with this record.)

Verifying the Constraint Generation Data Against the Suggested Constraints

With the previous approach to constraint discovery, in which we assume that the example data given to us contains only good data, it should always be the case that if we verify the example data against constraints generated against it, they will all pass.⁵ With the new approach, this is no longer the case, because our hope is that the constraints will help us to identify bad data. We show below the result of running verification against the generated contraints for the example we have been looking at:

Constraints
47

Records
108,570

Fields
10

Values
1,085,700

Failing Constraints
15

Failing Records
1,722

Failing Fields
10

Failing Values
1,887

We can also see a little more information about where the failures were in the table below.

Name	Failures		Type			Minimum			Maximum			Sign			Max Nulls			Duplicates			Values			Rex
Name	Values	Constraints	Allowed	Actual	✓	Allowed	Actual	✓	Allowed	Actual	✓	Allowed	Actual	✓	Allowed	Actual	✓	Allowed	Actual	✓	Allowed	Actual	✓	Allowed	Actual	✓
Colour	983	2	string	string	✓	length 3	length 3	✓	length 6	length 6	✓	-	-	-	0	0	✓	-	-	-	3 values	e.g. "green"	✗	1 pattern	e.g. "???"	✗
Parts	384	2	string	string	✓	length 5	length 5	✓	length 65	length 65	✓	-	-	-	0	1	✗	-	-	-	-	-	-	2 patterns	e.g. "62702 62132 62341"	✗
Price2	281	2	real	real	✓	-520.00	-4,331,261.54	✗	790.00	589,023.50	✗	-	-	-	-	15880	-	-	-	-	-	-	-	-	-	-
Price3	92	1	real	real	✓	0.00	0.00	✓	24,000.00	20,242,428.57	✗	≥ 0	✓	✓	0	0	✓	-	-	-	-	-	-	-	-	-
Price1	91	1	real	real	✓	0.00	0.00	✓	25,000.00	19,653,405.06	✗	≥ 0	✓	✓	0	0	✓	-	-	-	-	-	-	-	-	-
nItems	44	2	int	int	✓	1	1	✓	10	99	✗	> 0	✓	✓	0	37	✗	-	-	-	-	-	-	-	-	-
Code	4	1	string	string	✓	length 4	length 4	✓	length 4	length 4	✓	-	-	-	0	0	✓	-	-	-	-	-	-	2 patterns	e.g. "A__Z"	✗
ID	3	2	string	string	✓	length 32	length 32	✓	length 36	length 36	✓	-	-	-	0	0	✓	no	e.g. "374e4f9816e51891b1647da16012a3ab"	✗	-	-	-	1 pattern	e.g. "374e4f9816e51891b1647da16012a3ab"	✗
Date	3	1	date	date	✓	1970-01-01 01:23:20	1970-01-01	✗	2017-06-03 23:59:06	2017-06-03 23:59:06	✓	-	-	-	0	0	✓	-	-	-	-	-	-	-	-	-
Date:time-before-now	2	1	-	-	-	109 days, 16:27:17	109 days, 16:28:09	✓	17430 days, 16:26:23	17430 days, 16:27:15	✗	> 0	✓	✓	-	0	-	-	-	-	-	-	-	-	-	-

Because the verification fails in this way, and in doing so creates indicator fields for each failing constraint, and an overall field with the number of failures for each record, it is then easy to narrow down to the data being flagged by these constraints to see whether the constraints are useful or over zealous, and adjust them as necessary.

Summary

In this post, we've shown how we're extending Automatic Constraint Generation in TDDA to cover cases where the datasets used are not assumed to be perfect. We think this is quite a significant development. We'll use it a bit more, and when it's solid, extend the open-source tdda library to include this functionality.

As previously, I am aware that, classically, data is the plural of datum, and that purists would prefer the assertion to be "bad data are widespread and pervastive." I apologise to anyone whose sensibilities are offended by my use of the word data in the singular. ↩
As a further clarification, the proportion p is a used for each potential constraint on each field separately, so it's not "1% of all values" might be bad, but rather "1% of the Ages might be higher than the maximum constraint we generate" and so on. Obviously, we could generalize this approach to allow different possible propotions for each constraint type, or each field, or both, at the cost of increasing the number of free parameters. ↩
Like you, I look at those figures and am immediately suspicious that all the counts shown are multiples of 10. But this is one of those cases our suspicions are wrong: this is a random sample from larger data, but the roundness of these numbers is blind chance. ↩
In fact, looking carefully the third group of digits is fixed and starts with 1, indicating this is a UUID-1, which is something we hadn't noticed in this data until we got Rexpy to generate a regular expression for us, as part of TDDA. ↩
With the time-delta constraints we are now generating, this is not strictly true, but this need not concern us in this case. ↩

Obtaining the Python tdda Library

Posted on Thu 14 September 2017 in TDDA • Tagged with tdda, python

This post is a standing post that we plan to try to keep up to date, describing options for obtaining the open-source Python TDDA library that we maintain.

Using pip from PyPI

Assuming you have a working pip setup, you should be able to install the tdda library by typing:

pip install tdda

or, if your permissions don't allow use in this mode

sudo pip install tdda

If pip isn't working, or is associated with a different Python from the one you are using, try:

python -m pip install tdda

sudo python -m pip install tdda

The tdda library supports both Python 3 (tested with 3.6 and 3.7) and Python 2 (tested with 2.7). (We'll start testing against 3.8 real soon!)

Upgrading

If you have a version of the tdda library installed and want to upgrade it with pip, add -U to one of the command above, i.e. use whichever of the following you need for your setup:

pip install -U tdda
sudo pip install -U tdda
python -m pip install -U tdda
sudo python -m pip install -U tdda

Installing from Source

The source for the tdda library is available from Github and can be cloned with

git clone https://github.com/tdda/tdda.git

git clone git@github.com:tdda/tdda.git

When installing from source, if you want the command line tdda utility to be available, you need to run

python setup.py install

from the top-level tdda directory after downloading it.

Documentation

The main documentation for the tdda library is available on Read the Docs.

You can also build it youself if you have downloaded the source from Github. In order to do this, you will need an installation of Sphinx. The HTML documentation is built, starting from the top-level tdda directory by running:

cd doc
make html

Running TDDA's tests

Once you have installed TDDA (whether using pip or from source), you can run its tests by typing

tdda test

If you have all the dependencies, including optional dependencies, installed, you should get a line of dots and the message OK at the end, something like this:

$ tdda test
........................................................................................................................
----------------------------------------------------------------------
Ran 122 tests in 3.251s

OK

If you don't have some of the optional dependencies installed, some of the dots will be replaced by the letter 's'. For example:

$ tdda test
.................................................................s.............................s........................
----------------------------------------------------------------------
Ran 120 tests in 3.221s

OK (skipped=2)

This does not indicate a problem, and simply means there will be some of the functionality unavailable (e.g. usually one or more database types).

Using the TDDA examples

The tdda library includes three sets of examples, covering reference testing, automatic constraint discovery and verification, and Rexpy (discovery of regular expressions from examples, outside the context of constraints).

The tdda command line can be used to copy the relevant files into place. To get the examples, first change to a directory where you would like them to be placed, and then use the command:

tdda examples

This should produce the following output:

Copied example files for tdda.referencetest to ./referencetest-examples
Copied example files for tdda.constraints to ./constraints-examples
Copied example files for tdda.rexpy to ./rexpy-examples

Quick Reference Guides

There is a quick reference guides available for the TDDA library. These are often a little behind the current release, but are usually still quite helpful.

These are available from here.

Tutorial from PyData London

There is a video online of a workshop at PyData London 2017. Watching a video of a workshop probably isn't ideal, but it does have a fairly detailed and gentle introduction to using the library, so if you are struggling, it might be a good place to start.

GDPR, Consent and Microformats: A Half-Baked Idea

Posted on Fri 08 September 2017 in TDDA • Tagged with tdda

Last night I went to The Protectors of Data Scotland Meetup on the subject of Marketing and GDPR. If you're not familiar with Europe's fast-approaching General Data Protection Regulation, and you keep or process any personal data about humans,¹, you probably ought to learn about it. A good place to start is episode 202 of Horace Dediu's The Critical Path podcast, in which he interviews Tim Walters.

During the meeting, I had an idea, and though it is rather less than half-baked, right now it seems just about interesting enough that I thought I'd record it.

One of the key provisions in GDPR is that data processing generally requires consent of the data subject, and that consent is defined as

any freely given, specific, informed and unambiguous indication of his or her wishes by which the data subject, either by a statement or by a clear affirmative action, signifies agreement to personal data relating to them being processed

This is further clarified as follows:

This could include ticking a box when visiting an Internet website, choosing technical settings for information society services or by any other statement or conduct which clearly indicates in this context the data subject’s acceptance of the proposed processing of their personal data.

Silence, pre-ticked boxes or inactivity should therefore not constitute consent.

The Idea in a Nutshell

Briefly, the idea is this:

Websites (and potentially apps) requesting consents should include a digital specification of that consent in a standardized format to be defined (probably either an HTML microformat or a standardized JSON bundle).
This would allow software to understand the consents being requested unambiguously and present them in a standardized, uniform, easy-to-understand format. It would also encourages businesses and other organizations to standardize the forms of consent they request. I imagine that if this happened, initially browser extensions and special apps such as password managers would learn to read the format and present the information clearly, but if were successful, eventually web browsers themselves would do this.
Software could also allow people to create one or more templates or default responses, allowing, for example, someone who never wants to receive marketing to make this their default response, and someone who wants as many special offers as possible to have settings that reflect that. Obviously, you might want several different formats for organizations towards which you have different feelings.
A very small extension to the idea would extend the format to record the choices made, allowing password managers, browsers, apps etc. to record for the user exactly what consents were given.²

Benefits

I believe such a standard has potential benefits for all parties—businesses and other organizations requesting consent, individuals giving consent, regulators and courts:

Businesses and other data processing/capturing organizations would benefit from a clear set of consent kinds, each of which could have a detailed description (perhaps on an EU or W3C document) that could be referenced by a specific label (e.g. marketing_contact_email_organization). Best practice would hopefully quickly move to using these standard categories.
Software could present information in a standardized, clear way to users, highlighting non-standard provisions (preferably with standard symbols, a bit like the Creative Commons Symbols
By using template responses, users could more easily complete consent forms with less effort and less fear of ticking the wrong boxes.

Digital Specification? Microformat? JSON?

What are we actually talking about here?

The main (textual) content of a web page consists of the actual human-readable text together with annotation ("markup") to specify formatting and layout, as well as special features like the sort of checkboxes used to request consent. In older versions of the web, a web page was literally a text file in the a special format originally defined by Tim Berners-Lee (HTML).

For example, in HTML, this one-sentence paragraph with the word very in bold might be written as follows:

<p>For example, in HTML, this one-sentence paragraph with the
word <b>very</b> in bold might be written as follows:</p>

Since the advent of "Web 2.0", many web pages are generated dynamically, with much of the data being sent in a format called JSON. A simple example of some JSON for describing (say) an element in the periodic might be

{
    "name": "Lithium",
    "atomicnumber": 3,
    "metal": true,
    "period": 2,
    "group": 1,
    "etymology": "Greek lithos"
}

It would be straightforward³ to develop a format for allowing all the common types of marketing consent (and indeed, many other kinds of processing consent) to be expressed either in JSON or an HTML microformat (which might not be rendered directly by the webpage). As a sketch, the kind of thing a marketing consent request might look like in JSON could be:

{
    "format": "GDPR-Marketing-Consent-Bundle",
    "format_version": "1.0",
    "requesting_organization": "Stochastic Solutions Limited",
    "requesting_organization_data_protection_policy_page:
        "https://stochasticsolutions.com/privacy.html",
    "requesting_organization_partners": [],
    "requested_consents": [
        "marketing_contact_email_organization",
        "marketing_contact_mobile_phone_organization",
        "marketing_contact_physical_mail_organization"
    ],
    "request_date": "2017-09-08",
    "request_url": "https://StochasticSolutios.com/give-us-your-data"
}

Key features I am trying to illustrate here are:

The format would include details about the organization making the request
The format would have the capacity to list partner organizations in cases in which consent for partner marketing or processing was also requested
The format would be granular with a taxonomy of known kinds of consents. These might be parameterized, rather than being simple strings. In this case, I've included a few different contact mechanisms and the suffix "organization" to indicate this is consent for the organization itself, rather than any partners or other randoms.

Undoubtedly, a real implementation would end up a bit bigger than this, and perhaps more hierarchical, but hopefully not too much bigger.

The format could be extended very simply to include the response, which could then be sent back to the site and also made available on the page to the browser/password manager/apps etc. Here is an augmentation of the request format that also captures the responses:

{
    "format": "GDPR-Marketing-Consent-Bundle",
    "format_version": "1.0",
    "requesting_organization": "Stochastic Solutions Limited",
    "requesting_organization_data_protection_policy_page:
        "https://stochasticsolutions.com/privacy.html",
    "requesting_organization_partners": [],
    "requested_consents": {
        "marketing_contact_email_organization": false,
        "marketing_contact_mobile_phone_organization": false
        "marketing_contact_physical_mail_organization: true
    },
    "request_date": "2017-09-08",
    "request_url": "https://StochasticSolutions.com/give-us-your-data"
}

This example indicates consent to marketing contact by paper mail from the organization, but not by phone or email.

Exactly the same could be achieved with an HTML Microformat, perhaps with something like this:

<div id="GDPR-Marketing-Consent-Bundle">
    <span id="format_version">1.0</span>
    <span id="requesting_organization">
        Stochastic Solutions Limited
    </span>
    <span id="requesting_organization_data_protection_policy_page">
        "https://stochasticsolutions.com/privacy.html"
    </span>
    <ol id="requesting_organization_partners">
    </old>
    <ol id="requested_consents">
        <li>marketing_contact_email_organization</li>
        <li>marketing_contact_mobile_phone_organization<li>
        <li>marketing_contact_physical_mail_organization</li>
    </ol>
    <span id="request_date">2017-09-08</span>
    <span id="request_url">
        https://StochasticSolutions.com/give-us-your-data
    </span>
</div>

(Again, I've no idea whether this is actually what HTML-based microformats typically look like; this is purely illustrative.)

Useful?

I don't know whether this idea is useful or feasible, nor whether it is merely a half-baked version of something that an phalanx of people in Brussels has already specified, though I did perform many seconds of arduous due dilligence in the form of a web searches for terms like "marketing consent microformat" without turning up anything obviously relevant.

It seems to me that if something like this were created and adopted, it might help make GDPR and web/app-based consent avoid the ignominious fate of the cookie pop-ups that were so well intentioned but such a waste of time in practice. Ideally, some kind of collaboration between the relevant part of the EU and either W3C would produce (or at least endorse) any format.

Do get in touch through any of the channels (@tdda0, mail to info@ this domain etc.) if you have thoughts.

So-called personally identifiable information (PII). ↩
Possibly even on a blockchain, if you want to be terribly au courrant and have the possibility of cryptographic verification. ↩
technically straightforward; obviously this would require much work and hammering out of special cases and mechanisms for non-standard requirements. ↩

Quick Reference for TDDA Library

Posted on Thu 04 May 2017 in TDDA • Tagged with tdda

A quick-reference guide ("cheat sheet") is now available for the Python TDDA library. This is linked in the sidebar and available here.

We will try to keep it up-to-date as the library evolves.

See you all at PyData London 2017 this weekend (5-6 May 2017), where we'll be running a TDDA tutorial on Friday.

Improving Rexpy

Posted on Thu 09 March 2017 in TDDA • Tagged with tdda, rexpy, regular expressions

Today we are announcing some enhancements to Rexpy, the tdda tool for finding regular expressions from examples. In short, the new version often finds more precise regular expressions than was previously the case, with the only downside being a modest increase in run-time.

Background on Rexpy is available in two previous posts:

This post introduced the concept and Python library
This post discussed the addition of coverage information—statistics about which how many examples each regular expression matched.

Rexpy is also available online at https://rexpy.herokuapp.com.

Weaknesses addressed

Rexpy is not intended to be an entirely general-purpose tool: it is specifically focused on the case of trying to find regular expressions to characterize the sort of structured textual data we most often see in database and datasets. We are very interested in characterizing things like identifiers, zip codes, phone numbers, URLs, UUIDs, social security numbers and (string) bar codes, and much less interested in characterizing things like sentences, tweets, programs and encrypted text.

Within this focus, there were some obvious shortcomings of Rexpy, a significant subset of which the current release (tdda version 0.3.0) now addresses.

Example 1: Postcodes

Rexpy's tests have always included postcodes, but Rexpy never did a very good job with them. Here is the output from using tdda version 0.2.7:

$ python rexpy.py
EH1 1AA
B2 8EA

^[A-Z0-9]{2,3}\ [0-9A-F]{3}$

Rexpy's result is completely valid, but not very specific. It has correctly identified that there are two main parts, separated by a space, and that the first part is a mixture of two or three characters, each a capital letter or a number, and that the second part is exactly three characters, again all capital letters or numbers. However, it has failed to notice that the first group starts with a letter and follows this with a single digit and that the second group is one digit followed by two letters. What a human would probably have written is something more like:

^[A-Z]{1,2}[0-9]\ [0-9][A-F]{2}$

Let's try Rexpy 0.3.0.

^[A-Z]{1,2}\d\ \d[A-Z]{2}$

Now Rexpy does exactly what we would probably have wanted it to do, and actually written it slightly more compactly—\d is any digit, i.e. it is precisely equivalent to [0-9].

With a few more examples, it still does the perfect thing (in 0.3.0).

$ rexpy
EC1 1BB
W1 0AX
M1 1AE
B33 8TH
CR2 6XH
DN55 1PT

^[A-Z]{1,2}\d{1,2}\ \d[A-Z]{2}$

(Note that the 0.3.0 release of TDDA includes wrapper scripts, rexpy and tdda that allow the main functions to be used directly from command line. These are installed when you pip install tdda. So the rexpy above is exactly equivalent to running python rexpy.py.)

We should note, however, that it still doesn't work perfectly if we include general London postcodes (even with 0.3.0):

$ rexpy
EC1A 1BB
W1A 0AX
M1 1AE
B33 8TH
CR2 6XH
DN55 1PT

^[A-Z0-9]{2,4}\ \d[A-Z]{2}$

In this case, the addition of the final letter in the first block for EC1A and W1A has convinced the software that the first block is just a jumble of capital letters and numbers. We might hope that examples like these (at least, if expanded) would result in something like:

^[A-Z]{1,2}\d{1,2}A?\ \d[A-Z]{2}$

though the real pattern for postcodes is actually quite complex, with only certain London postal areas being allowed a trailing letter, and only in cases where there is a single digit in the first group, and that letter can actually be A, C, P or W.

So while it isn't perfect, Rexpy is doing fairly well with postcodes now.

Let's look at another couple of examples.

Example 2: Toy Examples

Looking at the logs from the Rexpy online, it is clear that a lot of people (naturally) start by trying the sorts of examples commonly used for teaching regular expressions. Here are some examples motivated by what we tend to see in logs.

First, let's try a common toy example in the old version of Rexpy (0.2.7):

$ python rexpy.py
ab
abb
abbb
abbbb
abbbbb
abbbbbb

^[a-z]+$

Not so impressive.

Now in the new version (0.3.0):

$ rexpy
ab
abb
abbb
abbbb
abbbbb
abbbbbb

^ab+$

That's more like it!

Example 3: Names

Here's another example it's got better at. First under the old version:

$ python rexpy.py
Albert Einstein
Rosalind Franklin
Isaac Newton

^[A-Za-z]+\ [A-Za-z]{6,8}$

Again, this is not wrong, but Rexpy has singularly failed to notice the pattern of capitalization.

Now, under the new version:

$ rexpy
Albert Einstein
Rosalind Franklin
Isaac Newton

^[A-Z][a-z]+\ [A-Z][a-z]{5,7}$

Better.

Incidentally, it's not doing anything special with the first character of groups. Here are some related examples:

$ rexpy
AlbertEinstein
RosalindFranklin
IsaacNewton

^[A-Z][a-z]+[A-Z][a-z]{5,7}$

Example 4: Identifiers

Some of the examples we used previously were like this (same result under both versions):

$ rexpy
123-AA-22
576-KY-18
989-BR-93

^\d{3}\-[A-Z]{2}\-\d{2}$

What worked less well in the old version were examples like these:

$ python rexpy.py
123AA22
576KY18
989BR93

^[A-Z0-9]{7}$

These work much better under the 0.3.0:

$ rexpy
123AA22
576KY18
989BR93

^\d{3}[A-Z]{2}\d{2}$

Some Remaining Areas for Improvement

The changes that we've made in this release of Rexpy appear to be almost unambiguous improvements. Both from trying examples, and from understanding the underlying code changes, we can find almost no cases in which the changes make the results worse, and a great number where the results are improved. Of course, that's not to say that there don't remain areas that could be improved.

Here we summarize a few of the things we still hope to improve:

Alternations of Whole Groups

Rexpy isn't very good at generating alternations at the moment, either at a character or group level. So for example, you might have hoped that in the following example, Rexpy would notice that the middle two letters are always AA or BB (or, possibly, that the letter is repeated).

$ rexpy
123-AA-321
465-BB-763
777-AA-81
434-BB-987
101-BB-773
032-BB-881
094-AA-662

^\d{3}\-[A-Z]{2}\-\d{2,3}$

Unfortunately, it does not. (This probably won't be very hard to change.)

Alternations within Groups

Similarly, you might hope that it would do rather better than this:

$ rexpy
Roger
Boger
Coger

^[A-Z][a-z]{4}$

Clearly, we would like this to produce

^[A-Z]oger$

and you might think from the previous examples that it would do this, but it can't combine the fixed oger with the adjacent letter range.

Too Many Expressions (or Combining Results)

Rexpy also produces rather too many regular expressions in many cases, particularly by failing to use optionals when it could. For example:

$ rexpy
Angela Carter
Barbara Kingsolver
Socrates
Cher
Martin Luther King
James Clerk Maxwell

^[A-Z][a-z]+$
^[A-Z][a-z]{5,6}\ [A-Z][a-z]+$
^[A-Z][a-z]{4,5}\ [A-Z][a-z]{4,5}\ [A-Z][a-z]+$

At least in some circumstances, we might prefer that this would produce a single result such as:

^[A-Z][a-z]+((\ [A-Z][a-z]+(\ [A-Z][a-z]+)?)?$

or, ever better:

^[A-Z][a-z]+(\ [A-Z][a-z])*$

Although we would definitely like Rexpy to be able to produce one of these results, we don't necessarily always want this behaviour. It transpires that in a TDDA context, producing different expressions for the different cases is very often useful. So if we do crack the "combining" problem, we'll probably make it an option (perhaps with levels); that will just leave the issue of deciding on a default!

Plans

We have ideas on how to address all of these, albeit not perfectly, so expect further improvements.

If you use Rexpy and have feedback, do let us know. You can reach us on Twitter at (@tdda0), and there's also a TDDA Slack (#TDDA) that we'd be happy to invite you to.

Older Posts Newer Posts