Introducing Rexpy: Automatic Discovery of Regular Expressions

Posted on Fri 11 November 2016 in TDDA • Tagged with tdda, constraints, pandas, regular expressions

Motivation

There's a Skyscanner data feed we have been working with for a year or so. It's produced some six million records so far, each of which has a transaction ID consisting of three parts—a four-digit alphanumeric transaction type, a numeric timestamp and a UUID, with the three parts separated by hyphens. Things like this:

adyt-1466611238-cf68496e-40f1-455e-94d9-ea13a96ff044
ooqt-1466602219-012da468-a820-11e6-8ba1-b8f6b118f191
z65e-1448755954-2d677190-ecda-4279-acb2-31a31ec8e86e

The only thing that really matters is that the transaction IDs are unique, but if everything is working correctly, the three parts should have the right structure and match data that we have in other fields in the feed.

We're pretty familiar with this data; or so we thought . . .

We've added a command to our data analysis software—Miró—for characterizing the patterns in string fields. The command is rex and when we run it on the field (tid), by saying:

[1] load skyscanner/transactions.miro
transactions.miro: 6,020,946 records; 6,020,946 (100%) selected; 57 fields.

[2] rex tid

this is the output:

^[A-Za-z0-9]{2,4}[\-\_]{1,3}\d{10}\-$
^dckx\-1466604137\-1aada032aa7348e1ac0fcfdd02a80f9c$
^[A-Za-z0-9]{2,4}[\-\_]{1,3}\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$
^q\_\_s\-\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$

The rex command has found four patterns that, between them, characterize all the values in the field, and it has expressed these in the form of regular expressions. (If you aren't familiar with regular expressions, you might want to read the linked Wikipedia article.)

The third pattern in the list is more-or-less the one we thought would characterize all the transaction IDs. It reads:

start¹ (^)
then 2--4 letters or numbers ([A-Za-z0-9]{2,4})
then a mixture one to three hyphens and underscores ([\-\_]{1,3})
then 10 digits (\d{10})
then a hyphen (\-)
then a UUID, which is a hyphen-separated collection of 28 hex digits, in groups of 8, 4, 4, 4 and 12 ([0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})
and end² ($).

The only difference between this and what we expected is that it turns out that some of the "alphanumeric transaction types" end with two underscores rather than being stricly alphanumeric, and the rex command has expressed this as "2-4 alphanumeric characters followed by 1-3 hyphens or underscores", rather than "2-4 characters that are alphanumeric or underscores, followed by a hyphen", which would have been more like the way we think about it.

What are the other three expressions?

The first one is the same, but without the UUID. Occasionally the UUID is missing (null), and when this happens the UUID portion of the tid is blank. It is possible to write a regular expression that combines these two cases, but rex doesn't quite yet know how to do this. The way to do it would be to make the UUID optional by enclosing it in parentheses and following it by a question mark:

([0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})?

which reads zero or one UUIDs, given that we know the pattern inside the parentheses corresponds to a UUID.

The last pattern is another one we could unify with the other two: it is the same except that it identifies a particular transaction type that again uses underscores, but now in the middle: q__s. So we could replace those three (and might, in an ideal world, want rex to find)

^[A-Za-z0-9\_]{4}[\-]\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$

This is exactly what we described, except with the inclusion of _ as an allowable character in the 4-character transaction type at the start.

But what about the second pattern?

^dckx\-1466604137\-1aada032aa7348e1ac0fcfdd02a80f9c$

It's actually a single, completely specific transaction ID, that matches the main pattern except for omitting the hyphens in the UUID. This shouldn't be possible—by which I mean, the UUID generator should never omit the hyphens. But clearly either it did for this one transaction, or something else stripped them out later. Either way, it shows that among the several million transactions, there a bad transaction ID.

A further check in Miró shows that this occurs just a single time (i.e. it is not duplicated):

[4]>  select tid = "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c"
transactions.miro: 6,020,946 records; 1 (0%) selected; 57 fields.
Selection:
(= tid "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c")

So we've learnt something interesting and useful about our data, and Miró's rex command has helped us produce regular expressions that characterize all our transaction IDs. It hasn't done a perfect job, but it was pretty useful, and it's easy for us to merge the three main patterns by hand. We plan to extend the functionality to cover these cases better over coming weeks.

Rexpy outside Miró

If you don't happen to have Miró, or want to find regular expressions from data outside the context of a Miró dataset, you can use equivalent functionality directly from the rexpy module of our open-source, MIT-licenced tdda package, available from Github:

git clone https://github.com/tdda/tdda.git

This provides both a Python API for finding regular expressions from example data, and a command line tool.

The Rexpy Command Line

If you just run rexpy.py from the command line, with no arguments, it expects you to type (or paste) a set of strings, and when you finish (with CTRL-D on unix-like systems, or CTRL-Z on Windows systems) it will spit out the regular expressions it thinks you need to match them: For example:

$ python rexpy.py
EH1 7JQ
WC1 4AA
G2 3PQ
^[A-Za-z0-9]{2,3}\ [A-Za-z0-9]{3}$

Or, of course, you can pipe input into it. If, for example, I do that in a folder full of photos from a Nikon camera, I get

$ ls *.* | python ~/python/tdda/rexpy/rexpy.py
^DSC\_\d{4}\.[A-Za-z]{3}$

(because these are all files like DSC_1234.NEF or DSC_9346.xmp).

You can also give it a filename as the first argument, in which case it will read strings (one per line) from a file.

So given the file ids.txt (which is in the rexpy/examples subdirectory in the tdda repository), containing:

123-AA-971
12-DQ-802
198-AA-045
1-BA-834

we can use rexpy on it by saying:

$ python rexpy.py examples/ids.txt
^\d{1,3}\-[A-Z]{2}\-\d{3}$

and if there is a header line you want to skip, you can add either -h or --header to tell Rexpy to skip that.

You can also give a filename as a second command line argument, in which case Rexpy will write the results (one per line) to that file.

Motivation for Rexpy

There's an old joke among programmers, generally attributed to Jamie Zawinski that bears repeating:

Some people, when confronted with a problem, think "I know, I'll use regular expressions."

Now they have two problems.

— Jamie Zawinski

As powerful as regular expressions are, even their best friends would probably concede that they can be hard to write, harder to read and harder still to debug.

Despite this, regular expressions are an attractive way to specify constraints on string fields for two main reasons:

First, regular expressions constitute a fast, powerful, and near-ubiquitous mechanism for describing a wide variety of possible structures in string data.
Secondly, regular expressions include the concept of capture groups. You may recall that in a grouped³ regular expression, one or more subcomponents is enclosed in parentheses and its matched value can be extracted. As we will see below, this is particulary interesting in the context of test-driven data analysis.

For concreteness, suppose we have a field containing strings such as:

123-AA-971
12-DQ-802
198-AA-045
1-BA-834

One obvious regular expression to describe these would be:

^\d+\-[A-Z]{2}\-\d{3}$

(start [^] with one or more digits [\d+], then a hyphen [\-], then two capital letters [A-Z]{2}, then another hyphen [\-], then three digits [\d{3}] then stop [$]).

But it is in the nature of regular expressions that there are both more and less specific formulations we could use to describe the same strings, ranging from the fairly specific:

^1[2389]{0,2}\-(AA|DQ|BA)\-\d{3}$

(start [^] with a 1 [1], followed by up to two digits chosen from {2, 3, 8, 9} [[2389]{0,2}], followed by a hyphen [\-] then AA, DA or BA [(AA|DQ|BA)], followed by another hypen [\-] then three digits [\d{3}] and finish [$])

to the fully general

^.*$

(start [^], then zero or more characters [.*], then stop [$]).

Going back to our first formulation

^\d+\-[A-Z]{2}\-\d{3}$

one possible grouped equivalent is

^(\d+)\-([A-Z]{2})\-(\d{3})$

The three parenthesized sections are known as groups, and regular expression implementations usually provide a way of looking up the value of these groups when a particular string is matched by it. For example, in Python we might say:

import re

pattern = re.compile(r'^(\d+)\-([A-Z]{2})\-(\d{3})$')
m = re.match(pattern, '123-ZQ-987')
if m:
    print('1: "%s"  2: "%s"  3: "%s"' % (m.group(1), m.group(2),
                                         m.group(3)))

m = re.match(pattern, '00-FT-020')
if m:
    print('1: "%s"  2: "%s"  3: "%s"' % (m.group(1), m.group(2),
                                         m.group(3)))

If we run this, the output is:

1: "123"  2: "ZQ"  3: "987"
1: "00"  2: "FT"  3: "020"

The Big Idea: Automatic Discovery of Constraints on String Structure

In the context of test-driven data analysis, the idea is probably obvious: for string fields with some kind of structure—telephone numbers, post codes, zip codes, UUIDs, more-or-less any kind of structured identifier, airline codes, airport codes, national insurance numbers, credit card numbers, bank sort codes, social security numbers—we would like to specify constraints on the values in the field using regular expressions. A natural extension to the TDDA constraints file format introduced in the last post would be something along the lines of:⁴

"regex": "^\\d+\\-[A-Z]{2}\\-\\d{3}$"

if there is a single regular expression that usefully matches all the allowed field values. If a field contains strings in multiple formats that are so different that using a single regular expression would be unhelpful, we might instead provide a list of regular expressions, such as:

"regex": ["^\\d+\\-[A-Z]{2}\\-\\d{3}$", "^[A-Z]{5}\\+\\d{5}$"]

which would mean each field values should match at least one of the regular expressions in the list.

Just as with the automatic discovery of other types of constraints, we want the TDDA constraint discovery library to be able to suggest suitable regular expression constraints on string fields, where appropriate. This is where Rexpy comes in:

Rexpy is a library for finding regular expressions that usefully characterize a given corpus of strings.

We are choosing to build Rexpy as a stand-alone module because it has clear utility outside the context of constraint generation.

The Other Idea: Automatically discovered Quasi-Fields

We can imagine going beyond simply using (and automatically discovering) regular expressions to describe constraints on string data. Once we have useful regular expressions that characterize some string data—and more particularly, in cases where we have a single regular expression that usefully describes the structure of the string—we can tag meaningful subcomponents. In the example we used above, we had three groups:

(\d+) — the digits at the start of the identifier
([A-Z]{2}) — the pair of letters in the middle
(\d{3}) — the three digits at the end

It's not totally trivial to work out which subcomponents are useful to tag, but I think we could probably find pretty good heuristics that would do a reasonable job, at least in simple cases. Once we know the groups, we can potentially start to treat them as quasi-fields in their own right. So in this case, if we had a field ID containing string identifiers like those shown, we might create from that three quasi fields as follows:

ID_qf1, of type int, values 123, 12, 198, and 1
ID_qf2, of type string, values AA, DQ, AA, and BA
ID_qf3, of type int, values 971, 802, 45 and 834

Once we have these quasi fields, we can potentially subject them to the usual TDDA constraint generation process, which might suggest extra, stronger constraints. For example, we might find the the numbers in ID_qf3 are unique, or form a contiguous sequence, and we might find that although our regular expression only specified that there were two letters in the middle, in fact the only combinations found in the (full) data were AA, DQ, BA and BB.

I don't want to suggest that all of this is easy: there are three or four non-trivial steps to get from where Rexpy is today to this full vision:

First, it has to get better at merging related regular expressions into a useful single regular expression with optional components and alternations than it is today.
Secondly, it would have to be able to identify good subcomponents for grouping.
Thirdly, it would have to do useful type inference on the groups it identifies.
Finally, it would have to be extended to create the quasi fields and apply the TDDA discovery process to them.

But none of this seems hopelessly difficult. So continue to watch this space.

The Rexpy API

Assuming you have cloned the TDDA library somewhere on your PYTHONPATH, you should then be able to use it through the API as follows:

from tdda import rexpy

corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
results = rexpy.extract(corpus)
print('Number of regular expressions found: %d' % len(results))
for rex in results:
    print('   ' +  rex)

which produces:

$ python ids.py
Number of regular expressions found: 1
   ^\d{1,3}\-[A-Z]{2}\-\d{3}$

In general, Rexpy returns a list of regular expressions, and at the moment it is not very good at merging them. There's quite a lot of unused code that, I hope, will soon allow it to do so. But even as it is, it can do a reasonable job of characterizing simple strings. Within reason, the more examples you give it, the better it can do, and it is reasonably performant with hundreds of thousands or even millions of strings.

Rexpy: the Pandas interface

There's also a Pandas binding. You can say:

from tdda import rexpy
results = rexpy.pdextract(df['A'])

to find regular expressions for the strings in column A of a dataframe df, or

from tdda import rexpy
results = rexpy.pdextract([df['A'], df['B'], df['C']])

to find a single set of regular expressions that match all the strings from columns A, B and C. In all cases, null (pandas.np.NaN) values are ignored. The results are returned as a (Python) list of regular expressions, as strings.

Final Words

Take it for a spin and let us know how you get on.

As always, follow us or tweet at us (@tdda0) if you want to hear more, and watch out for the TDDA Slack team, which will be opening up very soon.

More precisely, ^ matches start of line; by default, rex always starts regular expressions with ^ and finishes them with $ on the assumption that strings will be presented one-per-line ↩
Again, more precisely, $ matches end of line. ↩
Grouped regular expressions are also referred to variously as marked or tagged regular expressions, and the groups are also sometimes known as subexpressions. ↩
One thing to notice here is that in JSON we need extra backslashes in the regular expression. This is because regular expressions themselves make fairly liberal use of backslashes, and JSON uses backslash as an escape character. We could avoid this in Python by using raw strings, which are introduced with an r prefix (e.g. '^(\d+)\-([A-Z]{2})\-(\d{3})$'). In such strings, backslashes are not treated in any special way. Since JSON has no equivalent mechanism, we have to escape all our backslashes, leading to the ugliness above. ↩

The TDDA Constraints File Format

Posted on Fri 04 November 2016 in TDDA • Tagged with tdda, constraints, pandas

Background

We recently extended the tdda library to include support for automatic discovery of constraints from datasets, and for verification of datasets against constraints. Yesterday's post—Constraint Discovery and Verification for Pandas DataFrames—describes these developments and the API.

The library we published is intended to be a base for producing various implementations of the constraint discovery and verification process, and uses a JSON file format (extension .tdda) to save constraints in a form that should be interchangable between implementations. We currently have two compatible implementations—the open-source Pandas code in the library and the implementation in our own analytical software, Miró.

This post describes the .tdda JSON file format. The bulk of it is merely a snapshot of the documentation shipped with the library in the Github repository (visible on Github). We intend to keep that file up to date as we expand the format.

The TDDA JSON File Format

The TDDA constraints library (Repository https://github.com/tdda/tdda, module constraints) uses a JSON file to store constraints.

This document describes that file format.

Purpose

TDDA files describe constraints on a dataset, with a view to verifying the dataset to check whether any or all of the specified constraints are satisfied.

A dataset is assumed to consist of one or more fields (also known as columns), each of which has a (different) name¹ and a well-defined type.² Each field has a value for each of a number of records (also known as rows). In some cases, values may be null (or missing).³ Even a field consisting entirely of nulls can be considered to have a type.

Familiar examples of datasets include:

tables in relational databases
DataFrames in (Pandas and R)
flat ("CSV") files (subject to type inference or assigment)
sheets in spreadsheets, or areas within sheets, if the columns have names, are not merged, and have values with consistent meanings and types over an entire column
more generally, many forms of tabular data.

In principle, TDDA files are intended to be capable of supporting any kind of constraint regarding datasets. Today, we are primarily concerned with * field types * minimum and maximum values (or in the case of string fields, minumum and maximum string lengths) * whether nulls are allowed, * whether duplicate values are allowed within a field * the allowed values for a field.

The format also has support for describing relations between fields.

Future extensions we can already foresee include:

dataset-level constraints (e.g. numbers of records; required or disallowed fields)
sortedness of fields, of field values or both
regular expressions to which string fields should conform
constraints on subsets of the data (e.g. records dated after July 2016 should not have null values for the ID field)
constraints on substructure within fields (e.g. constraints on tagged subexpressions from regular expressions to which string fields are expected to conform)
potentially checksums (though this is more suitable for checking the integreity of transfer of a specific dataset, than for use across multiple related datasets)
constraints between datasets, most obviously key relations (e.g. every value field KA in dataset A should also occur in field KB in dataset B).

The motivation for generating, storing and verifying datasets against such sets of constraints is that they can provide a powerful way of detecting bad or unexpected inputs to, or outputs from, a data analysis process. They can also be valuable as checks on intermediate results. While manually generating constraints can be burdensome, automatic discovery of constraints from example datasets, potentially followed by manual removal of over-specific constraints, provides a good cost/benefit ratio in many situations.

Filename and encoding

The preferred extension for TDDA Constraints files is .tdda.
TDDA constraints files must be encoded as UTF-8.
TDDA files must be valid JSON.

Example

This is an extremely simple example TDDA file:

{
    "fields": {
        "a": {
            "type": "int",
            "min": 1,
            "max": 9,
            "sign": "positive",
            "max_nulls": 0,
            "no_duplicates": true
        },
        "b": {
            "type": "string",
            "min_length": 3,
            "max_length": 3,
            "max_nulls": 1,
            "no_duplicates": true,
            "allowed_values": [
                "one",
                "two"
            ]
        }
    }
}

General Structure

A TDDA file is a JSON dictionary. There are currently two supported top-level keys:

fields: constraints for individual fields, keyed on the field name. (In TDDA, we generally refer to dataset columns as fields.)
field_groups: constraints specifying relations between multiple fields (two, for now). field_groups constraints are keyed on a comma-separated list of the names of the fields to which they relate, and order is significant.

Both top-level keys are optional.

In future, we expect to add further top-level keys (e.g. for possible constraints on the number of rows, required or disallowed fields etc.)

The order of constraints in the file is immaterial (of course; this is JSON), though writers may choose to present fields in a particular order, e.g. dataset order or sorted on fieldname.

Field Constraints

The value of a field constraints entry (in the fields section) is a dictionary keyed on constraint kind. For example, the constraints on field a in the example above are specified as:

"a": {
    "type": "int",
    "min": 1,
    "max": 9,
    "sign": "positive",
    "max_nulls": 0,
    "no_duplicates": true
}

The TDDA library currently recognizes the following kinds of constraints:

type
min
max
min_length
max_length
sign
sign
max_nulls
no_duplicates
allowed_values

Other constraint libraries are free to define their own, custom kinds of constraints. We will probably recommend that non-standard constraints have names beginning with colon-terminated prefix. For example, if we wanted to support more specific Pandas type constraints, we would probably use a key such as pandas:type for this.

The value of a constraint is often simply a scalar value, but can be a list or a dictionary; when it is a dictionary, it should include a key value, with the principle value associated with the constraint (true, if there is no specific value beyond the name of the constraint).

If the value of a constraint (the scalar value, or the value key if the value is a dictionary) is null, this is taken to indicate the absence of a constraint. A constraint with value null should be completely ignored, so that a constraints file including null-valued constraints should produce identical results to one omitting those constraints. (This obviously means that we are discouraging using null as a meaningful constraint value, though a string "null" is fine, and in fact we use this for sign constraints.)

The semantics and values of the standard field constraint types are as follows:

type: the allowed (standard, TDDA) type of the field. This can be a single value from bool (boolean), int (integer; whole-numbered); real (floating point values); string (unicode in Python3; byte string in Python2) or date (any kind of date or date time, with or without timezone information). It can also be a list of such allowed values (in which case, order is not significant).

It is up to the generation and verification libraries to map between the actual types in whatever dataset/dataframe/table/... object is used and these TDDA constraint types, though over time we may provide further guidance.

Examples:
- {"type": "int"}
- {"type": ["int", "real"]}
min: the minimum allowed value for a field. This is often a simple value, but in the case of real fields, it can be convenient to specify a level of precision. In particular, a minimum value can have precision (default: fuzzy):
- closed: all non-null values in the field must be greater than or equal to the value specified.
- open: all non-null values in the field must be strictly greater than the value specified.
- fuzzy: when the precision is specified as fuzzy, the verifier should allow a small degree of violation of the constraint without generating a failure. Verifiers take a parameter, epsilon, which specifies how the fuzzy constraints should be taken to be: epsilon is a fraction of the constraint value by which field values are allowed to exceed the constraint without being considered to fail the constraint. This defaults to 0.01 (i.e. 1%). Notice that this means that constraint values of zero are never fuzzy.
Examples are:
- {"min": 1},
- {"min": 1.2},
- {"min": {"value": 3.4}, {"precision": "fuzzy"}}.
JSON, does not—of course—have a date type. TDDA files specifying dates should use string representations in one of the following formats:
- YYYY-MM-DD for dates without times
- YYYY-MM-DD hh:mm:ss for date-times without timezone
- YYYY-MM-DD hh:mm:ss [+-]ZZZZ for date-times with timezone.
We recommend that writers use precisely these formats, but that readers offer some flexibility in reading, e.g. accepting / as well as - to separate date components, and T as well as space to separate the time component from the date.
max: the maximum allowed value for a field. Much like min, but for maximum values. Examples are:
- {"max": 1},
- {"max": 1.2},
- {"max": {"value": 3.4}, {"precision": "closed"}}.
Dates should be formatted as for min values.
min_length: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation. Example:
- {"min_length": 2}
max_length: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation.
- {"max_length": 22}
sign: For numeric fields, the allowed sign of (non-null) values. Although this overlaps with minimum and maximum values, it it often useful to have a separate sign constraint, which carries semantically different information. Allowed values are:
- positive: All values must be greater than zero
- non-negative: No value may be less than zero
- zero: All values must be zero
- non-positive: No value may be greater than zero
- negative: All values must be negative
- null: No signed values are allowed, i.e. the field must be entirely null.
Example:
- {"sign": "non-negative"}
max_nulls: The maximum number of nulls allowed in the field. This can be any non-negative value. We recommend only writing values of zero (no nulls values are allowed) or 1 (At most a single null is allowed) into this constraint, but checking against any value found.

Example:
- {"max_nulls": 0}
no_duplicates: When this constraint is set on a field (with value true), it means that each non-null value must occur only once in the field. The current implementation only uses this constraint for string fields.

Example:
- {"no_duplicates": true}
allowed_values: The value of this constraint is a list of allowed values for the field. The order is not significant.

Example:
- {"allowed_values": ["red", "green", "blue"]}

MultiField Constraints

Multifield constraints are not yet being generated by this implementation, though our (proprietary) Miró implementation does produce them. The currently planned constraint types for field relations cover field equality and inequality for pairs of fields, with options to specify null relations too.

A simple example would be:

"field_groups": {
    "StartDate,EndDate": {"lt": true}
}

This is a less-than constraint, to be interpreted as

StartDate < EndDate wherever StartDate and EndDate and both non-null.

The plan is to support the obvious five equality and inequality relations:

lt: first field value is strictly less than the second field value for each record
lte: first field value is less than or equal to the second field value for each record
eq: first field value is equal to the second field value for each record
gte: first field value is greater than or equal to the second field value for each record
gt: first field value is strictly greater than the second field value for each record.

In the case of equality (only), we will probably also support a precision parameter with values fuzzy or precise.

There should probably also be an option to specify relations between null values in pairs of columns, either as a separate constraint or as a quality on each of the above.

Pandas, of course, allows multiple columns to have the same name. This format makes no concessions to such madness, though there is nothing to stop a verifier or generator sharing constraints across all columns with the same name. The Pandas generators and verifiers in this library do not currently attempt to do this. ↩
Pandas also allows columns of mixed type. Again, this file format does not recognize such columns, and it would probably be sensible not to use type constraints for columns of mixed type. ↩
Pandas uses not-a-number (pandas.np.NaN) to represent null values for numeric, string and boolean fields; it uses a special not-a-time (pandas.NaT) value to represent null date (Timestamp) values. ↩

Constraint Discovery and Verification for Pandas DataFrames

Posted on Thu 03 November 2016 in TDDA • Tagged with tdda, constraints, pandas

Background

In a previous post, Constraints and Assertions, we introduced the idea of using constraints to verify input, output and intermediate datasets for an analytical process. We also demonstrated that candidate constraints can be automatically generated from example datasets. We prototyped this in our own software (Miró) expressing constraints as lisp S-expressions.

Improving and Extending the Approach: Open-Source Pandas Code

We have now taken the core ideas, polished them a little and made them available through an open-source library, currently on Github. We will push it to PyPI when it has solidified a little further.

The constraint code I'm referring to is available in the constraints module of the tdda repository for the tdda user on github. So if you issue the command

git clone https://github.com/tdda/tdda.git

in a directory somewhere on your PYTHONPATH, this should enable you to use it.

The TDDA library:

is under an MIT licence;
runs under Python2 and Python3;
other than Pandas itself, has no dependencies outside the standard library unless you want to use feather files (see below);
includes a base layer to help with building constraint verification and discovery libraries for various systems;
includes Pandas implementations of constraint discovery and verification through a (Python) API;
uses a new JSON format (normally stored in .tdda files) for saving constraints;
also includes a prototype command-line tool for verifying a dataframe stored in feather format against a .tdda file of constraints. Feather is a file format developed by Wes McKinney (the original creator of Pandas) and Hadley Wickham (of ggplot and tidyverse fame) for dataframes that allows interchange between R and Pandas while preserving type information and exact values. It is based on the Apache Arrow project. It can be used directly or using our ugly-but-useful extension library pmmif, which allows extra metadata (including extended type information) to be saved alongside a .feather file, in a companion .pmm file.

Testing

All the constraint-handling code is in the constraints module within the TDDA repository.

After you've cloned the repository, it's probably a good idea to run the tests. There are two sets, and both should run under Python2 or Python3.

$ cd tdda/constraints
$ python testbase.py
.....
----------------------------------------------------------------------
Ran 5 tests in 0.003s

OK

$ python testpdconstraints.py
.....................
----------------------------------------------------------------------
Ran 21 tests in 0.123s

OK

There is example code (which we'll walk through below) in the examples subdirectory of constraints.

Basic Use

Constraint Discovery

Here is some minimal code for getting the software to discover constraints satisfied by a Pandas DataFrame:

import pandas as pd
from tdda.constraints.pdconstraints import discover_constraints

df = pd.DataFrame({'a': [1,2,3], 'b': ['one', 'two', pd.np.NaN]})
constraints = discover_constraints(df)
with open('/tmp/example_constraints.tdda', 'w') as f:
    f.write(constraints.to_json())

(This is the core of the example code in tdda/constraints/examples/simple_discovery.py, and is included in the docstring for the discover_constraints function.)

Hopefully the code is fairly self-explanatory, but walking through the lines after the imports:

We first generate a trivial 3-row DataFrame with an integer column a and a string column b. The string column includes a Pandas null (NaN).
We then pass that DataFrame to the discover_constraints function from tdda.constraints.pdconstraints, and it returns a DatasetConstraints object, which is defined in tdda.constraints.base.
The resulting constraints object has a to_json() method which converts the structured constraints into a JSON string.
In the example, we write that to /tmp/example_constraints.tdda; we encourage everyone to use the .tdda extension for these JSON constraint files.

This is what happens if we run the example file:

$ cd tdda/constraints/examples

$ python simple_discovery.py
Written /tmp/example_constraints.tdda successfully.

$ cat /tmp/example_constraints.tdda
{
    "fields": {
        "a": {
            "type": "int",
            "min": 1,
            "max": 9,
            "sign": "positive",
            "max_nulls": 0,
            "no_duplicates": true
        },
        "b": {
            "type": "string",
            "min_length": 3,
            "max_length": 3,
            "max_nulls": 1,
            "no_duplicates": true,
            "allowed_values": [
                "one",
                "two"
            ]
        }
    }
}

As you can see, in this case, the system has 'discovered' six constraints for each field, and it's rather easy to read what they are and at least roughly what they mean. We'll do a separate post describing the .tdda JSON file format, but it's documented in tdda/constraints/tdda_json_file_format.md in the repository (which—through almost unfathomable power of Github—means you can see it formatted here).

Constraint Verification

Now that we have a .tdda file, we can use it to verify a DataFrame.

First, let's look at code that should lead to a successful verification (this code is in tdda/constraints/examples/simple_verify_pass.py).

import pandas as pd
from tdda.constraints.pdconstraints import verify_df

df = pd.DataFrame({'a': [2, 4], 'b': ['one', pd.np.NaN]})
v = verify_df(df, 'example_constraints.tdda')

print('Passes: %d' % v.passes)
print('Failures: %d\n\n\n' % v.failures)
print(str(v))
print('\n\n')
print(v.to_frame())

Again, hopefully the code is fairly self-explanatory, but:

df is a DataFrame that is different from the one we used to generate the constraints, but is nevertheless consistent with the constraints.
The verify_df function from tdda.constraints.pdconstraints takes a DataFrame and the location of a .tdda file and verifies the DataFrame against the constraints in the file. The function returns a PandasVerification object. The PandasVerification class is a subclass of Verification from tdda.constraints.base, adding the ability to turn the verification object into a DataFrame.
All verification objects include passes and failures attributes, which respectively indicate the number of constraints that passed and the number that failed. So the simplest complete verification is simply to check that v.failures == 0.
Verification methods also include a __str__ method. Its output currently includes a section for fields and a summary.
the .to_frame() method converts a PandasVerification into a Pandas DataFrame.

If we run this, the result is as follows:

$ python simple_verify_pass.py
Passes: 12
Failures: 0



FIELDS:

a: 0 failures  6 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓  no_duplicates ✓

b: 0 failures  6 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓  allowed_values ✓

SUMMARY:

Passes: 12
Failures: 0



  field  failures  passes  type   min min_length   max max_length  sign  \
0     a         0       6  True  True        NaN  True        NaN  True
1     b         0       6  True   NaN       True   NaN       True   NaN

  max_nulls no_duplicates allowed_values
0      True          True            NaN
1      True          True           True

In the DataFrame produced from the Verification

True indicates a constraint that was satisfied in the dataset;
False indicates a constraint that was not satisfied in the dataset;
NaN (null) indicates a constraint that was not present for that field.

As you would expect, the field, failures and passes columns are, respectively, the name of the field, the number of failures and the number of passes for that field.

If we now change the DataFrame definition to:

df = pd.DataFrame({'a': [0, 1, 2, 10, pd.np.NaN],
                   'b': ['one', 'one', 'two', 'three', pd.np.NaN]})

(as is the case in tdda/constraints/examples/simple_verify_fail.py), we now expect some constraint failures. If we run this, we see:

$ python simple_verify_fail.py
Passes: 5
Failures: 7



FIELDS:

a: 4 failures  2 passes  type ✓  min ✗  max ✗  sign ✗  max_nulls ✗  no_duplicates ✓

b: 3 failures  3 passes  type ✓  min_length ✓  max_length ✗  max_nulls ✓  noh_duplicates ✗  allowed_values ✗

SUMMARY:

Passes: 5
Failures: 7



  field  failures  passes  type    min min_length    max max_length   sign  \
0     a         4       2  True  False        NaN  False        NaN  False
1     b         3       3  True    NaN       True    NaN      False    NaN

  max_nulls no_duplicates allowed_values
0     False          True            NaN
1      True         False          False

Final Notes

There are more options and there's more to say about the Pandas implementation, but that's probably enough for one post. We'll have follow-ups on the file format, more options, and the foibles of Pandas.

If you want to hear more, follow us on twitter at @tdda0.

WritableTestCase: Example Use

Posted on Sun 18 September 2016 in TDDA • Tagged with tdda

In my PyCon UK talk yesterday I promised to update the and document the copy of writabletestcase.WritableTestCase on GitHub.

The version I've put up is not quite as powerful as the example I showed in the talk—that will follow—but has the basic functionality.

I've now added examples to the repository and, below, show how these work.

The library is available with

git clone https://github.com/tdda/tdda.git

WritableTestCase extends unittest.TestCase, from the Python's standard library, in three main ways:

It provides methods for testing strings produced in memory or files written to disk against reference results in files. When a test fails, rather than just showing a hard-to-read difference, it writes the actual result to file (if necessary) and then shows the diff command needed to compare it—something like this:
```
Compare with "diff /path/to/actual-output /path/to/expected-output"
```
Obviously, the diff command can be replaced with a graphical diff tool, an open command or whatever.

Although this shouldn't be necessary (see below), you also have the option, after verification, or replacing diff with cp to copy the actual output as the new reference output.
Secondly, the code supports excluding lines of the output contain nominated strings. This is often handy for excluding things like date stamps, version numbers, copyright notices etc. These often appear in output, and vary, without affecting the semantics.

(The version of the library I showed at PyCon had more powerful variants of this, which I'll add later.)
Thirdly, if you verify that the new output is correct, the library supports re-running with the -w flag to overwrite the expected ("reference") results with the ones generated by the code.

Obviously, if this feature is abused, the value of the tests will be lost, but provided you check the output carefully before re-writing, this is a significant convenience.

The example code is in the examples subdirectory, called test_using_writabletestcase.py. It has two test functions, one of which generates HTML output as a string, and the other of which produces some slightly different HTML output as a file. In each case, the output produced by the function is not identical to the expected "reference" output (in examples/reference), but differs only on lines containing "Copyright" and "Version". Since these are passed into the test as exclusions, the tests should pass.

Here is the example code:

# -*- coding: utf-8 -*-
"""
test_using_writabletestcase.py: A simple example of how to use
tdda.writabletestcase.WritableTestCase.

Source repository: https://github.com/tdda/tdda

License: MIT

Copyright (c) Stochastic Solutions Limited 2016
"""
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import tempfile

from tdda import writabletestcase
from tdda.examples.generators import generate_string, generate_file


class TestExample(writabletestcase.WritableTestCase):
    def testExampleStringGeneration(self):
        """
        This test uses generate_string() from tdda.examples.generators
        to generate some HTML as a string.

        It is similar to the reference HTML in
        tdda/examples/reference/string_result.html except that the
        Copyright and version lines are slightly different.

        As shipped, the test should pass, because the ignore_patterns
        tell it to ignore those lines.

        Make a change to the generation code in the generate_string
        function in generators.py to change the HTML output.

        The test should then fail and suggest a diff command to run
        to see the difference.

        Rerun with

            python test_using_writabletestcase.py -w

        and it should re-write the reference output to match your
        modified results.
        """
        actual = generate_string()
        this_dir = os.path.abspath(os.path.dirname(__file__))
        expected_file = os.path.join(this_dir, 'reference',
                                     'string_result.html')
        self.check_string_against_file(actual, expected_file,
                                       ignore_patterns=['Copyright',
                                                        'Version'])


    def testExampleFileGeneration(self):
        """
        This test uses generate_file() from tdda.examples.generators
        to generate some HTML as a file.

        It is similar to the reference HTML in
        tdda/examples/reference/file_result.html except that the
        Copyright and version lines are slightly different.

        As shipped, the test should pass, because the ignore_patterns
        tell it to ignore those lines.

        Make a change to the generation code in the generate_file function
        in generators.py to change the HTML output.

        The test should then fail and suggest a diff command to run
        to see the difference.

        Rerun with

            python test_using_writabletestcase.py -w

        and it should re-write the reference output to match your
        modified results.
        """
        outdir = tempfile.gettempdir()
        outpath = os.path.join(outdir, 'file_result.html')
        generate_file(outpath)
        this_dir = os.path.abspath(os.path.dirname(__file__))
        expected_file = os.path.join(this_dir, 'reference',
                                     'file_result.html')
        self.check_file(outpath, expected_file,
                        ignore_patterns=['Copyright', 'Version'])


if __name__ == '__main__':
    writabletestcase.main(argv=writabletestcase.set_write_from_argv())

If you download it, and try running it, you should output similar to the following:

$ python test_using_writabletestcase.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.004s

OK

The reference output files it compares against are:

examples/reference/string_result.html
examples/reference/file_result.html

To see what happens when there's a difference, try editing one or both of the main functions that generate the HTML in generators.py. They're most just using explicit strings, so the simplest thing is just to change a word or something in the output.

If I change It's to It is in the generate_string() function and rerun, I get this output:

$ python test_using_writabletestcase.py
.
File check failed.
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/string_result.html /Users/njr/python/tdda/examples/reference/string_result.html".

Note exclusions:
Copyright
Version
F
======================================================================
FAIL: testExampleStringGeneration (__main__.TestExample)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_using_writabletestcase.py", line 55, in testExampleStringGeneration
    'Version'])
  File "/Users/njr/python/tdda/writabletestcase.py", line 294, in check_string_against_file
    self.assertEqual(failures, 0)
AssertionError: 1 != 0

----------------------------------------------------------------------
Ran 2 tests in 0.005s

FAILED (failures=1)
1 godel:$

If I then run the diff command it suggests, the output is:

$ diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/string_result.html /Users/njr/python/tdda/examples/reference/string_result.html
5,6c5,6
<     Copyright (c) Stochastic Solutions, 2016
<     Version 1.0.0
—
>     Copyright (c) Stochastic Solutions Limited, 2016
>     Version 0.0.0
29c29
<     It is not terribly exciting.
—
>     It's not terribly exciting.

Here you can see the differences that are excluded, and the change I actually made.

(The version I showed at PyCon has an option to see the only the non-excluded differences, but this version doesn't; that will come!)

If I now run again using -w, to re-write the reference output, it shows:

$ python test_using_writabletestcase.py -w
.Expected file /Users/njr/python/tdda/examples/reference/string_result.html written.
.
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

And, of course, if I run a third time, without -w, the test now passes:

$ python test_using_writabletestcase.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

So that's a quick overview of it works.

Slides and Rough Transcript of TDDA talk from PyCon UK 2016

Posted on Sat 17 September 2016 in TDDA • Tagged with tdda

Python UK 2016, Cardiff.

I gave a talk on test-driven data analysis at PyCon UK 2016, Cardiff, today.

The slides (which are kind-of useless without the words) are available here.

More usefully, a rough transcript, with thumbnail slides, is available here.

Older Posts Newer Posts