Motivation
There's a Skyscanner data feed we have been
working with for a year or so. It's produced some six million records
so far, each of which has a transaction ID consisting of three
parts—a four-digit alphanumeric transaction type, a numeric
timestamp and a UUID, with the three parts separated by
hyphens. Things like this:
adyt-1466611238-cf68496e-40f1-455e-94d9-ea13a96ff044
ooqt-1466602219-012da468-a820-11e6-8ba1-b8f6b118f191
z65e-1448755954-2d677190-ecda-4279-acb2-31a31ec8e86e
The only thing that really matters is that the transaction IDs are unique,
but if everything is working correctly, the three parts should have the
right structure and match data that we have in other fields in the feed.
We're pretty familiar with this data; or so we thought . . .
We've added a command to our data analysis
software—Miró—for
characterizing the patterns in string fields. The command is rex
and when we run it on the field (tid
), by saying:
[1] load skyscanner/transactions.miro
transactions.miro: 6,020,946 records; 6,020,946 (100%) selected; 57 fields.
[2] rex tid
this is the output:
^[A-Za-z0-9]{2,4}[\-\_]{1,3}\d{10}\-$
^dckx\-1466604137\-1aada032aa7348e1ac0fcfdd02a80f9c$
^[A-Za-z0-9]{2,4}[\-\_]{1,3}\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$
^q\_\_s\-\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$
The rex
command has found four patterns that, between them,
characterize all the values in the field, and it has expressed these in
the form of regular expressions. (If you aren't familiar with
regular
expressions, you
might want to read the linked Wikipedia
article.)
The third pattern in the list is more-or-less the one we thought would
characterize all the transaction IDs. It reads:
- start (
^
)
- then 2--4 letters or numbers (
[A-Za-z0-9]{2,4}
)
- then a mixture one to three hyphens and underscores (
[\-\_]{1,3}
)
- then 10 digits (
\d{10}
)
- then a hyphen (
\-
)
- then a UUID,
which is a hyphen-separated collection of 28 hex digits, in groups of
8, 4, 4, 4 and 12
(
[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}
)
- and end (
$
).
The only difference between this and what we expected is that it turns
out that some of the "alphanumeric transaction types" end with two
underscores rather than being stricly alphanumeric, and the rex
command has expressed this as "2-4 alphanumeric characters followed by
1-3 hyphens or underscores", rather than "2-4 characters that are
alphanumeric or underscores, followed by a hyphen", which would have
been more like the way we think about it.
What are the other three expressions?
The first one is the same, but without the UUID. Occasionally the UUID
is missing (null), and when this happens the UUID portion of the tid
is blank. It is possible to write a regular expression that combines
these two cases, but rex
doesn't quite yet know how to do this. The
way to do it would be to make the UUID optional by enclosing it in
parentheses and following it by a question mark:
([0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12})?
which reads zero or one UUIDs, given that we know the pattern inside the
parentheses corresponds to a UUID.
The last pattern is another one we could unify with the other two:
it is the same except that it identifies a particular transaction type
that again uses underscores, but now in the middle: q__s
. So we could
replace those three (and might, in an ideal world, want rex
to find)
^[A-Za-z0-9\_]{4}[\-]\d{10}\-[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}$
This is exactly what we described, except with the inclusion of _
as an
allowable character in the 4-character transaction type at the start.
But what about the second pattern?
^dckx\-1466604137\-1aada032aa7348e1ac0fcfdd02a80f9c$
It's actually a single, completely specific transaction ID, that matches
the main pattern except for omitting the hyphens in the UUID.
This shouldn't be possible—by which I mean, the UUID generator should
never omit the hyphens. But clearly either it did for this one transaction,
or something else stripped them out later. Either way, it shows that among
the several million transactions, there a bad transaction ID.
A further check in Miró shows that this occurs just a single time (i.e.
it is not duplicated):
[4]> select tid = "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c"
transactions.miro: 6,020,946 records; 1 (0%) selected; 57 fields.
Selection:
(= tid "dckx-1466604137-1aada032aa7348e1ac0fcfdd02a80f9c")
So we've learnt something interesting and useful about our data, and
Miró's rex
command has helped us produce regular expressions that
characterize all our transaction IDs. It hasn't done a perfect job,
but it was pretty useful, and it's easy for us to merge the three
main patterns by hand. We plan to extend the functionality to cover these
cases better over coming weeks.
Rexpy outside Miró
If
you don't happen to have Miró,
or want to find regular expressions from
data outside the context of a Miró dataset, you can use equivalent
functionality directly from the rexpy
module of our open-source,
MIT-licenced tdda
package, available from Github:
git clone https://github.com/tdda/tdda.git
This provides both a Python API for finding regular expressions from
example data, and a command line tool.
The Rexpy Command Line
If you just run rexpy.py
from the command line, with no arguments,
it expects you to type (or paste) a set of strings, and when you finish
(with CTRL-D
on unix-like systems, or CTRL-Z
on Windows systems)
it will spit out the regular expressions it thinks you need to match them:
For example:
$ python rexpy.py
EH1 7JQ
WC1 4AA
G2 3PQ
^[A-Za-z0-9]{2,3}\ [A-Za-z0-9]{3}$
Or, of course, you can pipe input into it. If, for example, I do that
in a folder full of photos from a Nikon camera, I get
$ ls *.* | python ~/python/tdda/rexpy/rexpy.py
^DSC\_\d{4}\.[A-Za-z]{3}$
(because these are all files like DSC_1234.NEF
or DSC_9346.xmp
).
You can also give it a filename as the first argument, in which case
it will read strings (one per line) from a file.
So given the file ids.txt
(which is in the rexpy/examples
subdirectory
in the tdda
repository), containing:
123-AA-971
12-DQ-802
198-AA-045
1-BA-834
we can use rexpy on it by saying:
$ python rexpy.py examples/ids.txt
^\d{1,3}\-[A-Z]{2}\-\d{3}$
and if there is a header line you want to skip, you can add either -h
or --header
to tell Rexpy to skip that.
You can also give a filename as a second command line argument, in which
case Rexpy will write the results (one per line) to that file.
Motivation for Rexpy
There's an old joke among programmers, generally attributed to
Jamie Zawinski
that bears repeating:
Some people, when confronted with a problem, think
"I know, I'll use regular expressions."
Now they have two problems.
— Jamie Zawinski
As powerful as regular expressions are, even their
best friends would probably concede that they can be hard
to write, harder to read and harder still to debug.
Despite this, regular expressions are an attractive way to specify
constraints on string fields for two main reasons:
-
First, regular expressions constitute a fast, powerful, and
near-ubiquitous mechanism for describing a wide variety of
possible structures in string data.
-
Secondly, regular expressions include the concept of
capture groups. You may recall that in a grouped
regular expression, one or more subcomponents is enclosed in
parentheses and its matched value can be extracted.
As we will see below, this is particulary interesting in
the context of test-driven data analysis.
For concreteness, suppose we have a field containing strings such as:
123-AA-971
12-DQ-802
198-AA-045
1-BA-834
One obvious regular expression to describe these would be:
(start [^
] with
one or more digits [\d+
],
then a hyphen [\-
],
then two capital letters [A-Z]{2}
,
then another hyphen [\-
],
then three digits [\d{3}
]
then stop [$
]).
But it is in the nature of regular expressions that there are both more
and less specific formulations we could use to describe the same strings,
ranging from the fairly specific:
^1[2389]{0,2}\-(AA|DQ|BA)\-\d{3}$
(start [^
]
with a 1 [1
],
followed by up to two digits
chosen from {2, 3, 8, 9} [[2389]{0,2}
],
followed by a hyphen [\-
]
then AA, DA or BA [(AA|DQ|BA)
],
followed by another hypen [\-
]
then three digits [\d{3}
]
and finish [$
])
to the fully general
(start [^
],
then zero or more characters [.*
],
then stop [$
]).
Going back to our first formulation
one possible grouped equivalent is
^(\d+)\-([A-Z]{2})\-(\d{3})$
The three parenthesized sections are known as groups, and
regular expression implementations usually provide a way of looking up
the value of these groups when a particular string is matched by it.
For example, in Python we might say:
import re
pattern = re.compile(r'^(\d+)\-([A-Z]{2})\-(\d{3})$')
m = re.match(pattern, '123-ZQ-987')
if m:
print('1: "%s" 2: "%s" 3: "%s"' % (m.group(1), m.group(2),
m.group(3)))
m = re.match(pattern, '00-FT-020')
if m:
print('1: "%s" 2: "%s" 3: "%s"' % (m.group(1), m.group(2),
m.group(3)))
If we run this, the output is:
1: "123" 2: "ZQ" 3: "987"
1: "00" 2: "FT" 3: "020"
The Big Idea: Automatic Discovery of Constraints on String Structure
In the context of test-driven data analysis, the idea is probably
obvious: for string fields with some kind of structure—telephone
numbers, post codes, zip codes, UUIDs, more-or-less any kind of
structured identifier, airline codes, airport codes, national
insurance numbers, credit card numbers, bank sort codes, social
security numbers—we would like to specify constraints on the values
in the field using regular expressions. A natural extension to the
TDDA constraints file
format
introduced in the last post would be something along the lines
of:
"regex": "^\\d+\\-[A-Z]{2}\\-\\d{3}$"
if there is a single regular expression that usefully matches all the
allowed field values. If a field contains strings in multiple
formats that are so different that using a single regular expression
would be unhelpful, we might instead provide a list of regular
expressions, such as:
"regex": ["^\\d+\\-[A-Z]{2}\\-\\d{3}$", "^[A-Z]{5}\\+\\d{5}$"]
which would mean each field values should match at least one of the regular
expressions in the list.
Just as with the automatic discovery
of other types of constraints,
we want the TDDA constraint discovery library to be able to suggest
suitable regular expression constraints on string fields, where appropriate.
This is where Rexpy comes in:
Rexpy is a library for finding regular expressions that usefully
characterize a given corpus of strings.
We are choosing to build Rexpy as a stand-alone module because it has
clear utility outside the context of constraint generation.
The Other Idea: Automatically discovered Quasi-Fields
We can imagine going beyond simply using (and automatically
discovering) regular expressions to describe constraints on string data.
Once we have useful regular expressions that characterize some
string data—and more particularly, in cases where we have a single
regular expression that usefully describes the structure of the string—we
can tag meaningful subcomponents. In the example we used above, we had
three groups:
(\d+)
— the digits at the start of the identifier
([A-Z]{2})
— the pair of letters in the middle
(\d{3})
— the three digits at the end
It's not totally trivial to work out which subcomponents are useful to tag,
but I think we could probably find pretty good heuristics that would do
a reasonable job, at least in simple cases. Once we know the groups, we
can potentially start to treat them as quasi-fields in their own right.
So in this case, if we had a field ID
containing string identifiers
like those shown, we might create from that three quasi fields as follows:
ID_qf1
, of type int, values 123
, 12
, 198
, and 1
ID_qf2
, of type string, values AA
, DQ
, AA
, and BA
ID_qf3
, of type int, values 971
, 802
, 45
and 834
Once we have these quasi fields, we can potentially subject them to the
usual TDDA constraint generation process, which might suggest extra,
stronger constraints. For example, we might find the the numbers in
ID_qf3
are unique, or form a contiguous sequence, and we might find
that although our regular expression only specified that there were two
letters in the middle, in fact the only combinations found in the
(full) data were AA
, DQ
, BA
and BB
.
I don't want to suggest that all of this is easy: there are three or four
non-trivial steps to get from where Rexpy is today to this full vision:
-
First, it has to get better at merging related regular expressions
into a useful single regular expression with optional components
and alternations than it is today.
-
Secondly, it would have to be able to identify good subcomponents
for grouping.
-
Thirdly, it would have to do useful type inference on the
groups it identifies.
-
Finally, it would have to be extended to create the quasi fields
and apply the TDDA discovery process to them.
But none of this seems hopelessly difficult. So continue to watch this space.
The Rexpy API
Assuming you have cloned the TDDA library somewhere on your PYTHONPATH
,
you should then be able to use it through the API as follows:
from tdda import rexpy
corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
results = rexpy.extract(corpus)
print('Number of regular expressions found: %d' % len(results))
for rex in results:
print(' ' + rex)
which produces:
$ python ids.py
Number of regular expressions found: 1
^\d{1,3}\-[A-Z]{2}\-\d{3}$
In general, Rexpy returns a list of regular expressions, and at the moment
it is not very good at merging them. There's quite a lot of unused code
that, I hope, will soon allow it to do so. But even as it is, it can do
a reasonable job of characterizing simple strings. Within reason, the
more examples you give it, the better it can do, and it is reasonably
performant with hundreds of thousands or even millions of strings.
Rexpy: the Pandas interface
There's also a Pandas binding. You can say:
from tdda import rexpy
results = rexpy.pdextract(df['A'])
to find regular expressions for the strings in column A
of a dataframe df
,
or
from tdda import rexpy
results = rexpy.pdextract([df['A'], df['B'], df['C']])
to find a single set of regular expressions that match all the strings from
columns A
, B
and C
. In all cases, null
(pandas.np.NaN
) values
are ignored. The results are returned as a (Python) list of regular
expressions, as strings.
Final Words
Take it for a spin and let us know how you get on.
As always, follow us or tweet at us (@tdda0)
if you want to hear more, and watch out for the TDDA Slack
team, which will be opening up very soon.