Coverage information for Rexpy
Posted on Tue 31 January 2017 in TDDA
Rexpy Stats
We
previously
added rexpy
to the Python tdda
module. Rexpy is used to
find regular expressions from example strings.
One of the most common requests from Rexpy users has been for information regarding how many examples each resulting regular expression matches.
We have now added a few methods to Rexpy to support this.
Currently this is only available in the Python library for Rexpy,
available as part of the tdda
module, with either
pip install tdda
or
git clone https://github.com/tdda/tdda.git
Needless to say, we also plan to use this functionality in the online version of Rexpy in the future.
Rexpy: Quick Recap
The following example shows simple use of Rexpy from Python:
$ python
>>> from tdda import rexpy
>>>
>>> corpus = ['123-AA-971', '12-DQ-802', '198-AA-045', '1-BA-834']
>>> for r in rexpy.extract(corpus):
>>> print(r)
>>>
^\d{1,3}\-[A-Z]{2}\-\d{3}$
In this case, Rexpy found a single regular expression that matched all the strings, but in many cases it returns a list of regular expressions, each covering some subset of the examples.
The way the algorithm currently works, in most cases1 each example
will match only one regular expression, but in general, some examples
might match more than one pattern. So we've designed the new functionality
to work even when this is the case. We've provided three new methods
on the Extractor
class, which gives a more powerful API than the
simple extract
function.
Here's an example based on one of Rexpy's tests:
>>> urls2 = [
'stochasticsolutions.com/',
'apple.com/',
'stochasticsolutions.com/', # actual duplicate
'https://www.stochasticsolutions.co.uk/',
'https://www.google.co.uk/',
'https://www.google.com',
'https://www.google.com/',
'https://www.guardian.co.uk/',
'https://www.guardian.com',
'https://www.guardian.com/',
'https://www.stochasticsolutions.com',
'web.stochasticsolutions.com',
'https://www.stochasticsolutions.com',
'tdda.info',
'gov.uk',
'https://web.web',
]
>>> x = rexpy.Extractor(urls2)
>>> for r in x.results.rex:
>>> print(r)
^[a-z]{3,4}\.[a-z]{2,4}$
^[a-z]+\.com\/$
^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
^http\:\/\/www\.[a-z]{6,8}\.com\/$
^http\:\/\/www\.[a-z]+\.co\.uk\/$
As you can see, Rexpy has produced six different regular expressions,
some of which should probably be collapsed together. The Extractor
object we have created has three new methods available.
The New Coverage Methods
The simplest new method is coverage(dedup=False)
, which returns a list of
the number of matches for each regular expression returned, in the same
order as the regular expressions in x.results.rex
. So:
>>> print(x.coverage())
[2, 3, 2, 4, 2, 3]
is the list of frequencies for the six regular expressions given, in order. So the pairings are illustrated by:
>>> for k, n in zip(x.results.rex, x.coverage()):
print('%d examples are matched by %s' % (n, k))
2 examples are matched by ^[a-z]{3,4}\.[a-z]{2,4}$
3 examples are matched by ^[a-z]+\.com\/$
2 examples are matched by ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
4 examples are matched by ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
2 examples are matched by ^http\:\/\/www\.[a-z]{6,8}\.com\/$
3 examples are matched by ^http\:\/\/www\.[a-z]+\.co\.uk\/$
The optional dedup
parameter, when set to True
, requests
deduplicated frequencies, i.e. ignoring any duplicate strings passed
in (remembering that Rexpy strips whitespace from both ends of input
strings). In this case, there is just one duplicate string
(stochasticsolutions.com/
). So:
>>> print(x.coverage(dedup=True))
[2, 2, 2, 4, 2, 3]
where the second number (the matches for ^[a-z]+\.com\/$
) is now 2, because
stochasticsolutions.com/
has been deduplicated.
We can also find the total number of examples, with or without duplicates,
by calling the n_examples(dedup=False)
method:
>>> print(x.n_examples())
16
>>> print(x.n_examples(dedup=True))
15
But what we will probably normally be most interested in doing is sorting
the regular expressions from highest to lowest coverage, ignoring any
examples matched by an earlier pattern in cases where they do overlap.
That's exactly what the incremental_coverage(dedup=False)
method does for us.
It returns an ordered dictionary.
>>> for (k, n) in x.incremental_coverage().items():
print('%d: %s' % (n, k))
4: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
3: ^[a-z]+\.com\/$
3: ^http\:\/\/www\.[a-z]+\.co\.uk\/$
2: ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
2: ^[a-z]{3,4}\.[a-z]{2,4}$
2: ^http\:\/\/www\.[a-z]{6,8}\.com\/$
This is our sixteen input strings (including duplicates), and the number of examples matched by this expression, not matched by any previous expression. (As noted earlier, that caveat probably won't make any difference at the moment, but it will in future versions.) So, to be explicit, this is saying:
-
The regular expression that matches most examples is:
^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
which matches 4 of the 16 strings. -
Of the remaining 12 examples, 3 are matched by
^[a-z]+\.com\/$
. -
Of the remaining 9 examples, 3 more are matched by
^http\:\/\/www\.[a-z]+\.co\.uk\/$
-
and so on.
Note, that in the case of ties, Rexpy sorts regular expressions as strings to break ties.
We can get the deduplicated numbers if we prefer:
>>> for (k, n) in x.incremental_coverage(dedup=True).items():
print('%d: %s' % (n, k))
4: ^[a-z]{4,5}\:\/\/www\.[a-z]+\.com$
3: ^http\:\/\/www\.[a-z]+\.co\.uk\/$
2: ^[a-z]+\.com\/$
2: ^[a-z]{3,4}[\.\/\:]{1,3}[a-z]+\.[a-z]{3}$
2: ^[a-z]{3,4}\.[a-z]{2,4}$
2: ^http\:\/\/www\.[a-z]{6,8}\.com\/$
That's all the new functionality for now. Let us know how you get on, and if you find any problems. And tweet your email address to @tdda0 if you want to join the TDDA Slack to discuss anything around the subject of test-driven data analysis.
[NOTE: This post was updated on 10.2.2017 after an update to the rexpy library changed function and attribute names from "sequential" (which was not very descriptive) to "incremental", which is better.]
-
In fact, probably in all cases, currently ↩