Rexpy for Generating Regular Expressions: Postcodes
Posted on Wed 20 February 2019 in TDDA
Rexpy is a powerful tool we created that generates regular expressions from examples. It's available online at https://rexpy.herokuapp.com and forms part of our open-source TDDA library.
Miró
users can use the built-in rex
command.
This post illustrates using Rexpy to find regular expressions for UK postcodes.
A regular expression for Postcodes
If someone asked you what a UK postcode looks like, and you don't live in London, you'd probably say something like:
A couple of letters, then a number then a space, then a number then a couple of letters.
About the simplest way to get Rexpy to generate a regular expression is to give it at least two examples. You can do this online at https://rexpy.herokuapp.com or using the open-source TDDA library.
If you give it EH1 3LH
and BB2 5NR
, Rexpy generates
[A-Z]{2}\d \d[A-Z]{2}
, as illustrated here,
using the online version of rexpy:
This is the regular-expression equivalent of what we said:
[A-Z]{2}
means exactly two ({2}
) characters from the range[A-Z]
, i.e. two capital letters\d
means a digit (which is the same as[0-9]
—two characters from the range 0 to 9)- the gap (
) is a space character
\d
is another digit[A-Z]{2}
is two more letters.
This doesn't cover all postcodes, but it's a good start.
Other cases
Any easy way to try out the regular expression we generated
is to use the grep
command1.
This is built into all Unix and Linux systems, and is available on
Windows if you install a Linux distribution under
WSL.
If we try matching a few postcodes using this regular expression, we'll see that many—but not all—postcodes match the pattern.
- On Linux, the particular variant of
grep
we need isgrep -P
, to tell it we're usingPerl
-style regular expressions. - On Unix (e.g. Macintosh), we need to use
grep -E
(oregrep
) to tell it we're using "extended" regular expressions
If we write a few postcodes to a file:
$ cat > postcodes
HA2 6QD
IP4 2LS
PR1 9BW
BB2 5NR
G1 9PU
DH9 6DU
RG22 4EX
EC1A 1AB
OL14 8DQ
CT2 7UD
we can then use grep
to find the lines that match:
$ grep -E '[A-Z]{2}\d \d[A-Z]{2}' postcodes
HA2 6QD
IP4 2LS
PR1 9BW
BB2 5NR
DH9 6DU
CT2 7UD
(Use -P
instead of -E
on Linux.)
More relevantly, for present purposes, we can also add the -v
flag,
to ask the match to be "inVerted", i.e. to show lines that fail to match:
$ grep -v -E '[A-Z]{2}\d \d[A-Z]{2}' postcodes
G1 9PU
RG22 4EX
EC1A 1AB
OL14 8DQ
-
The first of these, a Glasgow postcode, fails because it only has a single letter at the start.
-
The second and fourth fail because they have two digits after the letters.
-
The third fails because it's a London postcode with an extra letter,
A
after theEC1
.
Let's add an example of each in turn:
If we first add the Glasgow postcode, Rexpy generates
^[A-Z]{1,2}\d \d[A-Z]{2}$
.
Here [A-Z]{1,2}
in brackets means 1–2 capital letters,
and we've checked the anchor
checkbox, to get it to add in ^
at the start and $
at the end of the regular expression.2
If we use this with our grep
command, we get:
$ grep -v -E '^[A-Z]{1,2}\d \d[A-Z]{2}$' postcodes
RG22 4EX
EC1A 1AB
OL14 8DQ
If we now add in an example with two digits in the first part of the
postcode—say RG22 4EX
—rexpy further refines the expression to
^[A-Z]{1,2}\d{1,2} \d[A-Z]{2}$
, which is good for all(?) non-London
postcodes. If we repeat the grep
with this new pattern:
$ grep -v -E '^[A-Z]{1,2}\d{1,2} \d[A-Z]{2}$' postcodes
EC1A 1AB
only the London example now fails.
In a perfect world, just by adding EC1A 1AB
,
Rexpy would produce our ideal regular expression—something like
^[A-Z]{1,2}\d[A-Z]? \d[A-Z]{2}$
.
(Here, the ?
is the equivalent to {0,1}
, meaning that the
term before can occur zero times or once, i.e. it is optional.)
Unfortunately, that's not what happens. Instead, Rexpy produces:
^[A-Z0-9]{2,4} \d[A-Z]{2}$
Unfortunately, Rexpy has concluded that the first part is just a jumble of capital letters and numbers and is saying that the first part can be any mixture of 2-4 letters and numbers.
In this case, we'd probably fix up the regular expression by hand, or separately pass in the special Central London postcodes and all the rest. If we feed in a few London postcodes on their own, we get:
^[A-Z]{2}\d[A-Z] \d[A-Z]{2}$
which is also a useful start.
Have fun with Rexpy!
By the way: if you're in easy reach of Edinburgh, we're running a training course on the TDDA library as part of the Fringe of the Edinburgh DataFest, on 20th March. This will include use of Rexpy. You should come!
-
grep
stands for global regular expression print, and thee
inegrep
stands for extended. ↩ -
Sometimes, regular expressions match any line that contains the pattern anywhere in them, rather than requiring the pattern to match the whole line. In such cases, using the anchored form of the regular expression,
^[A-Z]{2}\d \d[A-Z]{2}$
, means that matching lines must not contain anything before or after the text that matches the regular expression. (You can think of^
as matching the start of the string, or line, and$
as matching the end.) ↩