Rexpy for Generating Regular Expressions: Postcodes

Posted on Wed 20 February 2019 in TDDA

Rexpy is a powerful tool we created that generates regular expressions from examples. It's available online at https://rexpy.herokuapp.com and forms part of our open-source TDDA library.

Miró users can use the built-in rex command.

This post illustrates using Rexpy to find regular expressions for UK postcodes.

A regular expression for Postcodes

If someone asked you what a UK postcode looks like, and you don't live in London, you'd probably say something like:

A couple of letters, then a number then a space, then a number then a couple of letters.

About the simplest way to get Rexpy to generate a regular expression is to give it at least two examples. You can do this online at https://rexpy.herokuapp.com or using the open-source TDDA library.

If you give it EH1 3LH and BB2 5NR, Rexpy generates [A-Z]{2}\d \d[A-Z]{2}, as illustrated here, using the online version of rexpy:

$Rexpy online, with EH1 3LH and BB2 5NR as inputs, produces [A-Z]{2}\d \d[A-Z]{2}$

This is the regular-expression equivalent of what we said:

[A-Z]{2} means exactly two ({2}) characters from the range [A-Z], i.e. two capital letters
\d means a digit (which is the same as [0-9]—two characters from the range 0 to 9)
the gap () is a space character
\d is another digit
[A-Z]{2} is two more letters.

This doesn't cover all postcodes, but it's a good start.

Other cases

Any easy way to try out the regular expression we generated is to use the grep command¹. This is built into all Unix and Linux systems, and is available on Windows if you install a Linux distribution under WSL.

If we try matching a few postcodes using this regular expression, we'll see that many—but not all—postcodes match the pattern.

On Linux, the particular variant of grep we need is grep -P, to tell it we're using Perl-style regular expressions.
On Unix (e.g. Macintosh), we need to use grep -E (or egrep) to tell it we're using "extended" regular expressions

If we write a few postcodes to a file:

$ cat > postcodes
HA2 6QD
IP4 2LS
PR1 9BW
BB2 5NR
G1 9PU
DH9 6DU
RG22 4EX
EC1A 1AB
OL14 8DQ
CT2 7UD

we can then use grep to find the lines that match:

$ grep -E '[A-Z]{2}\d \d[A-Z]{2}' postcodes
HA2 6QD
IP4 2LS
PR1 9BW
BB2 5NR
DH9 6DU
CT2 7UD

(Use -P instead of -E on Linux.)

More relevantly, for present purposes, we can also add the -v flag, to ask the match to be "inVerted", i.e. to show lines that fail to match:

$ grep -v -E '[A-Z]{2}\d \d[A-Z]{2}' postcodes
G1 9PU
RG22 4EX
EC1A 1AB
OL14 8DQ

The first of these, a Glasgow postcode, fails because it only has a single letter at the start.
The second and fourth fail because they have two digits after the letters.
The third fails because it's a London postcode with an extra letter, A after the EC1.

Let's add an example of each in turn:

If we first add the Glasgow postcode, Rexpy generates ^[A-Z]{1,2}\d \d[A-Z]{2}$.

$Rexpy online, adding G1 9PU, produces [A-Z]{1,2}\d \d[A-Z]{2}$

Here [A-Z]{1,2} in brackets means 1–2 capital letters, and we've checked the anchor checkbox, to get it to add in ^ at the start and $ at the end of the regular expression.² If we use this with our grep command, we get:

$ grep -v -E '^[A-Z]{1,2}\d \d[A-Z]{2}$' postcodes
RG22 4EX
EC1A 1AB
OL14 8DQ

If we now add in an example with two digits in the first part of the postcode—say RG22 4EX—rexpy further refines the expression to ^[A-Z]{1,2}\d{1,2} \d[A-Z]{2}$, which is good for all(?) non-London postcodes. If we repeat the grep with this new pattern:

$ grep -v -E '^[A-Z]{1,2}\d{1,2} \d[A-Z]{2}$' postcodes
EC1A 1AB

only the London example now fails.

In a perfect world, just by adding EC1A 1AB, Rexpy would produce our ideal regular expression—something like ^[A-Z]{1,2}\d[A-Z]? \d[A-Z]{2}$. (Here, the ? is the equivalent to {0,1}, meaning that the term before can occur zero times or once, i.e. it is optional.)

Unfortunately, that's not what happens. Instead, Rexpy produces:

^[A-Z0-9]{2,4} \d[A-Z]{2}$

Unfortunately, Rexpy has concluded that the first part is just a jumble of capital letters and numbers and is saying that the first part can be any mixture of 2-4 letters and numbers.

In this case, we'd probably fix up the regular expression by hand, or separately pass in the special Central London postcodes and all the rest. If we feed in a few London postcodes on their own, we get:

^[A-Z]{2}\d[A-Z] \d[A-Z]{2}$

which is also a useful start.

Have fun with Rexpy!

By the way: if you're in easy reach of Edinburgh, we're running a training course on the TDDA library as part of the Fringe of the Edinburgh DataFest, on 20th March. This will include use of Rexpy. You should come!

Training Course on Testing Data and Data Processes

grep stands for global regular expression print, and the e in egrep stands for extended. ↩
Sometimes, regular expressions match any line that contains the pattern anywhere in them, rather than requiring the pattern to match the whole line. In such cases, using the anchored form of the regular expression, ^[A-Z]{2}\d \d[A-Z]{2}$, means that matching lines must not contain anything before or after the text that matches the regular expression. (You can think of ^ as matching the start of the string, or line, and $ as matching the end.) ↩