Quick Reference for TDDA Library

Posted on Thu 04 May 2017 in TDDA • Tagged with tdda

A quick-reference guide ("cheat sheet") is now available for the Python TDDA library. This is linked in the sidebar and available here.

We will try to keep it up-to-date as the library evolves.

See you all at PyData London 2017 this weekend (5-6 May 2017), where we'll be running a TDDA tutorial on Friday.


Improving Rexpy

Posted on Thu 09 March 2017 in TDDA • Tagged with tdda, rexpy, regular expressions

Today we are announcing some enhancements to Rexpy, the tdda tool for finding regular expressions from examples. In short, the new version often finds more precise regular expressions than was previously the case, with the only downside being a modest increase in run-time.

Background on Rexpy is available in two previous posts:

  • This post introduced the concept and Python library
  • This post discussed the addition of coverage information—statistics about which how many examples each regular expression matched.

Rexpy is also available online at https://rexpy.herokuapp.com.

Weaknesses addressed

Rexpy is not intended to be an entirely general-purpose tool: it is specifically focused on the case of trying to find regular expressions to characterize the sort of structured textual data we most often see in database and datasets. We are very interested in characterizing things like identifiers, zip codes, phone numbers, URLs, UUIDs, social security numbers and (string) bar codes, and much less interested in characterizing things like sentences, tweets, programs and encrypted text.

Within this focus, there were some obvious shortcomings of Rexpy, a significant subset of which the current release (tdda version 0.3.0) now addresses.

Example 1: Postcodes

Rexpy's tests have always included postcodes, but Rexpy never did a very good job with them. Here is the output from using tdda version 0.2.7:

$ python rexpy.py
EH1 1AA
B2 8EA

^[A-Z0-9]{2,3}\ [0-9A-F]{3}$

Rexpy's result is completely valid, but not very specific. It has correctly identified that there are two main parts, separated by a space, and that the first part is a mixture of two or three characters, each a capital letter or a number, and that the second part is exactly three characters, again all capital letters or numbers. However, it has failed to notice that the first group starts with a letter and follows this with a single digit and that the second group is one digit followed by two letters. What a human would probably have written is something more like:

^[A-Z]{1,2}[0-9]\ [0-9][A-F]{2}$

Let's try Rexpy 0.3.0.

^[A-Z]{1,2}\d\ \d[A-Z]{2}$

Now Rexpy does exactly what we would probably have wanted it to do, and actually written it slightly more compactly—\d is any digit, i.e. it is precisely equivalent to [0-9].

With a few more examples, it still does the perfect thing (in 0.3.0).

$ rexpy
EC1 1BB
W1 0AX
M1 1AE
B33 8TH
CR2 6XH
DN55 1PT

^[A-Z]{1,2}\d{1,2}\ \d[A-Z]{2}$

(Note that the 0.3.0 release of TDDA includes wrapper scripts, rexpy and tdda that allow the main functions to be used directly from command line. These are installed when you pip install tdda. So the rexpy above is exactly equivalent to running python rexpy.py.)

We should note, however, that it still doesn't work perfectly if we include general London postcodes (even with 0.3.0):

$ rexpy
EC1A 1BB
W1A 0AX
M1 1AE
B33 8TH
CR2 6XH
DN55 1PT

^[A-Z0-9]{2,4}\ \d[A-Z]{2}$

In this case, the addition of the final letter in the first block for EC1A and W1A has convinced the software that the first block is just a jumble of capital letters and numbers. We might hope that examples like these (at least, if expanded) would result in something like:

^[A-Z]{1,2}\d{1,2}A?\ \d[A-Z]{2}$

though the real pattern for postcodes is actually quite complex, with only certain London postal areas being allowed a trailing letter, and only in cases where there is a single digit in the first group, and that letter can actually be A, C, P or W.

So while it isn't perfect, Rexpy is doing fairly well with postcodes now.

Let's look at another couple of examples.

Example 2: Toy Examples

Looking at the logs from the Rexpy online, it is clear that a lot of people (naturally) start by trying the sorts of examples commonly used for teaching regular expressions. Here are some examples motivated by what we tend to see in logs.

First, let's try a common toy example in the old version of Rexpy (0.2.7):

$ python rexpy.py
ab
abb
abbb
abbbb
abbbbb
abbbbbb

^[a-z]+$

Not so impressive.

Now in the new version (0.3.0):

$ rexpy
ab
abb
abbb
abbbb
abbbbb
abbbbbb

^ab+$

That's more like it!

Example 3: Names

Here's another example it's got better at. First under the old version:

$ python rexpy.py
Albert Einstein
Rosalind Franklin
Isaac Newton

^[A-Za-z]+\ [A-Za-z]{6,8}$

Again, this is not wrong, but Rexpy has singularly failed to notice the pattern of capitalization.

Now, under the new version:

$ rexpy
Albert Einstein
Rosalind Franklin
Isaac Newton

^[A-Z][a-z]+\ [A-Z][a-z]{5,7}$

Better.

Incidentally, it's not doing anything special with the first character of groups. Here are some related examples:

$ rexpy
AlbertEinstein
RosalindFranklin
IsaacNewton

^[A-Z][a-z]+[A-Z][a-z]{5,7}$

Example 4: Identifiers

Some of the examples we used previously were like this (same result under both versions):

$ rexpy
123-AA-22
576-KY-18
989-BR-93

^\d{3}\-[A-Z]{2}\-\d{2}$

What worked less well in the old version were examples like these:

$ python rexpy.py
123AA22
576KY18
989BR93

^[A-Z0-9]{7}$

These work much better under the 0.3.0:

$ rexpy
123AA22
576KY18
989BR93

^\d{3}[A-Z]{2}\d{2}$

Some Remaining Areas for Improvement

The changes that we've made in this release of Rexpy appear to be almost unambiguous improvements. Both from trying examples, and from understanding the underlying code changes, we can find almost no cases in which the changes make the results worse, and a great number where the results are improved. Of course, that's not to say that there don't remain areas that could be improved.

Here we summarize a few of the things we still hope to improve:

Alternations of Whole Groups

Rexpy isn't very good at generating alternations at the moment, either at a character or group level. So for example, you might have hoped that in the following example, Rexpy would notice that the middle two letters are always AA or BB (or, possibly, that the letter is repeated).

$ rexpy
123-AA-321
465-BB-763
777-AA-81
434-BB-987
101-BB-773
032-BB-881
094-AA-662

^\d{3}\-[A-Z]{2}\-\d{2,3}$

Unfortunately, it does not. (This probably won't be very hard to change.)

Alternations within Groups

Similarly, you might hope that it would do rather better than this:

$ rexpy
Roger
Boger
Coger

^[A-Z][a-z]{4}$

Clearly, we would like this to produce

^[A-Z]oger$

and you might think from the previous examples that it would do this, but it can't combine the fixed oger with the adjacent letter range.

Too Many Expressions (or Combining Results)

Rexpy also produces rather too many regular expressions in many cases, particularly by failing to use optionals when it could. For example:

$ rexpy
Angela Carter
Barbara Kingsolver
Socrates
Cher
Martin Luther King
James Clerk Maxwell

^[A-Z][a-z]+$
^[A-Z][a-z]{5,6}\ [A-Z][a-z]+$
^[A-Z][a-z]{4,5}\ [A-Z][a-z]{4,5}\ [A-Z][a-z]+$

At least in some circumstances, we might prefer that this would produce a single result such as:

^[A-Z][a-z]+((\ [A-Z][a-z]+(\ [A-Z][a-z]+)?)?$

or, ever better:

^[A-Z][a-z]+(\ [A-Z][a-z])*$

Although we would definitely like Rexpy to be able to produce one of these results, we don't necessarily always want this behaviour. It transpires that in a TDDA context, producing different expressions for the different cases is very often useful. So if we do crack the "combining" problem, we'll probably make it an option (perhaps with levels); that will just leave the issue of deciding on a default!

Plans

We have ideas on how to address all of these, albeit not perfectly, so expect further improvements.

If you use Rexpy and have feedback, do let us know. You can reach us on Twitter at (@tdda0), and there's also a TDDA Slack (#TDDA) that we'd be happy to invite you to.


An Error of Process

Posted on Wed 08 March 2017 in TDDA • Tagged with tdda, errors of process, errors of interpretation

Yesterday, email subscribers to the blog, and some RSS/casual viewers, will have seen a half-finished (in fact, abandoned) post that began to try to characterize success and failure on the crowd-funding platform Kickstarter.

The post was abandoned because I didn't believe its first conclusion, but unfortunately was published by mistake yesterday.

This post explains what happened and tries to salvage a "teachable moment" out of this minor fiasco.

The problem the post was trying to address

Kickstarter is a crowd-funding platform that allows people to back creative projects, usually in exchange for rewards of various kinds. Projects set a funding goal and backers only pay out if the aggregate pledges made match or exceed the funding goal during a funding period—usually 30 days.

Project Phoenix on Kickstarter, from The Icon Factory, seeks to fund the development of a new version of Twitterrific for Mac. Twitterrific was the first independent Twitter client, and was responsible for many of the things that define Twitter today.1 (You were, and are, cordially invited to break off from reading this post to go and back the project before reading on.)

Ollie is the bird in Twitterrific's icon.

At the time I started the post, the project had pledges of $63,554 towards a funding goal of $75,000 (84%) after 13 days, with 17 days to go. This is what the amount raised over time looked like (using data from Kicktraq):

Given that the amount raised was falling each day, and looked asymptotic, questions I was interested in were:

  • How likely was the project to succeed (i.e. to reach its funding goal by day 30? (In fact, it is now fully funded.)
  • How much was the project likely to raise?
  • How likely was the project to reach its stretch goal of $100,000?

The idea was to use some open data from Kickstarter and simple assumptions to try to find out what successful and unsuccessful projects look like.

Data and Assumptions

[This paragraph is unedited from the post yesterday, save that I have made the third item below bold.]

Kickstarter does not have a public API, but is scrapable. The site Web Robots makes available a series of roughly monthly scrapes of Kickstarter data from October 2015 to present; as well as seven older datasets. We have based our analysis on this data, making the following assumptions:

  1. The data is correct and covers all Kickstarter Projects
  2. That we are interpreting the fields in the data correctly
  3. Most critically: if any projects are missing from this data, the missing projects are random. Our analysis is completely invalid if failing projects are removed from the datasets.

[That last point, heavily signalled as critical, turned out not to be the case. As soon as I saw the 99.9% figure below, I went to try to validate that projects didn't go missing from month to month in the scraped data. In fact, they do, all the time, and when I realised this, I abandoned the post. There would have been other ways to try to make the prediction, but they would have been less reliable and required much more work.]

We started with the latest dataset, from 15th February 2017. This included data about 175,085 projects, which break down as follows.

Only projects with a 30-day funding period were included in the comparison, and only those for which we knew the final outcome.

count           is 30 day?
state           no     yes    TOTAL
failed      41,382  42,134   83,516
successful  44,071  31,142   75,213
canceled     6,319   5,463   11,782
suspended      463     363      826
live         2,084   1,664    3,748
TOTAL       94,319  80,766  175,085
-----------------------------------
less live:           1,664
-----------------------------------
Universe            79,102

The table showed that 80,766 of the projects are 30-day, and of these, 79,102 are not live. So this is our starting universe for analysis. NOTE: We deliberately did not exclude suspended or canceled projects, since doing so would have biased our results.

Various fields are available in the JSON data provided by Web Robots. The subset we have used are listed below, together with our interpretation of the meaning of each field:

  • launched_at — Unix timestamp (seconds since 1 January 1970) for the start of the funding period
  • deadline — Unix timestamp for the end of the funding period
  • state — (see above)
  • goal — the amount required to be raised for the project to be funded
  • pledged — the total amount of pledges (today); pledges can only be made during the funding period
  • currency — the currency in which the goal and pledges are made.
  • backers_count — the number of people who have pledged money.

Overall Statistics for 30-day, non-live projects

These are the overall statistics for our 30-day, non-live projects:

succeeded    count        %
no          47,839   60.48%
yes         31,263   39.52%
TOTAL       79,102  100.00%

Just under 40% of them succeed.

But what proportion reach 84% and still fail to reach 100%? According to the detailed data, the answer was just 0.10%, suggesting 99.90% of 30-day projects that reached 84% of their funding goal, at any stage of the campaign went on to be fully funded.

That looked wildly implausible to me, and immediately made me question whether the data I was trying to use was capable of supporting this analysis. In particular, my immediate worry was that projects that looked like they were not going to reach their goal might end up being removed—for whatever reason—more often that those that were on track. Although I have not proved that this is the case, it is clear that projects do quite often disappear between successive scrapes.

To check this, I went back over all the earlier datasets I had collected and extracted the projects that were live in those datasets. There were 47,777 such projects. I then joined those onto the latest dataset to see how many of them were in the latest dataset. 15,276 (31.97%) of the once-live projects were not in the latest data (based on joining on id).

It was at this point I abandoned the blog post.

Error of Process

So what did we learn?

The whole motivation for test-driven data analysis is the observation that data analysis is hard to get right, and most of us make mistakes all the time. We have previously classified these mistakes as

  • errors of interpretation (where we or a consumer of our analysis misunderstand the data, the methods, or our results)
  • errors of implementation (bugs)
  • errors of process (where we make a mistake in using our analytical process, and this leads to a false result being generated or propagated)
  • errors of applicability (where we use an analytical process with data that does not satisfy the requirements or assumptions (explicit or implicit) of the analysis).

We are trying to develop methodologies and tools to reduce the likelihood and impact of each of these kinds of errors.

While we wouldn't normally regard this blog as an analytical process, it's perhaps close enough that we can view this particular error through the TDDA lens. I was writing up the analysis as I did it, fully expecting to generate a useful post. Although I got as far as writing into the entry the (very dubious) 99.9% of 30-day projects that reach 84% funding at any stage on Kickstarter go on to be fully funded, that result immediately smelled wrong and I went off to try to see whether my assumptions about the data were correct. So I was trying hard to avoid an error of interpretation.

But an error of process occurred. This blog is published using Pelican, a static site generator that I mostly quite like. The way Pelican works is posts are (usually) written in Markdown with some metadata at the top. One of the bits of metadata is a Status field, which can either be set to Draft or Published.

When writing the posts, before publishing, you can either run a local webserver to view the output, or actually post them to the main site (on Github Pages, in this case). As long as their status is set to Draft, the posts don't show up as part of the blog in either site (local or on Github), but have to be accessed through a special draft URL. Unfortunately, the draft URL is a little hard to guess, so I generally work with posts with status set to Published until I push them to Github to allow other people to review them before setting them live.

What went wrong here is that the abandoned post had its status left as Published, which was fine until I started the next post (due tomorrow) and pushed that (as draft) to Github. Needless to say, a side-effect of pushing the site with a draft of tomorrow's post was that the abandoned post got pushed too, with its status as Public. Oops!

So the learning for me is that I either have to be more careful with Status (which is optimistic) or I need to add some protection in the publishing process to stop this happening. Realistically, that probably means creating a new Status—Internal—which will get the make process to transmogrify into Published when compiling locally, and Draft when pushing to Github. That should avoid repeats of this particular error of process.


  1. good things, like birds and @names and retweeting; not the abuse. 


Errors of Interpretation: Bad Graphs with Dual Scales

Posted on Mon 20 February 2017 in TDDA • Tagged with tdda, errors of interpretation, graphs

It is a primary responsibility of analysts to present findings and data clearly, in ways to minimize the likelihood of misinterpretation. Graphs should help this, but all too often, if drawn badly (whether deliberately or through oversight) they can make misinterpretation highly likely. I want to illustrate this danger with a unifortunate graph I came across recently in a very interesting—and good, and insightful—article on the US Election.

Take a look at this graph, taken from an article called The Road to Trumpsville: The Long, Long Mistreatment of the American Working Class, by Jeremy Grantham.1

Exhibit 1: Corportate Profits and Employee Compensation

In the article, this graph ("Exhibit 1") is described as follows by Grantham:

The combined result is shown in Exhibit 1: the share of GDP going to labor hit historical lows as recently as 2014 and the share going to corporate profits hit a simultaneous high.

Is that what you interpret from the graph? I agree with these words, but they don't really sum up my first reading of the graph. Rather, I think the natural reading of the graph is as follows:

Wow: Labor's share and Capital's share of GDP crossed over, apparently for good, around 2002. Before then, Capital's share was mostly materially lower than Labor's (though they were nearly equal, briefly, in 1965, and crossed for a for a few years in 1995), but over the 66-year period shown Capital's share increased while Labor's fell, until now is taking about four times as much as Labor.

I think something like that is what most people will read from the graph, unless they read it particularly carefully.

But that is not what this graph is saying. In fact, this is one of the most misleading graphs I have ever come across.

If you look carefully, the two lines use different scales: the red one, for Labor, uses the scale on the right, which runs from 23% to about 34%, whereas the blue line for Capital, uses the scale on the left, which runs from 3% to 11%.

Dual-scale graphs are always difficult to read; so difficult, in fact, that my personal recommendation is

Never plot data on two different scales on the same graph.

Not everyone agrees with this, but most people accept that dual-scale graphs are confusing and hard to read. Even, however, by the standards of dual scale graphs, this is bad.

Here are the problems, in roughly decreasing order of importance:

  1. The two lines are showing commensurate2 figures of roughly the same order of magnitude, so could and should have been on the same scale: this isn't a case of showing price against volume, where the units are different, or even a case in which one size in millimetres and another in miles: these are both percentages, of the same thing, all between 4% and 32%.
  2. The graphs cross over when the data doesn't. The very strong suggestion from the graphs that we go from Labor's share of GDP exceeding that of Capital to being radically lower than that of Capital is entirely false.
  3. Despite measuring the same quantity, the magnification is different on the two axes (i.e. the distance on the page between ticks is different, and the percentage-point gap represented by ticks on the two scales is different). As a consequence slopes (gradients) are not comparable.
  4. Neither scale goes to zero.
  5. The position of the two series relative to their scales is inconsistent: the Labor graph goes right down to the x-axis at its minimum (23%) while the Capital graph—whose minimum is also very close to an integer percentage—does not. This adds further to the impression that Labor's share has been absolutely annihilated.
  6. There are no gridlines to help you read the data. (Sure, gridlines are chart junk3, but are especially important when different scales are used, so you have some hope of reading the values.)

I want to be clear: I am not accusing Jeremy Grantham of deliberately plotting the data in a misleading way. I do not believe he intended to distort or manipulate. I suspect he's plotted it this way because stock graphs, which may well be the graphs he most often looks at,4 are frequently plotted with false zeros. Despite this, he has unfortunately plotted the graphs in a way5 that visually distorts the data in almost exactly the way I would choose to do if I wanted to make the points he is making appear even stronger than they are.

I don't have the source numbers, so I have gone through a rather painful exercise, of reading the numbers off the graph (at slightly coarser granularity) so that I can replot the graph as it should, in my opinion, have been plotted in the first place. (I apologise if I have misread any values; transcribing numbers from graphs is tedious and error-prone.) This is the result:

Exhibit 1 (revised): Same Data, with single, zero-based scale (redrawn approximation)

Even after I'd looked carefully at the scales and appreciated all the distortions in the original graph, I was quite shocked to see the data presented neutrally. To be clear: Grantham's textual summary of the data is accurate: a few years ago, Capital's share of GDP (from his figures) were at an all time—albeit not dramatically higher than in 1949 or about 1966—and Labor's share of GDP, a few years ago, was at an all-time low around 23%, down from 30%. But the true picture just doesn't look like the graph Gratham showed. (Again: I feel a bit bad about going on about this graph from such a good article; but the graph encapsulates a number of problematical practices that it makes a perfect illustration.)

How to Lie with Statistics

In 1954, Darrell Huff published a book called How to Lie with Statistics6. Chapter 5 is called The Gee Wizz Graph. His first example is the following graph (neutrally presented) graph:

Exhibit 2 (neutral): Sales Data, zero-based scale (redrawn from original)

As Huff says:

That is very well if all you want to do is convey information. But suppose you wish to win an argument, shock a reader, move him into action, sell him something. For that, this chart lacks schmaltz. Chop off the bottom.

Exhibit 2 (non-zero-based): Sales Data, non-zero-based scale (redrawn from original)

Huff continues:

Thats more like it. (You've saved paper7 too, something to point out if any carping fellow objects to your misleading graphics.)

But there's more, folks:

Now that you have practised to deceive, why stop with truncating? You have one more trick available that's worth a dozen of that. It will make your modest rise of ten per cent look livelier than one hundred percent is entitled to look. Simply change the proportion between the ordinate and the abscissa:

Exhibit 2 (non-zero-based, expanded): Sales Data, non-zero-based scale, expanded effect (redrawn from original)

Both of these unfortunate practices are present in Exhibit 1, and that's before we even get to dual scales.

Errors of Interpretation

In our various overviews of test-driven data analysis, (e.g., this summary) we have described four major classes of errors:

  • errors of interpretation
  • errors of implementation (bugs)
  • errors of process
  • errors of applicability

Errors of interpretation can occur at any point in the process: not only are we, the analysts, susceptible to misinterpreting our inputs, our methods, our intermediate results and our outputs, but the recipients of our insights and analyses are in even greater danger of misinterpreting our results, because they have not worked through the process and seen all that we did. As analysts, we have a special responsibility to make our results as clear as possible, and hard to misinterpret. We should assume not that the reader will be diligent, unhurried and careful, reading every number and observing every subtlety, but that she or he will be hurried and will rely on us to have brought out the salient points and to have helped the reader towards the right conclusions.

The purpose of a graph is to bring allow a reader to assimilate large quantities of data, and to understand patterns therein, more quickly and more easily than is possible from tables of numbers. There are strong conventions about how to do that, based on known human strengths and weaknesses as well as commonsense "fair treatment" of different series.

However well intentioned, Exhibit 1 fails in every respect: I would guess very few casual readers would get an accurate impression from the data as presented.

If data scientists had the equivalent of a Hippocratic Oath, it would be something like:

First, do not mislead.


  1. The Road to Trumpsville: The Long, Long Mistreatment of the American Working Class, by Jeremy Grantham, in the GMO Quarterly Newsletter, 4Q, 2016. https://www.gmo.com/docs/default-source/public-commentary/gmo-quarterly-letter.pdf 

  2. two variables are commensurate if they are measured in the same units and it is meaningful to make a direct comparison between them. 

  3. Tufte describes all ink on a graph that is not actually plotting data "chart junk", and advocates "maximizing data ink" (the amount of the ink on a graph actually devoted to plotting the data points) and minimizing chart junk. These are excellent principles. The Visual Display of Quantitative Information, Edward R. Tufte, Graphics Press (Cheshire, Connecticut) 1983. 

  4. Mr Grantham works for GMO, a "global investment management firm". https://gmo.com 

  5. chosen to use a plot, if he isn't responsible for the plot 

  6. How to Lie with Statistics, Darrell Huff, published Victor Gollancz, 1954. Republished, 1973, by Pelican Books. 

  7. Obviously the "saving paper" argument had more force in 1954, and the constant references to "him", "he" and "fellows" similarly stood out less than they do today. 


TDDA 1-pager

Posted on Fri 10 February 2017 in TDDA • Tagged with tdda

We have written a 1-page summary of some of the core ideas in TDDA.

It is available as a PDF from stochasticsolutions.com/pdf/TDDA-One-Pager.pdf.