GDPR, Consent and Microformats: A Half-Baked Idea

Posted on Fri 08 September 2017 in TDDA • Tagged with tdda

Last night I went to The Protectors of Data Scotland Meetup on the subject of Marketing and GDPR. If you're not familiar with Europe's fast-approaching General Data Protection Regulation, and you keep or process any personal data about humans,1, you probably ought to learn about it. A good place to start is episode 202 of Horace Dediu's The Critical Path podcast, in which he interviews Tim Walters.

During the meeting, I had an idea, and though it is rather less than half-baked, right now it seems just about interesting enough that I thought I'd record it.

One of the key provisions in GDPR is that data processing generally requires consent of the data subject, and that consent is defined as

any freely given, specific, informed and unambiguous indication of his or her wishes by which the data subject, either by a statement or by a clear affirmative action, signifies agreement to personal data relating to them being processed

This is further clarified as follows:

This could include ticking a box when visiting an Internet website, choosing technical settings for information society services or by any other statement or conduct which clearly indicates in this context the data subject’s acceptance of the proposed processing of their personal data.

Silence, pre-ticked boxes or inactivity should therefore not constitute consent.

The Idea in a Nutshell

Briefly, the idea is this:

  • Websites (and potentially apps) requesting consents should include a digital specification of that consent in a standardized format to be defined (probably either an HTML microformat or a standardized JSON bundle).

  • This would allow software to understand the consents being requested unambiguously and present them in a standardized, uniform, easy-to-understand format. It would also encourages businesses and other organizations to standardize the forms of consent they request. I imagine that if this happened, initially browser extensions and special apps such as password managers would learn to read the format and present the information clearly, but if were successful, eventually web browsers themselves would do this.

  • Software could also allow people to create one or more templates or default responses, allowing, for example, someone who never wants to receive marketing to make this their default response, and someone who wants as many special offers as possible to have settings that reflect that. Obviously, you might want several different formats for organizations towards which you have different feelings.

  • A very small extension to the idea would extend the format to record the choices made, allowing password managers, browsers, apps etc. to record for the user exactly what consents were given.2

Benefits

I believe such a standard has potential benefits for all parties—businesses and other organizations requesting consent, individuals giving consent, regulators and courts:

  • Businesses and other data processing/capturing organizations would benefit from a clear set of consent kinds, each of which could have a detailed description (perhaps on an EU or W3C document) that could be referenced by a specific label (e.g. marketing_contact_email_organization). Best practice would hopefully quickly move to using these standard categories.

  • Software could present information in a standardized, clear way to users, highlighting non-standard provisions (preferably with standard symbols, a bit like the Creative Commons Symbols

  • By using template responses, users could more easily complete consent forms with less effort and less fear of ticking the wrong boxes.

Digital Specification? Microformat? JSON?

What are we actually talking about here?

The main (textual) content of a web page consists of the actual human-readable text together with annotation ("markup") to specify formatting and layout, as well as special features like the sort of checkboxes used to request consent. In older versions of the web, a web page was literally a text file in the a special format originally defined by Tim Berners-Lee (HTML).

For example, in HTML, this one-sentence paragraph with the word very in bold might be written as follows:

<p>For example, in HTML, this one-sentence paragraph with the
word <b>very</b> in bold might be written as follows:</p>

Since the advent of "Web 2.0", many web pages are generated dynamically, with much of the data being sent in a format called JSON. A simple example of some JSON for describing (say) an element in the periodic might be

{
    "name": "Lithium",
    "atomicnumber": 3,
    "metal": true,
    "period": 2,
    "group": 1,
    "etymology": "Greek lithos"
}

It would be straightforward3 to develop a format for allowing all the common types of marketing consent (and indeed, many other kinds of processing consent) to be expressed either in JSON or an HTML microformat (which might not be rendered directly by the webpage). As a sketch, the kind of thing a marketing consent request might look like in JSON could be:

{
    "format": "GDPR-Marketing-Consent-Bundle",
    "format_version": "1.0",
    "requesting_organization": "Stochastic Solutions Limited",
    "requesting_organization_data_protection_policy_page:
        "https://stochasticsolutions.com/privacy.html",
    "requesting_organization_partners": [],
    "requested_consents": [
        "marketing_contact_email_organization",
        "marketing_contact_mobile_phone_organization",
        "marketing_contact_physical_mail_organization"
    ],
    "request_date": "2017-09-08",
    "request_url": "https://StochasticSolutios.com/give-us-your-data"
}

Key features I am trying to illustrate here are:

  • The format would include details about the organization making the request
  • The format would have the capacity to list partner organizations in cases in which consent for partner marketing or processing was also requested
  • The format would be granular with a taxonomy of known kinds of consents. These might be parameterized, rather than being simple strings. In this case, I've included a few different contact mechanisms and the suffix "organization" to indicate this is consent for the organization itself, rather than any partners or other randoms.

Undoubtedly, a real implementation would end up a bit bigger than this, and perhaps more hierarchical, but hopefully not too much bigger.

The format could be extended very simply to include the response, which could then be sent back to the site and also made available on the page to the browser/password manager/apps etc. Here is an augmentation of the request format that also captures the responses:

{
    "format": "GDPR-Marketing-Consent-Bundle",
    "format_version": "1.0",
    "requesting_organization": "Stochastic Solutions Limited",
    "requesting_organization_data_protection_policy_page:
        "https://stochasticsolutions.com/privacy.html",
    "requesting_organization_partners": [],
    "requested_consents": {
        "marketing_contact_email_organization": false,
        "marketing_contact_mobile_phone_organization": false
        "marketing_contact_physical_mail_organization: true
    },
    "request_date": "2017-09-08",
    "request_url": "https://StochasticSolutions.com/give-us-your-data"
}

This example indicates consent to marketing contact by paper mail from the organization, but not by phone or email.

Exactly the same could be achieved with an HTML Microformat, perhaps with something like this:

<div id="GDPR-Marketing-Consent-Bundle">
    <span id="format_version">1.0</span>
    <span id="requesting_organization">
        Stochastic Solutions Limited
    </span>
    <span id="requesting_organization_data_protection_policy_page">
        "https://stochasticsolutions.com/privacy.html"
    </span>
    <ol id="requesting_organization_partners">
    </old>
    <ol id="requested_consents">
        <li>marketing_contact_email_organization</li>
        <li>marketing_contact_mobile_phone_organization<li>
        <li>marketing_contact_physical_mail_organization</li>
    </ol>
    <span id="request_date">2017-09-08</span>
    <span id="request_url">
        https://StochasticSolutions.com/give-us-your-data
    </span>
</div>

(Again, I've no idea whether this is actually what HTML-based microformats typically look like; this is purely illustrative.)

Useful?

I don't know whether this idea is useful or feasible, nor whether it is merely a half-baked version of something that an phalanx of people in Brussels has already specified, though I did perform many seconds of arduous due dilligence in the form of a web searches for terms like "marketing consent microformat" without turning up anything obviously relevant.

It seems to me that if something like this were created and adopted, it might help make GDPR and web/app-based consent avoid the ignominious fate of the cookie pop-ups that were so well intentioned but such a waste of time in practice. Ideally, some kind of collaboration between the relevant part of the EU and either W3C would produce (or at least endorse) any format.

Do get in touch through any of the channels (@tdda0, mail to info@ this domain etc.) if you have thoughts.


  1. So-called personally identifiable information (PII)

  2. Possibly even on a blockchain, if you want to be terribly au courrant and have the possibility of cryptographic verification. 

  3. technically straightforward; obviously this would require much work and hammering out of special cases and mechanisms for non-standard requirements. 


Quick Reference for TDDA Library

Posted on Thu 04 May 2017 in TDDA • Tagged with tdda

A quick-reference guide ("cheat sheet") is now available for the Python TDDA library. This is linked in the sidebar and available here.

We will try to keep it up-to-date as the library evolves.

See you all at PyData London 2017 this weekend (5-6 May 2017), where we'll be running a TDDA tutorial on Friday.


Improving Rexpy

Posted on Thu 09 March 2017 in TDDA • Tagged with tdda, rexpy, regular expressions

Today we are announcing some enhancements to Rexpy, the tdda tool for finding regular expressions from examples. In short, the new version often finds more precise regular expressions than was previously the case, with the only downside being a modest increase in run-time.

Background on Rexpy is available in two previous posts:

  • This post introduced the concept and Python library
  • This post discussed the addition of coverage information—statistics about which how many examples each regular expression matched.

Rexpy is also available online at https://rexpy.herokuapp.com.

Weaknesses addressed

Rexpy is not intended to be an entirely general-purpose tool: it is specifically focused on the case of trying to find regular expressions to characterize the sort of structured textual data we most often see in database and datasets. We are very interested in characterizing things like identifiers, zip codes, phone numbers, URLs, UUIDs, social security numbers and (string) bar codes, and much less interested in characterizing things like sentences, tweets, programs and encrypted text.

Within this focus, there were some obvious shortcomings of Rexpy, a significant subset of which the current release (tdda version 0.3.0) now addresses.

Example 1: Postcodes

Rexpy's tests have always included postcodes, but Rexpy never did a very good job with them. Here is the output from using tdda version 0.2.7:

$ python rexpy.py
EH1 1AA
B2 8EA

^[A-Z0-9]{2,3}\ [0-9A-F]{3}$

Rexpy's result is completely valid, but not very specific. It has correctly identified that there are two main parts, separated by a space, and that the first part is a mixture of two or three characters, each a capital letter or a number, and that the second part is exactly three characters, again all capital letters or numbers. However, it has failed to notice that the first group starts with a letter and follows this with a single digit and that the second group is one digit followed by two letters. What a human would probably have written is something more like:

^[A-Z]{1,2}[0-9]\ [0-9][A-F]{2}$

Let's try Rexpy 0.3.0.

^[A-Z]{1,2}\d\ \d[A-Z]{2}$

Now Rexpy does exactly what we would probably have wanted it to do, and actually written it slightly more compactly—\d is any digit, i.e. it is precisely equivalent to [0-9].

With a few more examples, it still does the perfect thing (in 0.3.0).

$ rexpy
EC1 1BB
W1 0AX
M1 1AE
B33 8TH
CR2 6XH
DN55 1PT

^[A-Z]{1,2}\d{1,2}\ \d[A-Z]{2}$

(Note that the 0.3.0 release of TDDA includes wrapper scripts, rexpy and tdda that allow the main functions to be used directly from command line. These are installed when you pip install tdda. So the rexpy above is exactly equivalent to running python rexpy.py.)

We should note, however, that it still doesn't work perfectly if we include general London postcodes (even with 0.3.0):

$ rexpy
EC1A 1BB
W1A 0AX
M1 1AE
B33 8TH
CR2 6XH
DN55 1PT

^[A-Z0-9]{2,4}\ \d[A-Z]{2}$

In this case, the addition of the final letter in the first block for EC1A and W1A has convinced the software that the first block is just a jumble of capital letters and numbers. We might hope that examples like these (at least, if expanded) would result in something like:

^[A-Z]{1,2}\d{1,2}A?\ \d[A-Z]{2}$

though the real pattern for postcodes is actually quite complex, with only certain London postal areas being allowed a trailing letter, and only in cases where there is a single digit in the first group, and that letter can actually be A, C, P or W.

So while it isn't perfect, Rexpy is doing fairly well with postcodes now.

Let's look at another couple of examples.

Example 2: Toy Examples

Looking at the logs from the Rexpy online, it is clear that a lot of people (naturally) start by trying the sorts of examples commonly used for teaching regular expressions. Here are some examples motivated by what we tend to see in logs.

First, let's try a common toy example in the old version of Rexpy (0.2.7):

$ python rexpy.py
ab
abb
abbb
abbbb
abbbbb
abbbbbb

^[a-z]+$

Not so impressive.

Now in the new version (0.3.0):

$ rexpy
ab
abb
abbb
abbbb
abbbbb
abbbbbb

^ab+$

That's more like it!

Example 3: Names

Here's another example it's got better at. First under the old version:

$ python rexpy.py
Albert Einstein
Rosalind Franklin
Isaac Newton

^[A-Za-z]+\ [A-Za-z]{6,8}$

Again, this is not wrong, but Rexpy has singularly failed to notice the pattern of capitalization.

Now, under the new version:

$ rexpy
Albert Einstein
Rosalind Franklin
Isaac Newton

^[A-Z][a-z]+\ [A-Z][a-z]{5,7}$

Better.

Incidentally, it's not doing anything special with the first character of groups. Here are some related examples:

$ rexpy
AlbertEinstein
RosalindFranklin
IsaacNewton

^[A-Z][a-z]+[A-Z][a-z]{5,7}$

Example 4: Identifiers

Some of the examples we used previously were like this (same result under both versions):

$ rexpy
123-AA-22
576-KY-18
989-BR-93

^\d{3}\-[A-Z]{2}\-\d{2}$

What worked less well in the old version were examples like these:

$ python rexpy.py
123AA22
576KY18
989BR93

^[A-Z0-9]{7}$

These work much better under the 0.3.0:

$ rexpy
123AA22
576KY18
989BR93

^\d{3}[A-Z]{2}\d{2}$

Some Remaining Areas for Improvement

The changes that we've made in this release of Rexpy appear to be almost unambiguous improvements. Both from trying examples, and from understanding the underlying code changes, we can find almost no cases in which the changes make the results worse, and a great number where the results are improved. Of course, that's not to say that there don't remain areas that could be improved.

Here we summarize a few of the things we still hope to improve:

Alternations of Whole Groups

Rexpy isn't very good at generating alternations at the moment, either at a character or group level. So for example, you might have hoped that in the following example, Rexpy would notice that the middle two letters are always AA or BB (or, possibly, that the letter is repeated).

$ rexpy
123-AA-321
465-BB-763
777-AA-81
434-BB-987
101-BB-773
032-BB-881
094-AA-662

^\d{3}\-[A-Z]{2}\-\d{2,3}$

Unfortunately, it does not. (This probably won't be very hard to change.)

Alternations within Groups

Similarly, you might hope that it would do rather better than this:

$ rexpy
Roger
Boger
Coger

^[A-Z][a-z]{4}$

Clearly, we would like this to produce

^[A-Z]oger$

and you might think from the previous examples that it would do this, but it can't combine the fixed oger with the adjacent letter range.

Too Many Expressions (or Combining Results)

Rexpy also produces rather too many regular expressions in many cases, particularly by failing to use optionals when it could. For example:

$ rexpy
Angela Carter
Barbara Kingsolver
Socrates
Cher
Martin Luther King
James Clerk Maxwell

^[A-Z][a-z]+$
^[A-Z][a-z]{5,6}\ [A-Z][a-z]+$
^[A-Z][a-z]{4,5}\ [A-Z][a-z]{4,5}\ [A-Z][a-z]+$

At least in some circumstances, we might prefer that this would produce a single result such as:

^[A-Z][a-z]+((\ [A-Z][a-z]+(\ [A-Z][a-z]+)?)?$

or, ever better:

^[A-Z][a-z]+(\ [A-Z][a-z])*$

Although we would definitely like Rexpy to be able to produce one of these results, we don't necessarily always want this behaviour. It transpires that in a TDDA context, producing different expressions for the different cases is very often useful. So if we do crack the "combining" problem, we'll probably make it an option (perhaps with levels); that will just leave the issue of deciding on a default!

Plans

We have ideas on how to address all of these, albeit not perfectly, so expect further improvements.

If you use Rexpy and have feedback, do let us know. You can reach us on Twitter at (@tdda0), and there's also a TDDA Slack (#TDDA) that we'd be happy to invite you to.


An Error of Process

Posted on Wed 08 March 2017 in TDDA • Tagged with tdda, errors of process, errors of interpretation

Yesterday, email subscribers to the blog, and some RSS/casual viewers, will have seen a half-finished (in fact, abandoned) post that began to try to characterize success and failure on the crowd-funding platform Kickstarter.

The post was abandoned because I didn't believe its first conclusion, but unfortunately was published by mistake yesterday.

This post explains what happened and tries to salvage a "teachable moment" out of this minor fiasco.

The problem the post was trying to address

Kickstarter is a crowd-funding platform that allows people to back creative projects, usually in exchange for rewards of various kinds. Projects set a funding goal and backers only pay out if the aggregate pledges made match or exceed the funding goal during a funding period—usually 30 days.

Project Phoenix on Kickstarter, from The Icon Factory, seeks to fund the development of a new version of Twitterrific for Mac. Twitterrific was the first independent Twitter client, and was responsible for many of the things that define Twitter today.1 (You were, and are, cordially invited to break off from reading this post to go and back the project before reading on.)

Ollie is the bird in Twitterrific's icon.

At the time I started the post, the project had pledges of $63,554 towards a funding goal of $75,000 (84%) after 13 days, with 17 days to go. This is what the amount raised over time looked like (using data from Kicktraq):

Given that the amount raised was falling each day, and looked asymptotic, questions I was interested in were:

  • How likely was the project to succeed (i.e. to reach its funding goal by day 30? (In fact, it is now fully funded.)
  • How much was the project likely to raise?
  • How likely was the project to reach its stretch goal of $100,000?

The idea was to use some open data from Kickstarter and simple assumptions to try to find out what successful and unsuccessful projects look like.

Data and Assumptions

[This paragraph is unedited from the post yesterday, save that I have made the third item below bold.]

Kickstarter does not have a public API, but is scrapable. The site Web Robots makes available a series of roughly monthly scrapes of Kickstarter data from October 2015 to present; as well as seven older datasets. We have based our analysis on this data, making the following assumptions:

  1. The data is correct and covers all Kickstarter Projects
  2. That we are interpreting the fields in the data correctly
  3. Most critically: if any projects are missing from this data, the missing projects are random. Our analysis is completely invalid if failing projects are removed from the datasets.

[That last point, heavily signalled as critical, turned out not to be the case. As soon as I saw the 99.9% figure below, I went to try to validate that projects didn't go missing from month to month in the scraped data. In fact, they do, all the time, and when I realised this, I abandoned the post. There would have been other ways to try to make the prediction, but they would have been less reliable and required much more work.]

We started with the latest dataset, from 15th February 2017. This included data about 175,085 projects, which break down as follows.

Only projects with a 30-day funding period were included in the comparison, and only those for which we knew the final outcome.

count           is 30 day?
state           no     yes    TOTAL
failed      41,382  42,134   83,516
successful  44,071  31,142   75,213
canceled     6,319   5,463   11,782
suspended      463     363      826
live         2,084   1,664    3,748
TOTAL       94,319  80,766  175,085
-----------------------------------
less live:           1,664
-----------------------------------
Universe            79,102

The table showed that 80,766 of the projects are 30-day, and of these, 79,102 are not live. So this is our starting universe for analysis. NOTE: We deliberately did not exclude suspended or canceled projects, since doing so would have biased our results.

Various fields are available in the JSON data provided by Web Robots. The subset we have used are listed below, together with our interpretation of the meaning of each field:

  • launched_at — Unix timestamp (seconds since 1 January 1970) for the start of the funding period
  • deadline — Unix timestamp for the end of the funding period
  • state — (see above)
  • goal — the amount required to be raised for the project to be funded
  • pledged — the total amount of pledges (today); pledges can only be made during the funding period
  • currency — the currency in which the goal and pledges are made.
  • backers_count — the number of people who have pledged money.

Overall Statistics for 30-day, non-live projects

These are the overall statistics for our 30-day, non-live projects:

succeeded    count        %
no          47,839   60.48%
yes         31,263   39.52%
TOTAL       79,102  100.00%

Just under 40% of them succeed.

But what proportion reach 84% and still fail to reach 100%? According to the detailed data, the answer was just 0.10%, suggesting 99.90% of 30-day projects that reached 84% of their funding goal, at any stage of the campaign went on to be fully funded.

That looked wildly implausible to me, and immediately made me question whether the data I was trying to use was capable of supporting this analysis. In particular, my immediate worry was that projects that looked like they were not going to reach their goal might end up being removed—for whatever reason—more often that those that were on track. Although I have not proved that this is the case, it is clear that projects do quite often disappear between successive scrapes.

To check this, I went back over all the earlier datasets I had collected and extracted the projects that were live in those datasets. There were 47,777 such projects. I then joined those onto the latest dataset to see how many of them were in the latest dataset. 15,276 (31.97%) of the once-live projects were not in the latest data (based on joining on id).

It was at this point I abandoned the blog post.

Error of Process

So what did we learn?

The whole motivation for test-driven data analysis is the observation that data analysis is hard to get right, and most of us make mistakes all the time. We have previously classified these mistakes as

  • errors of interpretation (where we or a consumer of our analysis misunderstand the data, the methods, or our results)
  • errors of implementation (bugs)
  • errors of process (where we make a mistake in using our analytical process, and this leads to a false result being generated or propagated)
  • errors of applicability (where we use an analytical process with data that does not satisfy the requirements or assumptions (explicit or implicit) of the analysis).

We are trying to develop methodologies and tools to reduce the likelihood and impact of each of these kinds of errors.

While we wouldn't normally regard this blog as an analytical process, it's perhaps close enough that we can view this particular error through the TDDA lens. I was writing up the analysis as I did it, fully expecting to generate a useful post. Although I got as far as writing into the entry the (very dubious) 99.9% of 30-day projects that reach 84% funding at any stage on Kickstarter go on to be fully funded, that result immediately smelled wrong and I went off to try to see whether my assumptions about the data were correct. So I was trying hard to avoid an error of interpretation.

But an error of process occurred. This blog is published using Pelican, a static site generator that I mostly quite like. The way Pelican works is posts are (usually) written in Markdown with some metadata at the top. One of the bits of metadata is a Status field, which can either be set to Draft or Published.

When writing the posts, before publishing, you can either run a local webserver to view the output, or actually post them to the main site (on Github Pages, in this case). As long as their status is set to Draft, the posts don't show up as part of the blog in either site (local or on Github), but have to be accessed through a special draft URL. Unfortunately, the draft URL is a little hard to guess, so I generally work with posts with status set to Published until I push them to Github to allow other people to review them before setting them live.

What went wrong here is that the abandoned post had its status left as Published, which was fine until I started the next post (due tomorrow) and pushed that (as draft) to Github. Needless to say, a side-effect of pushing the site with a draft of tomorrow's post was that the abandoned post got pushed too, with its status as Public. Oops!

So the learning for me is that I either have to be more careful with Status (which is optimistic) or I need to add some protection in the publishing process to stop this happening. Realistically, that probably means creating a new Status—Internal—which will get the make process to transmogrify into Published when compiling locally, and Draft when pushing to Github. That should avoid repeats of this particular error of process.


  1. good things, like birds and @names and retweeting; not the abuse. 


Errors of Interpretation: Bad Graphs with Dual Scales

Posted on Mon 20 February 2017 in TDDA • Tagged with tdda, errors of interpretation, graphs

It is a primary responsibility of analysts to present findings and data clearly, in ways to minimize the likelihood of misinterpretation. Graphs should help this, but all too often, if drawn badly (whether deliberately or through oversight) they can make misinterpretation highly likely. I want to illustrate this danger with a unifortunate graph I came across recently in a very interesting—and good, and insightful—article on the US Election.

Take a look at this graph, taken from an article called The Road to Trumpsville: The Long, Long Mistreatment of the American Working Class, by Jeremy Grantham.1

Exhibit 1: Corportate Profits and Employee Compensation

In the article, this graph ("Exhibit 1") is described as follows by Grantham:

The combined result is shown in Exhibit 1: the share of GDP going to labor hit historical lows as recently as 2014 and the share going to corporate profits hit a simultaneous high.

Is that what you interpret from the graph? I agree with these words, but they don't really sum up my first reading of the graph. Rather, I think the natural reading of the graph is as follows:

Wow: Labor's share and Capital's share of GDP crossed over, apparently for good, around 2002. Before then, Capital's share was mostly materially lower than Labor's (though they were nearly equal, briefly, in 1965, and crossed for a for a few years in 1995), but over the 66-year period shown Capital's share increased while Labor's fell, until now is taking about four times as much as Labor.

I think something like that is what most people will read from the graph, unless they read it particularly carefully.

But that is not what this graph is saying. In fact, this is one of the most misleading graphs I have ever come across.

If you look carefully, the two lines use different scales: the red one, for Labor, uses the scale on the right, which runs from 23% to about 34%, whereas the blue line for Capital, uses the scale on the left, which runs from 3% to 11%.

Dual-scale graphs are always difficult to read; so difficult, in fact, that my personal recommendation is

Never plot data on two different scales on the same graph.

Not everyone agrees with this, but most people accept that dual-scale graphs are confusing and hard to read. Even, however, by the standards of dual scale graphs, this is bad.

Here are the problems, in roughly decreasing order of importance:

  1. The two lines are showing commensurate2 figures of roughly the same order of magnitude, so could and should have been on the same scale: this isn't a case of showing price against volume, where the units are different, or even a case in which one size in millimetres and another in miles: these are both percentages, of the same thing, all between 4% and 32%.
  2. The graphs cross over when the data doesn't. The very strong suggestion from the graphs that we go from Labor's share of GDP exceeding that of Capital to being radically lower than that of Capital is entirely false.
  3. Despite measuring the same quantity, the magnification is different on the two axes (i.e. the distance on the page between ticks is different, and the percentage-point gap represented by ticks on the two scales is different). As a consequence slopes (gradients) are not comparable.
  4. Neither scale goes to zero.
  5. The position of the two series relative to their scales is inconsistent: the Labor graph goes right down to the x-axis at its minimum (23%) while the Capital graph—whose minimum is also very close to an integer percentage—does not. This adds further to the impression that Labor's share has been absolutely annihilated.
  6. There are no gridlines to help you read the data. (Sure, gridlines are chart junk3, but are especially important when different scales are used, so you have some hope of reading the values.)

I want to be clear: I am not accusing Jeremy Grantham of deliberately plotting the data in a misleading way. I do not believe he intended to distort or manipulate. I suspect he's plotted it this way because stock graphs, which may well be the graphs he most often looks at,4 are frequently plotted with false zeros. Despite this, he has unfortunately plotted the graphs in a way5 that visually distorts the data in almost exactly the way I would choose to do if I wanted to make the points he is making appear even stronger than they are.

I don't have the source numbers, so I have gone through a rather painful exercise, of reading the numbers off the graph (at slightly coarser granularity) so that I can replot the graph as it should, in my opinion, have been plotted in the first place. (I apologise if I have misread any values; transcribing numbers from graphs is tedious and error-prone.) This is the result:

Exhibit 1 (revised): Same Data, with single, zero-based scale (redrawn approximation)

Even after I'd looked carefully at the scales and appreciated all the distortions in the original graph, I was quite shocked to see the data presented neutrally. To be clear: Grantham's textual summary of the data is accurate: a few years ago, Capital's share of GDP (from his figures) were at an all time—albeit not dramatically higher than in 1949 or about 1966—and Labor's share of GDP, a few years ago, was at an all-time low around 23%, down from 30%. But the true picture just doesn't look like the graph Gratham showed. (Again: I feel a bit bad about going on about this graph from such a good article; but the graph encapsulates a number of problematical practices that it makes a perfect illustration.)

How to Lie with Statistics

In 1954, Darrell Huff published a book called How to Lie with Statistics6. Chapter 5 is called The Gee Wizz Graph. His first example is the following graph (neutrally presented) graph:

Exhibit 2 (neutral): Sales Data, zero-based scale (redrawn from original)

As Huff says:

That is very well if all you want to do is convey information. But suppose you wish to win an argument, shock a reader, move him into action, sell him something. For that, this chart lacks schmaltz. Chop off the bottom.

Exhibit 2 (non-zero-based): Sales Data, non-zero-based scale (redrawn from original)

Huff continues:

Thats more like it. (You've saved paper7 too, something to point out if any carping fellow objects to your misleading graphics.)

But there's more, folks:

Now that you have practised to deceive, why stop with truncating? You have one more trick available that's worth a dozen of that. It will make your modest rise of ten per cent look livelier than one hundred percent is entitled to look. Simply change the proportion between the ordinate and the abscissa:

Exhibit 2 (non-zero-based, expanded): Sales Data, non-zero-based scale, expanded effect (redrawn from original)

Both of these unfortunate practices are present in Exhibit 1, and that's before we even get to dual scales.

Errors of Interpretation

In our various overviews of test-driven data analysis, (e.g., this summary) we have described four major classes of errors:

  • errors of interpretation
  • errors of implementation (bugs)
  • errors of process
  • errors of applicability

Errors of interpretation can occur at any point in the process: not only are we, the analysts, susceptible to misinterpreting our inputs, our methods, our intermediate results and our outputs, but the recipients of our insights and analyses are in even greater danger of misinterpreting our results, because they have not worked through the process and seen all that we did. As analysts, we have a special responsibility to make our results as clear as possible, and hard to misinterpret. We should assume not that the reader will be diligent, unhurried and careful, reading every number and observing every subtlety, but that she or he will be hurried and will rely on us to have brought out the salient points and to have helped the reader towards the right conclusions.

The purpose of a graph is to bring allow a reader to assimilate large quantities of data, and to understand patterns therein, more quickly and more easily than is possible from tables of numbers. There are strong conventions about how to do that, based on known human strengths and weaknesses as well as commonsense "fair treatment" of different series.

However well intentioned, Exhibit 1 fails in every respect: I would guess very few casual readers would get an accurate impression from the data as presented.

If data scientists had the equivalent of a Hippocratic Oath, it would be something like:

First, do not mislead.


  1. The Road to Trumpsville: The Long, Long Mistreatment of the American Working Class, by Jeremy Grantham, in the GMO Quarterly Newsletter, 4Q, 2016. https://www.gmo.com/docs/default-source/public-commentary/gmo-quarterly-letter.pdf 

  2. two variables are commensurate if they are measured in the same units and it is meaningful to make a direct comparison between them. 

  3. Tufte describes all ink on a graph that is not actually plotting data "chart junk", and advocates "maximizing data ink" (the amount of the ink on a graph actually devoted to plotting the data points) and minimizing chart junk. These are excellent principles. The Visual Display of Quantitative Information, Edward R. Tufte, Graphics Press (Cheshire, Connecticut) 1983. 

  4. Mr Grantham works for GMO, a "global investment management firm". https://gmo.com 

  5. chosen to use a plot, if he isn't responsible for the plot 

  6. How to Lie with Statistics, Darrell Huff, published Victor Gollancz, 1954. Republished, 1973, by Pelican Books. 

  7. Obviously the "saving paper" argument had more force in 1954, and the constant references to "him", "he" and "fellows" similarly stood out less than they do today.