Jupyter Notebooks Considered Harmful: The Parables of Anne and Beth

Posted on Thu 14 November 2024 in TDDA • Tagged with TDDA, reproducibility, process

I have long considered writing a post about the various problems I see with computational notebooks such as Jupyter Notebooks. As part of a book I am writing on TDDA, I created four parables about good and bad development practices for analytical workflows. They were not intended to form this post; but they way they turned out fits the theme quite well.

Situation Report

Anne and Beth are data scientists, working in parallel roles in different organizations. Each was previously tasked with analysing data up to the end of the second quarter of 2024. Their analyses were successful and popular. Even though there had never been any suggestion that the analysis would need to be updated, on the last Friday of October Anne and Beth each receive an urgent request to “rerun the numbers using data up to the end of Q3”. The following parables show four different ways this might play out for our two protagonists from this common initial situation.

Parable #1: The Parable of Parameterization

Beth

Beth locates the Jupyter Notebook she used to run the numbers previously and copies it with a new name ending Q3. She changes the copy to use the new data and tries to run it but discovers that the Notebook does not work if run from start to finish. (During development, Beth jumped about in the Notebook, changing steps and rerunning cells out of order as she worked until the answers looked plausible.)

Beth spends the rest of Friday trying to make the analysis work on the new data, cursing her former self and trying to remember exactly what she did previously.

At 16:30, Beth texts her partner saying she might be a little late home.


Anne

Anne begins by typing

make test

to run the tests she created based on her Q2 analysis. They pass. She then puts the data in the data subdirectory, calling it data-2024Q3, and types

make analysis-2024Q3.pdf

Her Makefile’s pattern rule matches the Q3 data to the target output and runs her parameterized script, which performs the analysis using the new data. It produces the PDF, and issues a message confirming that consistency checks on the outputs passed. After checking the document, Anne issues it.

At 09:30, Anne starts to plan the rest of her Friday.

Computational notebooks, such as Jupyter Notebooks,1 have taken the data science community by storm, to the point that it is now often assumed that analyses will be performed in a Notebook. Although I almost never use them myself, I do use a web interface (Salvador) to my own Miró that from a distance looks a version of Jupyter from a different solar system. Notebooks are excellent tools for ad hoc analysis, particularly data exploration, and offer clear benefits including the ability to embed graphical output, easy shareability, web approaches to handling wide tables, and facilitation of annotation of analysis in a spirit somewhat akin to literate programming. I do not wish to take away anyone's Notebooks, notwithstanding the title of this post. I do, however, see several key problems with the way Notebooks are used and abused. Briefly, these are:

  • Lack of Parameterization. I see Notebook users constantly copying Notebooks and editing them to work with new data, instead of writing parameterized code that handles different inputs. Anne's process uses the same program to process the Q2 and Q3 data. Beth's process uses a modified copy, which is significantly less manageable and offers more scope for error (particularly errors of process).

  • Lack of Automated testing. While it is possible to write tests for Notebooks, and some tools and guides exist e.g. Remedios 2021,2 in my experience it is rare for this to be done even by the standards of data science, where testing is less common than I would like it to be.

  • Out-of-order execution. In Notebooks, individual cells can be executed in any order, and state is maintained between execution steps. Cells may fail to work as intended (or at all) if the state has not been set up correctly before they are run. When this happens, other cells can be executed to patch up the state and then the failing cell can be run again. Not only can critical setup code end up lower down a Notebook than code that uses it, causing a problem if the Notebook is cleared and re-run: the key setup code can be altered or deleted after it has been used to set the state. This is my most fundamental reservation about Notebooks, and it not merely a theoretical concern. I have known many analysts who routinely leave Notebooks in inconsistent states that prevent them from running straight through to produce the results. Notebooks are fragile.

  • Interaction with Source Control Systems. Notebooks can be stored in source control systems like Git, but some care is needed. Again, in my experience, Notebooks tend not to under version control, with the copy-paste-edit pattern (for whole Notebooks) being more common.

In my view, Notebooks should be treated as prototypes to be converted to parameterized, tested scripts immediately after development. This will often involve converting the code in cells (or groups of cells) into parameterized functions, something else that, Notebooks seem to discourage. This is probably because cells provide a subset of the benefits of a callable function by visually grouping a block of code and allowing it to be executed in isolation. Cells do not, however, provide other key benefits of functions and classes, such as separate scopes, parameters, enhanced reusability, enhanced testability, and abstraction.

Anne's process has four key features that differ from Beth's.

  • Anne runs her code from a command line using make. If you are not familiar with the make utility, it is well worth learning about, but the critical point here is that Anne's setup allows her to use her process on new data without editing any code: in this case her make command (is intended to) get expanded to python analyse.py 2024Q3 and uses the parameter 2024Q3 both to locate the input data and to name the matching report generated.

  • Anne also benefits from tests she previously wrote, so has confidence that her code is behaving as expected on known data. This is the essence of what we mean by reference testing. While you might think that that if Anne has not changed anything since she last ran the tests, they are bound to pass (as they do in this parable), this is not necessarily the case.

  • Anne's code also includes computational checks on the outputs. Of course, such checks can be included in a Notebook just as easily as they can be in scripts. The reason they are not is entirely because I am making one analyst a paragon of good practice and the other a caricature of sloppiness.

  • Finally, unlike Beth, Anne takes the time to check her outputs before sending them on. Once again, this is because Anne cares about getting correct results, and wants to find any problems herself, not because she does not use a Notebook.

Parable #2: The Parable of Testing

Beth

Beth copies, renames and edits her previous Notebook and is pleased to see it runs without error on the Q2 data. She issues the results and plans the rest of her Friday.

The following week, Beth's inbox is flooded with people saying her results are “obviously wrong”. Beth is surprised since she merely copied the Q2 analysis, updated the input and output file paths, and ran it. She opens her old Q2 Notebook and reruns all cells. She is dismayed to see all the values and graphs in the second half of the Notebook change.

Beth has some remedial work to do.


Anne

Anne runs her tests but one fails. On examining the failing test, Anne realises that a change she made to one her helper libraries means that a transformation she had previous applied in the main analysis is now done by the library, so should be removed from her analysis code.

After making this change, the tests (which were all based on the Q2 data) pass. Anne commits the change to source control before typing

make analysis-2024Q3.pdf

to analyse the new data. After sense checking the results, Anne issues them.

In the first parable, Beth's code would not run from start to finish; this time, it runs but produces different answers from when she ran it before using the Q2 data. This could be because she had failed to clear and re-run the Notebook to generate her final Q2 results, but here I assuming that her results changed for the same reason as Anne's: they had both updated a helper library that their code used. Whereas Anne's tests detected the fact that her previous results had changed, Beth only discovered this when other people noticed her Q3 results did not look right (though had she checked her results, she might have noticed that something looked wrong.) Anne is a slightly better position that Beth to diagnose what went wrong, because her “correct” (previous) results are stored as part of her tests. Now that Beth has updated her Notebook, it may be harder for her to recover the old results. Even if she has access to the old and new results, Beth is probably is less good position than Anne because she has at least one test highlighting how the result has changed. This should allow her to make faster progress and gain confidence that her fix is correct more easily.

Parable #3: The Parable of Units

Beth

Again, Beth copies, renames and updates her previous Notebook and is happy to see it runs straight through on the Q3 data. She issues the results and looks forward to a low-stress day.

Around 16:00, Beth's phone rings and a stressed executive tells her the answers “can't be right” and need to be fixed quickly. Beth is puzzled. She opens her Q2 Notebook, re-runs it and the output is stable. That, at least, is good.

Beth now compares the Q2 and Q3 datasets and notices that the values in the PurchasePrice column are some three orders of magnitude larger in Q3 than in Q2, as if the data is in different units. She checks with her data supplier to confirm that this is the case, then sends some rebarbative emails, with the subject Garbage In, Garbage Out! Beth grumpily adds a cell to her Q3 notebook dividing the relevant column by 1,000. She then adds _fixed to the Q3 notebook's name to encourage her to copy that one next time. She wonders why everyone else is so lazy and incompetent.


Anne

As usual, Anne first runs her tests, which pass. She then runs the analysis on the Q3 data by issuing the command

make analysis-2024Q3.pdf

The code stops with an error:

Input Data Check failed for: PurchasePrice_kGBP
Max expected: GBP 10.0k
Max found: GBP 7,843.21k

(Anne's code creates a renamed copy of the column after load because she had noticed while analysing Q2, that the prices were in thousands of pounds.)

Anne checks with her data supplier, who confirms a change of units, which will continue going forward. Anne persuades her data provider to change the field name for clarity, and to reissue the data.

Anne adds different code paths based on the supplied column names and adds tests for the new case. Once they pass, and she has received the updated data, Anne commits the change, runs and checks the analysis and issues the results.

This time, Anne is saved by virtue of having added checks to the input data, which Beth clearly did not (though, again, such checks could easily be included in a Notebook). This builds directly on the ideas of other articles in this blog, whether implemented through TDDA-style constraints or more directly as explicit checks on input values in the code.

Anne (being the personification of good practice) also noticed the ambiguity in the PurchasePrice variable and created a renamed copy of it for clarity. Note, however, that her check would have worked if she had not created a renamed variable.

A third difference is that Anne has effected a systematic improvement in her data feed by getting the supplier to rename the field. This reduces the likelihood that the unit will be changed without flagging it, decreases chances of its being misinterpreted, and allows Anne to have two paths through her single script, coping with data in either format safely. By re-sourcing the updated data, Anne also confirms that the promised change has actually been made, and that the new data looks correct.

Finally, Beth now has different copies of her code and has to be careful to copy the right one next time (hence _fixed). Anne's old code only exists in the version control system, and crucially, her new code safely handles both cases.

Parable #4: The Parable of Applicability

Beth

Beth dusts off her Jupyter Notebook and, as usual, copies it with a new name ending Q3. She makes the necessary changes to use the new data but it crashes with the error:

ZeroDivisionError: division by zero

After a few hours tracing through the code, Beth eventually realises that there is no data in Q3 for a category that had been quite numerous in the Q2 data. Her Notebook indexes other categories against the missing category by calculating their ratios. On checking with her data provider, Beth confirms that the data is correct, so adds extra code to the Q3 version of the Notebook to handle the case. She also makes a mental note to try to remember to copy the Q3 notebook in future.


Anne

Anne runs her tests by typing

make test

and enjoys watching the progress of the “green line of goodness”. She then runs the analysis on the Q3 data by typing

make analysis-2024Q3.pdf

but it stops with an error message: There is no data for Reference Store 001. If this is right, you need to choose a different Reference Store. After establishing that the data is indeed correct, Anne updates the code to handle this situation, checks that the existing tests pass, adds a couple regression tests to make sure that it copes not only with the default reference store having no data, but also alternative reference stores. When all tests pass, she runs the analysis in the usual way, checks it, commits her updated code and issues the results.

As in the previous parable, the most important difference between Beth's approach and Anne's is that Beth's fix for their common problem is ad hoc and leads to a further code proliferation3 and its concomitant risks if the analysis is run again. In contrast, Anne's code becomes more general and robust as it handles the new case along with the old and she adds new extra tests (regression tests) to try to ensure that nothing breaks the handling of this case in future. The “green line of goodness” mentioned is the name some testers use for the line of dots (sometimes green) many test frameworks issue each time a test passes.


So there we have it. On a more constructive note, I have been following the progress of Quarto with interest. Quarto is a development of RMarkdown, a different style of computational document popular in the R and Reproducibile Research communities.4 To my mind it has fewer of the problems highlighted here. It also supports Python and much of the Python data stack as first-class citizens, and in fact integrates closely with Jupyter, which it uses behind the scenes for many Python-based workflows. I have been using it over the last couple of days, and though it is still distinctly rough around the edges, I think it offers a very promising way forward, with excellent output options that include PDF (via LaTeX), HTML, Word documents and many other formats. It's both an interesting alternative to Notebooks and (perhaps more realistically) a reasonable target for migrating code from Notebook prototypes. I use most of the cells to call functions imported at the start, promoting code re-use and parametization, which avoids another of the pitfalls of Notebooks (in practice) discussed above.


  1. I am using the term Jupyter Notebook to cover both what are now called Jupyter Notebooks “The Classic Notebook Interface”) and JupyterLabs (the “Next-Generation Notebook Interface”). This is both because most people I know continue to call them Jupyter Notebooks, even when using JupyterLab, and because “Notebooks” reads better in the text. I will capitalize Notebook when it refers to a computational notebook, as opposed to a paper notebook. 

  2. Remedios 2021, How to Test Jupyter Notebooks with Pytest and Nbmake, 2021-12-14. 

  3. analysis_fixed_fixed2_final.ipynb 

  4. Reproducible research is very much a sister movement to TDDA—a much larger sister movement. Its goals are similar and it is wholly congruent with all the ideas of TDDA. To the extent that there is divergence, some of it simply arises from separate evolution, and some from the fact that focus of reproducible research is more allowing other people to access your code and data, to run it themselves and verify the outputs, or to write their own analysis to verify your results even more strongly, or to use your code on their data as a different sort of validation. I sometimes call TDDA “reproducible research for solipsists”, because of its greater focus on testing, and helping to discover and eliminate problems even if no second person is involved. Another related area I have recently become aware of is Veridical data science, as developed by Bin Yu and Rececca Barter. The link is to their book of that name. 


An Adware Malware Story Featuring Safari, Notification Centre, and Box Plots

Posted on Sun 22 September 2024 in misc

This is not, primarily, an article about TDDA, but I thought it was worth publishing here anyway. Itʼs a story about a kind of adware/malware incident I had this morning—with digressions about box plots.

Disgression

I was doing some research for a book (on TDDA), looking up information on box plots, also known as box-and-whisker diagrams. When I first came across box plots, I assumed the “box” in the name was a reference to the literal “box” part of a traditional box plot. If you are not familiar with box plots, they typically look like the ones shown in Wikipedia:1:

Box plots for the Michaelson-Morely Experiment

There are variations, but typically the central line represents the median, the “box” delineates the interquartile range, the whiskers extend to either the minimum and maximum or, sometimes, other percentiles, such as 1 and 99. When the minimum and maximum are not used, outliers beyond those extents can be shown as individual points, as seen here for experiments 1 and 3.

At some point after learning about box plots, I became aware of the statistician George Box—he of “All models are wrong, but some models are useful”2 fame, and ended up believing that box plots had in fact been invented by him (and should, therefore, arguably be called “Box plots” rather than “box plots”). Whether someone misinformed me or my brain simply put 2 and 2 together to make about 15, Tufte5 (who advocates his own “reduced box plot”, in line with his principle of maximizing data ink and minimizing chart junk) states definitively that the box plot was in fact a refinement by John Tukey3 of Mary Eleanor Spearʼs “range bars”4. So I was wrong.

Back to the malware

Anyway, back to the malware. I was clicking about on image search results for searchs like box plot "George Box" and hit a site that gave one of the ubiquitous “Are you a human?” prompts that sometimes, but not always, act as a gateway to solving CAPTCHAs to train AI models. But this one didnʼt seem to work. I closed the tab and moved on, but soon after started getting highly suspicious looking pop-up notifications like these:

Malware/ad-ware notifications from ask you

These are comically obviously not legitimate warnings from anything that I would knowingly allow on a computer, which made me less alarmed that I might otherwise have been. But clearly something had happened as a result of clicking an image search result and an “I am not a robot” dialogue.

I wonʼt bore you with a blow-by-blow account of what I did, but the key points are that

  • Killing that tab did not stop the notifications.
  • Nor stopping and restarting Safari but bringing back all the old windows.
  • Nor did stopping and restarting Safari without beinging back any tabs or windows.
  • Nor did deleting todayʼs history from Safari.

So I did some web searches, almost all of the results of which advocated downloading sketchy “anti-malware” products to clean my Mac, which I was never going to do. Eventually, I came across the suggestion that it might be a site that had requested and been given permission to put notifications in Notification Centre. I think I was only half-aware that this was a possible behaviour, but it turns out that (on MacOS Ventura 13.6.9, with Safari 17.6) Safari → Settings → Websites has a Notifications section on the left that looks like this:

Malware/ad-ware notifications from ask you

I must have been aware of this at various points, because I had set the four websites at the bottom of the list to Deny, but I had not noticed the Allow websites to ask for permission to send notifications checkbox, which was enabled (but is now disabled). The top one—looks suspicious, dunnit?—was set to Allow when I went in. I have a strong suspicion that the site I tricked me into giving it permission by getting me to click something that did not appear to be asking for such permission. I suspect it hides its URL by using a newline or a lot of whitespace, which is why it does not show up in the screenshot above.

Setting that top (blank-looking) site to Deny and (as a belt-and-braces and preventative measure) unchecking the checkbox so that sites are not even allowed to ask for permission to put popups in Notification Centre had the desired effect of making popups stop. I believe this consitutes a full fix and that no data was being exfiltrated from the Mac, despite the malicious notification. I will probably also Remove at least that top site (with the Remove button in the future) but will leave it there for now in case Apple (or anyone else) can tell me how to find out what site it comes from.

I also found (but cannot now find again) an option to reset the notications from those sites. This was the extremely confusing dialogue for the site in question.

Reset notifications dialogue for

I think whatʼs going on here is that some text that the site is using to identify itself to Safari when asking for permission consists of the following text, incuding the new lines:

ask you

Confirm that you're not a robot, you need to click Allow

This makes reading the dialogue quite hard and confusing. Looking more carefully at Notification Centre, I also see this:

Ask you permission which text request

I don't quite understand whether this is an image forming a notification, or an image included in some other notification, but offset, or something else. Whatever it is, it consists of (or includes) white text saying

ask you

Confirm that you're not a
robot, you need to click Allow

with a little bit of light grey background around the letters.

I donʼt entirely understand why the site would have used barely readable white text on a light grey background like this, but I presume somehow this text was involved in getting me to click the “I am not a robot” dialogue (which I believe to be the only click I performed on the site).

Anyway, the long and the short of it is that if anyone else runs into this, my recommendations (which do not come from a security expert, so use your own judgement) are:

  1. Donʼt download a random binary from the internet to remove spyware.
  2. Try to find the Safari Preference for Notifications under Websites and see if you have a sketchy-looking entry like mine. If so, set that to Deny
  3. Probably also remove that site with the Remove buttom
  4. Consider turning off the ability for sites to request permission to put notifications in Notification Centre if this is not something you want, or that no site you care about needs.

  1. Taken from Wikipedia entry on box plots, own work by Wikipediaʼs User:Schutz (public domain). 

  2. George Box, (1919-2013): a wit, a kind man and a statistician., Obituary by Julian Champkin, 4 April 2013 https://significancemagazine.com/george-box-1919-2013-a-wit-a-kind-man-and-a-statistician-2/ 

  3. Exploratory data analysis, John Tukey, Reading/Addison-Wesley, 1977. 

  4. Charting Statistics, Mary Eleanor Spear, McGraw Hill, 1952. 

  5. The Visual Display of Quantitative Information, Edward R. Tufte, Graphics Press, 1984. 


PyData London 2024 TDDA Tutorial

Posted on Sun 21 July 2024 in TDDA • Tagged with TDDA, tutorial

PyData London had its tenth conference in 2024, and it was excellent.

I gave a tutorial on TDDA, and the video is available on YouTube and below:

The slides are also available here.


Learning the Hard Way: Regression to the Mean

Posted on Thu 20 June 2024 in TDDA • Tagged with TDDA, reproducibility, errors, interpretation

I was at the tenth PyData London Conference last weekend, which was excellent, as always. One of the keynote speakers was Rebecca Bilbro who gave a rather brilliant (and cleverly titled) talk called Mistakes Were Made: Data Science 10 Years In.

The title is, of course, a reference to the tendency many of us have to be more willing to admit that mistakes were made, than to say "I made mistakes". So I thought I'd share a mistake I made fairly early in my data science career, probably around 1996 or 1997. This is not one of those interview-style "I-sometimes-worry-I'm-a-bit-too-much-of-a-perfectionist"-style admissions that we have all heard; this one was bad.

My company at the time, Quadstone, was under contract to analyse a large retailer's customer base for relationship marketing, using loyalty-card data. We had done all sorts of work in the area with this retailer, and one day the relationship manager we were working with decided that it would be good to incentivise more spending among the retailer's less active, lower spending customers. This is fairly standard. The idea was to set a reasonably high, but achievable, target spend level for each of these customers over a period of a few weeks. Those customers who hit their individual target would receive a large number of loyalty points worth a reasonable amount of money.

We had been tracking spend carefully, placing customers on a behavioural segmentation, and had enough data that we felt relatively confident we knew what be good indivualized stretch goals for customers (wrongly, as events would prove). We set the targets at levels such that retailer should break even if people just met them (foregoing profit, but not losing much money), and estimated how many people we thought would hit the target if the campaign did not have much effect, and then estimated volumes and costs for various higher levels of campaign success.

I'm sure many of you can already see how this will go, and even more of you will have been attuned to the problem by the title of this post. We, however, we had not seen the title of this post, and although I knew about the phenomenon of regression to the mean, I had not really internalized it. I didn't know it in my bones. I had not been bitten by it. I did not see the trap we were walking into.

As Confucious apparently did not say:

I hear and I forget. I see and I remember. I do and I understand.

— probably not Confucious; possibly Xunzi.

Well, I certainly now understand.

On the positive side, our treated group increased its level of spend by a decent amount, and a large number of the group earned many extra loyalty points. I don't believe we had developed uplift modelling at the time of this work, we were very aware that we needed a randomized control group in order to understand the behaviour change we had driven, and we had kept one. To our dismay, the level of spend in the control group, though lower than that in the treated group, also increased quite considerably. In fact, it increased enough that the return on investment for the activity was negative, rather than positive. It was at this point (just before admitting to the client what had happened, and negotiating with them about exactly who should shoulder the loss1) a little voice in my head started saying regression to the mean, regression to the mean, regression to the mean, almost like a more analytical version of Long John Silver's parrot.

So (for those of you who don't know), what is regression to the mean? And why did it occur in this case? And why should we, in fact, have predicted that?

Allow me to lead you through the gory details.

Background: Control Groups

We all know that marketers can't honestly claim the credit for all the sales from people included in a direct marketing campaign, because (in almost all circumstances) some of them would have bought anyway. As with randomized control trials in medicine, in order to understand the true effect of our campaign, we need to divide our target population, uniformly at random,2 into a treatment group, who receive the marketing treatment in question, and a control group, who remain untreated. The two groups do not need to be the same size, but both need to be big enough to allow us to measure the outcome accurately, and indeed to measure the difference between the behaviour of the two groups. This is slightly problematical, because we don't know the effect size before we take action. Happily, however, if the effect is too small to measure, it is pretty much guaranteed to be uninteresting and not to achieve a meaningful return on investment, so we can size the two groups by calculating the minimum effect we need to be able to detect in order to achieve a sufficiently positive ROI.

The effect size is the difference between the outcome in the treated group and the control group—usually a difference in response rate, for a binary outcome, or a difference in a continuous variable such as revenue. Things become more interesting when there are negative effects in play, which is sometimes the case with intrusive marketing or when retention activity is being undertaken. There can be negative effects for a subpopulation or, in the worst cases, for the population as a whole. When these happen, a company is literally spending money to drive customers away, which is usually undesirable.

Let's suppose, for simplicity, that we have selected an ideal target population of 2 million and we mail half of them (chosen on the toss of a fair coin) and keep the other 1 million as controls. If we then send a motivational mailing to the 1 million encouraging them to spend more, with or without an incentive to do so, we can measure their average weekly spend in a pre-period (say six weeks) and their average weekly send in a post-period, which for simplicity we will also take to be six weeks. In this case, we will assume that there was no financial incentive: it was simply a motivation mail along the lines of "we're really great: come and give us more of your money". (Good creatives would craft the message more attractively than this.) Let's suppose we do this and that the results for the treated group of one million are as follows:


BeforeAfter
£50 £60

This is not enough information to say whether the campaign worked, and the reason has nothing to with statistical errors: at the scale of 1 million, you can guarantee the errors will be insignificant. It's also not primarily because we are measuring revenue rather than profit, nor because we haven't taken into account the cost of action (though those are things we should do). We can see that our 1 million customers spent, on average £10 per week more in the post-period than in the pre-period (a cool £60m in increased revenue over six weeks), but we don't know about causality: did our marketing campaign cause the effect?

To answer this, we need to look at what happened in the control group.


BeforeAfter
Mailed (Treated) £50 £60
Unmailed (Control) £50 £55

We immediately see that the spend in the pre-period was the same in both groups, as must be the case for a proper treatment-control split. We also seee that the spend in the control group rose to £55.

We now have enough evidence to say for with very high degree of confidence that the treatment caused a £5 per week increase in spend, but that some other factor—perhaps seasonality, TV ads, or a mis-step by our competitors—caused the other £5 increase per week across the larger population of 2 million customers. I should emphasize that this is a valid conclusion regardless of how the population of 2 million was chosen, as long as the treatment-control split was unbiased.

Behavioural Segmentation

Now let us take a similar treatment group drawn uniformly, at random, from a larger population of 10 million, We segment the treatment population by average weekly spend in the pre-period and plit the increase or decrease in spend, between the pre- and post-periods in each segment for our treatment group. The graph below shows a possible outcome.

A bar graph showing a split of the treated population
          using average spend bands for the pre-period
          of £0, and £0.01 to £10, and ten-pound intervals
          up to £90, and finally a bar for over £90.
          The vertical scale is the change in spend between
          the pre- and post periods, quantified by the difference
          between them (post-spend minus pre-spend).
          The bars decrease monotonically, with the £0 group
          increasing spend by about £15, and these increases
          dropping to zero for the £60--70 per week group,
          and being negative to the tune of about £10 a week
          for those spending over £90 in the pre-period.

For people who are not steeped in regression to mean, this graph may appear somewhat alarming. Depending on the distribution of the population, this might well represent an overall increase in spending (since probably more of the people are on the left of the graph, where the change in spend is positive). But I can almost guarantee that any marketing director would declare this to be disaster, saying (possibly more colourful language) "Look at the damage you have wreaked on my best customers!"

But would this be a reasonable reaction? Would it, in fact, be accurate? At this point, we have no idea whether the campaign caused the higher-spending customers' spend to decline, or whether something else did. To assess that, we need once more to look at the same information for the control group (9 million people, in this case). That's shown below.

The same graph as above, but now showing the change in
          spend for the control group as well as for the treated group.
          The same general pattern is seen in the control group,
          but the increases are smaller in the control group
          (starting at £8 for the group that had spent £0 in the
          six weeks before the mailing, and going down to --£15
          for the group spending over £90 in the pre-period.
          So change in spend is more positive, or less negative,
          in the treated group than in the control group in every
          behavioural segment.

What we clearly see is that in every segment the change in spend was either more positive or less negative in the treated group than in the control group. So the campaign did have a positive effect in every segment. Not so embarrassing. (If only this had been our case!)

Regression to the mean

To understand more clearly what's going on here, it's helpful to look at the same data but focus only the control group.

The same graph as above the last, but now with the treated
          group removed.

Remember, this is the control group: we have not done anything to this population. This is a classic case of regression to the mean. I would confidently predict that for almost any customer base, if we allocate people to segments on the basis of a behavioural characteristic over some period, then measure that same characteristic for the same people, using the same segment allocations, at a later period, we would see a pattern like this: the segments with low rates of the behaviour in question in the first place would increase that behaviour (at least, relative to the population as a whole), and the people in the segments that exhibited the behaviour more would fall back, on average.

Why?

Mixing Effects

When you segment a population on the basis of a behaviour, many of the people are captured exhibiting their typical behaviours. But inevitably, you capture some of the people exhibiting behaviour that is for them atypical.

Consider the first bar in the example above—people who spent nothing in the six weeks before the mailing. Ignoring the possibility of people returning goods, it is impossible for the average spend of this group to decline. In fact, if even a single person from this group buys something in the post-campaign period, the average spend for that segment will increase. In terms of the mixing I am talking about, some of the people will have completely lapsed, and will never spend again, while others were in an atypically low spending period for them: maybe they were on holiday, or trying out a competitor or didn't use their loyalty card and so their spending was not tracked. The thing that's special about this first group is that they literally cannot be in an atypically high spending period when they were assigned to segments, because they weren't spending anything.

It's less clear-cut, but a similar argument pertains to the group on the far right of the graph. Some of those are people who routinely spend over £90 a week at this retailer. But others will have had atypically high spend when we assigned them to segments: maybe they had a huge party and bought lots of alcohol for it, or maybe they shopped for someone else over that period. With the highest-spending group, there will probably be a small number of people whose spend was atypically low during the period we assigned them to segments, but there are likely to be far more people for whom this spend was atypically high at the right side of the distribution. So in this case, we can see it's likely that the average spend of these higher-spending segments will decline (relative to the population as a whole) if we measure them at a later time period.3

For people in the middle of the distribution, the story is similar but more balanced. Some people will have their typical spend where we measured it, and there will be others whom we captured at atypically high or atypically low spending periods, but those will tend to cancel out.

These mixing effects give the best explanation I know of the phenomenon of regression to the mean. It is always something to look for when you assign people to segments based a behaviour and then look for changes in the people in those segments a later time.

So how did we lose so much money?

The reason our campaign worked out so poorly was that we did not take into account regression to the mean when we set the targets, because we didn't think of it. Because we targeted more people with below-median spend than above-median spend, regression to the mean meant that although spend increased quite strongly among our treated customers, it also increased quite strongly for the control group in each segment. In that regard, the uplifts were less similar across the spend segments that I have shown here; something, in fact, that I now know to be characteristic of retail unike in many other areas. The most active shoppers are often also the most responsive to marketing campaigns.

The campaign produced an uplift in all segments, but much more of the increase in spend than we expected was due to regression to the mean in the population we targeted, with the result that value of the loyalty points given out significantly exceeded the incremental profit contribution from the people awarded those points.

This was a tough one. But at least I will remember it forever.


  1. Without getting too specific, the loss was a six-figure sterling sum, back when that was an more significant amount of money than it is today. It was not really a material amount for the retailer, which had a significant fraction of the UK population as regular customers; but it was a highly material amount for Quadstone: more, in fact, than the four founders had invested in the company, the probably less than our entire first-round funding. And retailers don't get to be big and dominant by treating six-figure losses with equinimity. 

  2. Uniformly at random means that, conceptually, a coin is tossed or a die is rolled to determine whether someone is allocated to the control group or the treated group. The coin or die does not need to be fair: it's fine to allocate all the 1's on the die to control and all the 2-6's to treated, or to use a weighted coin, as long as the procedure does use any other information to determine the allocation. For example, choosing to put two thirds of the men in control (chosen randomly) and only one third of the women in control is no good, because now, if there's a difference, it is hard to easily separate out the effect of the treatment from the effect of sex. (If the volume suffices, you could assess the uplift independently for men and women, in this particular case, but that quickly gets complicated, and there is more than enough scope for errors without such complications.) 

  3. One possible confusion here is that I'm describing a static, rather than a dynamic, segmentation here: people are allocated to a segment on the basis of the spend in the pre-period and remain in that segment when assessed in the post-period. If we reassigned people on the basis of their later behaviour, we would not see this effect if the spend distribution were static. 


Name Styles

Posted on Mon 04 March 2024 in TDDA • Tagged with TDDA, names

This is just a bit of fun, but I've always been interested in the different kinds of names allowed, encouraged, and used in different areas of computing and data.

A few years ago, I tweeted some well-known naming styles and a collection of lesser-known naming styles. I was playing about with the same idea while thinking about metadata standards today and came up with this. Just as I often think one of the boxes on the uniqitous 2x2 "Boston-Box"-style matrices makes no sense, I think some of the boxes on the evil-good-lawful-chaotic breakdown (which I gather comes from Dungeons and Dragons make little sense, so forgive me if some of this looks slightly forced. But I think it's fun.

Evil-Good-Lawful-Chaotic 3x3 matrix classification of name styles. LAWFUL GOOD: CamelCase, dromedaryCase, snake_case.  NEUTRAL GOOD: kebab-case, SCREAMING_SNAKE_CASE.  CHAOTIC GOOD: Train-Case, SCREAMING-KEBAB-CASE.  LAWFUL NEUTRAL: Pascal_Snake_Case, camel Snake Case, flatcase, UPPERFLATCASE.  NEUTRAL: reservedcase_, private ish case.  CHAOTIC NEUTRAL: space case.  LAWFUL EVIL: double quoted case, single quoted case, __dunder_case__.  NEUTRAL EVIL: path/case.extended, colon:caseflatcase, path/case, endash-kebab-case, quoted embedded newline case.  CHAOTIC EVIL: teRRorIsT nOTe CAse, alternating_separator-case, curly double quoted case, curly single quoted case, unquoted embedded, newline case.