Jupyter Notebooks Considered Harmful: The Parables of Anne and Beth

Posted on Thu 14 November 2024 in TDDA

I have long considered writing a post about the various problems I see with computational notebooks such as Jupyter Notebooks. As part of a book I am writing on TDDA, I created four parables about good and bad development practices for analytical workflows. They were not intended to form this post; but they way they turned out fits the theme quite well.

Situation Report

Anne and Beth are data scientists, working in parallel roles in different organizations. Each was previously tasked with analysing data up to the end of the second quarter of 2024. Their analyses were successful and popular. Even though there had never been any suggestion that the analysis would need to be updated, on the last Friday of October Anne and Beth each receive an urgent request to “rerun the numbers using data up to the end of Q3”. The following parables show four different ways this might play out for our two protagonists from this common initial situation.

Parable #1: The Parable of Parameterization

Beth

Beth locates the Jupyter Notebook she used to run the numbers previously and copies it with a new name ending Q3. She changes the copy to use the new data and tries to run it but discovers that the Notebook does not work if run from start to finish. (During development, Beth jumped about in the Notebook, changing steps and rerunning cells out of order as she worked until the answers looked plausible.)

Beth spends the rest of Friday trying to make the analysis work on the new data, cursing her former self and trying to remember exactly what she did previously.

At 16:30, Beth texts her partner saying she might be a little late home.

Anne

Anne begins by typing

make test

to run the tests she created based on her Q2 analysis. They pass. She then puts the data in the data subdirectory, calling it data-2024Q3, and types

make analysis-2024Q3.pdf

Her Makefile’s pattern rule matches the Q3 data to the target output and runs her parameterized script, which performs the analysis using the new data. It produces the PDF, and issues a message confirming that consistency checks on the outputs passed. After checking the document, Anne issues it.

At 09:30, Anne starts to plan the rest of her Friday.

Computational notebooks, such as Jupyter Notebooks,¹ have taken the data science community by storm, to the point that it is now often assumed that analyses will be performed in a Notebook. Although I almost never use them myself, I do use a web interface (Salvador) to my own Miró that from a distance looks a version of Jupyter from a different solar system. Notebooks are excellent tools for ad hoc analysis, particularly data exploration, and offer clear benefits including the ability to embed graphical output, easy shareability, web approaches to handling wide tables, and facilitation of annotation of analysis in a spirit somewhat akin to literate programming. I do not wish to take away anyone's Notebooks, notwithstanding the title of this post. I do, however, see several key problems with the way Notebooks are used and abused. Briefly, these are:

Lack of Parameterization. I see Notebook users constantly copying Notebooks and editing them to work with new data, instead of writing parameterized code that handles different inputs. Anne's process uses the same program to process the Q2 and Q3 data. Beth's process uses a modified copy, which is significantly less manageable and offers more scope for error (particularly errors of process).
Lack of Automated testing. While it is possible to write tests for Notebooks, and some tools and guides exist e.g. Remedios 2021,² in my experience it is rare for this to be done even by the standards of data science, where testing is less common than I would like it to be.
Out-of-order execution. In Notebooks, individual cells can be executed in any order, and state is maintained between execution steps. Cells may fail to work as intended (or at all) if the state has not been set up correctly before they are run. When this happens, other cells can be executed to patch up the state and then the failing cell can be run again. Not only can critical setup code end up lower down a Notebook than code that uses it, causing a problem if the Notebook is cleared and re-run: the key setup code can be altered or deleted after it has been used to set the state. This is my most fundamental reservation about Notebooks, and it not merely a theoretical concern. I have known many analysts who routinely leave Notebooks in inconsistent states that prevent them from running straight through to produce the results. Notebooks are fragile.
Interaction with Source Control Systems. Notebooks can be stored in source control systems like Git, but some care is needed. Again, in my experience, Notebooks tend not to under version control, with the copy-paste-edit pattern (for whole Notebooks) being more common.

In my view, Notebooks should be treated as prototypes to be converted to parameterized, tested scripts immediately after development. This will often involve converting the code in cells (or groups of cells) into parameterized functions, something else that, Notebooks seem to discourage. This is probably because cells provide a subset of the benefits of a callable function by visually grouping a block of code and allowing it to be executed in isolation. Cells do not, however, provide other key benefits of functions and classes, such as separate scopes, parameters, enhanced reusability, enhanced testability, and abstraction.

Anne's process has four key features that differ from Beth's.

Anne runs her code from a command line using make. If you are not familiar with the make utility, it is well worth learning about, but the critical point here is that Anne's setup allows her to use her process on new data without editing any code: in this case her make command (is intended to) get expanded to python analyse.py 2024Q3 and uses the parameter 2024Q3 both to locate the input data and to name the matching report generated.
Anne also benefits from tests she previously wrote, so has confidence that her code is behaving as expected on known data. This is the essence of what we mean by reference testing. While you might think that that if Anne has not changed anything since she last ran the tests, they are bound to pass (as they do in this parable), this is not necessarily the case.
Anne's code also includes computational checks on the outputs. Of course, such checks can be included in a Notebook just as easily as they can be in scripts. The reason they are not is entirely because I am making one analyst a paragon of good practice and the other a caricature of sloppiness.
Finally, unlike Beth, Anne takes the time to check her outputs before sending them on. Once again, this is because Anne cares about getting correct results, and wants to find any problems herself, not because she does not use a Notebook.

Parable #2: The Parable of Testing

Beth

Beth copies, renames and edits her previous Notebook and is pleased to see it runs without error on the Q3 data. She issues the results and plans the rest of her Friday.

The following week, Beth's inbox is flooded with people saying her results are “obviously wrong”. Beth is surprised since she merely copied the Q2 analysis, updated the input and output file paths, and ran it. She opens her old Q2 Notebook and reruns all cells. She is dismayed to see all the values and graphs in the second half of the Notebook change.

Beth has some remedial work to do.

Anne

Anne runs her tests but one fails. On examining the failing test, Anne realises that a change she made to one her helper libraries means that a transformation she had previous applied in the main analysis is now done by the library, so should be removed from her analysis code.

After making this change, the tests (which were all based on the Q2 data) pass. Anne commits the change to source control before typing

make analysis-2024Q3.pdf

to analyse the new data. After sense checking the results, Anne issues them.

In the first parable, Beth's code would not run from start to finish; this time, it runs but produces different answers from when she ran it before using the Q2 data. This could be because she had failed to clear and re-run the Notebook to generate her final Q2 results, but here I am assuming that her results changed for the same reason as Anne's: they had both updated a helper library that their code used. Whereas Anne's tests detected the fact that her previous results had changed, Beth only discovered this when other people noticed her Q3 results did not look right (though had she checked her results, she might have noticed that something looked wrong.) Anne is in a slightly better position than Beth to diagnose what went wrong, because her “correct” (previous) results are stored as part of her tests. Now that Beth has updated her Notebook, it may be harder for her to recover the old results. Even if she has access to the old and new results, Beth is probably is less good position than Anne because Anne has at least one test highlighting how the result has changed. This should allow her to make faster progress and gain confidence that her fix is correct more easily.

Parable #3: The Parable of Units

Beth

Again, Beth copies, renames and updates her previous Notebook and is happy to see it runs straight through on the Q3 data. She issues the results and looks forward to a low-stress day.

Around 16:00, Beth's phone rings and a stressed executive tells her the answers “can't be right” and need to be fixed quickly. Beth is puzzled. She opens her Q2 Notebook, re-runs it and the output is stable. That, at least, is good.

Beth now compares the Q2 and Q3 datasets and notices that the values in the PurchasePrice column are some three orders of magnitude larger in Q3 than in Q2, as if the data is in different units. She checks with her data supplier to confirm that this is the case, then sends some rebarbative emails, with the subject Garbage In, Garbage Out! Beth grumpily adds a cell to her Q3 notebook dividing the relevant column by 1,000. She then adds _fixed to the Q3 notebook's name to encourage her to copy that one next time. She wonders why everyone else is so lazy and incompetent.

Anne

As usual, Anne first runs her tests, which pass. She then runs the analysis on the Q3 data by issuing the command

make analysis-2024Q3.pdf

The code stops with an error:

Input Data Check failed for: PurchasePrice_kGBP
Max expected: GBP 10.0k
Max found: GBP 7,843.21k

(Anne's code creates a renamed copy of the column after load because she had noticed while analysing Q2, that the prices were in thousands of pounds.)

Anne checks with her data supplier, who confirms a change of units, which will continue going forward. Anne persuades her data provider to change the field name for clarity, and to reissue the data.

Anne adds different code paths based on the supplied column names and adds tests for the new case. Once they pass, and she has received the updated data, Anne commits the change, runs and checks the analysis and issues the results.

This time, Anne is saved by virtue of having added checks to the input data, which Beth clearly did not (though, again, such checks could easily be included in a Notebook). This builds directly on the ideas of other articles in this blog, whether implemented through TDDA-style constraints or more directly as explicit checks on input values in the code.

Anne (being the personification of good practice) also noticed the ambiguity in the PurchasePrice variable and created a renamed copy of it for clarity. Note, however, that her check would have worked if she had not created a renamed variable.

A third difference is that Anne has effected a systematic improvement in her data feed by getting the supplier to rename the field. This reduces the likelihood that the unit will be changed without flagging it, decreases chances of its being misinterpreted, and allows Anne to have two paths through her single script, coping with data in either format safely. By re-sourcing the updated data, Anne also confirms that the promised change has actually been made, and that the new data looks correct.

Finally, Beth now has different copies of her code and has to be careful to copy the right one next time (hence _fixed). Anne's old code only exists in the version control system, and crucially, her new code safely handles both cases.

Parable #4: The Parable of Applicability

Beth

Beth dusts off her Jupyter Notebook and, as usual, copies it with a new name ending Q3. She makes the necessary changes to use the new data but it crashes with the error:

ZeroDivisionError: division by zero

After a few hours tracing through the code, Beth eventually realises that there is no data in Q3 for a category that had been quite numerous in the Q2 data. Her Notebook indexes other categories against the missing category by calculating their ratios. On checking with her data provider, Beth confirms that the data is correct, so adds extra code to the Q3 version of the Notebook to handle the case. She also makes a mental note to try to remember to copy the Q3 notebook in future.

Anne

Anne runs her tests by typing

make test

and enjoys watching the progress of the “green line of goodness”. She then runs the analysis on the Q3 data by typing

make analysis-2024Q3.pdf

but it stops with an error message: There is no data for Reference Store 001. If this is right, you need to choose a different Reference Store. After establishing that the data is indeed correct, Anne updates the code to handle this situation, checks that the existing tests pass, adds a couple regression tests to make sure that it copes not only with the default reference store having no data, but also alternative reference stores. When all tests pass, she runs the analysis in the usual way, checks it, commits her updated code and issues the results.

As in the previous parable, the most important difference between Beth's approach and Anne's is that Beth's fix for their common problem is ad hoc and leads to a further code proliferation³ and its concomitant risks if the analysis is run again. In contrast, Anne's code becomes more general and robust as it handles the new case along with the old and she adds new extra tests (regression tests) to try to ensure that nothing breaks the handling of this case in future. The “green line of goodness” mentioned is the name some testers use for the line of dots (sometimes green) many test frameworks issue each time a test passes.

So there we have it. On a more constructive note, I have been following the progress of Quarto with interest. Quarto is a development of RMarkdown, a different style of computational document popular in the R and Reproducibile Research communities.⁴ To my mind it has fewer of the problems highlighted here. It also supports Python and much of the Python data stack as first-class citizens, and in fact integrates closely with Jupyter, which it uses behind the scenes for many Python-based workflows. I have been using it over the last couple of days, and though it is still distinctly rough around the edges, I think it offers a very promising way forward, with excellent output options that include PDF (via LaTeX), HTML, Word documents and many other formats. It's both an interesting alternative to Notebooks and (perhaps more realistically) a reasonable target for migrating code from Notebook prototypes. I use most of the cells to call functions imported at the start, promoting code re-use and parametization, which avoids another of the pitfalls of Notebooks (in practice) discussed above.

I am using the term Jupyter Notebook to cover both what are now called Jupyter Notebooks “The Classic Notebook Interface”) and JupyterLabs (the “Next-Generation Notebook Interface”). This is both because most people I know continue to call them Jupyter Notebooks, even when using JupyterLab, and because “Notebooks” reads better in the text. I will capitalize Notebook when it refers to a computational notebook, as opposed to a paper notebook. ↩
Remedios 2021, How to Test Jupyter Notebooks with Pytest and Nbmake, 2021-12-14. ↩
analysis_fixed_fixed2_final.ipynb ↩
Reproducible research is very much a sister movement to TDDA—a much larger sister movement. Its goals are similar and it is wholly congruent with all the ideas of TDDA. To the extent that there is divergence, some of it simply arises from separate evolution, and some from the fact that focus of reproducible research is more allowing other people to access your code and data, to run it themselves and verify the outputs, or to write their own analysis to verify your results even more strongly, or to use your code on their data as a different sort of validation. I sometimes call TDDA “reproducible research for solipsists”, because of its greater focus on testing, and helping to discover and eliminate problems even if no second person is involved. Another related area I have recently become aware of is Veridical data science, as developed by Bin Yu and Rececca Barter. The link is to their book of that name. ↩