I have long considered writing a post about the various problems I see
with computational notebooks such as Jupyter Notebooks. As part of a book
I am writing on TDDA, I created four parables about good and bad
development practices for analytical workflows. They were not intended
to form this post; but they way they turned out fits the theme quite well.
Situation Report
Anne and Beth are data scientists, working in parallel roles in
different organizations. Each was previously tasked with analysing
data up to the end of the second quarter of 2024. Their analyses
were successful and popular. Even though there had never been any
suggestion that the analysis would need to be updated,
on the last Friday of October Anne and Beth each receive an urgent request
to “rerun the numbers using data up to the
end of Q3”. The following parables show four different ways this
might play out for our two protagonists from this common initial
situation.
Parable #1: The Parable of Parameterization
Beth
Beth locates the Jupyter Notebook she used to run the numbers
previously and copies it with a new name ending Q3
. She changes the
copy to use the new data and tries to run it but discovers that the
Notebook does not work if run from start to finish. (During
development, Beth jumped about in the Notebook, changing steps and
rerunning cells out of order as she worked until the answers looked
plausible.)
Beth spends the rest of Friday trying to make the analysis work on
the new data, cursing her former self and trying to remember exactly
what she did previously.
At 16:30, Beth texts her partner saying she might be a little late
home.
Anne
Anne begins by typing
make test
to run the tests she created based on her Q2 analysis. They pass. She
then puts the data in the data
subdirectory, calling it data-2024Q3
,
and types
make analysis-2024Q3.pdf
Her Makefile
’s pattern rule matches the Q3 data to the target output
and runs her parameterized script, which performs the analysis using
the new data. It produces the PDF, and issues a message confirming
that consistency checks on the outputs passed. After checking the
document, Anne issues it.
At 09:30, Anne starts to plan the rest of her Friday.
Computational notebooks, such as Jupyter Notebooks, have
taken the data science community by storm, to the point that it is now
often assumed that analyses will be performed in a Notebook. Although
I almost never use them myself, I do use a web interface (Salvador) to
my own Miró that from a
distance looks a version of Jupyter from a different solar system.
Notebooks are excellent tools for ad hoc analysis, particularly data
exploration, and offer clear benefits including the ability to embed
graphical output, easy shareability, web approaches to handling wide
tables, and facilitation of annotation of analysis in a spirit
somewhat akin to literate programming. I do not wish to take away
anyone's Notebooks, notwithstanding the title of this post. I do,
however, see several key problems with the way Notebooks are used and
abused. Briefly, these are:
-
Lack of Parameterization.
I see Notebook users constantly copying
Notebooks and editing them to work with new data, instead of
writing parameterized code that handles different inputs. Anne's
process uses the same program to process the Q2 and Q3 data.
Beth's process uses a modified copy, which is significantly less
manageable and offers more scope for error (particularly
errors of process).
-
Lack of Automated testing. While it is possible to write tests
for Notebooks, and some tools and guides exist
e.g. Remedios 2021,
in my experience it is rare for this to be
done even by the standards of data science, where testing is less
common than I would like it to be.
-
Out-of-order execution. In Notebooks, individual cells can be
executed in any order, and state is maintained between execution
steps. Cells may fail to work as intended (or at all) if the state
has not been set up correctly before they are run. When this
happens, other cells can be executed to patch up the state and
then the failing cell can be run again. Not only can critical
setup code end up lower down a Notebook than code that uses it,
causing a problem if the Notebook is cleared and re-run: the key
setup code can be altered or deleted after it has been used to set
the state. This is my most fundamental reservation about
Notebooks, and it not merely a theoretical concern. I have known
many analysts who routinely leave Notebooks in inconsistent states
that prevent them from running straight through to produce the
results. Notebooks are fragile.
-
Interaction with Source Control Systems. Notebooks can be stored
in source control systems like Git, but some care is
needed. Again, in my experience, Notebooks tend not to under
version control, with the copy-paste-edit pattern
(for whole Notebooks) being more common.
In my view, Notebooks should be treated as prototypes to be converted
to parameterized, tested scripts immediately after development. This
will often involve converting the code in cells (or groups of cells)
into parameterized functions, something else that, Notebooks seem to
discourage. This is probably because cells provide a subset of the
benefits of a callable function by visually grouping a block of code
and allowing it to be executed in isolation. Cells do not, however,
provide other key benefits of functions and classes, such as separate
scopes, parameters, enhanced reusability, enhanced testability, and
abstraction.
Anne's process has four key features that differ from Beth's.
-
Anne runs her code from a command line using make
. If you
are not familiar with the
make utility, it is well worth
learning about, but the critical point here is
that Anne's setup allows her to use her process on new data without
editing any code: in this case her make
command
(is intended to) get expanded to python analyse.py 2024Q3
and uses the parameter
2024Q3
both to locate the input data and to name the
matching report generated.
-
Anne also benefits from tests she previously wrote, so has
confidence that her code is behaving as expected on known data.
This is the essence of what we mean by
reference testing.
While you might think that that if Anne has not changed anything
since she last ran the tests, they are bound to pass (as they do in
this parable), this is not necessarily the case.
-
Anne's code also includes computational checks on the outputs. Of
course, such checks can be included in a Notebook just as easily as
they can be in scripts. The reason they are not is entirely because
I am making one analyst a paragon of good practice and the other a
caricature of sloppiness.
-
Finally, unlike Beth, Anne takes the time to check her outputs
before sending them on. Once again, this is because Anne cares
about getting correct results, and wants to find any problems
herself, not because she does not use a Notebook.
Parable #2: The Parable of Testing
Beth
Beth copies, renames and edits her previous Notebook
and is pleased to see it runs without error on the Q3 data.
She issues the results and plans the rest of her Friday.
The following week, Beth's inbox is flooded with
people saying her results are “obviously wrong”.
Beth is surprised since she merely copied the Q2 analysis,
updated the input and output file paths, and ran it.
She opens her old Q2 Notebook and reruns all cells. She is
dismayed to see all the values and graphs in the second
half of the Notebook change.
Beth has
some remedial work to do.
Anne
Anne runs her tests but one fails.
On examining the failing test, Anne realises
that a change she made to one her helper libraries means that
a transformation she had previous applied in the main analysis
is now done by the library, so should be removed from her analysis
code.
After making this change, the tests (which
were all based on the Q2 data) pass. Anne commits the
change to source control before typing
make analysis-2024Q3.pdf
to analyse the new data.
After sense checking the results,
Anne issues them.
In the first parable, Beth's code would not run from start to finish;
this time, it runs but produces different answers from when she ran it
before using the Q2 data. This could be because she had failed to
clear and re-run the Notebook to generate her final Q2 results, but
here I am assuming that her results changed for the same reason as
Anne's: they had both updated a helper library that their code used.
Whereas Anne's tests detected the fact that her previous results had
changed, Beth only discovered this when other people noticed her Q3
results did not look right (though had she checked her results, she
might have noticed that something looked wrong.) Anne is in a slightly
better position than Beth to diagnose what went wrong, because her
“correct” (previous) results are stored as part of her tests. Now
that Beth has updated her Notebook, it may be harder for her to
recover the old results. Even if she has access to the old and new
results, Beth is probably is less good position than Anne because Anne
has at least one test highlighting how the result has changed. This
should allow her to make faster progress and gain confidence that her
fix is correct more easily.
Parable #3: The Parable of Units
Beth
Again, Beth copies, renames and updates her previous Notebook
and is happy to see it runs straight through on the Q3 data.
She issues the results and looks forward to a low-stress day.
Around 16:00, Beth's phone rings and a stressed executive
tells her the answers “can't be right” and need to be fixed quickly.
Beth is puzzled. She opens her Q2 Notebook, re-runs it and the output
is stable. That, at least, is good.
Beth now compares the Q2 and Q3 datasets
and notices that the values in the
PurchasePrice
column are some three orders of magnitude larger
in Q3 than in Q2, as if the data is in different units.
She checks with her data supplier to confirm that this is the case,
then sends some rebarbative emails, with the subject
Garbage In, Garbage Out!
Beth grumpily adds a cell to her Q3 notebook dividing the
relevant column by 1,000.
She then adds _fixed
to the Q3 notebook's name
to encourage her to copy that one next time.
She wonders why everyone else is so lazy and incompetent.
Anne
As usual, Anne first runs her tests, which pass.
She then runs the analysis on the Q3 data by issuing
the command
make analysis-2024Q3.pdf
The code stops with an error:
Input Data Check failed for: PurchasePrice_kGBP
Max expected: GBP 10.0k
Max found: GBP 7,843.21k
(Anne's code creates a renamed copy of the column after load because
she had noticed while analysing Q2, that the prices were in thousands
of pounds.)
Anne checks with her data supplier, who confirms a change of units,
which will continue going forward.
Anne persuades her data provider to change the field name for clarity,
and to reissue the data.
Anne adds different code paths based on the supplied column names
and adds tests for the new case.
Once they pass, and she has received the updated data,
Anne commits the change,
runs and checks the analysis and issues the results.
This time, Anne is saved by virtue of having added checks to the input
data, which Beth clearly did not (though, again, such checks could easily be
included in a Notebook). This builds directly on the ideas of
other articles in this blog, whether implemented through TDDA-style
constraints or more directly as explicit checks on input values in the
code.
Anne (being the personification of good practice) also noticed
the ambiguity in the PurchasePrice
variable and created
a renamed copy of it for clarity. Note, however, that her check
would have worked if she had not created a renamed variable.
A third difference is that Anne has effected a systematic
improvement in her data feed by getting the supplier to rename
the field. This reduces the likelihood that the unit will be
changed without flagging it, decreases chances of its being
misinterpreted, and allows Anne to have two paths through her
single script, coping with data in either format safely.
By re-sourcing the updated data, Anne also confirms
that the promised change has actually been made,
and that the new data looks correct.
Finally, Beth now has different copies
of her code and has to be careful to copy the right one next time
(hence _fixed
). Anne's old code
only exists in the version control system, and crucially, her new
code safely handles both cases.
Parable #4: The Parable of Applicability
Beth
Beth dusts off her Jupyter Notebook
and, as usual, copies it with a new name ending Q3.
She makes the necessary changes to use the new data
but it crashes with the error:
ZeroDivisionError: division by zero
After a few hours tracing through the code,
Beth eventually realises that there is no data in Q3 for
a category that had been quite numerous in the Q2 data.
Her Notebook indexes other categories against the missing category
by calculating their ratios. On checking with her data provider,
Beth confirms that the data is correct, so adds extra code to the
Q3 version of the Notebook to handle the case.
She also makes a mental note to try to remember to copy the Q3 notebook
in future.
Anne
Anne runs her tests by typing
make test
and enjoys watching the progress of the “green line of goodness”.
She then runs the analysis on the Q3 data by typing
make analysis-2024Q3.pdf
but it stops with an error message: There is
no data for Reference Store 001. If this is right, you
need to choose a different Reference Store.
After establishing that the data is indeed correct,
Anne updates the code to handle this situation, checks
that the existing tests pass, adds a couple regression tests
to make sure that it copes not only with the default reference
store having no data, but also alternative reference stores.
When all tests pass, she runs the analysis
in the usual way, checks it, commits her updated code
and issues the results.
As in the previous parable, the most important difference between
Beth's approach and Anne's is that Beth's fix for their common problem
is ad hoc and leads to a further code proliferation and its
concomitant risks if the analysis is run again. In contrast, Anne's
code becomes more general and robust as it handles the new case along
with the old and she adds new extra tests (regression tests)
to try to ensure that nothing breaks the handling of this case in
future. The “green line of goodness” mentioned is the name some
testers use for the line of dots (sometimes green) many test
frameworks issue each time a test passes.
So there we have it. On a more constructive note, I have been
following the progress of Quarto with interest.
Quarto is a development of RMarkdown, a different style of computational
document popular in the R and Reproducibile Research communities.
To my mind it has fewer of the problems highlighted here.
It also supports Python and much of the Python data stack as first-class
citizens, and in fact integrates closely with Jupyter, which it uses
behind the scenes for many Python-based workflows.
I have been using it over the last couple of days, and though it is
still distinctly rough around the edges, I think it offers a very
promising way forward, with excellent output options that include
PDF (via LaTeX), HTML, Word documents and many other formats.
It's both an interesting alternative to Notebooks and
(perhaps more realistically) a reasonable target for migrating
code from Notebook prototypes. I use most of the cells to
call functions imported at the start, promoting code re-use
and parametization, which avoids another of the pitfalls of
Notebooks (in practice) discussed above.