Best Practices for Notebook Users
Posted on Tue 17 December 2024 in misc
In a previous post, I discussed some of the dangers of challenges, dangers and weaknesses of Jupyter Notebooks, JupyterLabs and their ilk. I used The Parables of Anne and Beth as a device to illustrate what I think of as good and bad practices for data science. A reasonable criticism of this was that it did not really offer anything to help people who might wish to continue using computational notebooks, but to work in such a way as to limit the harms identified.
Although it probably rings slightly hollow, my goal is absolutely to improve the quality of data science, data analyis, data engineering, and really all data work, and I very much see the attractions and strengths of Jupyter, despite being critical of certain dark patterns I see around their use.
Here are some suggesting best practices for notebook users that I hope might be helpful an constructive. It's true that if you adopt all of them, I might have succeeded in prising your Notebook from your hands, but if you adopt any of them, as you use notebooks, I think you will be safer and more successful. I'm very much in favour of half a loaf.
Subject to the vaguaries of the web, the checkboxes, online, should be clickable, should you find it useful as an actual checklist, and there's a PDF version available too.
Notebook Best Practices
- NBP1 Ensure that your Notebook runs correctly after completion
- Develop the Notebook
- Take a temporary copy of the Notebook
- Clear the Notebook
- Run and confirm the results match the temporary copy
- Fix (if this is not the case)
- Clear the new Notebook again
- Commit the new Notebook to version control
- Rerun (so the Notebook contains the results)
- Delete the temporary copy
- NBP2 Store the (cleared) Notebook in version control (see NBP1)
- NBP3 Parameterize inputs and outputs at the top of the Notebook
- e.g. Set and use variables such as INPATH and OUTPATH
- Write the most important outputs to file (if not already done)
- NBP4 Replace some individual cells or groups of cells with functions
- Move them into an importable file, import and use
- Prioritize potentially re-usable code and code requiring testing
- NBP5 Write some tests
- Create a reference (regression) test for the whole process
- Create unit tests for the individual functions
- NBP6 Consider restructuring/extracting the code as a standalone script
- NBP7 Allow the parameters to be set from the command line
- Alternatively, read from a configuration file (e.g. .json or .toml)
- NBP8 Consider using safer alternatives like Marimo and Quarto
NBP Version 1.0.
A printable PDF copy is available.