Test-Driven Document Development

Posted on Tue 02 September 2025 in TDDA

Summary

Computational documents attempt to guarantee that results included within them—such as graphs—correspond to the code and data claimed to generate them. They typically achieve this by generating the outputs from the code at the time the document is generated or viewed. This solves significant problems, including those of code rusting (exhibiting changed behaviour) and of unintentional inclusion of stale, incorrect, or unvalidated results. There is, however, a danger of what I term co-rusting, whereby the code and its outputs drift away from correctness (rust) together, without the author realizing. This is likely if the code continues to generate output (i.e., does not crash or report an error).

Computational documents are an important part of reproducible research, within which the main approach to avoiding co-rusting tends to be the use of reproducible environments, which aim to prevent rusting by pinning down as much of the computational environment as possible.

Test-Driven Document Development (TDDD) builds on computational documents by adding automated tests that fail when results change (materially). If these tests are run as part of the build process for the document, the possibily of co-rusting is reduced or eliminated. TDDD can be viewed as the application of test-driven data analysis (TDDA) to the process of document creation, essentially considering the generation of a document as an analytical process that should be supported by reference tests.

The tests can be created by hand, but the Gentest functionality of the tdda tool turns out to be powerful for implementing the tests needed by TDDD, whatever language is used to generate the results.

Background: Computational Documents

Computational documents include one or more results generated by computer code, and provide some guarantee that each result matches its generating code. This is usually achieved by including the code in the document and generating the output either as part of document production (compilation, e.g., Quarto, or in a more limited way, cog) or on-the-fly, for computational notebooks (interpretation, like Jupyter Notebooks / JupyterLab and marimo).

Here is a simple Quarto computational document that calculates the number of potential UK postcodes as defined by a regular expression describing valid ones.¹ This number is quoted in a book I am writing on TDDA. Prior to today, it was pasted into the book by copying the output from an interactive Python session where I calculated it. I probably inserted the thousand separators by hand (another error-prone process). Today I not only changed the number to be included from a calculation when the book is compiled, but also added reference tests to detect if it changes. (source)

---
title: "Quarto Postcodes (inline)"
format:
  html:
    code-fold: true
  pdf:
    toc: false
jupyter: python3
---

```{python}
from letters import nL

RE = r'^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$'

def n_poss_postcodes_for_re():
    """
    Number of strings matching:
      ^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$
    """
    n_postal_areas = nL + nL * nL  # 1 or two letters
    n_postal_districts = 10 + 100  # Any one or two digit number
                                   # 0 and 0x aren't used, but match the regex
    n_subdistricts = nL + 1        # Not all letters are used,
                                   # and only for some London codes,
                                   # but for our regex...
                                   # The +1 is for ones not using a subdistrict

    n_outcodes = n_postal_areas * n_postal_districts * n_subdistricts
    n_incodes = 10 * nL * nL       # Digit then two letters
    n_postcodes = n_outcodes * n_incodes

    return n_postcodes


if __name__ == '__main__':
    n = n_poss_postcodes_for_re()
```
The number of postcode-like strings matching

    `{python} RE`

is `{python} f'{n:,}'`

This document is written in a dialect of Markdown defined by Quarto. It has a header at the top, containing metadata, then a fenced Markdown Python block containing (which defines two variables used later in the document), and some text that uses those two variables (RE and n_formatted) to say how many postcodes match. It has a confected dependency on an another Python file, letters.py defining the number of letters, nL, in English:

nL = 26

It can be compiled with:

    quarto render postcodes1.qmd

producing this page and this document. This rather simple computational document, which shows the code and one important output number that is “guaranteed” to be generated from the code shown. It would be usual to includes graphs or tables of some sort, but this is minimal example so I really wanted only a single number.

The version of the code actually used to generate the number in the book, does not import nL from letters.py, but includes the line nL = 26 in the main program. That's because I'm not trying to make it fail in the book. I've written in this way for the post to give me an easy way to demonstrate co-rusting, which is a entirely real phenomenon. A change in a dependency is a common reason for rusting. (If you do not believe in code rusting or co-rusting, try reading Why Code Rusts; if that doesn't convince you, this article may not be for you.)

Writing Tests For the Code

We will begin by writing tests for essentially the same code, just written as a standalone Python program rather than embedded in a Quarto document.

Here is same code as an actual python script postcodes.py, together with some slightly different behaviour after calling the postcode-counting function.

import json
from letters import nL
from tdda.utils import dict_to_tex_macros

RE = r'^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$'

def n_poss_postcodes_for_re():
    """
    Number of strings matching:
      ^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$
    """
    n_postal_areas = nL + nL * nL  # 1 or two letters
    n_postal_districts = 10 + 100  # Any one or two digit number
                                   # 0 and 0x aren't used, but match the regex
    n_subdistricts = nL + 1        # Not all letters are used,
                                   # and only for some London codes,
                                   # but for our regex...
                                   # The +1 is for ones not using a subdistrict

    n_outcodes = n_postal_areas * n_postal_districts * n_subdistricts
    n_incodes = 10 * nL * nL       # Digit then two letters
    n_postcodes = n_outcodes * n_incodes

    return n_postcodes


if __name__ == '__main__':
    n = n_poss_postcodes_for_re()
    d = {'n': n, 'n_str': f'{n:,}', 'postcodeRE': RE}
    json_path = 'postcodes.json'
    with open(json_path, 'w') as f:
        json.dump(d, f, indent=4)
    dict_to_tex_macros(d, 'postcodes-defs.tex', verbose=False)

If we run this code, it produces no output but writes two files. The first is a JSON file, postcodes.json,)

{
    "n": 434464659200,
    "n_str": "434,464,659,200",
    "postcodeRE": "^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$"
}

We have chosen to write into this the values we might want in the document (in this case, both the number as a number, as the formatted number, as well as the relevant regular expression).

There's a second file, postcodes-defs.tex, which we will use later when we use LaTeX as a TDDD engine. This contains the same values, but now as TeX macros:

\def\n{434464659200}
\def\nStr{434,464,659,200}
\def\postcodeRE{\^[A-Z]\{1,2\}[0-9]\{1,2\}[A-Z]? [0-9][A-Z]\{2\}\$}

If you have the tdda library installed, you have as part of it a tool called Gentest, which can write tests in Python for essentially any command-line program, script, or command, in any language.

The line below instructs Gentest to generate tests for running the Python program postcodes.py.

$ tdda gentest 'python postcodes.py'

This produces the following output:


Running command 'python postcodes.py' to generate output (run 1 of 2).
Saved (empty) output to stdout to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/STDOUT.
Saved (empty) output to stderr to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/STDERR.
Copied $(pwd)/postcodes-defs.tex to $(pwd)/ref/python_postcodes_py/postcodes-defs.tex
Copied $(pwd)/postcodes.json to $(pwd)/ref/python_postcodes_py/postcodes.json

Running command 'python postcodes.py' to generate output (run 2 of 2).
Saved (empty) output to stdout to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/2/STDOUT.
Saved (empty) output to stderr to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/2/STDERR.
Copied $(pwd)/postcodes-defs.tex to $(pwd)/ref/python_postcodes_py/2/postcodes-defs.tex
Copied $(pwd)/postcodes.json to $(pwd)/ref/python_postcodes_py/2/postcodes.json

Test script written as /Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py
Command execution took: 0.44s

SUMMARY:

Directory to run in:        /Users/njr/blogs/tdda-code/tddd-postcodes
Shell command:              python postcodes.py
Test script generated:      /Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py
Reference files:
    $(pwd)/postcodes-defs.tex
    $(pwd)/postcodes.json
Check stdout:               yes (was empty)
Check stderr:               yes (was empty)
Expected exit code:         0
Clobbering permitted:       yes
Number of times script ran: 2
Number of tests written:    6

If you run tdda gentest without specifying a command, you get a wizard, which asks what command to run and also gives you various other options that can alternatively be passed on the command line.

The output is intended to be self explanatory, but to elaborate, what Gentest has done is:

Run the command twice;
Recorded what was printed (both on the normal output stream stdout and also, separately, what was printed on the error output stream stderr;
Taken copies of any files created—in our case case, the .json and .tex files.
Noted the exit code from the program (here 0, indicating successful completion);
Looked to see whether there were any differences between the two runs, and whether anything in the output looked highly dependent on the environment or context. Here nothing did, but if it had Gentest would have generated tests that attempted to factor out things that look as if they might vary from run to run. (Examples include timestamps, run durations, hostnames etc.);
Written a test script, test_python_postcodes_py.py. When run, this executes the command under test and compares its behaviour and outputs to those it collected when generating the tests. The tests only pass if the behaviour and outputs were identical other than anything Gentest decided was not fixed. In this case, there was nothing Gentest thought classes as not fixed.

The code generated is in test_python_postcodes_py.py

If we run this test script, thus:

$ python test_python_postcodes_py.py

we get

......
----------------------------------------------------------------------
Ran 6 tests in 0.439s

OK

which shows that our tests have passed, meaning that the output is unchanged. I'm not going to go through the tests, but by all means look at them.

Simulated Co-Rusting

Let's look at what happens if our code's behaviour changes as a result of rusting. We will simulate this by replacing letters.py with letters52.py, which records the number of upper- and lower-case letters in English.²

    cp letters52.py letters.py

if we do this and run the tests again we get two test failures and some suggested diff commands to run to understand them,

..2 lines are different, starting at line 1
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes-defs.tex /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes-defs.tex

F2 lines are different, starting at line 2
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes.json /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes.json

F..
======================================================================
FAIL: test_postcodes_defs_tex (__main__.Test_PYTHON_POSTCODES.test_postcodes_defs_tex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py", line 52, in test_postcodes_defs_tex
    self.assertTextFileCorrect(os.path.join(self.cwd, 'postcodes-defs.tex'),
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               os.path.join(self.refdir, 'postcodes-defs.tex'),
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               encoding='ascii')
                               ^^^^^^^^^^^^^^^^^
AssertionError: False is not true : 2 lines are different, starting at line 1
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes-defs.tex /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes-defs.tex


======================================================================
FAIL: test_postcodes_json (__main__.Test_PYTHON_POSTCODES.test_postcodes_json)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py", line 57, in test_postcodes_json
    self.assertTextFileCorrect(os.path.join(self.cwd, 'postcodes.json'),
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               os.path.join(self.refdir, 'postcodes.json'),
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               encoding='ascii')
                               ^^^^^^^^^^^^^^^^^
AssertionError: False is not true : 2 lines are different, starting at line 2
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes.json /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes.json


----------------------------------------------------------------------
Ran 6 tests in 0.443s

FAILED (failures=2)

and if we run the second suggested diff command (on the JSON files), we see:

2,3c2,3
<     "n": 434464659200,
<     "n_str": "434,464,659,200",
---
>     "n": 14094194400,
>     "n_str": "14,094,194,400",

This is showing us that, with the changed dependency, the code is now producing well over 400 million potential postcodes, rather than th 14 million we expected. (The lack of a newline at the end of stdout is not significant, and is ignored by the test.) So as we hoped, the test detected the rusting of our code, and the co-rusting of its output.

The second diff command shows exactly the same differences in the TeX macros written:

1,2c1,2
< \def\n{434464659200}
< \def\nStr{434,464,659,200}
---
> \def\n{14094194400}
> \def\nStr{14,094,194,400}

If we run the Quarto file postcodes1.qmd with the change, there is no obvious problem: the code and the result continue to match, but are now different from what I intended and orginally validated. Here are the html and pdf

A TDDD Version of the Quarto Doc

We can make the Quarto document more robust (and have the benefit of keeping the code in a script, rather than forcing it into the document) by using this Quarto file, postcodes2.qmd.

---
title: "Quarto Postcodes (with inclusion)"
format:
  html:
    code-fold: true
  pdf:
    toc: false
jupyter: python3
---

{{< include _postcodes.py.qmd >}}
```{python}
with open('ref/python_postcodes_py/postcodes.json') as f:
    ref = json.load(f)
assert d == ref
```
The number of postcode-like strings matching

    `{python} ref['postcodeRE']`

is `{python} ref['n_str']`

The include line at the top imports the file _postcodes.py.qmd. This file is just our script, in Quarto Markdown fences, with a underscore filename, which Quarto requires for inclusions for some reason. We construct the file automatically as part of the build process (in the Makefile).

After the inclusion, we read the JSON file that Gentest saved in its reference directory into Python as a dictionary called ref and then, check that thi refernece dictionary is equal to the one we generated when we ran the code as part of the Quarto rendering process. The Makefile runs the tests (outside Quarto) immediately before rendering so if the assertion passes, we actually know two things:

The tests passed when we ran them outside Quarto (showing that the produce the results we previously validated as OK), and
When we ran the same code inside Quarto, its results (or at least, the results in the dictionary) were also the same as the reference results in the test.

The rest of the Quarto document is the same as the first version except that use the results from the dictionary (since those are validated) and choose to use the preformatted string ref['n_str'] rather than formatting it inline. (This makes no difference.)

In this case, and many others, it makes no difference whether we use ref (the results read from the refernece JSON file) or d as the source of our values, because the assertion checked that they were identical. The reason I've used ref is that in some other cases, the we allow non-material differences between the actual and reference results, typically things like datestamps indicating run-time, machine names etc. (If those are different, we need to use a slightly different assertion.) By using the reference results, we ensure that the document does not change each time we compile it if there are no material differences.

Discussion

Look at the JSON and TeX macros
Change the letters to be 52
Show the test failing
Show how to use the script code in Quarto
Do the LaTeX version.

All current valid postcodes match this expression, but many string that match it do not exist and some would probably not be considered valid. ↩
By way of full disclosure, when I actually replaced letters.py with letters52.py and ran the tests they passed, to my dismay. This happened not because of a problem with the tests, but because I created letters52.py and letters26.py by copying letters.py and failed to update the contents of th letters52.py. If you were were to look back in the Git history for the repo, you'd see that. I mention this simply as a further demonstration that all humans are prone to error, which is some of the reason TDDD and TDDA are helpful! Of course, some humans are less errir-prone than others! ↩