Title: Reference Testing with TDDA: A Guide for LLMs and Coding Bots
Date: 2026-05-20T12:00:00
Tags: reference testing, LLMs, coding bots, gentest, pytest, unittest
Category: TDDA
Status: published


_[Raw markdown](https://tdda.info/md/076-reference-testing-with-tdda-a-guide-for-llms-and-coding-bots.md) for direct use in LLM context windows._

This post is addressed primarily to LLMs and coding agents—it was
distilled by Claude from [_Test-Driven Data Analysis_](https://www.routledge.com/Test-Driven-Data-Analysis/Radcliffe/p/book/9781032897158)
(Radcliffe, CRC Press, 2026; [extra resources](https://book.tdda.info)),
the [tdda source code](https://github.com/tdda/tdda),
the [documentation](https://tdda.readthedocs.io), and the
[man pages](https://tdda.readthedocs.io/en/latest/cli.html)
(also available as [markdown](https://github.com/tdda/tdda/blob/master/doc/source/cli.md)).
If you are a human who wants a dense, no-nonsense introduction to
reference testing with TDDA, you are welcome too. It was written in
response to repeated direct observations of coding agents—including
Claude—using `-W` to make failing tests pass without verifying that
the new output was correct.

### What Is a Reference Test?

A reference test is a semantic regression test. 'Semantic' here
means the comparison understands equivalence: outputs that differ
in timestamps, row order, version strings, hostnames, file paths,
or other legitimately variable content can still pass — only
meaningful changes cause failures. You run your
analytical process or function, capture its outputs, inspect them
and decide they are correct, then store them as _reference outputs_.
From that point on, the test reruns the process and checks that the
outputs still match the reference. If something changes—a library
update, a refactoring, a subtle bug—the test catches it.

The `tdda.referencetest` library makes this easy for Python code.
It extends `unittest` (and works with `pytest`) and provides
assertion methods for:

 - strings and text files (`assertStringCorrect`, `assertTextFileCorrect`)
 - data frames (`assertDataFrameCorrect`, `assertDataFramesEquivalent`)
 - binary files (`assertBinaryFileCorrect`)

It also provides `tdda gentest`, a command-line tool that watches
your process run and generates a test suite automatically.

The key insight: reference testing is particularly well suited to
analytical code because the outputs are often large, complex, and
hard to specify in advance. You develop the process, carefully inspect and verify the output
is right _once_, store it, and thereafter the test does the checking.


### The `-W` (`--write-all`) Trap

**Do not blindly run `-W` (or `--write-all`) merely to make failing tests pass.**

This is the most important thing in this document.

When a reference test fails, `tdda.referencetest` tells you what
differs between the actual output and the stored reference. It also
offers you a way to update the reference to match the new output:
pass `-W` (or `--write-all` under `pytest`) when running the tests.

Running `-W` overwrites the reference files with whatever the code
currently produces. After this, the tests pass. **This tells you
nothing about whether the output is correct.** The tests will pass
even if the output is completely wrong, because you just told them
that the wrong output is the new reference.

The correct workflow when tests fail is:

1. Read the failure message. It tells you what changed.
2. Run the `diff` command suggested in the failure output.
3. Look at the actual differences. Are they expected? Are they correct?
4. **Only if you have verified the new output is correct:** update
   the references. With unittest, use `-1W` (tagged tests only,
   recommended) or `-W` (all); with pytest, use
   `--tagged --write-all -s` or `--write-all -s`. See
   [Running a Subset of Tests with Tags](#running-a-subset-of-tests-with-tags)
   for the full commands. If you have a tame human to hand, this is a
   good moment to involve them—humans are often better at judging
   whether output is actually correct, and can get quite sweary when
   you overwrite correct reference results with nonsense.
5. Run the tests again (without `-W`) to confirm they pass.
6. Check the updated reference files into version control.

If you skip step 3 and go straight to `-W`, you have not tested
anything. You have merely synchronized the reference to whatever
the code happens to produce right now.

#### Safe use of `-W`: the git audit pattern

If the reference files are clean in git before you run `-W`, you can use
`git diff` afterwards to inspect exactly what changed, and
`git checkout -- path/to/testdata/` to revert all reference changes
at once if anything looks wrong. This makes `-W` a controlled and
auditable operation—but only if the working tree was clean before you
ran it. Always check before running `-W`, not after.


### Unit-Enhanced Reference Tests

The test code and command-line flags differ between reference tests
built on `unittest` and those built on `pytest`. The sections below
cover the `unittest` variants first; see
[Writing Reference Tests with pytest](#writing-reference-tests-with-pytest)
for the pytest equivalents. Where flags differ, both are given.

| Task | unittest | pytest |
|----|----|-----|
| Run all tests | `python tests.py -F` | `pytest tests/ --log-failures` |
| Run only tagged tests | `python tests.py -F -1` | `pytest tests/ --log-failures --tagged` |
| Rewrite all references | `-W` | `--write-all -s` |
| Rewrite tagged only | `-1W` | `--tagged --write-all -s` |

Full syntax and explanations for each are in the sections below; this
table is a quick reference.

A partial structural defence against careless `-W` use is
_unit-enhanced reference tests_: after the reference assertion, add
one or more specific assertions about things that must be true
regardless (shown here in `unittest` style):

```python
def test_output(self):
    result = run_my_process(input_data)
    self.assertStringCorrect(result, 'expected.txt')
    # These survive a careless -W rewrite:
    self.assertIn('Total: 42 records', result)
    self.assertTrue(result.strip().endswith('OK'))
```

The reference assertion runs first. If it fails, `tdda` writes the
actual output and suggests a diff command—the normal workflow. If
you then carelessly rewrite with `-W`, the subsequent assertions will
still fail if the output is wrong in ways they cover.

This is not a complete defence—you have to choose the assertions
carefully—but it makes it much harder to accidentally accept a broken
result. Choose assertions that reflect the core correctness property
the test was designed to verify.

This pattern emerged from the author's direct experience of coding
agents (including Claude) repeatedly using `-W` to make tests pass
without verifying the results. It is recommended for any test where
the reference output has semantic structure that can be spot-checked.


### The `-F` (`--log-failures`) Flag

Always pass the log-failures flag when running tests. It logs the IDs
of any failing tests to a timestamped file
(`YYYY-MM-DDTHHMMSS-failing-tests.txt`) in your system temp directory
(overridable with `$TDDA_FAIL_DIR`). This enables the `tdda tag`
workflow: `tdda tag` reads the most recent such file and adds `@tag`
to the failing tests, so you can re-run and regenerate references for
just those tests.

Without the flag, no failures file is written and `tdda tag` has
nothing to work with.

#### With unittest (running directly with Python)

Pass `-F` (or `--log-failures`):

```
python tests/test_mycode.py -F
```

#### With pytest

Pass `--log-failures`:

```
pytest tests/ --log-failures
```

#### Permanent default

To avoid passing the flag every time, add this to `~/.tdda.toml`:

```toml
[referencetest]
log_failures = true
```

This modifies the user's global configuration. Consult your human
before doing it.


### The Kicker: the `-W` Problem is Not Restricted to TDDA

The anti-pattern described above — rewriting expected outputs to make
tests pass without verifying the new output is correct — applies far
beyond `tdda.referencetest`. LLM coding agents routinely treat passing
tests as the goal rather than as evidence of correctness. Whether
rewriting a reference file, updating a pytest snapshot, regenerating
Jest snapshots, or changing a hardcoded expected value in an assertion,
the same question applies first: is the new result actually correct?

**Green tests after any kind of expected-value rewrite tell you nothing
about correctness. They tell you only that the code now matches
whatever you told it to match.**

The correct workflow is the same regardless of framework:

1. A test fails. Read the failure. What changed?
2. Is the change correct, or is it a bug?
3. Only if correct: update the expected value.
4. If you're not sure: ask your human.

The specific value of `tdda.referencetest` is that it makes step 1
easy — the diff tooling is built in, and `-F`/`tdda tag`/`-1W` limit
the blast radius. But the discipline is universal.


### Running a Subset of Tests with Tags

To run only some tests, use the `@tag` decorator:

```python
from tdda.referencetest import ReferenceTestCase, tag

class TestMyProcess(ReferenceTestCase):

    @tag
    def test_main_output(self):
        result = run_my_process()
        self.assertStringCorrect(result, 'expected_output.txt')

    def test_other_thing(self):
        ...
```

`@tag` can decorate individual test methods or entire test classes.
The flags to run only tagged tests differ between unittest and pytest.

#### With unittest (running directly with Python)

```
python tests/test_mycode.py -F -1        # run only tagged tests
python tests/test_mycode.py -F -1W       # regenerate references for tagged tests only
```

`-1W` combines `-1` and `-W` (`--write-all`). This is the safe way to
regenerate, because it limits the blast radius to tests you have
explicitly chosen and tagged.

#### With pytest

```
pytest tests/ --log-failures --tagged               # run only tagged tests
pytest tests/ --log-failures --tagged --write-all -s  # regenerate references for tagged tests only
```

Pass `-s` to prevent pytest from capturing output, so that `tdda`
can report which reference files were written.

#### The full workflow with `tdda tag`

The `-F` → `tdda tag` → `-1W` workflow lets you rewrite only the
references that actually failed, without manually deciding which tests
to tag:

1. Run tests with `-F` (or `--log-failures`) to record failing test IDs
2. Run `tdda tag` to add `@tag` to those tests automatically
3. Inspect the diffs to verify the new output is correct
4. Run `-1W` (or `--tagged --write-all -s`) to rewrite only those references
5. Run `make untag` (or the sed command below) to remove the tags

This is always preferable to bare `-W`, which rewrites every reference
file regardless of whether the test failed.

#### Removing stale tags

Before adding new tags, remove any stale `@tag` decorators from
previous sessions. There is usually a `make untag` target that does
this, or you can use:

```
# macOS (BSD sed):
sed -i '' '/^[[:space:]]*@tag[[:space:]]*$/d' tests/test_mycode.py

# Linux (GNU sed):
sed -i '/^[[:space:]]*@tag[[:space:]]*$/d' tests/test_mycode.py
```


### Writing Reference Tests with `unittest`

A minimal test file:

```python
import os
from tdda.referencetest import ReferenceTestCase, tag

TESTDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)),
                       'testdata')

class TestMyProcess(ReferenceTestCase):

    def test_output(self):
        result = run_my_process(input_data)
        self.assertStringCorrect(result,
                                 os.path.join(TESTDIR, 'expected.txt'))

    def test_dataframe(self):
        df = produce_dataframe()
        self.assertDataFrameCorrect(df,
                                    os.path.join(TESTDIR, 'expected.csv'))

if __name__ == '__main__':
    ReferenceTestCase.main()
```

When running under `pytest`, the `if __name__ == '__main__':` block is
simply ignored—the same test file works with both runners unchanged.

Run it:

```
python test_myprocess.py -F           # run all tests
python test_myprocess.py -F -1        # run only tagged tests
python test_myprocess.py -F -1W       # regenerate references for tagged tests
```

The first time you run with `-1W` after writing a new test, it
_writes_ the reference file. Subsequent runs compare against it.

**After writing references with `-1W`, always inspect the files that
were written.** The fact that the test now passes means only that the
reference matches the output. It says nothing about whether either is
correct.


### Writing Reference Tests with `pytest`

The same test classes work under `pytest`, with different flags:

```
pytest tests/                           # run all tests
pytest tests/ --tagged                  # run only tagged tests
pytest tests/ --tagged --write-all -s   # regenerate references for tagged tests
```

Note:
 - Use `--write-all` instead of `-W`.
 - Use `--tagged` instead of `-1`.
 - Pass `-s` to prevent `pytest` from capturing output, so that `tdda`
   can report which reference files were written.
 - The short flags `-W` and `-1` are `tdda` extensions; they only work
   when running the test file directly with Python, not under `pytest`.


### Assertion API: Text and Strings

**`assertStringCorrect(string, ref_path, ...)`**
Check an in-memory string against a reference text file.

**`assertTextFileCorrect(actual_path, ref_path, ...)`**
Check a text file on disk against a reference text file.

**`assertTextFilesCorrect(actual_paths, ref_paths, ...)`**
Check multiple text files against corresponding reference files.

All three share these optional parameters for handling variable output:

| Parameter | Effect |
|----|----|
| `lstrip=True` | Strip leading whitespace from each line before comparing |
| `rstrip=True` | Strip trailing whitespace from each line before comparing |
| `ignore_substrings=['foo','bar']` | Ignore any line in the _expected_ file containing one of these substrings; the corresponding actual line can be anything |
| `ignore_patterns=[r'pattern']` | Lines differing only in substrings matching these regexes pass; text outside the match must be identical in both |
| `remove_lines=['foo']` | Remove lines containing these substrings from _both_ actual and expected before comparing |
| `preprocess=fn` | Apply `fn(list_of_lines)` to both actual and expected (as lists of strings) before comparing |
| `max_permutation_cases=N` | Pass if lines differ only in order, up to N permutations; `None` = unlimited |

#### `ignore_substrings`—ignore whole lines by substring

Lines in the expected output containing the substring are skipped.
The match is against the expected file only—the actual output can
have anything on those lines (or nothing):

```python
# Reference file contains:
#   Copyright (c) Stochastic Solutions Limited, 2016
#   Version 0.0.0
# Actual output has current year and version—but we don't care:
self.assertStringCorrect(actual, 'expected.html',
    ignore_substrings=['Copyright', 'Version'])
```

#### `ignore_patterns`—ignore variable substrings within a line

Lines pass if they differ only in parts matching the regex.
Everything outside the match must be identical in both files:

```python
# Actual:   "Generated: 2026-05-20T14:32:01 by pipeline v2.3.1"
# Expected: "Generated: 2024-01-15T09:00:00 by pipeline v1.0.0"
# Both lines still match with:
self.assertStringCorrect(actual, 'expected.txt',
    ignore_patterns=[
        r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}',
        r'v\d+\.\d+\.\d+',
    ])
```

`ignore_patterns` is stricter than `ignore_substrings`: the non-matching
parts of each line must agree exactly, so you cannot accidentally mask
a real change in the surrounding text.

#### `remove_lines`—strip lines from both files

Lines containing the substring are removed from both actual and expected
before comparing. Use this for optional or ephemeral lines that should
not appear in the reference at all:

```python
# Both files have lines like "WARNING: cache miss" that are
# present sometimes and absent other times:
self.assertStringCorrect(actual, 'expected.txt',
    remove_lines=['WARNING: cache miss'])
```

Unlike `ignore_substrings`, `remove_lines` strips from both sides, so
the reference file also need not contain these lines.

#### `preprocess`—transform both files before comparing

Takes a function that accepts a list of strings (lines) and returns
a transformed list. Applied to both actual and expected:

```python
def strip_timestamps(lines):
    # remove leading timestamp prefix "2026-05-20 14:32:01 " from each line
    return [line[20:] if len(line) > 20 else line for line in lines]

self.assertStringCorrect(actual, 'expected.txt',
    preprocess=strip_timestamps)
```

#### `max_permutation_cases`—allow reordered lines

Pass if the lines are a permutation of each other, up to the given
number of permutations. Use `None` for unlimited:

```python
# Output order is non-deterministic, but the set of lines is fixed:
self.assertStringCorrect(actual, 'expected.txt',
    max_permutation_cases=None)
```


### Assertion API: DataFrames

The DataFrame assertion methods work with Pandas 2.x and 3.x (all three
backends: `numpy_nullable`, `pyarrow`, and `original`) and with Polars.
You can even compare DataFrames across engines—e.g. a Pandas actual
against a Polars reference—with the `engine` parameter if needed.

**`assertDataFramesEquivalent(df, ref_df, ...)`**
Compare two in-memory DataFrames (Pandas or Polars).

**`assertDataFrameCorrect(df, ref_path, ...)`**
Compare an in-memory DataFrame against a reference file (CSV or Parquet).

**`assertStoredDataFrameCorrect(actual_path, ref_path, ...)`**
Compare two DataFrames both stored on disk.

**`assertStoredDataFramesCorrect(actual_paths, ref_paths, ...)`**
Compare multiple pairs of on-disk DataFrames.

#### `check_data` and `check_types`—exclude columns

The most common use is excluding columns whose values are legitimately
variable (random seeds, run IDs, timestamps):

```python
# Exclude the 'random' column from both value and type checks:
columns = self.all_fields_except(['random'])
self.assertDataFrameCorrect(df, 'expected.csv',
                            check_data=columns,
                            check_types=columns)
```

`check_data`, `check_types`, and `check_order` all accept the same forms:
- `None` or `True`: check all fields (default)
- `False`: skip entirely
- a list of field names to check
- a function taking a DataFrame and returning a list of field names

#### `sortby`—sort before comparing

Use when row order is non-deterministic:

```python
self.assertDataFrameCorrect(df, 'expected.csv',
                            sortby=['country', 'date'])
```

#### `condition`—filter rows before comparing

Use when only a subset of rows is relevant to the test:

```python
# Only compare rows where status is 'complete':
self.assertDataFrameCorrect(df, 'expected.csv',
    condition=lambda df: df['status'] == 'complete')
```

#### `precision`—floating-point tolerance

Default is 7 decimal places. Loosen it when values come via CSV
(which can lose precision):

```python
self.assertDataFrameCorrect(df, 'expected.csv', precision=5)
```

#### `type_matching`—dtype strictness

- `'strict'` (default for Parquet): dtypes must be identical
- `'medium'` (default for CSV): same underlying type (int, float, datetime)
  but different bit width or nullability allowed
- `'loose'`: anything Pandas can compare

```python
# CSV round-trips can change int64 to float64—use medium:
self.assertDataFrameCorrect(df, 'expected.csv', type_matching='medium')
```

#### `fuzzy_nulls`—treat different null types as equal

```python
# pd.NaN and None treated as equivalent:
self.assertDataFramesEquivalent(df, ref_df, fuzzy_nulls=True)
```

#### `engine`—Pandas or Polars

Inferred automatically from the DataFrames. Only needed when comparing
across types (a Pandas actual against a Polars reference or vice versa):

```python
self.assertDataFramesEquivalent(pandas_df, polars_df, engine='pandas')
```


### `tdda diff`—Understanding DataFrame Failures

When a DataFrame assertion fails, the failure message suggests one or
more `diff` commands. For tabular data, it often suggests both a raw
`diff` and a `tdda diff`:

```
Compare with:
    diff /tmp/actual-expected.csv /path/to/testdata/expected.csv
Compare with:
    tdda diff /tmp/actual-expected.csv /path/to/testdata/expected.csv
```

`tdda diff` uses the same comparison logic as the assertion methods and
produces a structured summary: which columns differ, how many rows, and
a table showing the differing values side by side. It is much easier to
read than raw `diff` for anything beyond a handful of rows. Always prefer
it for DataFrame failures. Example output:

```
Columns with differences: 1 / 12
Rows with differences:    3 / 1000

Values:
  Row   Column    Actual    Expected
   42   revenue   1500.50   1500.00
  108   revenue      0.00       NaN
  731   revenue    999.99   1000.00
```

It accepts the same field-selection flags as the assertion methods:

```
tdda diff actual.csv expected.csv --xfields random,run_id
```

### Assertion API: Binary Files

**`assertBinaryFileCorrect(actual_path, ref_path)`**
Check that a binary file is byte-for-byte identical to a reference file.
No options for partial matching—if you need that, extract the relevant
data and use a string or DataFrame assertion instead.


### Generating Tests Automatically with Gentest

If you have a command-line process—a script, a shell command, an R
program—`tdda gentest` can generate a test suite for it:

```
tdda gentest 'python my_analysis.py input.csv' testsuite.py
```

Gentest runs the command multiple times, captures all outputs (stdout,
stderr, exit code, any files written), detects which parts vary between
runs, and writes a test script that checks the stable parts. The
generated script uses `tdda.referencetest` and can be run and maintained
like any other reference test.

Inspect the generated test and the reference outputs before trusting
them. Gentest is good at generating structurally correct tests; you
still need to verify that the reference outputs are actually correct.


### The Reference Test Checklist

☐ **Create at least one reference test** for every analytical process you write.  
☐ **Run tests before making changes**, so you know the baseline.  
☐ **Run tests after making changes**, before assuming they worked.  
☐ **When a test fails**, read the diff before doing anything else.  
☐ **Never run `-W` without first verifying the new output is correct.**  
☐ **Prefer `-1W` (or `--tagged --write-all -s`)** over bare `-W`—rewrite only the references that actually failed.  
☐ **Use `-F` and `tdda tag`** to automatically tag failing tests for targeted reruns and rewrites.  
☐ **After writing references, inspect the files.** Tests passing after `-W` or `--write-all` is not evidence of correctness.  
☐ **Ensure reference files are clean in git** before running `-W`, so you can use `git diff` to review changes and revert with `git checkout -- testdata/` if needed.  
☐ **Consider unit-enhanced reference tests** for anything with checkable semantic structure.  
☐ **Add a regression test for every bug you fix.**  
☐ **1 test vs. 0 tests is a bigger difference than 100 vs. 1.**


### Further Reading

 - [TDDA library documentation](https://tdda.readthedocs.io/)
 - [Reference test examples](https://github.com/tdda/tdda/tree/master/tdda/examples/referencetest_examples)
 - `man tdda`, `man tdda-gentest`, `man tdda-diff`
 - [_Test-Driven Data Analysis_](https://www.routledge.com/Test-Driven-Data-Analysis/Radcliffe/p/book/9781032897158)
   (Radcliffe, CRC Press, 2026), Part II, Chapters 9–12
 - [Book resources](https://book.tdda.info)