Reference Testing with TDDA: A Guide for LLMs and Coding Bots

Posted on Wed 20 May 2026 in TDDA • Tagged with reference testing, LLMs, coding bots, gentest, pytest, unittest

Raw markdown for direct use in LLM context windows.

This post is addressed primarily to LLMs and coding agents—it was distilled by Claude from Test-Driven Data Analysis (Radcliffe, CRC Press, 2026; extra resources), the tdda source code, the documentation, and the man pages (also available as markdown). If you are a human who wants a dense, no-nonsense introduction to reference testing with TDDA, you are welcome too. It was written in response to repeated direct observations of coding agents—including Claude—using -W to make failing tests pass without verifying that the new output was correct.

What Is a Reference Test?

A reference test is a semantic regression test. 'Semantic' here means the comparison understands equivalence: outputs that differ in timestamps, row order, version strings, hostnames, file paths, or other legitimately variable content can still pass — only meaningful changes cause failures. You run your analytical process or function, capture its outputs, inspect them and decide they are correct, then store them as reference outputs. From that point on, the test reruns the process and checks that the outputs still match the reference. If something changes—a library update, a refactoring, a subtle bug—the test catches it.

The tdda.referencetest library makes this easy for Python code. It extends unittest (and works with pytest) and provides assertion methods for:

  • strings and text files (assertStringCorrect, assertTextFileCorrect)
  • data frames (assertDataFrameCorrect, assertDataFramesEquivalent)
  • binary files (assertBinaryFileCorrect)

It also provides tdda gentest, a command-line tool that watches your process run and generates a test suite automatically.

The key insight: reference testing is particularly well suited to analytical code because the outputs are often large, complex, and hard to specify in advance. You develop the process, carefully inspect and verify the output is right once, store it, and thereafter the test does the checking.

The -W (--write-all) Trap

Do not blindly run -W (or --write-all) merely to make failing tests pass.

This is the most important thing in this document.

When a reference test fails, tdda.referencetest tells you what differs between the actual output and the stored reference. It also offers you a way to update the reference to match the new output: pass -W (or --write-all under pytest) when running the tests.

Running -W overwrites the reference files with whatever the code currently produces. After this, the tests pass. This tells you nothing about whether the output is correct. The tests will pass even if the output is completely wrong, because you just told them that the wrong output is the new reference.

The correct workflow when tests fail is:

  1. Read the failure message. It tells you what changed.
  2. Run the diff command suggested in the failure output.
  3. Look at the actual differences. Are they expected? Are they correct?
  4. Only if you have verified the new output is correct: update the references. With unittest, use -1W (tagged tests only, recommended) or -W (all); with pytest, use --tagged --write-all -s or --write-all -s. See Running a Subset of Tests with Tags for the full commands. If you have a tame human to hand, this is a good moment to involve them—humans are often better at judging whether output is actually correct, and can get quite sweary when you overwrite correct reference results with nonsense.
  5. Run the tests again (without -W) to confirm they pass.
  6. Check the updated reference files into version control.

If you skip step 3 and go straight to -W, you have not tested anything. You have merely synchronized the reference to whatever the code happens to produce right now.

Safe use of -W: the git audit pattern

If the reference files are clean in git before you run -W, you can use git diff afterwards to inspect exactly what changed, and git checkout -- path/to/testdata/ to revert all reference changes at once if anything looks wrong. This makes -W a controlled and auditable operation—but only if the working tree was clean before you ran it. Always check before running -W, not after.

Unit-Enhanced Reference Tests

The test code and command-line flags differ between reference tests built on unittest and those built on pytest. The sections below cover the unittest variants first; see Writing Reference Tests with pytest for the pytest equivalents. Where flags differ, both are given.

Task unittest pytest
Run all tests python tests.py -F pytest tests/ --log-failures
Run only tagged tests python tests.py -F -1 pytest tests/ --log-failures --tagged
Rewrite all references -W --write-all -s
Rewrite tagged only -1W --tagged --write-all -s

Full syntax and explanations for each are in the sections below; this table is a quick reference.

A partial structural defence against careless -W use is unit-enhanced reference tests: after the reference assertion, add one or more specific assertions about things that must be true regardless (shown here in unittest style):

def test_output(self):
    result = run_my_process(input_data)
    self.assertStringCorrect(result, 'expected.txt')
    # These survive a careless -W rewrite:
    self.assertIn('Total: 42 records', result)
    self.assertTrue(result.strip().endswith('OK'))

The reference assertion runs first. If it fails, tdda writes the actual output and suggests a diff command—the normal workflow. If you then carelessly rewrite with -W, the subsequent assertions will still fail if the output is wrong in ways they cover.

This is not a complete defence—you have to choose the assertions carefully—but it makes it much harder to accidentally accept a broken result. Choose assertions that reflect the core correctness property the test was designed to verify.

This pattern emerged from the author's direct experience of coding agents (including Claude) repeatedly using -W to make tests pass without verifying the results. It is recommended for any test where the reference output has semantic structure that can be spot-checked.

The -F (--log-failures) Flag

Always pass the log-failures flag when running tests. It logs the IDs of any failing tests to a timestamped file (YYYY-MM-DDTHHMMSS-failing-tests.txt) in your system temp directory (overridable with $TDDA_FAIL_DIR). This enables the tdda tag workflow: tdda tag reads the most recent such file and adds @tag to the failing tests, so you can re-run and regenerate references for just those tests.

Without the flag, no failures file is written and tdda tag has nothing to work with.

With unittest (running directly with Python)

Pass -F (or --log-failures):

python tests/test_mycode.py -F

With pytest

Pass --log-failures:

pytest tests/ --log-failures

Permanent default

To avoid passing the flag every time, add this to ~/.tdda.toml:

[referencetest]
log_failures = true

This modifies the user's global configuration. Consult your human before doing it.

The Kicker: the -W Problem is Not Restricted to TDDA

The anti-pattern described above — rewriting expected outputs to make tests pass without verifying the new output is correct — applies far beyond tdda.referencetest. LLM coding agents routinely treat passing tests as the goal rather than as evidence of correctness. Whether rewriting a reference file, updating a pytest snapshot, regenerating Jest snapshots, or changing a hardcoded expected value in an assertion, the same question applies first: is the new result actually correct?

Green tests after any kind of expected-value rewrite tell you nothing about correctness. They tell you only that the code now matches whatever you told it to match.

The correct workflow is the same regardless of framework:

  1. A test fails. Read the failure. What changed?
  2. Is the change correct, or is it a bug?
  3. Only if correct: update the expected value.
  4. If you're not sure: ask your human.

The specific value of tdda.referencetest is that it makes step 1 easy — the diff tooling is built in, and -F/tdda tag/-1W limit the blast radius. But the discipline is universal.

Running a Subset of Tests with Tags

To run only some tests, use the @tag decorator:

from tdda.referencetest import ReferenceTestCase, tag

class TestMyProcess(ReferenceTestCase):

    @tag
    def test_main_output(self):
        result = run_my_process()
        self.assertStringCorrect(result, 'expected_output.txt')

    def test_other_thing(self):
        ...

@tag can decorate individual test methods or entire test classes. The flags to run only tagged tests differ between unittest and pytest.

With unittest (running directly with Python)

python tests/test_mycode.py -F -1        # run only tagged tests
python tests/test_mycode.py -F -1W       # regenerate references for tagged tests only

-1W combines -1 and -W (--write-all). This is the safe way to regenerate, because it limits the blast radius to tests you have explicitly chosen and tagged.

With pytest

pytest tests/ --log-failures --tagged               # run only tagged tests
pytest tests/ --log-failures --tagged --write-all -s  # regenerate references for tagged tests only

Pass -s to prevent pytest from capturing output, so that tdda can report which reference files were written.

The full workflow with tdda tag

The -Ftdda tag-1W workflow lets you rewrite only the references that actually failed, without manually deciding which tests to tag:

  1. Run tests with -F (or --log-failures) to record failing test IDs
  2. Run tdda tag to add @tag to those tests automatically
  3. Inspect the diffs to verify the new output is correct
  4. Run -1W (or --tagged --write-all -s) to rewrite only those references
  5. Run make untag (or the sed command below) to remove the tags

This is always preferable to bare -W, which rewrites every reference file regardless of whether the test failed.

Removing stale tags

Before adding new tags, remove any stale @tag decorators from previous sessions. There is usually a make untag target that does this, or you can use:

# macOS (BSD sed):
sed -i '' '/^[[:space:]]*@tag[[:space:]]*$/d' tests/test_mycode.py

# Linux (GNU sed):
sed -i '/^[[:space:]]*@tag[[:space:]]*$/d' tests/test_mycode.py

Writing Reference Tests with unittest

A minimal test file:

import os
from tdda.referencetest import ReferenceTestCase, tag

TESTDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)),
                       'testdata')

class TestMyProcess(ReferenceTestCase):

    def test_output(self):
        result = run_my_process(input_data)
        self.assertStringCorrect(result,
                                 os.path.join(TESTDIR, 'expected.txt'))

    def test_dataframe(self):
        df = produce_dataframe()
        self.assertDataFrameCorrect(df,
                                    os.path.join(TESTDIR, 'expected.csv'))

if __name__ == '__main__':
    ReferenceTestCase.main()

When running under pytest, the if __name__ == '__main__': block is simply ignored—the same test file works with both runners unchanged.

Run it:

python test_myprocess.py -F           # run all tests
python test_myprocess.py -F -1        # run only tagged tests
python test_myprocess.py -F -1W       # regenerate references for tagged tests

The first time you run with -1W after writing a new test, it writes the reference file. Subsequent runs compare against it.

After writing references with -1W, always inspect the files that were written. The fact that the test now passes means only that the reference matches the output. It says nothing about whether either is correct.

Writing Reference Tests with pytest

The same test classes work under pytest, with different flags:

pytest tests/                           # run all tests
pytest tests/ --tagged                  # run only tagged tests
pytest tests/ --tagged --write-all -s   # regenerate references for tagged tests

Note: - Use --write-all instead of -W. - Use --tagged instead of -1. - Pass -s to prevent pytest from capturing output, so that tdda can report which reference files were written. - The short flags -W and -1 are tdda extensions; they only work when running the test file directly with Python, not under pytest.

Assertion API: Text and Strings

assertStringCorrect(string, ref_path, ...) Check an in-memory string against a reference text file.

assertTextFileCorrect(actual_path, ref_path, ...) Check a text file on disk against a reference text file.

assertTextFilesCorrect(actual_paths, ref_paths, ...) Check multiple text files against corresponding reference files.

All three share these optional parameters for handling variable output:

Parameter Effect
lstrip=True Strip leading whitespace from each line before comparing
rstrip=True Strip trailing whitespace from each line before comparing
ignore_substrings=['foo','bar'] Ignore any line in the expected file containing one of these substrings; the corresponding actual line can be anything
ignore_patterns=[r'pattern'] Lines differing only in substrings matching these regexes pass; text outside the match must be identical in both
remove_lines=['foo'] Remove lines containing these substrings from both actual and expected before comparing
preprocess=fn Apply fn(list_of_lines) to both actual and expected (as lists of strings) before comparing
max_permutation_cases=N Pass if lines differ only in order, up to N permutations; None = unlimited

ignore_substrings—ignore whole lines by substring

Lines in the expected output containing the substring are skipped. The match is against the expected file only—the actual output can have anything on those lines (or nothing):

# Reference file contains:
#   Copyright (c) Stochastic Solutions Limited, 2016
#   Version 0.0.0
# Actual output has current year and version—but we don't care:
self.assertStringCorrect(actual, 'expected.html',
    ignore_substrings=['Copyright', 'Version'])

ignore_patterns—ignore variable substrings within a line

Lines pass if they differ only in parts matching the regex. Everything outside the match must be identical in both files:

# Actual:   "Generated: 2026-05-20T14:32:01 by pipeline v2.3.1"
# Expected: "Generated: 2024-01-15T09:00:00 by pipeline v1.0.0"
# Both lines still match with:
self.assertStringCorrect(actual, 'expected.txt',
    ignore_patterns=[
        r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}',
        r'v\d+\.\d+\.\d+',
    ])

ignore_patterns is stricter than ignore_substrings: the non-matching parts of each line must agree exactly, so you cannot accidentally mask a real change in the surrounding text.

remove_lines—strip lines from both files

Lines containing the substring are removed from both actual and expected before comparing. Use this for optional or ephemeral lines that should not appear in the reference at all:

# Both files have lines like "WARNING: cache miss" that are
# present sometimes and absent other times:
self.assertStringCorrect(actual, 'expected.txt',
    remove_lines=['WARNING: cache miss'])

Unlike ignore_substrings, remove_lines strips from both sides, so the reference file also need not contain these lines.

preprocess—transform both files before comparing

Takes a function that accepts a list of strings (lines) and returns a transformed list. Applied to both actual and expected:

def strip_timestamps(lines):
    # remove leading timestamp prefix "2026-05-20 14:32:01 " from each line
    return [line[20:] if len(line) > 20 else line for line in lines]

self.assertStringCorrect(actual, 'expected.txt',
    preprocess=strip_timestamps)

max_permutation_cases—allow reordered lines

Pass if the lines are a permutation of each other, up to the given number of permutations. Use None for unlimited:

# Output order is non-deterministic, but the set of lines is fixed:
self.assertStringCorrect(actual, 'expected.txt',
    max_permutation_cases=None)

Assertion API: DataFrames

The DataFrame assertion methods work with Pandas 2.x and 3.x (all three backends: numpy_nullable, pyarrow, and original) and with Polars. You can even compare DataFrames across engines—e.g. a Pandas actual against a Polars reference—with the engine parameter if needed.

assertDataFramesEquivalent(df, ref_df, ...) Compare two in-memory DataFrames (Pandas or Polars).

assertDataFrameCorrect(df, ref_path, ...) Compare an in-memory DataFrame against a reference file (CSV or Parquet).

assertStoredDataFrameCorrect(actual_path, ref_path, ...) Compare two DataFrames both stored on disk.

assertStoredDataFramesCorrect(actual_paths, ref_paths, ...) Compare multiple pairs of on-disk DataFrames.

check_data and check_types—exclude columns

The most common use is excluding columns whose values are legitimately variable (random seeds, run IDs, timestamps):

# Exclude the 'random' column from both value and type checks:
columns = self.all_fields_except(['random'])
self.assertDataFrameCorrect(df, 'expected.csv',
                            check_data=columns,
                            check_types=columns)

check_data, check_types, and check_order all accept the same forms: - None or True: check all fields (default) - False: skip entirely - a list of field names to check - a function taking a DataFrame and returning a list of field names

sortby—sort before comparing

Use when row order is non-deterministic:

self.assertDataFrameCorrect(df, 'expected.csv',
                            sortby=['country', 'date'])

condition—filter rows before comparing

Use when only a subset of rows is relevant to the test:

# Only compare rows where status is 'complete':
self.assertDataFrameCorrect(df, 'expected.csv',
    condition=lambda df: df['status'] == 'complete')

precision—floating-point tolerance

Default is 7 decimal places. Loosen it when values come via CSV (which can lose precision):

self.assertDataFrameCorrect(df, 'expected.csv', precision=5)

type_matching—dtype strictness

  • 'strict' (default for Parquet): dtypes must be identical
  • 'medium' (default for CSV): same underlying type (int, float, datetime) but different bit width or nullability allowed
  • 'loose': anything Pandas can compare
# CSV round-trips can change int64 to float64—use medium:
self.assertDataFrameCorrect(df, 'expected.csv', type_matching='medium')

fuzzy_nulls—treat different null types as equal

# pd.NaN and None treated as equivalent:
self.assertDataFramesEquivalent(df, ref_df, fuzzy_nulls=True)

engine—Pandas or Polars

Inferred automatically from the DataFrames. Only needed when comparing across types (a Pandas actual against a Polars reference or vice versa):

self.assertDataFramesEquivalent(pandas_df, polars_df, engine='pandas')

tdda diff—Understanding DataFrame Failures

When a DataFrame assertion fails, the failure message suggests one or more diff commands. For tabular data, it often suggests both a raw diff and a tdda diff:

Compare with:
    diff /tmp/actual-expected.csv /path/to/testdata/expected.csv
Compare with:
    tdda diff /tmp/actual-expected.csv /path/to/testdata/expected.csv

tdda diff uses the same comparison logic as the assertion methods and produces a structured summary: which columns differ, how many rows, and a table showing the differing values side by side. It is much easier to read than raw diff for anything beyond a handful of rows. Always prefer it for DataFrame failures. Example output:

Columns with differences: 1 / 12
Rows with differences:    3 / 1000

Values:
  Row   Column    Actual    Expected
   42   revenue   1500.50   1500.00
  108   revenue      0.00       NaN
  731   revenue    999.99   1000.00

It accepts the same field-selection flags as the assertion methods:

tdda diff actual.csv expected.csv --xfields random,run_id

Assertion API: Binary Files

assertBinaryFileCorrect(actual_path, ref_path) Check that a binary file is byte-for-byte identical to a reference file. No options for partial matching—if you need that, extract the relevant data and use a string or DataFrame assertion instead.

Generating Tests Automatically with Gentest

If you have a command-line process—a script, a shell command, an R program—tdda gentest can generate a test suite for it:

tdda gentest 'python my_analysis.py input.csv' testsuite.py

Gentest runs the command multiple times, captures all outputs (stdout, stderr, exit code, any files written), detects which parts vary between runs, and writes a test script that checks the stable parts. The generated script uses tdda.referencetest and can be run and maintained like any other reference test.

Inspect the generated test and the reference outputs before trusting them. Gentest is good at generating structurally correct tests; you still need to verify that the reference outputs are actually correct.

The Reference Test Checklist

Create at least one reference test for every analytical process you write.
Run tests before making changes, so you know the baseline.
Run tests after making changes, before assuming they worked.
When a test fails, read the diff before doing anything else.
Never run -W without first verifying the new output is correct.
Prefer -1W (or --tagged --write-all -s) over bare -W—rewrite only the references that actually failed.
Use -F and tdda tag to automatically tag failing tests for targeted reruns and rewrites.
After writing references, inspect the files. Tests passing after -W or --write-all is not evidence of correctness.
Ensure reference files are clean in git before running -W, so you can use git diff to review changes and revert with git checkout -- testdata/ if needed.
Consider unit-enhanced reference tests for anything with checkable semantic structure.
Add a regression test for every bug you fix.
1 test vs. 0 tests is a bigger difference than 100 vs. 1.

Further Reading


TDDA: The Book, the 3.0 Library, and the PyData London 2026 Tutorial

Posted on Tue 19 May 2026 in TDDA • Tagged with library, talk, book

This blog has been quite quiet, but there is a great deal of news and it may be less quiet for a while.

The Book

Today, 19th May 2026, sees the world-wide release of Test-Driven Data Analysis, from CRC Press.

The cover of the book Test-Driven Data Analysis by Nicholas J. Radcliffe. It is published by Chapman and Hall, part of CRC Press, from Taylor & Francis Group, and is part of the DATA SCIENCE SERIES. The cover is black with mostly white text and a white graphic. The graphic is a 3-row by 4-column grid of squares. Each square contains a number of dots laid out on a regular 32x32 grid. The top-left square has 1024 dots (“full”) and working along each row in turn, the number of dots roughly halves each time, apparently at random (and, actually, pseudo-randomly). The last row’s boxes have six, two, two, and one dot.

It is available from all good booksellers and all sellers of good books, and until 30th June 2026 the code 26SMA1 will give a 20% discount from the publisher's site.

The book covers:

  • the TDDA methodology
    • including areas not obviously amenable to software support, such as errors of interpretation, errors of applicability, errors of process, and errors of judgement
  • the TDDA command-line tools for
    • data validation,
    • reference-test generation with Gentest (test for code in any language),
    • a diff tool for on-disk data frames (as parquet files and flat files)
    • tools for working with the tdda.serial format and also with CSVW (CSV on the Web) and Frictionless.
  • Reference testing with tdda.referencetest under unittest or pytest
  • Test-Driven Document Development (TDDD)
  • APIs for all functionality

Resources from the book are available at book.tdda.info, including

  • 22 Checklists
  • All figures
  • Glossary
  • Data Profiles
  • Data Dictionaries
  • TDDD tests for the book.

Examples from the book are available from the tdda library by using the tdda command:

tdda examples book

The whole of TDDA is really built around the encapsulation of the data-analysis cycle shown below, and the diagram shows how the book covers these ideas.

The main part of the diagram consists of six circles from
left to right.
The first five circles have failure mode text
under them and an error class below that.
1. CHOOSE APPROACH.
Failure: 'Fail to understand data, problem domain, or methods',
ERROR OF INTERPRETATION (error of formulation).
Ch 13.
2. DEVELOP ANALYTICAL PROCESS.
Failure: 'Mistakes during coding' and the associated
ERROR OF IMPLEMENTATION (bug).
Ch 9-12.
3. RUN ANALYTICAL PROCESS.
Failure: 'Use the software incorrectly'
ERROR OF PROCESS (operator error).
Ch 16.
4. PRODUCE ANALYTICAL RESULTS
Failure 'Mismatch between development data or assumptions
and deployment data'
ERROR OF APPLICABILITY (category error).
Ch 1-7 & 17.
5. INTERPRET ANALYTICAL RESULTS
Failure 'Misinterpret the results'
ERROR OF INTERPRETATION (communication error).
Ch 14 & 15.
6. `First, Do No Harm'.
ERROR OF JUDGEMENT.
Ch 17.
Arrows lead to FAILURE and SUCCESS boxes.
Remedies and book chapters sit underneath the main diagram.

The TDDA Library, Version 3.0

Top Line: Three Machines illustrating
1. constraint discover and data validation: an input hopper takes training
data and produces constraints, or training data + constraints to produce
data validations at the output chute.
2. Rexpy, which takes strings in its input hopper and produces
regular expressions at the output chute,
3. TDDA gentest, which takes code in the input hopper and produces a Python
reference-test script as output.
Bottom Line: 4. tdda diff which compares data in flat files and parquet
files to detect (semantic) differences.
5. tdda.serial, which is a format for describing flat-file formats and
a suite of tools for working with tdda.serial, CSVW, and Frictionless
6. tdda.referencetest, for semantic testing of complex analytical results.

Version 3.0 of the library and command-line tools is a major upgrade.

All the main features have upgrades:

  • Data validation using constraints, which can be generated from training data.

  • Inference of regular expressions from example strings.

  • Automatic generation of tests for almost any non-GUI code in any language (Gentest).
    "Gentest writes tests so you don't have to."™

  • Enhanced test support for complex results in both Python's unittest and in pytest with reference testing.

New features include:

  • Support for Pandas 3.0, including all three backends (original, numpy_nullable, and pyarrow).

  • Support for Polars DataFrames in most areas of the library.

  • Comprehensive Parquet support, replacing feather format.

  • tdda diff: find and visualize differences between datasets in flat files (like CSV files) and parquet files, with control over specificity and scope.

  • Flat-file metadata support: the new tdda.serial format allows the format of CSV and other flat files to be described for accurate reading across libraries. This includes inference of flat-file formats, Python code generation, helper functions for reading and writing flat files with metadata, and conversion between tdda.serial, CSVW (CSV on the Web), and Frictionless.

  • Text utilities for Unicode, including glyph counting and extended normalization forms beyond canonical composition and decomposition (NFC, NFD), and kompatibility normalization (NFKC and NFKD). Form NFTK performs further kompatibility normalization including accent stripping.

  • Man pages for all commands

  • Upgraded documentation for command line tools and the API.

PyData London TDDA Tutorial, 5th June 2026, 14:10

I'll be giving a 90-minute hands-on tutorial on TDDA on 5th June 2026 at PyData London. Do come along if you can. PyData is always great, for experts and novices and all levels of technical interest and proficiency. It would be great to see you there.

Get tickets from PyData.

And if you have something to share, prepare a 5-minute Lightning Talk. They are always a highlight of the conference.


Test-Driven Document Development

Posted on Tue 02 September 2025 in TDDA • Tagged with TDDD

Summary

Computational documents attempt to guarantee that results included within them—such as graphs—correspond to the code and data claimed to generate them. They typically achieve this by generating the outputs from the code at the time the document is generated or viewed. This solves significant problems, including those of code rusting (exhibiting changed behaviour) and of unintentional inclusion of stale, incorrect, or unvalidated results. There is, however, a danger of what I term co-rusting, whereby the code and its outputs drift away from correctness (rust) together, without the author realizing. This is likely if the code continues to generate output (i.e., does not crash or report an error).

Computational documents are an important part of reproducible research, within which the main approach to avoiding co-rusting tends to be the use of reproducible environments, which aim to prevent rusting by pinning down as much of the computational environment as possible.

Test-Driven Document Development (TDDD) builds on computational documents by adding automated tests that fail when results change (materially). If these tests are run as part of the build process for the document, the possibily of co-rusting is reduced or eliminated. TDDD can be viewed as the application of test-driven data analysis (TDDA) to the process of document creation, essentially considering the generation of a document as an analytical process that should be supported by reference tests.

The tests can be created by hand, but the Gentest functionality of the tdda tool turns out to be powerful for implementing the tests needed by TDDD, whatever language is used to generate the results.

Background: Computational Documents

Computational documents include one or more results generated by computer code, and provide some guarantee that each result matches its generating code. This is usually achieved by including the code in the document and generating the output either as part of document production (compilation, e.g., Quarto, or in a more limited way, cog) or on-the-fly, for computational notebooks (interpretation, like Jupyter Notebooks / JupyterLab and marimo).

Here is a simple Quarto computational document that calculates the number of potential UK postcodes as defined by a regular expression describing valid ones.1 This number is quoted in a book I am writing on TDDA. Prior to today, it was pasted into the book by copying the output from an interactive Python session where I calculated it. I probably inserted the thousand separators by hand (another error-prone process). Today I not only changed the number to be included from a calculation when the book is compiled, but also added reference tests to detect if it changes. (source)

---
title: "Quarto Postcodes (inline)"
format:
  html:
    code-fold: true
  pdf:
    toc: false
jupyter: python3
---

```{python}
from letters import nL

RE = r'^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$'

def n_poss_postcodes_for_re():
    """
    Number of strings matching:
      ^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$
    """
    n_postal_areas = nL + nL * nL  # 1 or two letters
    n_postal_districts = 10 + 100  # Any one or two digit number
                                   # 0 and 0x aren't used, but match the regex
    n_subdistricts = nL + 1        # Not all letters are used,
                                   # and only for some London codes,
                                   # but for our regex...
                                   # The +1 is for ones not using a subdistrict

    n_outcodes = n_postal_areas * n_postal_districts * n_subdistricts
    n_incodes = 10 * nL * nL       # Digit then two letters
    n_postcodes = n_outcodes * n_incodes

    return n_postcodes


if __name__ == '__main__':
    n = n_poss_postcodes_for_re()
```
The number of postcode-like strings matching

    `{python} RE`

is `{python} f'{n:,}'`

This document is written in a dialect of Markdown defined by Quarto. It has a header at the top, containing metadata, then a fenced Markdown Python block containing (which defines two variables used later in the document), and some text that uses those two variables (RE and n_formatted) to say how many postcodes match. It has a confected dependency on an another Python file, letters.py defining the number of letters, nL, in English:

nL = 26

It can be compiled with:

    quarto render postcodes1.qmd

producing this page and this document. This rather simple computational document, which shows the code and one important output number that is “guaranteed” to be generated from the code shown. It would be usual to includes graphs or tables of some sort, but this is minimal example so I really wanted only a single number.

The version of the code actually used to generate the number in the book, does not import nL from letters.py, but includes the line nL = 26 in the main program. That's because I'm not trying to make it fail in the book. I've written in this way for the post to give me an easy way to demonstrate co-rusting, which is a entirely real phenomenon. A change in a dependency is a common reason for rusting. (If you do not believe in code rusting or co-rusting, try reading Why Code Rusts; if that doesn't convince you, this article may not be for you.)

Writing Tests For the Code

We will begin by writing tests for essentially the same code, just written as a standalone Python program rather than embedded in a Quarto document.

Here is same code as an actual python script postcodes.py, together with some slightly different behaviour after calling the postcode-counting function.

import json
from letters import nL
from tdda.utils import dict_to_tex_macros

RE = r'^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$'

def n_poss_postcodes_for_re():
    """
    Number of strings matching:
      ^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$
    """
    n_postal_areas = nL + nL * nL  # 1 or two letters
    n_postal_districts = 10 + 100  # Any one or two digit number
                                   # 0 and 0x aren't used, but match the regex
    n_subdistricts = nL + 1        # Not all letters are used,
                                   # and only for some London codes,
                                   # but for our regex...
                                   # The +1 is for ones not using a subdistrict

    n_outcodes = n_postal_areas * n_postal_districts * n_subdistricts
    n_incodes = 10 * nL * nL       # Digit then two letters
    n_postcodes = n_outcodes * n_incodes

    return n_postcodes


if __name__ == '__main__':
    n = n_poss_postcodes_for_re()
    d = {'n': n, 'n_str': f'{n:,}', 'postcodeRE': RE}
    json_path = 'postcodes.json'
    with open(json_path, 'w') as f:
        json.dump(d, f, indent=4)
    dict_to_tex_macros(d, 'postcodes-defs.tex', verbose=False)

If we run this code, it produces no output but writes two files. The first is a JSON file, postcodes.json,)

{
    "n": 434464659200,
    "n_str": "434,464,659,200",
    "postcodeRE": "^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$"
}

We have chosen to write into this the values we might want in the document (in this case, both the number as a number, as the formatted number, as well as the relevant regular expression).

There's a second file, postcodes-defs.tex, which we will use later when we use LaTeX as a TDDD engine. This contains the same values, but now as TeX macros:

\def\n{434464659200}
\def\nStr{434,464,659,200}
\def\postcodeRE{\^[A-Z]\{1,2\}[0-9]\{1,2\}[A-Z]? [0-9][A-Z]\{2\}\$}

If you have the tdda library installed, you have as part of it a tool called Gentest, which can write tests in Python for essentially any command-line program, script, or command, in any language.

The line below instructs Gentest to generate tests for running the Python program postcodes.py.

$ tdda gentest 'python postcodes.py'

This produces the following output:


Running command 'python postcodes.py' to generate output (run 1 of 2).
Saved (empty) output to stdout to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/STDOUT.
Saved (empty) output to stderr to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/STDERR.
Copied $(pwd)/postcodes-defs.tex to $(pwd)/ref/python_postcodes_py/postcodes-defs.tex
Copied $(pwd)/postcodes.json to $(pwd)/ref/python_postcodes_py/postcodes.json

Running command 'python postcodes.py' to generate output (run 2 of 2).
Saved (empty) output to stdout to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/2/STDOUT.
Saved (empty) output to stderr to /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/2/STDERR.
Copied $(pwd)/postcodes-defs.tex to $(pwd)/ref/python_postcodes_py/2/postcodes-defs.tex
Copied $(pwd)/postcodes.json to $(pwd)/ref/python_postcodes_py/2/postcodes.json

Test script written as /Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py
Command execution took: 0.44s

SUMMARY:

Directory to run in:        /Users/njr/blogs/tdda-code/tddd-postcodes
Shell command:              python postcodes.py
Test script generated:      /Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py
Reference files:
    $(pwd)/postcodes-defs.tex
    $(pwd)/postcodes.json
Check stdout:               yes (was empty)
Check stderr:               yes (was empty)
Expected exit code:         0
Clobbering permitted:       yes
Number of times script ran: 2
Number of tests written:    6

If you run tdda gentest without specifying a command, you get a wizard, which asks what command to run and also gives you various other options that can alternatively be passed on the command line.

The output is intended to be self explanatory, but to elaborate, what Gentest has done is:

  • Run the command twice;
  • Recorded what was printed (both on the normal output stream stdout and also, separately, what was printed on the error output stream stderr;
  • Taken copies of any files created—in our case case, the .json and .tex files.
  • Noted the exit code from the program (here 0, indicating successful completion);
  • Looked to see whether there were any differences between the two runs, and whether anything in the output looked highly dependent on the environment or context. Here nothing did, but if it had Gentest would have generated tests that attempted to factor out things that look as if they might vary from run to run. (Examples include timestamps, run durations, hostnames etc.);
  • Written a test script, test_python_postcodes_py.py. When run, this executes the command under test and compares its behaviour and outputs to those it collected when generating the tests. The tests only pass if the behaviour and outputs were identical other than anything Gentest decided was not fixed. In this case, there was nothing Gentest thought classes as not fixed.

The code generated is in test_python_postcodes_py.py

If we run this test script, thus:

$ python test_python_postcodes_py.py

we get

......
----------------------------------------------------------------------
Ran 6 tests in 0.439s

OK

which shows that our tests have passed, meaning that the output is unchanged. I'm not going to go through the tests, but by all means look at them.

Simulated Co-Rusting

Let's look at what happens if our code's behaviour changes as a result of rusting. We will simulate this by replacing letters.py with letters52.py, which records the number of upper- and lower-case letters in English.2

    cp letters52.py letters.py

if we do this and run the tests again we get two test failures and some suggested diff commands to run to understand them,

..2 lines are different, starting at line 1
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes-defs.tex /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes-defs.tex

F2 lines are different, starting at line 2
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes.json /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes.json

F..
======================================================================
FAIL: test_postcodes_defs_tex (__main__.Test_PYTHON_POSTCODES.test_postcodes_defs_tex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py", line 52, in test_postcodes_defs_tex
    self.assertTextFileCorrect(os.path.join(self.cwd, 'postcodes-defs.tex'),
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               os.path.join(self.refdir, 'postcodes-defs.tex'),
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               encoding='ascii')
                               ^^^^^^^^^^^^^^^^^
AssertionError: False is not true : 2 lines are different, starting at line 1
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes-defs.tex /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes-defs.tex


======================================================================
FAIL: test_postcodes_json (__main__.Test_PYTHON_POSTCODES.test_postcodes_json)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/njr/blogs/tdda-code/tddd-postcodes/test_python_postcodes_py.py", line 57, in test_postcodes_json
    self.assertTextFileCorrect(os.path.join(self.cwd, 'postcodes.json'),
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               os.path.join(self.refdir, 'postcodes.json'),
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                               encoding='ascii')
                               ^^^^^^^^^^^^^^^^^
AssertionError: False is not true : 2 lines are different, starting at line 2
Compare with:
    diff /Users/njr/blogs/tdda-code/tddd-postcodes/postcodes.json /Users/njr/blogs/tdda-code/tddd-postcodes/ref/python_postcodes_py/postcodes.json


----------------------------------------------------------------------
Ran 6 tests in 0.443s

FAILED (failures=2)

and if we run the second suggested diff command (on the JSON files), we see:

2,3c2,3
<     "n": 434464659200,
<     "n_str": "434,464,659,200",
---
>     "n": 14094194400,
>     "n_str": "14,094,194,400",

This is showing us that, with the changed dependency, the code is now producing well over 400 million potential postcodes, rather than th 14 million we expected. (The lack of a newline at the end of stdout is not significant, and is ignored by the test.) So as we hoped, the test detected the rusting of our code, and the co-rusting of its output.

The second diff command shows exactly the same differences in the TeX macros written:

1,2c1,2
< \def\n{434464659200}
< \def\nStr{434,464,659,200}
---
> \def\n{14094194400}
> \def\nStr{14,094,194,400}

If we run the Quarto file postcodes1.qmd with the change, there is no obvious problem: the code and the result continue to match, but are now different from what I intended and orginally validated. Here are the html and pdf

A TDDD Version of the Quarto Doc

We can make the Quarto document more robust (and have the benefit of keeping the code in a script, rather than forcing it into the document) by using this Quarto file, postcodes2.qmd.

---
title: "Quarto Postcodes (with inclusion)"
format:
  html:
    code-fold: true
  pdf:
    toc: false
jupyter: python3
---

{{< include _postcodes.py.qmd >}}
```{python}
with open('ref/python_postcodes_py/postcodes.json') as f:
    ref = json.load(f)
assert d == ref
```
The number of postcode-like strings matching

    `{python} ref['postcodeRE']`

is `{python} ref['n_str']`


The include line at the top imports the file _postcodes.py.qmd. This file is just our script, in Quarto Markdown fences, with a underscore filename, which Quarto requires for inclusions for some reason. We construct the file automatically as part of the build process (in the Makefile).

After the inclusion, we read the JSON file that Gentest saved in its reference directory into Python as a dictionary called ref and then, check that thi refernece dictionary is equal to the one we generated when we ran the code as part of the Quarto rendering process. The Makefile runs the tests (outside Quarto) immediately before rendering so if the assertion passes, we actually know two things:

  1. The tests passed when we ran them outside Quarto (showing that the produce the results we previously validated as OK), and

  2. When we ran the same code inside Quarto, its results (or at least, the results in the dictionary) were also the same as the reference results in the test.

The rest of the Quarto document is the same as the first version except that use the results from the dictionary (since those are validated) and choose to use the preformatted string ref['n_str'] rather than formatting it inline. (This makes no difference.)

In this case, and many others, it makes no difference whether we use ref (the results read from the refernece JSON file) or d as the source of our values, because the assertion checked that they were identical. The reason I've used ref is that in some other cases, the we allow non-material differences between the actual and reference results, typically things like datestamps indicating run-time, machine names etc. (If those are different, we need to use a slightly different assertion.) By using the reference results, we ensure that the document does not change each time we compile it if there are no material differences.

Discussion

Next:

  • Look at the JSON and TeX macros
  • Change the letters to be 52
  • Show the test failing
  • Show how to use the script code in Quarto
  • Do the LaTeX version.

  1. All current valid postcodes match this expression, but many string that match it do not exist and some would probably not be considered valid. 

  2. By way of full disclosure, when I actually replaced letters.py with letters52.py and ran the tests they passed, to my dismay. This happened not because of a problem with the tests, but because I created letters52.py and letters26.py by copying letters.py and failed to update the contents of th letters52.py. If you were were to look back in the Git history for the repo, you'd see that. I mention this simply as a further demonstration that all humans are prone to error, which is some of the reason TDDD and TDDA are helpful! Of course, some humans are less errir-prone than others! 


tdda.serial: Metadata for Flat Files (CSV Files)

Posted on Mon 23 June 2025 in misc

Almost all data scientists and data engineers have to work with flat files (CSV files) from time to time. Despite their many problems, CSVs are too ubiquitous, too universal, and (whisper it) have too many strengths for them to be likely to disappear. Even if they did, they would quickly be reinvented. The problems with them are widely known and discussed, and will be familar to almost everyone who works with them. They include issues with encodings, types, quoting, nulls, headers, and with dates and times. My favourite summary of them remains Jesse Donat's Falsehoods Programmers Believe about CSVs. I wrote about them on this blog nearly four years ago (Flat Files).

Over the last year or so I've been writing a book on test-driven data analysis. The only remaining chapter without a full draft discusses the same topics as this post—metadata for CSV files and new parts of the TDDA software that assist with its creation and use. This post documents my current thinking, plans and ambitions in this area, and shows some of what is already implemented.1

A Metadata Format for Flat Files: tdda.serial

The core of the new work is a new format, tdda.serial, for describing data in CSV files.

The previous post showed an example (“XMD”) metadata file used by the Miró software from my company Stochastic Solutions, which was as follows:2

    <?xml version="1.0" encoding="UTF-8"?>
    <dataformat>
        <sep>,</sep>                     <!-- field separator -->
        <null></null>                    <!-- NULL marker -->
        <quoteChar>"</quoteChar>         <!-- Quotation mark -->
        <encoding>UTF-8</encoding>       <!-- any python coding name -->
        <allowApos>True</allowApos>      <!-- allow apostophes in strings -->
        <skipHeader>False</skipHeader>   <!-- ignore the first line of file -->
        <pc>False</pc>                   <!-- Convert 1.2% to 0.012 etc. -->
        <excel>False</excel>             <!-- pad short lines with NULLs -->
        <dateFormat>eurodt</dateFormat>  <!-- Miró date format name -->
        <fields>
            <field extname="mc id" name="ID" type="string"/>
            <field extname="mc nm" name="MachineName" type="int"/>
            <field extname="secs" name="TimeToManufacture" type="real"/>
            <field extname="commission date" name="DateOfCommission"
                   type="date"/>
            <field extname="mc cp" name="Completion Time" type="date"
                   format="rdt"/>
            <field extname="sh dt" name="ShipDate" type="date" format="rd"/>
            <field extname="qa passed?" name="Passed QA" type="bool"/>
        </fields>
        <requireAllFields>False</requireAllFields>
        <banExtraFields>False</banExtraFields>
    </dataformat>

Here is one equivalent way of expressing essentially the same information in the (evolving) tdda.serial format:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "writer": "tdda.serial-2.2.15",
    "tdda.serial": {
        "encoding": "UTF-8",
        "delimiter": "|",
        "quote_char": "\"",
        "escape_char": "\\",
        "stutter_quotes": false,
        "null_indicators": "",
        "accept_percentages_as_floats": false,
        "header_row_count": 1,
        "map_missing_trailing_cols_to_null": false,
        "fields": {
            "mc id": {
                "name": "ID",
                "fieldtype": "int"
            },
            "mc nm": {
                "name": "Name",
                "fieldtype": "string"
            },
            "secs": {
                "name": "TimeToManufacture",
                "fieldtype": "int"
            },
            "commission date": {
                "name": "DateOfCommission",
                "fieldtype": "date",
                "format": "iso8601date"
            },
            "mc cp": {
                "name": "CompletionTime",
                "fieldtype": "datetime",
                "format": "iso8601datetime"
            },
            "sh dt":  {
                "name": "ShipDate",
                "fieldtype": "date",
                "format": "iso8601date"
            },
            "qa passed?": {
                "name": "PassedQA",
                "fieldtype": "bool",
                "true_values": "yes",
                "false_values": "no"
            }
        }
    }
}

The details don't matter too much at this stage, and may yet change, but briefly here we see the file (typically with a .serial extension), describing:

  • the text encoding used for the data (UTF-8);
  • the field separator (pipe, |);
  • the quote character (double quote, ");
  • the escape character (\), which is used to escape double quotes in double-quoted strings, among other things;
  • whether quotes are stuttered or escaped within quoted strings;
  • the string used to denote null values (this can be a single string or a list);
  • the number of header rows;
  • an explicit note not to accept percentages in the file as floating-point values;
  • whether or not lines with too few fields should be regarded as having nulls for the apparently missing fields. (Excel usually does not write values after the last non-empty cell in each row on a worksheet.)
  • information about individual fields. In this case, a dictionary is used to map names in the flat file to names to be used in the dataset. Numbers can also be used to indicate column position, particularly if there is no header, though they have to be quoted because this is JSON. Field types are also specified, together with any extra information required, e.g. the non-standard true and false values for the boolean field collected? (in the file), which becomes HasBeenCollected once read. Formats for the date and time fields are also specified here.

When the fields are presented as a dictionary, as here, this allows for the possibility that there are other fields in the file, for which metadata is not provided. If a list is used instead, the field list is taken to be complete. In this case, external names can be provided using an csvname attribute, if they are different.

Pretty much everything is optional, and, where appropriate, defaults can be put in the main section and over-ridden on a per-field basis. This is useful if, for example, one or two fields use different null markers from the default, or if multiple date formats are used. (The format key will probably change to dateformat and boolformat to make this overriding work better.)

Here is a simple example of its use with Pandas. Suppose we have the following pipe-separated flat file, with the name machines.psv.

mc id|mc nm|secs|commission date|mc cp|sh dt|qa passed?
1111111|"Machine 1"|86400|2025-06-01|2025-06-07T12:34:56|2025-06-21|yes
2222222|"Machine 2"||2025-06-02|2025-06-08T12:34:57|2025-06-22
3333333|"Machine 3"|86399|2025-06-03|2025-06-09T12:34:55|2025-06-22|no

Then we can use the following Python code to load the data, informed by the metadata in machines.serial (the example shown above).

from tdda.serial import csv_to_pandas

df = csv_to_pandas('machines.psv', 'machines.serial')
print(df, '\n')
df.info()

This produces the following output:

$ python pd-read-machines.py
        ID       Name  TimeToManufacture DateOfCommission      CompletionTime   ShipDate  PassedQA
0  1111111  Machine 1              86400       2025-06-01 2025-06-07 12:34:56 2025-06-21      True
1  2222222  Machine 2               <NA>       2025-06-02 2025-06-08 12:34:57 2025-06-22      <NA>
2  3333333  Machine 3              86399       2025-06-03 2025-06-09 12:34:55 2025-06-22     False

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   ID                 3 non-null      Int64
 1   Name               3 non-null      string
 2   TimeToManufacture  2 non-null      Int64
 3   DateOfCommission   3 non-null      datetime64[ns]
 4   CompletionTime     3 non-null      datetime64[ns]
 5   ShipDate           3 non-null      datetime64[ns]
 6   PassedQA           2 non-null      boolean
dtypes: Int64(2), boolean(1), datetime64[ns](3), string(1)
memory usage: 288.0 bytes

There's nothing particularly special here, but Pandas has read the file correctly using the metadata to understand

  • the pipe separator;
  • the date and time formats;
  • the yes/no format of PassedQA;
  • the null indicator;
  • the intended, more usable internal field names;
  • field types, here defaulting to nullable types.

As with the pandas.read_csv, we can choose whether to prefer nullable types, but the default using tdda.serial is to do so. In this case, the date formats and null indicators would be fine anyway, with Pandas defaults, but here we could instead have specified, say, European dates and ? for nulls.

This code:

from tdda.serial load_metadata, serial_to_pandas_read_csv_args
from rich import print as rprint

md = load_metadata('machines.serial')
kwargs = serial_to_pandas_read_csv_args(md)
rprint(kwargs)

shows the parameters actually passed to pandas.read_csv:

{
    'dtype': {'ID': 'Int64', 'Name': 'string', 'TimeToManufacture': 'Int64', 'PassedQA': 'boolean'},
    'date_format': {'DateOfCommission': 'ISO8601', 'CompletionTime': 'ISO8601', 'ShipDate': 'ISO8601'},
    'parse_dates': ['DateOfCommission', 'CompletionTime', 'ShipDate'],
    'sep': '|',
    'encoding': 'UTF-8',
    'escapechar': '\\',
    'quotechar': '"',
    'doublequote': False,
    'na_values': [''],
    'keep_default_na': False,
    'names': ['ID', 'Name', 'TimeToManufacture', 'DateOfCommission', 'CompletionTime', 'ShipDate', 'PassedQA'],
    'header': 0,
    'true_values': ['yes'],
    'false_values': ['no']
}

We can do the very similar things using Polars (and “soon”, other libraries). Here's a way to read the file with Polars:

from tdda.serial import csv_to_polars

df = csv_to_polars('machines.psv', 'machines.serial',
                   map_other_bools_to_string=True)
print(df)

which produces: Output from polars. There two warnings (about polars not understanding escaping or alternate bool values, and PassedQA being read a string, because that was specified in the parameters. There's then the data table showing the types as i64, str, i64, three datetimes (with microsecond resolution) and PassedQA as str. Nulls are shown for the second row for TimeToManufacture and PassedQA. The transformed field names are used.

This does mostly the same thing as the Pandas version, but issues two warnings. The first is because an escape character is specified, which the Polars CSV reader doesn't really understand. The second warning is because the Polars CSV reader can't handle non-standard booleans. By default, when these are specified for Polars, tdda.serial will issue a warning but still call polars.read_csv to read the file, because they might not, in fact, be used. The parameter passed in the Python code above (map_other_bools_to_string=True) tells tdda.serial to direct Polars to read this column as a string instead (as it would if we didn't specify a type). Of course, it would be possible to have the reader then go through and turn the strings into booleans after reading, but that feels like more a metadata library should do.

The warnings helpfully tell you what to look out for as possible issues when the file is read. This as an example of a principle I'm trying to use throughout tdda.serial: when there's something in the serial metadata that a given reader might not be able to handle correctly, issue a warning, and possibly provide an option to control that behaviour.

We can do the same thing as we did for Pandas and look at the arguments generated for Polars, using the following, very similar, Python code:

from tdda.serial import (load_metadata, serial_to_polars_read_csv_args)
from rich import print as rprint

md = load_metadata('machines.serial')
kwargs = serial_to_polars_read_csv_args(md, map_other_bools_to_string=True)
rprint(kwargs)

This produces

{
    'separator': '|',
    'quote_char': '"',
    'null_values': [''],
    'encoding': 'UTF-8',
    'schema': {
        'ID': Int64,
        'Name': String,
        'TimeToManufacture': Int64,
        'DateOfCommission': Datetime,
        'CompletionTime': Datetime,
        'ShipDate': Datetime,
        'PassedQA': String
    },
    'new_columns': [
        'ID',
        'Name',
        'TimeToManufacture',
        'DateOfCommission',
        'CompletionTime',
        'ShipDate',
        'PassedQA'
    ]
}

The only subtlety here is that the types in Schema are actual polars types (pl.Int64 etc.) rather than strings, hence their not being quoted. (They're not prefixed because repr(pl.Int64) is the string "Int64", which prints as Int64.) The library can also write a tdda.serial file containing the Polars arguments explicitly. It looks like this:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "writer": "tdda.serial-2.2.15",
    "polars.read_csv": {
        "separator": "|",
        "quote_char": "\"",
        "null_values": [
            ""
        ],
        "encoding": "UTF-8",
        "schema": {
            "ID": "Int64",
            "Name": "String",
            "TimeToManufacture": "Int64",
            "DateOfCommission": "Datetime",
            "CompletionTime": "Datetime",
            "ShipDate": "Datetime",
            "PassedQA": "String"
        },
        "new_columns": [
            "ID",
            "Name",
            "TimeToManufacture",
            "DateOfCommission",
            "CompletionTime",
            "ShipDate",
            "PassedQA"
        ]
    }
}

Here, because we need to serialize the tdda.serial file as JSON, the polars types are mapped to their string names. The tdda library takes care of the conversion in both directions.

A single .serial file can contain multiple flavours of metadata—tdda.serial, polars.read_csv, pandas.read_csv etc. When it does, a call to load_metadata can specify a preferred flavour, or let the library choose. My hope, however, is that in most cases the tdda.serial section will contain enough information to work as well as a library-specific specification.

Goals for tdda.serial

Image showing a circle with tdda.serial in the middle and arrows leading in and out for three formats (CSVW, tdda.serial, and Frictionless), five libraries (DuckDB, Python csv, Pandas, Polars, and Apache Arrow) and Excel. Pandas, CSVW, tdda.serial and Polars are bold for both input and output.

When I went to write down the goals for tdda.serial, I was surprised at how long the list was. Not all of this is implemented but here is the current state of the goals for tdda.serial. (The image above shows the vision for it, with the bold parts mostly implemented, and the rest currently only planned.)

  • Describe Flat File Formats. Allow accurate representation, full or partial, of flat-file formats used (or potentially used) by one or more concrete flat files. or .
    • It primarily targets comma-separated values (.csv) and related formats (tab-separated, pipe-separated etc.), but also potentially other tabular data. It could, for example, be used to describe things like date formats and numeric subtypes for tabular data stored in JSON or JSON Lines.
    • Full or partial is important. When reading data, it is often convenient only to specify things that are causing problems. On write, fuller specifications are, of course, desirable.
  • Read Flat Files. Assist with reading flat files correctly, based on metadata in .serial files and other formats (like CSVW), primarily using data in the "tdda.serial" format.
    • Convert metadata currently stored as tdda.serial to dictionaries of arguments for other libraries that work with CSVs.
    • Provide an API to get such libraries to read flat-file data correctly, guided by the metadata
    • Generate code to get such libraries to read flat-file data correctly, guided by the metadata. Assist with writing flat files in documented formats.
    • Interoperate, where possible, with other metadata formats like CSVW and Frictionless.
  • Generate tdda.serial Metadata Files. Assist with generating metadata describing the format of CSV files based on the write arguments provided to the writing software.
  • Write Flat Files. Assist with getting libraries to write CSV files using a format specified in a tdda.serial file.
    • This provides a second way of increasing interoperability: we can help readers to read from a specific format, and writers to write to that same format.
  • Assist/Support other Software Reading, Writing, and otherwise handling Flat Files.
    • DataFrame Libraries
      • Pandas
      • Polars
      • Apache Arrow
    • Databases
      • DuckDB
      • SQLite
      • Postgres
      • MySQL
    • Miscellaneous
      • Python csv
      • tdda
  • Support Library-specific Read/Write Metadata. Provide a mechanism for documenting library-specific read/write parameters for CSV files explicitly:
    • For storing the library-specific write parameters used with pandas.to_csv, polars.write_csv in .serial files (and the ability to use such parameters)
    • For storing the library-specific read parameters required to read a flat file with high fidelity using, e.g. pandas.read_csv , polars.read_csv etc.
  • Assist with Format Choice. Provide a mechanism for helping to choose a good CSV format for a concrete dataset to be written, e.g. choosing null indicators that are not likely to be confused with serialized non-null values in the dataset.
  • SERDE Verification. Provide mechanisms for checking whether a dataset can be round-tripped successfully to a flat file (i.e. that the same library, at least, can write data to a flat file, read it back, and recover identical, equivalent, or similar data).3
  • CLI Tools. Through the associated command-line tool, tdda diff, and equivalent API functions, to check whether two datasets are equivalent.

    • In the case of the command-line tool this is two datasets on disk (flat files, parquet files etc.). It might also be possible to compare two database tables, in the same or different RDBMS instances, or data in a database table and in a file on disk, though this is not yet implemented. (The next post will discuss tdda diff further.)
    • In the case of the API, this can also include in-memory data structures such as data frames.
  • Provide Metadata Format Conversions. Provide mechanisms for converting between different library-specific flat-file parameters and tdda's tdda.serial format, as well as between the tdda.serial format, csvw, and (perhaps) frictionless.

  • Generate Validation Statistics and Validate using them. (Potentially) write additional data for a concrete dataset that can be used for further validation that it has been read correctly, e.g. summary statistics, checksums etc.

Discussion

The usual observation when proposing something new like this is that the last thing the world needs is another “standard”. As Randall Munro puts it: (https://imgs.xkcd.com/927):

HOW STANDARDS PROLIFERATE: (See A/C chargers, character encodings, instant messaging etc. Cartoon. Panel 1: SITUATION: There are 14 competing standards. Panel 2: (Conversation between two people.) 14? Ridiculous! We need to develop one universal standard that covers everyone's use cases. (Yeah.) Panel 3 (SOON:): SITUATION: There are 15 competing standards.

In this case, however, I don't think there are all that many recognized ways of describing flat-file formats. I was involved in one (the .fdd flat-file description data format) while at Quadstone, and I currently use the XMD format above at Stochastic Solutions, but pretty-much no one else does. While working with a friend, Neil Skilling, he ran across the CSVW standard, developed under the auspices of W3C, and that led to my finding the Python frictionless project. At first I thought one of those might be the solution I was looking for, but in fact they have goals and desgins that are different enough that they don't quite fulfill the most important goals for tdda.serial, as impressive as both projects are. Reluctantly, therefore, I began working on tdda.serial, which aims to interoperate with and support CSVW, (and to some extent, frictionless), but also to handle other cases.

The biggest single difference between the focus of tdda.serial and the CSVW is that tdda.serial is primarily concerned with documenting a format that might be used by many flat files (different concrete datasets sharing the same sttructure and formatting) whereas CSVW is primarily concerned with documenting either a single specific CSV file or a specific collection of CSV files, usually each having different structure. This seems like a rather subtle difference, but in fact turns out to be quite consequential.

Here's the first example CSVW file from csvw.org:

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "tables": [{
    "url": "http://opendata.leeds.gov.uk/downloads/gritting/grit_bins.csv",
    "tableSchema": {
      "columns": [
      {
        "name": "location",
        "datatype": "integer"
      },
      {
        "name": "easting",
        "datatype": "decimal",
        "propertyUrl": "http://data.ordnancesurvey.co.uk/ontology/spatialrelations/easting"
      },
      {
        "name": "northing",
        "datatype": "decimal",
        "propertyUrl": "http://data.ordnancesurvey.co.uk/ontology/spatialrelations/northing"
      }
      ],
      "aboutUrl": "#{location}"
    }
  }],
  "dialect": {
    "header": false
  }
}

Notice that the CSVW file caters for multiple CSV files (a list of tables in the tables element), and that the location of the table is provided as a URL (which is a required element in CSVW). In the context of CSV on the web, this makes complete sense. It's specified as being URL, but can be a file: URL, or a simple path. One convention, fora CSVW file documenting a single dataset, seems to be that the metadata for grit_bins.csv is stored in grit_bins-metadata.json in the same directory as the CSV file itself (locally, or on the web).

What is significant, however, is that this establishes either a one-to-one relationship between CSV files and CSVW metadata files or, if the CSVW file contains metadata about several files, a one-to-one relationship between CSVW files and metadata tables in a CSVW file. Here, for example, is Example 5 from the CSVW Primer:

{
  "@context": "http://www.w3.org/ns/csvw",
  "tables": [{
    "url": "countries.csv"
  }, {
    "url": "country-groups.csv"
  }, {
    "url": "unemployment.csv"
  }]
}

The metadata “knows” the data file (or data files) that it describes. In contrast, the main concern of tdda.serial is to describe a format and structure that might well be used for many specific (“concrete”) flat files. The relationship is almost reversed as shown here:

Left: The CSVW file above, containing three CSV URLS, having arrows from each filename (URL) to that CSV file, as a named icon. Right: Three csv filesn named machines1.csv, machines2.csv, and machines3.csv, each with arrows to a single tdda.serial file (the one shown above).

Even though the URL (url) is a mandatory parameter in CSVW, there is nothing to prevent us from taking a CSVW file (particularly one describing a single table) and using its metadata to define a format to be used with other flat files. In doing, however, we would clearly be going against the grain of the design of CSVW. As an example of how it then does not quite fit, sometimes we want the metadata to describe exactly the fields in the data, and other times we want it to be a partial specification. In the XMD file, there are explicit parameters to say whether or not extra fields are allowed, and whether all fields are required. In the case of the tdda.serial file, we use a list of fields when we are describing all the fields allowed and required in a flat file, and a dictionary when we are providing information only on a subset, not necessarily in order.4 This sort of flexibility is harder in CSVW, which always uses a list to specify the fields. I could propose and use extensions, or try to get extensions added to the standard, but the former seem undesirable, and the latter hard an unlikely. (It does not look as if there have been and revisions to CSVW since 2022.) There are, in fact, many details of CSVW that are problematical for even the first two libaries I've looked at (Pandas and Polars), so unfortunately I think something different is needed.

Library-specific Support in tdda.serial

Another goal for tdda.serial is that it should be useful even for people who are only using a single library—e.g. Pandas. In such cases, there is typically a function or method for writing CSV files (pandas.DataFrame.to_csv), and another for reading them (pandas.read_csv). Both typically have many optional arguments, and in keeping with Postel's Law (the Robustness Principle), they typically have more flexibility in read formats than in write formats. In the case of Pandas, the read function's signature is:

pandas.read_csv(
    filepath_or_buffer, *, sep=<no_default>,
    delimiter=None, header='infer', names=<no_default>, index_col=None,
    usecols=None, dtype=None, engine=None, converters=None,
    true_values=None, false_values=None, skipinitialspace=False,
    skiprows=None, skipfooter=0, nrows=None, na_values=None,
    keep_default_na=True, na_filter=True, verbose=<no_default>,
    skip_blank_lines=True, parse_dates=None,
    infer_datetime_format=<no_default>, keep_date_col=<no_default>,
    date_parser=<no_default>, date_format=None, dayfirst=False,
    cache_dates=True, iterator=False, chunksize=None,
    compression='infer', thousands=None, decimal='.',
    lineterminator=None, quotechar='"', quoting=0, doublequote=True,
    escapechar=None, comment=None, encoding=None,
    encoding_errors='strict', dialect=None, on_bad_lines='error',
    delim_whitespace=<no_default>, low_memory=True, memory_map=False,
    float_precision=None, storage_options=None,
  dtype_backend=<no_default>
)

(49 parameters), while the write method's signature is:

DataFrame.to_csv(
    path_or_buf=None, *, sep=',', na_rep='',
    float_format=None, columns=None, header=True, index=True,
    index_label=None, mode='w', encoding=None, compression='infer',
    quoting=None, quotechar='"', lineterminator=None, chunksize=None,
    date_format=None, doublequote=True, escapechar=None, decimal='.',
    errors='strict', storage_options=None
)

(22 parameters).

The tdda library's command-line tools allow a tdda.serial specification to be converted to parameters for pandas.read_csv, returning them as a dictionary that can be passed in using **kargs. It can also generate python code to do the read using pandas.read_csv or directly perform the read, saving the result to parquet.

Similarly, the library can take a set of arguments for DataFrame.to_csv and create a tdda.serial file describing the format used (or write the data and metadata together).

For a user working with a single library, however, converting to and from tdda.serial's metadata description might be unnecessarily cumbersome and may work imperfectly. This is because different libraries represent data differently, and are based on slighlty different conceptions of CSV files. While I am going to make some effort to allow tdda.serial universal, it is likely that there will always be some cases in which there is a loss of fidelity moving between any specific library's arguments and the .serial representation.

For these reasons, the tdda library also supports directly writing arguments for a given library. That is why the tdda.serial metadata description is one level down inside the tdda.serial file, under a tdda.serial key. It is also possible to have sections for pandas.read_csv, polars.read_csv with exactly the arguments they need.


  1. The functionality used on this post is not in the release version of the tdda library, but is there on a branch called detectreport, so can be accessed if anyone it particulary keen. 

  2. In fact, in writing this post, I updated the previous one to use a slightly more sensible example that previously; this is the new, slightly more useful example. 

  3. CSV is not a very suitable format for perfect round-tripping of data for reasons including numeric rounding, multiple types for the same data, and equivalent representations such as string and categoricals. Even using a typed format such as parquet, some of these details may change on round-tripping and most software needs a library-specific format in order to achieve perfect fidelity when serializing and deserializing data. 

  4. This precise mechanism may change, but it is important for tdda.serial's purpose that is supports both full and partial field schema specification. 


TDDA and Quality for LLMs

Posted on Mon 23 December 2024 in misc

It is December 2024 as I write, and large language models (LLMs) are having an extended moment as I have been writing a book on tet-driven data analysis. Several people have suggested that I should write about LLMs or artificial intelligence (AI), a term that for many people now means either LLMs or LLMs and other the other forms of generative AI.

Training Inference

Size Training Data

Inputs

Goal

First do no harm.

Strong AI.

Beliefs. Hallucinations.

Stochastic hypothesis generators.

Rhydwaith

  1. LLMs are neural networks that (loosely) predict the next word.*
  2. Given some text, they predict the next word
  3. You sentences by appending each predicted word to the input and iterating.

Mary had a -> little Mary had a little -> lamb, Mary had a little lamb, -> its Mary had a little lamb, its -> fleece

or

Mary had a -> seizure Mary had a seizure -> last Mary had a seizure last -> night

  1. LLMs are trained on unimaginably large corpuses of data, mainly from the web.

  2. LLMs have trillions of parameters—knobs that can be set to different values

  3. With any given parameter settings, the LLMs will predict the next word
  4. Some knob settings match the next-word associations better than others
  5. Training an LLMs consists of optimizing the knob settings
  6. (Most of) the parameters (knobs) are called ``weights''.
  7. During training, the current weights are used to predict the next word

    • When it is ``wrong'' (differs from the input), the weights are adjusted
    • Even when it is ``right'', the weights are usually adjusted
    • The raw prediction is not a single word, but probabilities for possible words
    • There is always an error, which can always be reduced.
    • The adjustments are calculated to try to reduce the errors over time.