Test-Driven Data Analysis

TDDA Book Online Serialization

Posted on Wed 27 May 2026 in TDDA • Tagged with tdda, book

The cover of the book Test-Driven Data Analysis by Nicholas J. Radcliffe. It is published by Chapman and Hall, part of CRC Press, from Taylor & Francis Group, and is part of the DATA SCIENCE SERIES. The cover is black with mostly white text and a white graphic. The graphic is a 3-row by 4-column grid of squares. Each square contains a number of dots laid out on a regular 32x32 grid. The top-left square has 1024 dots (“full”) and working along each row in turn, the number of dots roughly halves each time, apparently at random (and, actually, pseudo-randomly). The last row’s boxes have six, two, two, and one dot.

As announced a few days ago, my book, Test-Driven Data Analysis, is now available for sale from all good booksellers and all sellers of good books, around the world.

The book is aimed at analysts, data scientists, engineers, researchers and anyone else interested in making analytical processes more reliable, testable, and reproducible.

My main goal in writing it has always been to encourage wider adoption of the ideas. With that in mind, I am delighted to be able to announce that the full content of the book will be available online.

Channelling my inner Charles Dickens, I am releasing one chapter per week. All the auxiliary material is already available, together with Chapter 1, at the TDDA Book's site.

A new chapter will appear each week until mid-September 2026. You can sign up to get notifications as chapters are released using this link.

If you would like a physical copy, the publisher is offering a 20% discount with code 26SMA1 (until 30th June 2026) if you order directly from its site.

The book is structured around the analytical lifecycle, common failure modes, and the remedies discussed in the book:

The main part of the diagram consists of six circles from
left to right.
The first five circles have failure mode text
under them and an error class below that.
1. CHOOSE APPROACH.
Failure: 'Fail to understand data, problem domain, or methods',
ERROR OF INTERPRETATION (error of formulation).
Ch 13.
2. DEVELOP ANALYTICAL PROCESS.
Failure: 'Mistakes during coding' and the associated
ERROR OF IMPLEMENTATION (bug).
Ch 9-12.
3. RUN ANALYTICAL PROCESS.
Failure: 'Use the software incorrectly'
ERROR OF PROCESS (operator error).
Ch 16.
4. PRODUCE ANALYTICAL RESULTS
Failure 'Mismatch between development data or assumptions
and deployment data'
ERROR OF APPLICABILITY (category error).
Ch 1-7 & 17.
5. INTERPRET ANALYTICAL RESULTS
Failure 'Misinterpret the results'
ERROR OF INTERPRETATION (communication error).
Ch 14 & 15.
6. ‘First, Do No Harm’.
ERROR OF JUDGEMENT.
Ch 17.
Arrows lead to FAILURE and SUCCESS boxes.
Remedies and book chapters sit underneath the main diagram.

The analytical lifecycle, common failure modes, and the remedies discussed in the book.

In addition to the book itself, you will find at the site links to:

All the book's examples, together with TDDD tests for most of them
All the book's Figures (in colour and B&W)
All the book's checklists (in various forms)
All the Profiles in the book (and quite a lot more)
All the Data Dictionaries mentioned in the book
The book's Glossary
The book's Errata (not, unfortunately, empty)

Top Line: Three Machines illustrating
1. constraint discovery and data validation: an input hopper takes training
data and produces constraints, or training data + constraints to produce
data validations at the output chute.
2. Rexpy, which takes strings in its input hopper and produces
regular expressions at the output chute,
3. TDDA gentest, which takes code in the input hopper and produces a Python
reference-test script as output.
Bottom Line: 4. tdda diff which compares data in flat files and parquet
files to detect (semantic) differences.
5. tdda.serial, which is a format for describing flat-file formats and
a suite of tools for working with tdda.serial, CSVW, and Frictionless
6. tdda.referencetest, for semantic testing of complex analytical results.

Some of the principal TDDA tools and capabilities covered in the book.

Data Validation with tdda Constraints: A Guide for LLMs and Coding Agents

Posted on Fri 22 May 2026 in TDDA • Tagged with data validation, constraints, LLMs, coding bots, tdda, pandas, parquet

Raw markdown for direct use in LLM context windows.

This post is addressed primarily to LLMs and coding agents. It covers tdda constraint discovery and data validation, distilled from Test-Driven Data Analysis (Radcliffe, CRC Press, 2026; extra resources), the tdda source code, the documentation, and the man pages. If you are a human who wants a dense, no-nonsense introduction to data validation with tdda, you are welcome too.

If you are a coding agent, the fast path is:

Profile the data (e.g. tdda profiles); discuss findings with the human.
tdda discover -x data.csv constraints.tdda—generate constraints from known-good data.
Read the generated .tdda file; edit it by hand. Discovery generates constraints that fit the training data—some will be too tight (stale date ranges, over-fitted min/max), some too loose (weak regex patterns), and some redundant or overlapping. Only you, in consultation with your tame human, can fix this. If you skip it, you will get both false positives and, worse, false negatives.
tdda detect data.csv constraints.tdda bad.csv --key id_field—find and write failing records. Or tdda verify data.csv constraints.tdda for a quick field-level summary without writing individual failures.
Never skip step 3. If you skip it, you will miss bad data. See The Workflow.

Data Validation with Constraints in TDDA

The tdda constraint discovery tool examines a dataset, finds patterns that hold across it, and encodes them as a JSON file (.tdda). You then validate new data against that file. Think of it as unit tests for data: you capture what "good" looks like once, and thereafter test that new data matches.

The core tool is the tdda command-line utility, with three subcommands:

tdda discover—generate constraints from training data
tdda verify—check whether data satisfies constraints (summary)
tdda detect—find the individual records that fail constraints

Install:

pip install tdda
tdda examples  # download example data and code

The Workflow

There are two phases: development and deployment. Using only deployment — skipping development—is the single most common mistake and is covered separately below.

Development phase (training data, then holdout data)

Step 0: Decide the operating point. Before doing anything else, discuss the false-positive/false-negative trade-off with a suitable human. In a safety-critical pipeline you may want to work hard to avoid false negatives even at the cost of more alerts; in a high-volume low-stakes pipeline the balance may be very different. This decision should drive every subsequent adaptation choice—it is not something to assume a default for.

Step 1: Profile and discuss. Before discovering constraints, produce a data profile—frequency distributions, null counts, summary statistics, outlier analysis. Use whatever tools you have (ydata-profiling, custom pandas code, etc.); the profiles at book.tdda.info/profiles show the sort of thing you need. Ideally, discuss the profile and data with a suitable human expert. The profile helps you understand what "valid" looks like before formalizing it. Profiling is not part of tdda.

Step 2: Discover. Run tdda discover on known-good training data to generate a .tdda constraints file automatically. The -x flag enables regex generation for string fields; -G suppresses grouping (usually produces simpler patterns). Both together: -xG.

tdda discover -xG data.csv constraints.tdda

Step 3: Read. Read the generated .tdda file. This is a named step, not a preamble to editing. Understand what was discovered before touching it.

Step 4: Adapt. Edit the constraints by hand. The vocabulary of adaptation is: Tighten / Relax / Add / Delete / Choose Among. This step is not optional. Auto-generated constraints are always a first draft—they will have stale date ranges, unnecessary no_duplicates constraints, and over-fitted or under-specified regex patterns. See Hand-Editing the .tdda File.

Step 5: Validate against holdout. Apply the adapted constraints to holdout data—data not used in discovery. Adapt further as needed. This is where you discover that your constraints are too tight (false positives on valid data) or too loose (missing real problems).

Deployment phase (operational data)

Step 6: Verify. Run tdda verify on each incoming batch of operational data. Fast and terse—reports which constraints fail and for how many records.

Step 7: Monitor. Classify failures:

True positives—bad data caught correctly. Act: reject the batch, fix the root cause, or improve normalisation, cleansing, or the upstream pipeline.
False positives—valid data flagged wrongly. Relax or remove the offending constraints.
False negatives—bad data that passed through. Tighten or add constraints.

Step 8: Refine. Adapt the constraints based on what monitoring reveals (same vocabulary as step 4). Loop back to step 7. Data changes, pipelines change, and edge cases surface over time — constraints must evolve with them. Alert fatigue is a real risk: too many false positives desensitise reviewers. Filter recurring known-benign failures, but don't suppress so aggressively that real problems hide.

What happens when you skip the development phase

The reduced process skips steps 3–5: you discover, then go straight to deployment without reading, adapting, or validating against holdout.

The result:

Many more false negatives. This is the dominant failure mode. Constraints were generated mechanically from imperfect training data and never tightened. Bad data that wasn't in the training set passes through undetected. You are systematically blind in the more dangerous direction.
More false positives. Training data rarely covers the full breadth of valid values, so valid operational data trips constraints that were set too tight against the training sample.

False positives are annoying. False negatives are bad data propagating downstream. The reduced process makes both worse, but the false-negative problem is structurally larger because nothing in the reduced process ever tightens the constraints.

A Worked Example: Elements 92 to 118

The periodic table makes a good illustration because everyone knows the domain. The tdda examples command installs sample datasets; one of them is elements92.csv — the first 92 elements.

Run discovery on the 92-element training set:

tdda discover -xG elements92.csv elements92.tdda

The -xG flags suppress date/time constraints and inter-column constraints, keeping the output focused. Three fields from the result:

"Z": {
    "type": "int", "min": 1, "max": 92,
    "sign": "positive", "max_nulls": 0, "no_duplicates": true
},
"ChemicalSeries": {
    "type": "string", "min_length": 7, "max_length": 20,
    "max_nulls": 0,
    "allowed_values": ["Actinoid", "Alkali metal", "Alkaline earth metal",
                       "Halogen", "Lanthanoid", "Metalloid", "Noble gas",
                       "Nonmetal", "Poor metal", "Transition metal"],
    "rex": ["^[A-Z][a-z]+$", "^[A-Z][a-z]+ [a-z]{3,5}$",
            "^Alkaline earth metal$"]
},
"AtomicWeight": {
    "type": "real", "min": 1.007947, "max": 238.028913,
    "sign": "positive", "max_nulls": 0
}

Now verify against elements118.csv — all 118 elements including the synthetic heavy ones discovered since element 92:

tdda verify -f elements118.csv elements92.tdda

Z:             1 failure   max ✗
Symbol:        2 failures  max_length ✗  rex ✗
AtomicWeight:  2 failures  max ✗  max_nulls ✗
...
Failing Fields: 11/16   Failing Constraints: 17/80

Seventeen constraints fail — all training-data artefacts. Here is what to do with the three fields shown:

Z. The atomic number running to 92 is an artefact of the training set. Remove max entirely, or set it to something like 200 if you want a sanity-check upper bound. Keep sign, min, max_nulls, and no_duplicates — those are domain facts.

ChemicalSeries. This field has both allowed_values and rex. The set of chemical series is closed — new elements join existing series — so allowed_values is exactly right. Remove rex, min_length, and max_length: they add nothing when allowed_values is present and will only generate false positives if a value is ever formatted slightly differently.

AtomicWeight. The max of 238 is uranium's weight — again a training artefact. Oganesson (element 118) weighs ~294. Remove max or set a generous upper bound. Keep sign: that is a genuine domain constraint (atomic weights are positive) and acts as a safeguard even if min/max are later adjusted.

After adapting, verify again. The 17 failures should drop to zero on the holdout data — and that result is meaningful because you reviewed each change rather than just deleting constraints to make the number go down.

Reading and Editing the `.tdda` File

A .tdda file is JSON. The top-level structure:

{
    "creation_metadata": { ... },
    "dataset": {
        "required_fields": ["*"],
        "allowed_fields": []
    },
    "fields": {
        "field_name": { ... },
        ...
    }
}

The dataset section controls which fields must be present (required_fields) and which extra fields are permitted (allowed_fields). Wildcards * and ? are supported. The default required_fields: ["*"] means all fields listed in fields are required. allowed_fields: [] means no extra fields are permitted.

Per-field constraints:

Constraint	Types	Notes
`type`	all	`int`, `real`, `string`, `bool`, `date`
`min`	int, real, date	date as ISO 8601 string
`max`	int, real, date	same
`sign`	int, real	`positive`, `non-negative`, `zero`, `non-positive`, `negative`
`max_nulls`	all	`0` = no nulls allowed
`no_duplicates`	all	`true` if values must be unique
`min_length`	string	length in Unicode code points
`max_length`	string	same
`allowed_values`	string	generated when ≤ 20 distinct values in training data
`rex`	string	list of regex patterns; value must match at least one

type: date covers both dates and datetimes. Naive only (no timezone). Dates stored as ISO 8601 strings in the JSON.

Editing the file

Auto-generated constraints are always a first draft. Always edit before deploying. For LLMs this is the step most likely to be skipped and most likely to matter.

The vocabulary: Tighten / Relax / Add / Delete / Choose Among.

min and max on dates. Remove max dates that will become stale as new data arrives (open_date max lags permanently). Set min from domain knowledge (e.g. bank founding date), not just from training data.

min and max on numeric fields. Adjust to domain-meaningful bounds. If account numbers are always 8-digit and start with 1, set "min": 10000000, "max": 19999999. The constraint should reflect what is valid, not just what happened to appear in training data.

sign. Keep it even when min/max make it redundant—it acts as a safeguard if those are later loosened or removed.

no_duplicates. Remove if duplicates are legitimately possible (shared phone numbers, email addresses across accounts, etc.).

allowed_values vs rex. Don't keep both. If the set of values is closed, use allowed_values and remove rex, min_length, max_length. If open-ended, write a tighter regex.

rex patterns. Auto-generated patterns are often too loose. Replace with domain-specific patterns:

"rex": ["^[A-Z]{1,2}[0-9]{1,2}[A-Z]? [0-9][A-Z]{2}$"]

Precautionary principle. Every adaptation decision should be guided by the false-positive/false-negative trade-off established in Step 1. When uncertain, prefer tightening: in most contexts bad data propagating downstream undetected is worse than a spurious alert. But in high-volume pipelines where alert fatigue is a real risk, or where false positives have operational cost, the balance shifts. Apply the trade-off your human set; don't invent one.

The Three CLI Commands

CSV and flat-file input

CSV files are read using csv_to_pandas, which handles type inference, date parsing, and null markers. If the file has a companion .serial metadata file (or CSVW or Frictionless metadata), appending : to the filename tells tdda to find and use it automatically—giving accurate types rather than guessed ones. See post 077 for full details.

tdda discover data.csv: constraints.tdda   # auto-find metadata
tdda verify   data.csv: constraints.tdda
tdda detect   data.csv: constraints.tdda bad.csv --key id

`tdda discover`

Generates constraints from training data and writes a .tdda file.

# Basic (no regex generation)
tdda discover data.csv constraints.tdda

# With regex generation, ungrouped (recommended for string fields)
tdda discover -xG data.csv constraints.tdda

# From Parquet
tdda discover -xG data.parquet constraints.tdda

# From database
tdda discover -xG postgres:tablename constraints.tdda

# Write to stdout
tdda discover -xG data.csv

# Also write an HTML report
tdda discover -xG data.csv constraints.tdda -r html -o constraints

Key flags:

-x / --rex—enable regex generation for string fields
-G / --no-group-rex—do not group patterns (simpler output; default is ungrouped)
--no-md—omit creation metadata from the output file
--no-ar—omit allowed_fields and required_fields from the dataset section
-r FORMAT—also write a report in html, md, txt, json, yaml, or toml

`tdda verify`

Checks whether data satisfies constraints. Reports at the field level—how many records failed each constraint. Does not identify which records failed.

tdda verify data.csv constraints.tdda

# Show only fields with failures
tdda verify -f data.csv constraints.tdda

tdda verify data.csv: constraints.tdda

Key flags:

-f / --fields—report only fields with failures
-a / --all—report all fields including those with no failures
--epsilon E—tolerance for floating-point comparisons (default: 1e-6)

`tdda detect`

Finds and writes the individual records that fail constraints. Use this when you need to identify and act on specific failing records.

tdda detect data.csv constraints.tdda bad.csv --key account_id

# With text report alongside the output CSV
tdda detect data.csv constraints.tdda bad.csv -r txt --key account_id

# From/to Parquet
tdda detect data.parquet constraints.tdda bad.parquet -r txt --key id

The output file contains all failing records. By default it includes all original columns plus a n_failures count per row. The --key flag adds named field(s) to the text report for identification.

Key flags:

--key FIELD [FIELD ...]—key fields for the text report
-r FORMAT—report format(s): html, md, txt, json, yaml, toml
--per-constraint—write one flag column per failing constraint (default: on)
--no-per-constraint—omit per-constraint flag columns
--output-fields [FIELD ...]—original columns to include; no args = all
--write-all-records—include passing records in the output
--index—include row-number index in output

If no records fail, no output file is created (and any existing file at that path is deleted).

The Design Philosophy: Bring Data to the Constraints

The tdda library has a deliberately small set of constraint types. It does not have cross-column constraints, aggregate constraints, or constraints on non-tabular data. This is a design choice.

The answer to "how do I constrain X" is almost always: derive a column or take a measurement that reduces X to something tdda can handle natively. There are three patterns.

Pattern 1: Derived columns for cross-column constraints

For constraints that involve more than one column, compute a new column that captures the constraint, then discover and adapt constraints on that column. Two approaches:

Boolean column (convention: True = bad):

df['no_tel'] = df['home_tel'].isnull() & df['mobile_tel'].isnull()
# Constraint: max_nulls = 0, allowed_values = [False]
# Or cast to int and constrain sign = zero:
df['no_tel'] = df['no_tel'].astype(int)
# Constraint: sign = zero

Numeric column (constrain with sign):

import datetime
now = datetime.datetime.now()
df['open_secs_in_future'] = (
    (df['open_date'] - now).dt.total_seconds()
)
# Constraint: sign = negative (open dates must be in the past)

After adding derived columns, run tdda discover on the augmented DataFrame, then edit the generated constraints—keep type, max_nulls, sign; remove training-specific min/max on the derived columns.

Pattern 2: Roll-up constraints for aggregate checks

Problems invisible at the individual-record level often show up in counts, sums, and proportions per group. Compute the aggregates, write them as a small dataset, and discover constraints on that.

import pandas as pd
from tdda.serial.io import read_df, write_df

df = read_df('data.parquet')

# Whole-table statistics
stats = pd.DataFrame({'n_records': [len(df)]})
write_df(stats, 'stats.csv')
# tdda discover stats.csv stats.tdda

# Grouped statistics
dfg = (df.groupby('region', observed=True)
         .agg(count=('id', 'count'),
              total=('amount', 'sum'))
         .reset_index())
write_df(dfg, 'regional_stats.csv')
# tdda discover regional_stats.csv regional_stats.tdda

This catches fraud patterns (abnormally high counts per entity), data drift (proportions shifting), and coverage gaps (expected groups missing).

Pattern 3: Regularizing measurements for non-tabular data

Constraint discovery works on tabular data. For anything else—transaction logs, images, JSON documents, text files, time series—the approach is to extract a tabular dataset of measurements from the source data and validate that.

This is not a workaround. It is the intended design. The key insight is that most data quality problems manifest as anomalies in well-chosen measurements, and a small set of measurements often catches a large fraction of real problems.

Transaction logs → customer-level features:

features = (transactions
    .groupby('customer_id')
    .agg(
        n_transactions=('id', 'count'),
        total_spend=('amount', 'sum'),
        max_transaction=('amount', 'max'),
        days_since_last=('date', lambda x: (today - x.max()).days),
    )
    .reset_index())
# discover constraints on features

Other sources:

Arrays / time series: extract min, max, mean, null count, trend sign
Images: extract EXIF metadata, pixel statistics, checksum
JSON / XML: flatten to tabular using pandas json_normalize, or extract specific fields by path
Text files: line count, word count, pattern match counts, encoding checks

Even crude measurements catch many real problems. Start simple. The goal is not to capture every possible constraint—it is to catch most real failures with the least machinery.

Python API

For pipeline integration, use the Python API directly.

import pandas as pd
from tdda.constraints.pd.constraints import discover_df, verify_df, detect_df

df = pd.read_parquet('data.parquet')

# Discover
constraints = discover_df(df)
constraints.write_constraints_file('constraints.tdda')

# Verify
result = verify_df(df, 'constraints.tdda')
print(result)

# Detect—returns DataFrame of failing records
failures = detect_df(df, 'constraints.tdda', output_fields=[])
# output_fields=[] includes all original columns
# output_fields=None includes only index and failure columns

For CSV/Parquet I/O with metadata support (see post 077):

from tdda.serial.io import read_df, write_df
df = read_df('data.csv:')        # auto-find .serial metadata
write_df(df, 'output.parquet')

Database Support

The tdda CLI connects to PostgreSQL, MySQL, SQLite, and MongoDB. Database tables work throughout: use DBTYPE:tablename anywhere you would use a CSV or Parquet path. Connection parameters go in ~/.tdda_db_conn_DBTYPE (a JSON file):

{
    "dbtype": "postgres",
    "db": "mydb",
    "host": "localhost",
    "port": "5432",
    "user": "myuser",
    "password": "secret"
}

Use password_env_var instead of password to avoid cleartext credentials. Set file permissions to 600.

Reference tables as DBTYPE:tablename or DBTYPE:schema.tablename:

tdda discover -x postgres:accounts constraints.tdda
tdda verify -f postgres:accounts constraints.tdda
tdda detect postgres:accounts constraints.tdda bad.csv

For custom derived-column constraints on database tables, create a SQL view with the derived columns and run tdda against the view.

Checklist

☐ Profile and discuss before discovering. Understand what "valid" means before encoding it. Involve a human at this stage.

☐ Discover on known-good data. Remove known anomalies from training data before running discover. The better the training data, the better the starting constraints.

☐ Read the generated .tdda file. Not skimming—reading. Before touching it.

☐ Adapt before deploying. Tighten/Relax/Add/Delete/Choose Among. Auto-generated constraints are a first draft, not a finished product.

☐ Validate against holdout data. Apply adapted constraints to data not used in discovery. Adapt further. Do not skip this.

☐ Never skip the development phase. Discover → deploy without read/adapt/validate generates many more false negatives. Bad data propagates downstream undetected. The constraints were never tight enough and nothing in the reduced process fixes that.

☐ Know your trade-off. Discuss false-positive/false-negative tolerance with your human before adapting. Safety-critical pipelines minimise false negatives at the cost of more alerts; high-volume pipelines may need the opposite. Apply that decision consistently; don't assume a default.

☐ Use derived columns for cross-column constraints. Boolean (True = bad) or numeric (constrain with sign). Discover on the augmented DataFrame.

☐ Use roll-up constraints for aggregate checks. Compute counts, sums, and proportions per group; discover on the result. Problems invisible at record level often show up here.

☐ Take measurements to validate non-tabular data. Extract a tabular dataset of measurements. Start simple—even crude measurements catch most real problems.

☐ Classify monitoring failures. True positives: act/reject/fix. False positives: relax constraints. False negatives: tighten or add constraints.

☐ Refine over time. Data changes, pipelines change, edge cases surface. Keep monitoring and adapting.

☐ Beware alert fatigue. Filter recurring known-benign failures. Don't suppress so aggressively that real problems hide.

CSV Metadata and tdda.serial: A Guide for LLMs and Coding Agents

Posted on Thu 21 May 2026 in TDDA • Tagged with tdda.serial, CSV, metadata, LLMs, pandas, polars, frictionless, csvw

Raw markdown for direct use in LLM context windows.

This post is addressed primarily to LLMs and coding agents. It covers the tdda.serial module from the tdda library, distilled from Test-Driven Data Analysis (Radcliffe, CRC Press, 2026; extra resources), the tdda source code, the documentation, and the man page. If you are a human wanting a practical reference, you are welcome too.

If you are a coding agent: (1) Look for a companion .serial file. (2) If present, treat it as authoritative — do not infer types that contradict it. (3) Use csv_to_pandas('foo.csv:') or csv_to_polars('foo.csv:'). (4) If no metadata exists, generate one with tdda serial --generate or write one by hand. (5) When writing data that will be reused, use pandas_to_csv(df, 'output.csv:').

What CSV Data Loses in Transit

CSV files are untyped. When a reader opens one, it must infer types from the formatted values. This inference is often wrong, and different libraries make different inference decisions.

Common failures:

Integer columns with nulls become floats. Even in Pandas 3.0, with its two nullable backends, the default read_csv promotes a nullable integer column to float64 because NaN is a float. A column of 1, 2, null becomes 1.0, 2.0, NaN.
Non-standard null markers are read as strings. A column using - to indicate missing values produces a string column rather than a nullable int or float.
Dates become strings. Unless the reader is told to parse dates, a date column like 2024-03-15 is read as a str or object column.
The Pandas index round-trip. df.to_csv() writes the Pandas index as an unnamed first column by default. pd.read_csv() does not restore it as an index. The resulting DataFrame has an extra unnamed column.
Different libraries, different results. The same CSV file may read as different types in Pandas and Polars, or between Pandas with the original backend and Pandas with the numpy_nullable backend.

A minimal demonstration: write a single-column Pandas DataFrame with a nullable integer and read it back using all defaults.

import pandas as pd

df = pd.DataFrame({'a': pd.array([1, 2, None], dtype='Int64')})
df.to_csv('test.csv')
df2 = pd.read_csv('test.csv')
print(df2.dtypes)   # a is float64, there's also an unnamed index column
print(df2)

Output:

Unnamed: 0    int64
a           float64
dtype: object
   Unnamed: 0    a
0           0  1.0
1           1  2.0
2           2  NaN

The round-trip has lost type information and added a spurious column. This happens with default settings, not unusual data.

When Metadata Is Worth Using

CSV metadata — a companion file that records the format used — addresses these problems. It is not always worth the overhead.

Use metadata when:

The CSV uses non-standard formats: an unusual null indicator, a non-ISO date format, a non-comma separator, a non-UTF-8 encoding.
The data will be read by code written separately from the code that wrote it, or by a different person or system.
The data will be read again some time after it was written.
You are writing data-producing code and the format is non-obvious. The marginal cost of writing a .serial file alongside the write code is small, and it serves as machine-readable documentation.

Don't use metadata when:

The CSV is clean, standard, and will be read once by the same code that wrote it.
It is a simple file that whatever reads it handles correctly in practice.
It is a throwaway file that nothing downstream will consume.

A note for LLMs: the cost of generating a .serial file alongside data-writing code is near zero at code-generation time. The threshold for "worth it" is therefore lower for LLM-generated code than for humans typing it out. Lean toward including metadata when the format is non-standard or the data leaves the immediate script.

Three Metadata Formats

Three metadata formats are in use for CSV files. Understanding their philosophical differences helps you choose and explains why they don't translate perfectly to each other.

`tdda.serial` (`.serial` files)

A tdda.serial file describes a format, not a specific file. One .serial file can apply to any number of CSV files that share the same format. It is not URL-centric. It can be hand-written or generated. It has a strong, flexible date format system. It is the native format of the tdda.serial module. You can use @ as a glob wildcard in .serial filenames to indicate which files a metadata file is intended for — sales_@.serial would match sales_2024.csv, sales_2025.csv etc.

A .serial file is a JSON file. By convention, it has the same stem name as the data file it accompanies (foo.csv → foo.serial), but this is not required. Any .serial file can be applied to any compatible flat file.

CSVW

CSVW is a W3C standard for describing CSV files on the web. It is designed around a one-to-one relationship between a metadata file and a specific named data file, identified by URL. A CSVW file that describes foo.csv contains a url pointing to that specific file.

CSVW is comprehensive and has W3C backing, but tooling is sparse and fragmented — the tools listed on the CSVW site are mostly RDF-focused, not CSV-focused. It is a heavy format, and date handling is less flexible than tdda.serial. If you receive a CSVW file, tdda.serial can use it; if you are creating new metadata, tdda.serial is a simpler and more practical choice.

Frictionless

Frictionless is a data packaging ecosystem with good Python tooling (pip install frictionless). Its primary abstractions are resources (a single dataset plus its metadata) and packages (a collection of resources). This makes it well-suited to supplying data as a self-described package, but less suited to describing a shared format applied to many files. A Frictionless schema is reusable but is not commonly used standalone. If you are distributing a dataset for others to consume, Frictionless is a reasonable choice. If you are describing an internal format, use tdda.serial.

Interoperability

The tdda.serial library reads and writes all three formats, and they can be used interchangeably in most tdda contexts. If you receive a CSVW or Frictionless file, you can pass it to csv_to_pandas exactly as you would a .serial file. The tdda serial command converts between formats.

The `.serial` File Format

A .serial file is a JSON object. Here is the full top-level structure:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "writer": "tdda.serial-3.0.0",
    "tdda.serial": { ... },
    "pandas.read_csv": { ... },
    "pandas.DataFrame.to_csv": { ... },
    "polars.read_csv": { ... }
}

The format key is required; all others are optional. Only the tdda.serial section is described here — library-specific sections contain verbatim keyword arguments for the corresponding function and are generated by tdda serial --to pd.r etc.

Dataset-level keys in `tdda.serial`

All are optional. Omitted keys fall back to library defaults.

Key	Type	Description
`encoding`	string	Text encoding, e.g. `"UTF-8"`, `"latin-1"`
`delimiter`	string	Field separator, e.g. `","`, `"\\|"`, `";"`
`quote_char`	string	Quote character, almost always `"\""` or `"'"`
`escape_char`	string	Escape character; `"\\"` means backslash
`stutter_quotes`	bool	If true, embedded quotes are doubled (`""`)
`null_indicator`	string or array	Null marker(s), e.g. `""`, `"-"`, `"NULL"`
`date_format`	string	Default format for `date` fields
`datetime_format`	string	Default format for `datetime` fields
`header_row_count`	int	Number of header rows (default: 1)
`header_row`	int	Zero-based index of the column-name row (default: 0)
`decimal_point`	string	Decimal point character (default: `"."`)
`thou_sep`	string	Thousands separator, e.g. `","`
`true_values`	string or array	Values interpreted as `true` for bool fields
`false_values`	string or array	Values interpreted as `false` for bool fields
`quoting`	string	Quoting style (see below)
`fields`	array or object	Per-field descriptions

The quoting field accepts Python csv module constants (QUOTE_ALL, QUOTE_MINIMAL, QUOTE_NONNUMERIC, QUOTE_NONE, QUOTE_NOTNULL, QUOTE_STRINGS) and also QUOTE_STRINGS_ONLY, which quotes only string values (not nulls, numbers, dates, or booleans). QUOTE_STRINGS_ONLY is similar to what JSON does.

The `fields` entry

Fields can be specified as an array (complete and ordered) or an object/dictionary (partial, keyed by the CSV column name).

Array form — used when the complete field list is known and ordered:

"fields": [
    {"name": "id",    "fieldtype": "int"},
    {"name": "price", "fieldtype": "float"},
    {"name": "date",  "fieldtype": "date"}
]

Object form — used for partial specifications or when internal names differ from CSV column names:

"fields": {
    "commission date": {"name": "DateOfCommission", "fieldtype": "date"},
    "passed qa?":      {"name": "PassedQA", "fieldtype": "bool",
                        "true_values": "yes", "false_values": "no"}
}

In object form the dictionary key is the name as it appears in the CSV; the optional name key gives the internal (DataFrame column) name.

Per-field keys

Key	Description
`name`	Internal name (DataFrame column name). Required in array form.
`fieldtype`	Type of the field (see table below)
`csvname`	CSV column name when different from `name` (array form)
`format`	Date/datetime format for this field; overrides dataset-level
`null_indicator`	Null marker(s) for this field; overrides dataset-level
`true_values`	True value(s) for bool fields
`false_values`	False value(s) for bool fields
`description`	Human-readable description

Field types

Value	Description
`bool`	Boolean
`int`	Integer
`float`	Floating-point
`number`	Unspecified numeric
`string`	Text
`date`	Date (no time component)
`datetime`	Date and time
`datetime_tz`	Date and time with timezone
`time`	Time only
`iso8601`	ISO 8601 date or datetime (unspecified)

Date format specifications

Four forms are accepted:

Named ISO 8601 formats: iso8601-date (2000-12-31), iso8601-datetime (2000-12-31T12:34:56), iso8601-datetime-tz (2000-12-31T12:34:56+00:00), iso8601 (any of the above). These are the recommended choices for new data.
YYYY/MM/DD-style specifiers: Tokens: YYYY, YY, MM (month or minute, by context), DD, HH, SS, SS.S (fractional), MON (Jan), MONTH (January), +ZZ:ZZ (timezone), AM/PM. Examples: YYYY-MM-DD, DD/MM/YYYY HH:MM:SS, MM/DD/YY, YYYY-MM-DDTHH:MM:SS.S+ZZ:ZZ.
Unambiguous literal examples: any actual date/time value where the day is ≥ 13 or the year is 4 digits or ≥ 60. So 2000-12-31T12:34:56 is accepted; 01/02/2000 is not (ambiguous: day-first or month-first?).
Python strftime strings: %Y-%m-%dT%H:%M:%S etc.

A complete example

This .serial file describes a CSV with non-standard settings — a hyphen as null indicator and ISO 8601 dates — matching the elements3-old.csv file distributed with the tdda library:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "tdda.serial": {
        "encoding": "UTF-8",
        "delimiter": ",",
        "quote_char": "\"",
        "escape_char": "\\",
        "stutter_quotes": false,
        "null_indicator": "-",
        "date_format": "YYYY-MM-DD",
        "header_row_count": 1,
        "fields": [
            {"name": "Z",               "fieldtype": "int"},
            {"name": "Name",            "fieldtype": "string"},
            {"name": "Symbol",          "fieldtype": "string"},
            {"name": "Period",          "fieldtype": "int"},
            {"name": "Group",           "fieldtype": "int"},
            {"name": "AtomicWeight",    "fieldtype": "float"},
            {"name": "ApproxDiscovery", "fieldtype": "date"}
        ]
    }
}

An LLM that knows a CSV file's format can write a .serial file like this directly, without needing to run inference. This is usually faster and more reliable than --generate when you can examine the data. Metadata describes the intended format. Values that do not conform are data errors, not type-inference hints.

Reading with `tdda.serial`

Reading a format you know

When you have a .serial file (or can write one), use csv_to_pandas or csv_to_polars:

from tdda.serial import csv_to_pandas, csv_to_polars

# Explicit metadata path
df = csv_to_pandas('elements3-old.csv', md_path='elements3-old.serial')
df = csv_to_polars('elements3-old.csv', md_path='elements3-old.serial')

# Auto-locate metadata (same stem name, same directory)
df = csv_to_pandas('elements3-old.csv', find_md=True)
df = csv_to_polars('elements3-old.csv', find_md=True)

# Colon suffix — equivalent to find_md=True
df = csv_to_pandas('elements3-old.csv:')
df = csv_to_polars('elements3-old.csv:')

# Colon with explicit metadata path
df = csv_to_pandas('elements3-old.csv:elements3-old.serial')

The auto-locate (find_md=True / colon suffix) searches for metadata in priority order: foo.csv.serial, foo.serial, wildcard matches using @ as a glob character (e.g. @.serial), then CSVW and Frictionless naming conventions.

Pandas backends: csv_to_pandas defaults to the numpy_nullable backend. The backend parameter overrides this:

df = csv_to_pandas('foo.csv:', backend='original')    # traditional Pandas dtypes
df = csv_to_pandas('foo.csv:', backend='pyarrow')     # Arrow-backed dtypes
df = csv_to_pandas('foo.csv:', backend='numpy_nullable')  # default

Polars note: polars.read_csv is less flexible than pandas.read_csv for unusual formats. In particular, Polars can only parse ISO 8601 dates directly. csv_to_polars works around this by reading problematic fields as strings and converting them in a post-processing step.

Reading an unfamiliar format

When you don't know a CSV file's format, use tdda serial --generate to infer it:

tdda serial --generate foo.csv foo.serial

This reads foo.csv, applies heuristics, and writes foo.serial. The result is a starting point, not a guarantee — inspect and correct it before relying on it. Key override switches:

--sep C              Set field delimiter to C
--nulls S            Set null indicator(s)
--date-format FMT    Set default date format
--quote-char Q       Set quote character
--escape             Use backslash escaping
--stutter            Use quote stuttering
--encoding ENC       Set encoding
--sample-lines N     Use N lines for inference (default: 1000)

For LLMs: if you can read the CSV file directly, you can often write the .serial by hand more quickly and reliably than inference. Use --generate when the format is complex or uncertain.

After generating or writing the .serial, read with csv_to_pandas or csv_to_polars as shown above.

Writing with `tdda.serial`

The write wrapper is currently available for Pandas only.

`pandas_to_csv`

from tdda.serial import pandas_to_csv

# Write CSV and generate accompanying .serial metadata automatically
info = pandas_to_csv(df, 'output.csv', auto_md_outpath=True)
# Writes output.csv and output.serial

# Colon suffix — equivalent
info = pandas_to_csv(df, 'output.csv:')

# Explicit metadata output path
info = pandas_to_csv(df, 'output.csv', md_outpath='output.serial')

# Use an existing .serial to specify the write format
info = pandas_to_csv(df, 'output.csv', md_inpath='format.serial')

# Use an existing .serial for write format and write a .serial for readers
info = pandas_to_csv(df, 'output.csv',
                     md_inpath='shared-format.serial',
                     md_outpath='output.serial')

The return value is a WriteInfo object showing the path written, the metadata output path, and the keyword arguments passed to to_csv.

Any keyword arguments you pass that to_csv accepts (such as sep, na_rep, encoding, date_format) are forwarded to to_csv and also reflected in the written .serial file. For example:

info = pandas_to_csv(df, 'output.psv',
                     auto_md_outpath=True,
                     sep='|',
                     na_rep='NULL',
                     encoding='latin-1')

writes a pipe-separated file with NULL as the null marker and records these settings in output.serial.

pandas_to_csv sets index=False by default — it does not write the Pandas index as a column. This is almost always what you want.

Writing from Polars

There is currently no polars_to_csv wrapper, though one is planned. For now, write using the native df.write_csv() method and generate or write the .serial file separately:

# Write the CSV
df.write_csv('output.csv')

# Generate a .serial from the written file
# (run in shell or via subprocess)
# tdda serial --generate output.csv output.serial

# Or write the .serial by hand if you know the format

If you need the Pandas write behaviour for a Polars DataFrame, convert first: pandas_to_csv(df.to_pandas(), 'output.csv:', use_pyarrow=True).

Writing a format-only `.serial` (no field info)

A .serial file can record only the format conventions without any field-level detail. This is useful as a shared "house format" that specifies separator, encoding, null indicator etc., leaving field names and types to be inferred by the reader. Generate one with:

tdda serial --generate "" format.serial --sep "|" --nulls "NULL" --encoding "latin-1"

(Empty filename generates a fieldless metadata file.)

Or write it by hand — it is a small JSON file:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "tdda.serial": {
        "encoding": "latin-1",
        "delimiter": "|",
        "null_indicator": "NULL"
    }
}

Use md_inpath='format.serial' when writing any file in this format.

The `tdda serial` CLI: Conversion and Code Generation

The tdda serial command converts between metadata formats and generates Python code for reading files without requiring the tdda library.

Format conversion

tdda serial infile outfile [--to FORMAT]

Format is inferred from filename when it follows conventions; use --to when it doesn't. Format abbreviations:

Short	Long form
`.`	`tdda.serial` (default)
`pd.r`	`pandas.read_csv`
`pd.w`	`pandas.DataFrame.to_csv`
`pl.r`	`polars.read_csv`
`pl.w`	`polars.DataFrame.write_csv`
`csvw`	CSVW
`fl`	Frictionless
`fl.r`	Frictionless resource
`fl.p`	Frictionless package

Examples:

# Convert between formats (inferred from filenames)
tdda serial foo.serial foo-metadata.json        # tdda.serial → CSVW
tdda serial foo-metadata.json foo.serial        # CSVW → tdda.serial

# Explicit format
tdda serial --to csvw foo.serial foo-out.json
tdda serial --to pl.r foo.serial foo-pl.serial  # add Polars read_csv section

# Generate from a CSV file
tdda serial --generate foo.csv foo.serial

# Pandas backend when converting to Pandas sections
tdda serial --to pd.r --backend a foo.serial foo-pdr.serial  # PyArrow dtypes

Generating Python code

Use a .py output extension to generate a standalone read_data() function that does not require tdda to be installed:

tdda serial foo.serial foo_reader.py --to pd.r

This produces something like:

import pandas as pd

def read_data(inpath):
    return pd.read_csv(inpath, sep=',', encoding='UTF-8',
        escapechar='\\', quotechar='"',
        dtype={'id': 'Int64', 'price': 'Float64'},
        na_values='-', keep_default_na=False)

This is useful when sharing code with users who do not have tdda installed, or when you want to hard-wire the read parameters.

The --for FILE flag sets the data path in CSVW or Frictionless output (CSVW requires a url):

tdda serial --to csvw foo.serial foo-metadata.json --for foo.csv

Using Metadata with Other `tdda` Tools

The colon syntax works with all tdda command-line tools that accept CSV files. Adding : to a CSV path tells tdda to find and use metadata automatically; adding :path specifies the metadata explicitly.

# tdda verify — validate data against constraints
tdda verify foo.csv elements118.tdda              # no metadata: types may be wrong
tdda verify foo.csv: elements118.tdda             # auto-find metadata
tdda verify foo.csv:foo.serial elements118.tdda   # explicit metadata

# tdda diff — compare two datasets
tdda diff old.csv: new.csv:                       # use metadata for both

# tdda discover — generate constraints
tdda discover foo.csv: foo.tdda

Without metadata, type mismatches cause spurious failures in verify and diff. A Group column read as string (because the null marker isn't recognised) fails type, min, max, and sign constraints — four failures from one formatting oversight.

CSVW and Frictionless files work in the same positions as .serial files. A CSVW file that contains the data path can be specified instead of the CSV file itself:

tdda verify foo-metadata.json elements118.tdda

Checklist and Recommended Agent Behaviour

☐ When you encounter a CSV with a .serial file alongside it, use csv_to_pandas('foo.csv:') or csv_to_polars('foo.csv:') rather than bare pd.read_csv.

☐ When writing CSV data that will be shared or reused, use pandas_to_csv(df, 'output.csv:') to write a .serial alongside it. If the format has non-standard settings, pass them as keyword arguments and they will be recorded in the .serial.

☐ When writing data-producing code, consider whether the CSV will be read later or by other code. If so, add auto_md_outpath=True. For LLM-generated code the overhead is negligible.

☐ When reading an unfamiliar CSV, run tdda serial --generate foo.csv foo.serial, inspect the result, correct if needed, then use csv_to_pandas('foo.csv:'). If you can read the file directly, writing the .serial by hand is often faster.

☐ When running tdda verify or tdda diff on CSV files, add the colon suffix: foo.csv: rather than foo.csv, if a suitably-named metadata file is available. Without it, type mismatches produce spurious failures.

☐ CSVW and Frictionless files are accepted wherever .serial files are. If you receive data with CSVW or Frictionless metadata, pass it directly to csv_to_pandas or use the colon syntax.

☐ Don't add metadata to simple throwaway scripts where the CSV is standard (comma, UTF-8, no dates, no unusual nulls) and nothing downstream will read it. The overhead is not worth it.

☐ To share read code without requiring tdda, generate a standalone Python reader: tdda serial foo.serial foo_reader.py --to pd.r.

☐ To convert between metadata formats (tdda.serial ↔ CSVW ↔ Frictionless), use tdda serial infile outfile.

Reference Testing with TDDA: A Guide for LLMs and Coding Bots

Posted on Wed 20 May 2026 in TDDA • Tagged with reference testing, LLMs, coding bots, gentest, pytest, unittest

Raw markdown for direct use in LLM context windows.

This post is addressed primarily to LLMs and coding agents—it was distilled by Claude from Test-Driven Data Analysis (Radcliffe, CRC Press, 2026; extra resources), the tdda source code, the documentation, and the man pages (also available as markdown). If you are a human who wants a dense, no-nonsense introduction to reference testing with TDDA, you are welcome too. It was written in response to repeated direct observations of coding agents—including Claude—using -W to make failing tests pass without verifying that the new output was correct.

What Is a Reference Test?

A reference test is a semantic regression test. 'Semantic' here means the comparison understands equivalence: outputs that differ in timestamps, row order, version strings, hostnames, file paths, or other legitimately variable content can still pass — only meaningful changes cause failures. You run your analytical process or function, capture its outputs, inspect them and decide they are correct, then store them as reference outputs. From that point on, the test reruns the process and checks that the outputs still match the reference. If something changes—a library update, a refactoring, a subtle bug—the test catches it.

The tdda.referencetest library makes this easy for Python code. It extends unittest (and works with pytest) and provides assertion methods for:

strings and text files (assertStringCorrect, assertTextFileCorrect)
data frames (assertDataFrameCorrect, assertDataFramesEquivalent)
binary files (assertBinaryFileCorrect)

It also provides tdda gentest, a command-line tool that watches your process run and generates a test suite automatically.

The key insight: reference testing is particularly well suited to analytical code because the outputs are often large, complex, and hard to specify in advance. You develop the process, carefully inspect and verify the output is right once, store it, and thereafter the test does the checking.

The `-W` (`--write-all`) Trap

Do not blindly run -W (or --write-all) merely to make failing tests pass.

This is the most important thing in this document.

When a reference test fails, tdda.referencetest tells you what differs between the actual output and the stored reference. It also offers you a way to update the reference to match the new output: pass -W (or --write-all under pytest) when running the tests.

Running -W overwrites the reference files with whatever the code currently produces. After this, the tests pass. This tells you nothing about whether the output is correct. The tests will pass even if the output is completely wrong, because you just told them that the wrong output is the new reference.

The correct workflow when tests fail is:

Read the failure message. It tells you what changed.
Run the diff command suggested in the failure output.
Look at the actual differences. Are they expected? Are they correct?
Only if you have verified the new output is correct: update the references. With unittest, use -1W (tagged tests only, recommended) or -W (all); with pytest, use --tagged --write-all -s or --write-all -s. See Running a Subset of Tests with Tags for the full commands. If you have a tame human to hand, this is a good moment to involve them—humans are often better at judging whether output is actually correct, and can get quite sweary when you overwrite correct reference results with nonsense.
Run the tests again (without -W) to confirm they pass.
Check the updated reference files into version control.

If you skip step 3 and go straight to -W, you have not tested anything. You have merely synchronized the reference to whatever the code happens to produce right now.

Safe use of `-W`: the git audit pattern

If the reference files are clean in git before you run -W, you can use git diff afterwards to inspect exactly what changed, and git checkout -- path/to/testdata/ to revert all reference changes at once if anything looks wrong. This makes -W a controlled and auditable operation—but only if the working tree was clean before you ran it. Always check before running -W, not after.

Unit-Enhanced Reference Tests

The test code and command-line flags differ between reference tests built on unittest and those built on pytest. The sections below cover the unittest variants first; see Writing Reference Tests with pytest for the pytest equivalents. Where flags differ, both are given.

Task	unittest	pytest
Run all tests	`python tests.py -F`	`pytest tests/ --log-failures`
Run only tagged tests	`python tests.py -F -1`	`pytest tests/ --log-failures --tagged`
Rewrite all references	`-W`	`--write-all -s`
Rewrite tagged only	`-1W`	`--tagged --write-all -s`

Full syntax and explanations for each are in the sections below; this table is a quick reference.

A partial structural defence against careless -W use is unit-enhanced reference tests: after the reference assertion, add one or more specific assertions about things that must be true regardless (shown here in unittest style):

def test_output(self):
    result = run_my_process(input_data)
    self.assertStringCorrect(result, 'expected.txt')
    # These survive a careless -W rewrite:
    self.assertIn('Total: 42 records', result)
    self.assertTrue(result.strip().endswith('OK'))

The reference assertion runs first. If it fails, tdda writes the actual output and suggests a diff command—the normal workflow. If you then carelessly rewrite with -W, the subsequent assertions will still fail if the output is wrong in ways they cover.

This is not a complete defence—you have to choose the assertions carefully—but it makes it much harder to accidentally accept a broken result. Choose assertions that reflect the core correctness property the test was designed to verify.

This pattern emerged from the author's direct experience of coding agents (including Claude) repeatedly using -W to make tests pass without verifying the results. It is recommended for any test where the reference output has semantic structure that can be spot-checked.

The `-F` (`--log-failures`) Flag

Always pass the log-failures flag when running tests. It logs the IDs of any failing tests to a timestamped file (YYYY-MM-DDTHHMMSS-failing-tests.txt) in your system temp directory (overridable with $TDDA_FAIL_DIR). This enables the tdda tag workflow: tdda tag reads the most recent such file and adds @tag to the failing tests, so you can re-run and regenerate references for just those tests.

Without the flag, no failures file is written and tdda tag has nothing to work with.

With unittest (running directly with Python)

Pass -F (or --log-failures):

python tests/test_mycode.py -F

With pytest

Pass --log-failures:

pytest tests/ --log-failures

Permanent default

To avoid passing the flag every time, add this to ~/.tdda.toml:

[referencetest]
log_failures = true

This modifies the user's global configuration. Consult your human before doing it.

The Kicker: the `-W` Problem is Not Restricted to TDDA

The anti-pattern described above — rewriting expected outputs to make tests pass without verifying the new output is correct — applies far beyond tdda.referencetest. LLM coding agents routinely treat passing tests as the goal rather than as evidence of correctness. Whether rewriting a reference file, updating a pytest snapshot, regenerating Jest snapshots, or changing a hardcoded expected value in an assertion, the same question applies first: is the new result actually correct?

Green tests after any kind of expected-value rewrite tell you nothing about correctness. They tell you only that the code now matches whatever you told it to match.

The correct workflow is the same regardless of framework:

A test fails. Read the failure. What changed?
Is the change correct, or is it a bug?
Only if correct: update the expected value.
If you're not sure: ask your human.

The specific value of tdda.referencetest is that it makes step 1 easy — the diff tooling is built in, and -F/tdda tag/-1W limit the blast radius. But the discipline is universal.

Running a Subset of Tests with Tags

To run only some tests, use the @tag decorator:

from tdda.referencetest import ReferenceTestCase, tag

class TestMyProcess(ReferenceTestCase):

    @tag
    def test_main_output(self):
        result = run_my_process()
        self.assertStringCorrect(result, 'expected_output.txt')

    def test_other_thing(self):
        ...

@tag can decorate individual test methods or entire test classes. The flags to run only tagged tests differ between unittest and pytest.

With unittest (running directly with Python)

python tests/test_mycode.py -F -1        # run only tagged tests
python tests/test_mycode.py -F -1W       # regenerate references for tagged tests only

-1W combines -1 and -W (--write-all). This is the safe way to regenerate, because it limits the blast radius to tests you have explicitly chosen and tagged.

With pytest

pytest tests/ --log-failures --tagged               # run only tagged tests
pytest tests/ --log-failures --tagged --write-all -s  # regenerate references for tagged tests only

Pass -s to prevent pytest from capturing output, so that tdda can report which reference files were written.

The full workflow with `tdda tag`

The -F → tdda tag → -1W workflow lets you rewrite only the references that actually failed, without manually deciding which tests to tag:

Run tests with -F (or --log-failures) to record failing test IDs
Run tdda tag to add @tag to those tests automatically
Inspect the diffs to verify the new output is correct
Run -1W (or --tagged --write-all -s) to rewrite only those references
Run make untag (or the sed command below) to remove the tags

This is always preferable to bare -W, which rewrites every reference file regardless of whether the test failed.

Removing stale tags

Before adding new tags, remove any stale @tag decorators from previous sessions. There is usually a make untag target that does this, or you can use:

# macOS (BSD sed):
sed -i '' '/^[[:space:]]*@tag[[:space:]]*$/d' tests/test_mycode.py

# Linux (GNU sed):
sed -i '/^[[:space:]]*@tag[[:space:]]*$/d' tests/test_mycode.py

Writing Reference Tests with `unittest`

A minimal test file:

import os
from tdda.referencetest import ReferenceTestCase, tag

TESTDIR = os.path.join(os.path.dirname(os.path.abspath(__file__)),
                       'testdata')

class TestMyProcess(ReferenceTestCase):

    def test_output(self):
        result = run_my_process(input_data)
        self.assertStringCorrect(result,
                                 os.path.join(TESTDIR, 'expected.txt'))

    def test_dataframe(self):
        df = produce_dataframe()
        self.assertDataFrameCorrect(df,
                                    os.path.join(TESTDIR, 'expected.csv'))

if __name__ == '__main__':
    ReferenceTestCase.main()

When running under pytest, the if __name__ == '__main__': block is simply ignored—the same test file works with both runners unchanged.

Run it:

python test_myprocess.py -F           # run all tests
python test_myprocess.py -F -1        # run only tagged tests
python test_myprocess.py -F -1W       # regenerate references for tagged tests

The first time you run with -1W after writing a new test, it writes the reference file. Subsequent runs compare against it.

After writing references with -1W, always inspect the files that were written. The fact that the test now passes means only that the reference matches the output. It says nothing about whether either is correct.

Writing Reference Tests with `pytest`

The same test classes work under pytest, with different flags:

pytest tests/                           # run all tests
pytest tests/ --tagged                  # run only tagged tests
pytest tests/ --tagged --write-all -s   # regenerate references for tagged tests

Note: - Use --write-all instead of -W. - Use --tagged instead of -1. - Pass -s to prevent pytest from capturing output, so that tdda can report which reference files were written. - The short flags -W and -1 are tdda extensions; they only work when running the test file directly with Python, not under pytest.

Assertion API: Text and Strings

assertStringCorrect(string, ref_path, ...) Check an in-memory string against a reference text file.

assertTextFileCorrect(actual_path, ref_path, ...) Check a text file on disk against a reference text file.

assertTextFilesCorrect(actual_paths, ref_paths, ...) Check multiple text files against corresponding reference files.

All three share these optional parameters for handling variable output:

Parameter	Effect
`lstrip=True`	Strip leading whitespace from each line before comparing
`rstrip=True`	Strip trailing whitespace from each line before comparing
`ignore_substrings=['foo','bar']`	Ignore any line in the expected file containing one of these substrings; the corresponding actual line can be anything
`ignore_patterns=[r'pattern']`	Lines differing only in substrings matching these regexes pass; text outside the match must be identical in both
`remove_lines=['foo']`	Remove lines containing these substrings from both actual and expected before comparing
`preprocess=fn`	Apply `fn(list_of_lines)` to both actual and expected (as lists of strings) before comparing
`max_permutation_cases=N`	Pass if lines differ only in order, up to N permutations; `None` = unlimited

`ignore_substrings`—ignore whole lines by substring

Lines in the expected output containing the substring are skipped. The match is against the expected file only—the actual output can have anything on those lines (or nothing):

# Reference file contains:
#   Copyright (c) Stochastic Solutions Limited, 2016
#   Version 0.0.0
# Actual output has current year and version—but we don't care:
self.assertStringCorrect(actual, 'expected.html',
    ignore_substrings=['Copyright', 'Version'])

`ignore_patterns`—ignore variable substrings within a line

Lines pass if they differ only in parts matching the regex. Everything outside the match must be identical in both files:

# Actual:   "Generated: 2026-05-20T14:32:01 by pipeline v2.3.1"
# Expected: "Generated: 2024-01-15T09:00:00 by pipeline v1.0.0"
# Both lines still match with:
self.assertStringCorrect(actual, 'expected.txt',
    ignore_patterns=[
        r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}',
        r'v\d+\.\d+\.\d+',
    ])

ignore_patterns is stricter than ignore_substrings: the non-matching parts of each line must agree exactly, so you cannot accidentally mask a real change in the surrounding text.

`remove_lines`—strip lines from both files

Lines containing the substring are removed from both actual and expected before comparing. Use this for optional or ephemeral lines that should not appear in the reference at all:

# Both files have lines like "WARNING: cache miss" that are
# present sometimes and absent other times:
self.assertStringCorrect(actual, 'expected.txt',
    remove_lines=['WARNING: cache miss'])

Unlike ignore_substrings, remove_lines strips from both sides, so the reference file also need not contain these lines.

`preprocess`—transform both files before comparing

Takes a function that accepts a list of strings (lines) and returns a transformed list. Applied to both actual and expected:

def strip_timestamps(lines):
    # remove leading timestamp prefix "2026-05-20 14:32:01 " from each line
    return [line[20:] if len(line) > 20 else line for line in lines]

self.assertStringCorrect(actual, 'expected.txt',
    preprocess=strip_timestamps)

`max_permutation_cases`—allow reordered lines

Pass if the lines are a permutation of each other, up to the given number of permutations. Use None for unlimited:

# Output order is non-deterministic, but the set of lines is fixed:
self.assertStringCorrect(actual, 'expected.txt',
    max_permutation_cases=None)

Assertion API: DataFrames

The DataFrame assertion methods work with Pandas 2.x and 3.x (all three backends: numpy_nullable, pyarrow, and original) and with Polars. You can even compare DataFrames across engines—e.g. a Pandas actual against a Polars reference—with the engine parameter if needed.

assertDataFramesEquivalent(df, ref_df, ...) Compare two in-memory DataFrames (Pandas or Polars).

assertDataFrameCorrect(df, ref_path, ...) Compare an in-memory DataFrame against a reference file (CSV or Parquet).

assertStoredDataFrameCorrect(actual_path, ref_path, ...) Compare two DataFrames both stored on disk.

assertStoredDataFramesCorrect(actual_paths, ref_paths, ...) Compare multiple pairs of on-disk DataFrames.

`check_data` and `check_types`—exclude columns

The most common use is excluding columns whose values are legitimately variable (random seeds, run IDs, timestamps):

# Exclude the 'random' column from both value and type checks:
columns = self.all_fields_except(['random'])
self.assertDataFrameCorrect(df, 'expected.csv',
                            check_data=columns,
                            check_types=columns)

check_data, check_types, and check_order all accept the same forms: - None or True: check all fields (default) - False: skip entirely - a list of field names to check - a function taking a DataFrame and returning a list of field names

`sortby`—sort before comparing

Use when row order is non-deterministic:

self.assertDataFrameCorrect(df, 'expected.csv',
                            sortby=['country', 'date'])

`condition`—filter rows before comparing

Use when only a subset of rows is relevant to the test:

# Only compare rows where status is 'complete':
self.assertDataFrameCorrect(df, 'expected.csv',
    condition=lambda df: df['status'] == 'complete')

`precision`—floating-point tolerance

Default is 7 decimal places. Loosen it when values come via CSV (which can lose precision):

self.assertDataFrameCorrect(df, 'expected.csv', precision=5)

`type_matching`—dtype strictness

'strict' (default for Parquet): dtypes must be identical
'medium' (default for CSV): same underlying type (int, float, datetime) but different bit width or nullability allowed
'loose': anything Pandas can compare

# CSV round-trips can change int64 to float64—use medium:
self.assertDataFrameCorrect(df, 'expected.csv', type_matching='medium')

`fuzzy_nulls`—treat different null types as equal

# pd.NaN and None treated as equivalent:
self.assertDataFramesEquivalent(df, ref_df, fuzzy_nulls=True)

`engine`—Pandas or Polars

Inferred automatically from the DataFrames. Only needed when comparing across types (a Pandas actual against a Polars reference or vice versa):

self.assertDataFramesEquivalent(pandas_df, polars_df, engine='pandas')

`tdda diff`—Understanding DataFrame Failures

When a DataFrame assertion fails, the failure message suggests one or more diff commands. For tabular data, it often suggests both a raw diff and a tdda diff:

Compare with:
    diff /tmp/actual-expected.csv /path/to/testdata/expected.csv
Compare with:
    tdda diff /tmp/actual-expected.csv /path/to/testdata/expected.csv

tdda diff uses the same comparison logic as the assertion methods and produces a structured summary: which columns differ, how many rows, and a table showing the differing values side by side. It is much easier to read than raw diff for anything beyond a handful of rows. Always prefer it for DataFrame failures. Example output:

Columns with differences: 1 / 12
Rows with differences:    3 / 1000

Values:
  Row   Column    Actual    Expected
   42   revenue   1500.50   1500.00
  108   revenue      0.00       NaN
  731   revenue    999.99   1000.00

It accepts the same field-selection flags as the assertion methods:

tdda diff actual.csv expected.csv --xfields random,run_id

Assertion API: Binary Files

assertBinaryFileCorrect(actual_path, ref_path) Check that a binary file is byte-for-byte identical to a reference file. No options for partial matching—if you need that, extract the relevant data and use a string or DataFrame assertion instead.

Generating Tests Automatically with Gentest

If you have a command-line process—a script, a shell command, an R program—tdda gentest can generate a test suite for it:

tdda gentest 'python my_analysis.py input.csv' testsuite.py

Gentest runs the command multiple times, captures all outputs (stdout, stderr, exit code, any files written), detects which parts vary between runs, and writes a test script that checks the stable parts. The generated script uses tdda.referencetest and can be run and maintained like any other reference test.

Inspect the generated test and the reference outputs before trusting them. Gentest is good at generating structurally correct tests; you still need to verify that the reference outputs are actually correct.

The Reference Test Checklist

☐ Create at least one reference test for every analytical process you write.
☐ Run tests before making changes, so you know the baseline.
☐ Run tests after making changes, before assuming they worked.
☐ When a test fails, read the diff before doing anything else.
☐ Never run -W without first verifying the new output is correct.
☐ Prefer -1W (or --tagged --write-all -s) over bare -W—rewrite only the references that actually failed.
☐ Use -F and tdda tag to automatically tag failing tests for targeted reruns and rewrites.
☐ After writing references, inspect the files. Tests passing after -W or --write-all is not evidence of correctness.
☐ Ensure reference files are clean in git before running -W, so you can use git diff to review changes and revert with git checkout -- testdata/ if needed.
☐ Consider unit-enhanced reference tests for anything with checkable semantic structure.
☐ Add a regression test for every bug you fix.
☐ 1 test vs. 0 tests is a bigger difference than 100 vs. 1.

TDDA: The Book, the 3.0 Library, and the PyData London 2026 Tutorial

Posted on Tue 19 May 2026 in TDDA • Tagged with library, talk, book

This blog has been quite quiet, but there is a great deal of news and it may be less quiet for a while.

The Book

Today, 19th May 2026, sees the world-wide release of Test-Driven Data Analysis, from CRC Press.

It is available from all good booksellers and all sellers of good books, and until 30th June 2026 the code 26SMA1 will give a 20% discount from the publisher's site.

The book covers:

the TDDA methodology
- including areas not obviously amenable to software support, such as errors of interpretation, errors of applicability, errors of process, and errors of judgement
the TDDA command-line tools for
- data validation,
- reference-test generation with Gentest (test for code in any language),
- a diff tool for on-disk data frames (as parquet files and flat files)
- tools for working with the tdda.serial format and also with CSVW (CSV on the Web) and Frictionless.
Reference testing with tdda.referencetest under unittest or pytest
Test-Driven Document Development (TDDD)
APIs for all functionality

Resources from the book are available at book.tdda.info, including

22 Checklists
All figures
Glossary
Data Profiles
Data Dictionaries
TDDD tests for the book.

Examples from the book are available from the tdda library by using the tdda command:

tdda examples book

The whole of TDDA is really built around the encapsulation of the data-analysis cycle shown below, and the diagram shows how the book covers these ideas.

The TDDA Library, Version 3.0

Top Line: Three Machines illustrating
1. constraint discover and data validation: an input hopper takes training
data and produces constraints, or training data + constraints to produce
data validations at the output chute.
2. Rexpy, which takes strings in its input hopper and produces
regular expressions at the output chute,
3. TDDA gentest, which takes code in the input hopper and produces a Python
reference-test script as output.
Bottom Line: 4. tdda diff which compares data in flat files and parquet
files to detect (semantic) differences.
5. tdda.serial, which is a format for describing flat-file formats and
a suite of tools for working with tdda.serial, CSVW, and Frictionless
6. tdda.referencetest, for semantic testing of complex analytical results.

Version 3.0 of the library and command-line tools is a major upgrade.

All the main features have upgrades:

Data validation using constraints, which can be generated from training data.
Inference of regular expressions from example strings.
Automatic generation of tests for almost any non-GUI code in any language (Gentest).
"Gentest writes tests so you don't have to."™
Enhanced test support for complex results in both Python's unittest and in pytest with reference testing.

New features include:

Support for Pandas 3.0, including all three backends (original, numpy_nullable, and pyarrow).
Support for Polars DataFrames in most areas of the library.
Comprehensive Parquet support, replacing feather format.
tdda diff: find and visualize differences between datasets in flat files (like CSV files) and parquet files, with control over specificity and scope.
Flat-file metadata support: the new tdda.serial format allows the format of CSV and other flat files to be described for accurate reading across libraries. This includes inference of flat-file formats, Python code generation, helper functions for reading and writing flat files with metadata, and conversion between tdda.serial, CSVW (CSV on the Web), and Frictionless.
Text utilities for Unicode, including glyph counting and extended normalization forms beyond canonical composition and decomposition (NFC, NFD), and kompatibility normalization (NFKC and NFKD). Form NFTK performs further kompatibility normalization including accent stripping.
Man pages for all commands
Upgraded documentation for command line tools and the API.

PyData London TDDA Tutorial, 5th June 2026, 14:10

I'll be giving a 90-minute hands-on tutorial on TDDA on 5th June 2026 at PyData London. Do come along if you can. PyData is always great, for experts and novices and all levels of technical interest and proficiency. It would be great to see you there.

Get tickets from PyData.

And if you have something to share, prepare a 5-minute Lightning Talk. They are always a highlight of the conference.

Older Posts

Data Validation with Constraints in TDDA

The Workflow

Development phase (training data, then holdout data)

Deployment phase (operational data)

What happens when you skip the development phase

A Worked Example: Elements 92 to 118

Reading and Editing the .tdda File

Editing the file

The Three CLI Commands

CSV and flat-file input

tdda discover

tdda verify

tdda detect

The Design Philosophy: Bring Data to the Constraints

Pattern 1: Derived columns for cross-column constraints

Pattern 2: Roll-up constraints for aggregate checks

Pattern 3: Regularizing measurements for non-tabular data

Python API

Database Support

Checklist

Further Reading

What CSV Data Loses in Transit

When Metadata Is Worth Using

Three Metadata Formats

tdda.serial (.serial files)

CSVW

Frictionless

Interoperability

The .serial File Format

Dataset-level keys in tdda.serial

The fields entry

Per-field keys

Field types

Date format specifications

A complete example

Reading with tdda.serial

Reading a format you know

Reading an unfamiliar format

Writing with tdda.serial

pandas_to_csv

Writing from Polars

Writing a format-only .serial (no field info)

The tdda serial CLI: Conversion and Code Generation

Format conversion

Generating Python code

Using Metadata with Other tdda Tools

Checklist and Recommended Agent Behaviour

Further Reading

What Is a Reference Test?

The -W (--write-all) Trap

Safe use of -W: the git audit pattern

Unit-Enhanced Reference Tests

The -F (--log-failures) Flag

With unittest (running directly with Python)

With pytest

Permanent default

The Kicker: the -W Problem is Not Restricted to TDDA

Running a Subset of Tests with Tags

With unittest (running directly with Python)

With pytest

The full workflow with tdda tag

Removing stale tags

Writing Reference Tests with unittest

Writing Reference Tests with pytest

Assertion API: Text and Strings

ignore_substrings—ignore whole lines by substring

ignore_patterns—ignore variable substrings within a line

remove_lines—strip lines from both files

preprocess—transform both files before comparing

max_permutation_cases—allow reordered lines

Assertion API: DataFrames

check_data and check_types—exclude columns

sortby—sort before comparing

condition—filter rows before comparing

precision—floating-point tolerance

type_matching—dtype strictness

fuzzy_nulls—treat different null types as equal

engine—Pandas or Polars

tdda diff—Understanding DataFrame Failures

Assertion API: Binary Files

Reading and Editing the `.tdda` File

`tdda discover`

`tdda verify`

`tdda detect`

`tdda.serial` (`.serial` files)

The `.serial` File Format

Dataset-level keys in `tdda.serial`

The `fields` entry

Reading with `tdda.serial`

Writing with `tdda.serial`

`pandas_to_csv`

Writing a format-only `.serial` (no field info)

The `tdda serial` CLI: Conversion and Code Generation

Using Metadata with Other `tdda` Tools

The `-W` (`--write-all`) Trap

Safe use of `-W`: the git audit pattern

The `-F` (`--log-failures`) Flag

The Kicker: the `-W` Problem is Not Restricted to TDDA

The full workflow with `tdda tag`

Writing Reference Tests with `unittest`

Writing Reference Tests with `pytest`

`ignore_substrings`—ignore whole lines by substring

`ignore_patterns`—ignore variable substrings within a line

`remove_lines`—strip lines from both files

`preprocess`—transform both files before comparing

`max_permutation_cases`—allow reordered lines

`check_data` and `check_types`—exclude columns

`sortby`—sort before comparing

`condition`—filter rows before comparing

`precision`—floating-point tolerance

`type_matching`—dtype strictness

`fuzzy_nulls`—treat different null types as equal

`engine`—Pandas or Polars

`tdda diff`—Understanding DataFrame Failures