TDDA: The Book, the 3.0 Library, and the PyData London 2026 Tutorial

Posted on Tue 19 May 2026 in TDDA

This blog has been quite quiet, but there is a great deal of news and it may be less quiet for a while.

The Book

Today, 19th May 2026, sees the world-wide release of Test-Driven Data Analysis, from CRC Press.

The cover of the book Test-Driven Data Analysis by Nicholas J. Radcliffe. It is published by Chapman and Hall, part of CRC Press, from Taylor & Francis Group, and is part of the DATA SCIENCE SERIES. The cover is black with mostly white text and a white graphic. The graphic is a 3-row by 4-column grid of squares. Each square contains a number of dots laid out on a regular 32x32 grid. The top-left square has 1024 dots (“full”) and working along each row in turn, the number of dots roughly halves each time, apparently at random (and, actually, pseudo-randomly). The last row’s boxes have six, two, two, and one dot.

It is available from all good booksellers and all sellers of good books, and until 30th June 2026 the code 26SMA1 will give a 20% discount from the publisher's site.

The book covers:

  • the TDDA methodology
    • including areas not obviously amenable to software support, such as errors of interpretation, errors of applicability, errors of process, and errors of judgement
  • the TDDA command-line tools for
    • data validation,
    • reference-test generation with Gentest (test for code in any language),
    • a diff tool for on-disk data frames (as parquet files and flat files)
    • tools for working with the tdda.serial format and also with CSVW (CSV on the Web) and Frictionless.
  • Reference testing with tdda.referencetest under unittest or pytest
  • Test-Driven Document Development (TDDD)
  • APIs for all functionality

Resources from the book are available at book.tdda.info, including

  • 22 Checklists
  • All figures
  • Glossary
  • Data Profiles
  • Data Dictionaries
  • TDDD tests for the book.

Examples from the book are available from the tdda library by using the tdda command:

tdda examples book

The whole of TDDA is really built around the encapsulation of the data-analysis cycle shown below, and the diagram shows how the book covers these ideas.

The main part of the diagram consists of six circles from
left to right.
The first five circles have failure mode text
under them and an error class below that.
1. CHOOSE APPROACH.
Failure: 'Fail to understand data, problem domain, or methods',
ERROR OF INTERPRETATION (error of formulation).
Ch 13.
2. DEVELOP ANALYTICAL PROCESS.
Failure: 'Mistakes during coding' and the associated
ERROR OF IMPLEMENTATION (bug).
Ch 9-12.
3. RUN ANALYTICAL PROCESS.
Failure: 'Use the software incorrectly'
ERROR OF PROCESS (operator error).
Ch 16.
4. PRODUCE ANALYTICAL RESULTS
Failure 'Mismatch between development data or assumptions
and deployment data'
ERROR OF APPLICABILITY (category error).
Ch 1-7 & 17.
5. INTERPRET ANALYTICAL RESULTS
Failure 'Misinterpret the results'
ERROR OF INTERPRETATION (communication error).
Ch 14 & 15.
6. `First, Do No Harm'.
ERROR OF JUDGEMENT.
Ch 17.
Arrows lead to FAILURE and SUCCESS boxes.
Remedies and book chapters sit underneath the main diagram.

The TDDA Library, Version 3.0

Top Line: Three Machines illustrating
1. constraint discover and data validation: an input hopper takes training
data and produces constraints, or training data + constraints to produce
data validations at the output chute.
2. Rexpy, which takes strings in its input hopper and produces
regular expressions at the output chute,
3. TDDA gentest, which takes code in the input hopper and produces a Python
reference-test script as output.
Bottom Line: 4. tdda diff which compares data in flat files and parquet
files to detect (semantic) differences.
5. tdda.serial, which is a format for describing flat-file formats and
a suite of tools for working with tdda.serial, CSVW, and Frictionless
6. tdda.referencetest, for semantic testing of complex analytical results.

Version 3.0 of the library and command-line tools is a major upgrade.

All the main features have upgrades:

  • Data validation using constraints, which can be generated from training data.

  • Inference of regular expressions from example strings.

  • Automatic generation of tests for almost any non-GUI code in any language (Gentest).
    "Gentest writes tests so you don't have to."™

  • Enhanced test support for complex results in both Python's unittest and in pytest with reference testing.

New features include:

  • Support for Pandas 3.0, including all three backends (original, numpy_nullable, and pyarrow).

  • Support for Polars DataFrames in most areas of the library.

  • Comprehensive Parquet support, replacing feather format.

  • tdda diff: find and visualize differences between datasets in flat files (like CSV files) and parquet files, with control over specificity and scope.

  • Flat-file metadata support: the new tdda.serial format allows the format of CSV and other flat files to be described for accurate reading across libraries. This includes inference of flat-file formats, Python code generation, helper functions for reading and writing flat files with metadata, and conversion between tdda.serial, CSVW (CSV on the Web), and Frictionless.

  • Text utilities for Unicode, including glyph counting and extended normalization forms beyond canonical composition and decomposition (NFC, NFD), and kompatibility normalization (NFKC and NFKD). Form NFTK performs further kompatibility normalization including accent stripping.

  • Man pages for all commands

  • Upgraded documentation for command line tools and the API.

PyData London TDDA Tutorial, 5th June 2026, 14:10

I'll be giving a 90-minute hands-on tutorial on TDDA on 5th June 2026 at PyData London. Do come along if you can. PyData is always great, for experts and novices and all levels of technical interest and proficiency. It would be great to see you there.

Get tickets from PyData.

And if you have something to share, prepare a 5-minute Lightning Talk. They are always a highlight of the conference.