CSV Metadata and tdda.serial: A Guide for LLMs and Coding Agents
Posted on Thu 21 May 2026 in TDDA • Tagged with tdda.serial, CSV, metadata, LLMs, pandas, polars, frictionless, csvw
Raw markdown for direct use in LLM context windows.
This post is addressed primarily to LLMs and coding agents. It covers the
tdda.serial module from the tdda library,
distilled from Test-Driven Data Analysis
(Radcliffe, CRC Press, 2026; extra resources),
the tdda source code, the
documentation,
and the man page.
If you are a human wanting a practical reference, you are welcome too.
If you are a coding agent: (1) Look for a companion .serial file.
(2) If present, treat it as authoritative — do not infer types that
contradict it. (3) Use csv_to_pandas('foo.csv:') or
csv_to_polars('foo.csv:'). (4) If no metadata exists, generate one
with tdda serial --generate or write one by hand.
(5) When writing data that will be reused, use pandas_to_csv(df, 'output.csv:').
What CSV Data Loses in Transit
CSV files are untyped. When a reader opens one, it must infer types from the formatted values. This inference is often wrong, and different libraries make different inference decisions.
Common failures:
- Integer columns with nulls become floats. Even in Pandas 3.0, with
its two nullable backends, the default
read_csvpromotes a nullable integer column tofloat64becauseNaNis a float. A column of1, 2, nullbecomes1.0, 2.0, NaN. - Non-standard null markers are read as strings. A column using
-to indicate missing values produces astringcolumn rather than a nullableintorfloat. - Dates become strings. Unless the reader is told to parse dates,
a date column like
2024-03-15is read as astrorobjectcolumn. - The Pandas index round-trip.
df.to_csv()writes the Pandas index as an unnamed first column by default.pd.read_csv()does not restore it as an index. The resulting DataFrame has an extra unnamed column. - Different libraries, different results. The same CSV file may read
as different types in Pandas and Polars, or between Pandas with the
original backend and Pandas with the
numpy_nullablebackend.
A minimal demonstration: write a single-column Pandas DataFrame with a nullable integer and read it back using all defaults.
import pandas as pd
df = pd.DataFrame({'a': pd.array([1, 2, None], dtype='Int64')})
df.to_csv('test.csv')
df2 = pd.read_csv('test.csv')
print(df2.dtypes) # a is float64, there's also an unnamed index column
print(df2)
Output:
Unnamed: 0 int64
a float64
dtype: object
Unnamed: 0 a
0 0 1.0
1 1 2.0
2 2 NaN
The round-trip has lost type information and added a spurious column. This happens with default settings, not unusual data.
When Metadata Is Worth Using
CSV metadata — a companion file that records the format used — addresses these problems. It is not always worth the overhead.
Use metadata when:
- The CSV uses non-standard formats: an unusual null indicator, a non-ISO date format, a non-comma separator, a non-UTF-8 encoding.
- The data will be read by code written separately from the code that wrote it, or by a different person or system.
- The data will be read again some time after it was written.
- You are writing data-producing code and the format is non-obvious.
The marginal cost of writing a
.serialfile alongside the write code is small, and it serves as machine-readable documentation.
Don't use metadata when:
- The CSV is clean, standard, and will be read once by the same code that wrote it.
- It is a simple file that whatever reads it handles correctly in practice.
- It is a throwaway file that nothing downstream will consume.
A note for LLMs: the cost of generating a .serial file alongside
data-writing code is near zero at code-generation time. The threshold
for "worth it" is therefore lower for LLM-generated code than for humans
typing it out. Lean toward including metadata when the format is
non-standard or the data leaves the immediate script.
Three Metadata Formats
Three metadata formats are in use for CSV files. Understanding their philosophical differences helps you choose and explains why they don't translate perfectly to each other.
tdda.serial (.serial files)
A tdda.serial file describes a format, not a specific file. One
.serial file can apply to any number of CSV files that share the same
format. It is not URL-centric. It can be hand-written or generated.
It has a strong, flexible date format system. It is the native format
of the tdda.serial module. You can use @ as a glob wildcard in .serial
filenames to indicate which files a metadata file is intended for —
sales_@.serial would match sales_2024.csv, sales_2025.csv etc.
A .serial file is a JSON file. By convention, it has the same stem
name as the data file it accompanies (foo.csv → foo.serial), but
this is not required. Any .serial file can be applied to any
compatible flat file.
CSVW
CSVW is a W3C standard for describing CSV files on
the web. It is designed around a one-to-one relationship between a
metadata file and a specific named data file, identified by URL. A
CSVW file that describes foo.csv contains a url pointing to that
specific file.
CSVW is comprehensive and has W3C backing, but tooling is sparse and
fragmented — the tools listed on the CSVW site are mostly RDF-focused,
not CSV-focused. It is a heavy format, and date handling is less
flexible than tdda.serial. If you receive a CSVW file, tdda.serial
can use it; if you are creating new metadata, tdda.serial is a
simpler and more practical choice.
Frictionless
Frictionless is a data packaging
ecosystem with good Python tooling (pip install frictionless). Its
primary abstractions are resources (a single dataset plus its
metadata) and packages (a collection of resources). This makes it
well-suited to supplying data as a self-described package, but less
suited to describing a shared format applied to many files. A
Frictionless schema is reusable but is not commonly used standalone.
If you are distributing a dataset for others to consume, Frictionless
is a reasonable choice. If you are describing an internal format, use
tdda.serial.
Interoperability
The tdda.serial library reads and writes all three formats, and they
can be used interchangeably in most tdda contexts. If you receive a
CSVW or Frictionless file, you can pass it to csv_to_pandas exactly
as you would a .serial file. The tdda serial command converts
between formats.
The .serial File Format
A .serial file is a JSON object. Here is the full top-level structure:
{
"format": "http://tdda.info/ns/tdda.serial",
"writer": "tdda.serial-3.0.0",
"tdda.serial": { ... },
"pandas.read_csv": { ... },
"pandas.DataFrame.to_csv": { ... },
"polars.read_csv": { ... }
}
The format key is required; all others are optional. Only the
tdda.serial section is described here — library-specific sections
contain verbatim keyword arguments for the corresponding function and
are generated by tdda serial --to pd.r etc.
Dataset-level keys in tdda.serial
All are optional. Omitted keys fall back to library defaults.
| Key | Type | Description |
|---|---|---|
encoding |
string | Text encoding, e.g. "UTF-8", "latin-1" |
delimiter |
string | Field separator, e.g. ",", "\|", ";" |
quote_char |
string | Quote character, almost always "\"" or "'" |
escape_char |
string | Escape character; "\\" means backslash |
stutter_quotes |
bool | If true, embedded quotes are doubled ("") |
null_indicator |
string or array | Null marker(s), e.g. "", "-", "NULL" |
date_format |
string | Default format for date fields |
datetime_format |
string | Default format for datetime fields |
header_row_count |
int | Number of header rows (default: 1) |
header_row |
int | Zero-based index of the column-name row (default: 0) |
decimal_point |
string | Decimal point character (default: ".") |
thou_sep |
string | Thousands separator, e.g. "," |
true_values |
string or array | Values interpreted as true for bool fields |
false_values |
string or array | Values interpreted as false for bool fields |
quoting |
string | Quoting style (see below) |
fields |
array or object | Per-field descriptions |
The quoting field accepts Python csv module constants
(QUOTE_ALL, QUOTE_MINIMAL, QUOTE_NONNUMERIC, QUOTE_NONE,
QUOTE_NOTNULL, QUOTE_STRINGS) and also QUOTE_STRINGS_ONLY,
which quotes only string values (not nulls, numbers, dates, or booleans).
QUOTE_STRINGS_ONLY is similar to what JSON does.
The fields entry
Fields can be specified as an array (complete and ordered) or an object/dictionary (partial, keyed by the CSV column name).
Array form — used when the complete field list is known and ordered:
"fields": [
{"name": "id", "fieldtype": "int"},
{"name": "price", "fieldtype": "float"},
{"name": "date", "fieldtype": "date"}
]
Object form — used for partial specifications or when internal names differ from CSV column names:
"fields": {
"commission date": {"name": "DateOfCommission", "fieldtype": "date"},
"passed qa?": {"name": "PassedQA", "fieldtype": "bool",
"true_values": "yes", "false_values": "no"}
}
In object form the dictionary key is the name as it appears in the CSV;
the optional name key gives the internal (DataFrame column) name.
Per-field keys
| Key | Description |
|---|---|
name |
Internal name (DataFrame column name). Required in array form. |
fieldtype |
Type of the field (see table below) |
csvname |
CSV column name when different from name (array form) |
format |
Date/datetime format for this field; overrides dataset-level |
null_indicator |
Null marker(s) for this field; overrides dataset-level |
true_values |
True value(s) for bool fields |
false_values |
False value(s) for bool fields |
description |
Human-readable description |
Field types
| Value | Description |
|---|---|
bool |
Boolean |
int |
Integer |
float |
Floating-point |
number |
Unspecified numeric |
string |
Text |
date |
Date (no time component) |
datetime |
Date and time |
datetime_tz |
Date and time with timezone |
time |
Time only |
iso8601 |
ISO 8601 date or datetime (unspecified) |
Date format specifications
Four forms are accepted:
-
Named ISO 8601 formats:
iso8601-date(2000-12-31),iso8601-datetime(2000-12-31T12:34:56),iso8601-datetime-tz(2000-12-31T12:34:56+00:00),iso8601(any of the above). These are the recommended choices for new data. -
YYYY/MM/DD-style specifiers: Tokens:
YYYY,YY,MM(month or minute, by context),DD,HH,SS,SS.S(fractional),MON(Jan),MONTH(January),+ZZ:ZZ(timezone),AM/PM. Examples:YYYY-MM-DD,DD/MM/YYYY HH:MM:SS,MM/DD/YY,YYYY-MM-DDTHH:MM:SS.S+ZZ:ZZ. -
Unambiguous literal examples: any actual date/time value where the day is ≥ 13 or the year is 4 digits or ≥ 60. So
2000-12-31T12:34:56is accepted;01/02/2000is not (ambiguous: day-first or month-first?). -
Python
strftimestrings:%Y-%m-%dT%H:%M:%Setc.
A complete example
This .serial file describes a CSV with non-standard settings — a hyphen
as null indicator and ISO 8601 dates — matching the elements3-old.csv
file distributed with the tdda library:
{
"format": "http://tdda.info/ns/tdda.serial",
"tdda.serial": {
"encoding": "UTF-8",
"delimiter": ",",
"quote_char": "\"",
"escape_char": "\\",
"stutter_quotes": false,
"null_indicator": "-",
"date_format": "YYYY-MM-DD",
"header_row_count": 1,
"fields": [
{"name": "Z", "fieldtype": "int"},
{"name": "Name", "fieldtype": "string"},
{"name": "Symbol", "fieldtype": "string"},
{"name": "Period", "fieldtype": "int"},
{"name": "Group", "fieldtype": "int"},
{"name": "AtomicWeight", "fieldtype": "float"},
{"name": "ApproxDiscovery", "fieldtype": "date"}
]
}
}
An LLM that knows a CSV file's format can write a .serial file like
this directly, without needing to run inference. This is usually faster
and more reliable than --generate when you can examine the data.
Metadata describes the intended format. Values that do not conform are
data errors, not type-inference hints.
Reading with tdda.serial
Reading a format you know
When you have a .serial file (or can write one), use csv_to_pandas
or csv_to_polars:
from tdda.serial import csv_to_pandas, csv_to_polars
# Explicit metadata path
df = csv_to_pandas('elements3-old.csv', md_path='elements3-old.serial')
df = csv_to_polars('elements3-old.csv', md_path='elements3-old.serial')
# Auto-locate metadata (same stem name, same directory)
df = csv_to_pandas('elements3-old.csv', find_md=True)
df = csv_to_polars('elements3-old.csv', find_md=True)
# Colon suffix — equivalent to find_md=True
df = csv_to_pandas('elements3-old.csv:')
df = csv_to_polars('elements3-old.csv:')
# Colon with explicit metadata path
df = csv_to_pandas('elements3-old.csv:elements3-old.serial')
The auto-locate (find_md=True / colon suffix) searches for metadata
in priority order: foo.csv.serial, foo.serial, wildcard matches using
@ as a glob character (e.g. @.serial), then CSVW and Frictionless
naming conventions.
Pandas backends: csv_to_pandas defaults to the numpy_nullable
backend. The backend parameter overrides this:
df = csv_to_pandas('foo.csv:', backend='original') # traditional Pandas dtypes
df = csv_to_pandas('foo.csv:', backend='pyarrow') # Arrow-backed dtypes
df = csv_to_pandas('foo.csv:', backend='numpy_nullable') # default
Polars note: polars.read_csv is less flexible than pandas.read_csv
for unusual formats. In particular, Polars can only parse ISO 8601 dates
directly. csv_to_polars works around this by reading problematic fields
as strings and converting them in a post-processing step.
Reading an unfamiliar format
When you don't know a CSV file's format, use tdda serial --generate
to infer it:
tdda serial --generate foo.csv foo.serial
This reads foo.csv, applies heuristics, and writes foo.serial. The
result is a starting point, not a guarantee — inspect and correct it
before relying on it. Key override switches:
--sep C Set field delimiter to C
--nulls S Set null indicator(s)
--date-format FMT Set default date format
--quote-char Q Set quote character
--escape Use backslash escaping
--stutter Use quote stuttering
--encoding ENC Set encoding
--sample-lines N Use N lines for inference (default: 1000)
For LLMs: if you can read the CSV file directly, you can often write
the .serial by hand more quickly and reliably than inference. Use
--generate when the format is complex or uncertain.
After generating or writing the .serial, read with csv_to_pandas or
csv_to_polars as shown above.
Writing with tdda.serial
The write wrapper is currently available for Pandas only.
pandas_to_csv
from tdda.serial import pandas_to_csv
# Write CSV and generate accompanying .serial metadata automatically
info = pandas_to_csv(df, 'output.csv', auto_md_outpath=True)
# Writes output.csv and output.serial
# Colon suffix — equivalent
info = pandas_to_csv(df, 'output.csv:')
# Explicit metadata output path
info = pandas_to_csv(df, 'output.csv', md_outpath='output.serial')
# Use an existing .serial to specify the write format
info = pandas_to_csv(df, 'output.csv', md_inpath='format.serial')
# Use an existing .serial for write format and write a .serial for readers
info = pandas_to_csv(df, 'output.csv',
md_inpath='shared-format.serial',
md_outpath='output.serial')
The return value is a WriteInfo object showing the path written, the
metadata output path, and the keyword arguments passed to to_csv.
Any keyword arguments you pass that to_csv accepts (such as sep,
na_rep, encoding, date_format) are forwarded to to_csv and also
reflected in the written .serial file. For example:
info = pandas_to_csv(df, 'output.psv',
auto_md_outpath=True,
sep='|',
na_rep='NULL',
encoding='latin-1')
writes a pipe-separated file with NULL as the null marker and records
these settings in output.serial.
pandas_to_csv sets index=False by default — it does not write the
Pandas index as a column. This is almost always what you want.
Writing from Polars
There is currently no polars_to_csv wrapper, though one is planned.
For now, write using the native df.write_csv() method and generate
or write the .serial file separately:
# Write the CSV
df.write_csv('output.csv')
# Generate a .serial from the written file
# (run in shell or via subprocess)
# tdda serial --generate output.csv output.serial
# Or write the .serial by hand if you know the format
If you need the Pandas write behaviour for a Polars DataFrame, convert
first: pandas_to_csv(df.to_pandas(), 'output.csv:', use_pyarrow=True).
Writing a format-only .serial (no field info)
A .serial file can record only the format conventions without any
field-level detail. This is useful as a shared "house format" that
specifies separator, encoding, null indicator etc., leaving field names
and types to be inferred by the reader. Generate one with:
tdda serial --generate "" format.serial --sep "|" --nulls "NULL" --encoding "latin-1"
(Empty filename generates a fieldless metadata file.)
Or write it by hand — it is a small JSON file:
{
"format": "http://tdda.info/ns/tdda.serial",
"tdda.serial": {
"encoding": "latin-1",
"delimiter": "|",
"null_indicator": "NULL"
}
}
Use md_inpath='format.serial' when writing any file in this format.
The tdda serial CLI: Conversion and Code Generation
The tdda serial command converts between metadata formats and generates
Python code for reading files without requiring the tdda library.
Format conversion
tdda serial infile outfile [--to FORMAT]
Format is inferred from filename when it follows conventions; use --to
when it doesn't. Format abbreviations:
| Short | Long form |
|---|---|
. |
tdda.serial (default) |
pd.r |
pandas.read_csv |
pd.w |
pandas.DataFrame.to_csv |
pl.r |
polars.read_csv |
pl.w |
polars.DataFrame.write_csv |
csvw |
CSVW |
fl |
Frictionless |
fl.r |
Frictionless resource |
fl.p |
Frictionless package |
Examples:
# Convert between formats (inferred from filenames)
tdda serial foo.serial foo-metadata.json # tdda.serial → CSVW
tdda serial foo-metadata.json foo.serial # CSVW → tdda.serial
# Explicit format
tdda serial --to csvw foo.serial foo-out.json
tdda serial --to pl.r foo.serial foo-pl.serial # add Polars read_csv section
# Generate from a CSV file
tdda serial --generate foo.csv foo.serial
# Pandas backend when converting to Pandas sections
tdda serial --to pd.r --backend a foo.serial foo-pdr.serial # PyArrow dtypes
Generating Python code
Use a .py output extension to generate a standalone read_data()
function that does not require tdda to be installed:
tdda serial foo.serial foo_reader.py --to pd.r
This produces something like:
import pandas as pd
def read_data(inpath):
return pd.read_csv(inpath, sep=',', encoding='UTF-8',
escapechar='\\', quotechar='"',
dtype={'id': 'Int64', 'price': 'Float64'},
na_values='-', keep_default_na=False)
This is useful when sharing code with users who do not have tdda
installed, or when you want to hard-wire the read parameters.
The --for FILE flag sets the data path in CSVW or Frictionless output
(CSVW requires a url):
tdda serial --to csvw foo.serial foo-metadata.json --for foo.csv
Using Metadata with Other tdda Tools
The colon syntax works with all tdda command-line tools that accept
CSV files. Adding : to a CSV path tells tdda to find and use
metadata automatically; adding :path specifies the metadata explicitly.
# tdda verify — validate data against constraints
tdda verify foo.csv elements118.tdda # no metadata: types may be wrong
tdda verify foo.csv: elements118.tdda # auto-find metadata
tdda verify foo.csv:foo.serial elements118.tdda # explicit metadata
# tdda diff — compare two datasets
tdda diff old.csv: new.csv: # use metadata for both
# tdda discover — generate constraints
tdda discover foo.csv: foo.tdda
Without metadata, type mismatches cause spurious failures in verify
and diff. A Group column read as string (because the null marker
isn't recognised) fails type, min, max, and sign constraints — four
failures from one formatting oversight.
CSVW and Frictionless files work in the same positions as .serial files.
A CSVW file that contains the data path can be specified instead of the
CSV file itself:
tdda verify foo-metadata.json elements118.tdda
Checklist and Recommended Agent Behaviour
☐ When you encounter a CSV with a .serial file alongside it, use
csv_to_pandas('foo.csv:') or csv_to_polars('foo.csv:') rather than
bare pd.read_csv.
☐ When writing CSV data that will be shared or reused, use
pandas_to_csv(df, 'output.csv:') to write a .serial alongside it.
If the format has non-standard settings, pass them as keyword arguments
and they will be recorded in the .serial.
☐ When writing data-producing code, consider whether the CSV will
be read later or by other code. If so, add auto_md_outpath=True.
For LLM-generated code the overhead is negligible.
☐ When reading an unfamiliar CSV, run
tdda serial --generate foo.csv foo.serial, inspect the result, correct
if needed, then use csv_to_pandas('foo.csv:'). If you can read the
file directly, writing the .serial by hand is often faster.
☐ When running tdda verify or tdda diff on CSV files, add the
colon suffix: foo.csv: rather than foo.csv, if a suitably-named
metadata file is available. Without it, type mismatches produce
spurious failures.
☐ CSVW and Frictionless files are accepted wherever .serial files
are. If you receive data with CSVW or Frictionless metadata, pass it
directly to csv_to_pandas or use the colon syntax.
☐ Don't add metadata to simple throwaway scripts where the CSV is standard (comma, UTF-8, no dates, no unusual nulls) and nothing downstream will read it. The overhead is not worth it.
☐ To share read code without requiring tdda, generate a standalone
Python reader: tdda serial foo.serial foo_reader.py --to pd.r.
☐ To convert between metadata formats (tdda.serial ↔ CSVW ↔
Frictionless), use tdda serial infile outfile.
Further Reading
- tdda.serial documentation
tdda serialman page- tdda library source
- Test-Driven Data Analysis (Radcliffe, CRC Press, 2026), Chapter 8
- Book resources




