CSV Metadata and tdda.serial: A Guide for LLMs and Coding Agents

Posted on Thu 21 May 2026 in TDDA

Raw markdown for direct use in LLM context windows.

This post is addressed primarily to LLMs and coding agents. It covers the tdda.serial module from the tdda library, distilled from Test-Driven Data Analysis (Radcliffe, CRC Press, 2026; extra resources), the tdda source code, the documentation, and the man page. If you are a human wanting a practical reference, you are welcome too.

If you are a coding agent: (1) Look for a companion .serial file. (2) If present, treat it as authoritative — do not infer types that contradict it. (3) Use csv_to_pandas('foo.csv:') or csv_to_polars('foo.csv:'). (4) If no metadata exists, generate one with tdda serial --generate or write one by hand. (5) When writing data that will be reused, use pandas_to_csv(df, 'output.csv:').

What CSV Data Loses in Transit

CSV files are untyped. When a reader opens one, it must infer types from the formatted values. This inference is often wrong, and different libraries make different inference decisions.

Common failures:

  • Integer columns with nulls become floats. Even in Pandas 3.0, with its two nullable backends, the default read_csv promotes a nullable integer column to float64 because NaN is a float. A column of 1, 2, null becomes 1.0, 2.0, NaN.
  • Non-standard null markers are read as strings. A column using - to indicate missing values produces a string column rather than a nullable int or float.
  • Dates become strings. Unless the reader is told to parse dates, a date column like 2024-03-15 is read as a str or object column.
  • The Pandas index round-trip. df.to_csv() writes the Pandas index as an unnamed first column by default. pd.read_csv() does not restore it as an index. The resulting DataFrame has an extra unnamed column.
  • Different libraries, different results. The same CSV file may read as different types in Pandas and Polars, or between Pandas with the original backend and Pandas with the numpy_nullable backend.

A minimal demonstration: write a single-column Pandas DataFrame with a nullable integer and read it back using all defaults.

import pandas as pd

df = pd.DataFrame({'a': pd.array([1, 2, None], dtype='Int64')})
df.to_csv('test.csv')
df2 = pd.read_csv('test.csv')
print(df2.dtypes)   # a is float64, there's also an unnamed index column
print(df2)

Output:

Unnamed: 0    int64
a           float64
dtype: object
   Unnamed: 0    a
0           0  1.0
1           1  2.0
2           2  NaN

The round-trip has lost type information and added a spurious column. This happens with default settings, not unusual data.

When Metadata Is Worth Using

CSV metadata — a companion file that records the format used — addresses these problems. It is not always worth the overhead.

Use metadata when:

  • The CSV uses non-standard formats: an unusual null indicator, a non-ISO date format, a non-comma separator, a non-UTF-8 encoding.
  • The data will be read by code written separately from the code that wrote it, or by a different person or system.
  • The data will be read again some time after it was written.
  • You are writing data-producing code and the format is non-obvious. The marginal cost of writing a .serial file alongside the write code is small, and it serves as machine-readable documentation.

Don't use metadata when:

  • The CSV is clean, standard, and will be read once by the same code that wrote it.
  • It is a simple file that whatever reads it handles correctly in practice.
  • It is a throwaway file that nothing downstream will consume.

A note for LLMs: the cost of generating a .serial file alongside data-writing code is near zero at code-generation time. The threshold for "worth it" is therefore lower for LLM-generated code than for humans typing it out. Lean toward including metadata when the format is non-standard or the data leaves the immediate script.

Three Metadata Formats

Three metadata formats are in use for CSV files. Understanding their philosophical differences helps you choose and explains why they don't translate perfectly to each other.

tdda.serial (.serial files)

A tdda.serial file describes a format, not a specific file. One .serial file can apply to any number of CSV files that share the same format. It is not URL-centric. It can be hand-written or generated. It has a strong, flexible date format system. It is the native format of the tdda.serial module. You can use @ as a glob wildcard in .serial filenames to indicate which files a metadata file is intended for — sales_@.serial would match sales_2024.csv, sales_2025.csv etc.

A .serial file is a JSON file. By convention, it has the same stem name as the data file it accompanies (foo.csvfoo.serial), but this is not required. Any .serial file can be applied to any compatible flat file.

CSVW

CSVW is a W3C standard for describing CSV files on the web. It is designed around a one-to-one relationship between a metadata file and a specific named data file, identified by URL. A CSVW file that describes foo.csv contains a url pointing to that specific file.

CSVW is comprehensive and has W3C backing, but tooling is sparse and fragmented — the tools listed on the CSVW site are mostly RDF-focused, not CSV-focused. It is a heavy format, and date handling is less flexible than tdda.serial. If you receive a CSVW file, tdda.serial can use it; if you are creating new metadata, tdda.serial is a simpler and more practical choice.

Frictionless

Frictionless is a data packaging ecosystem with good Python tooling (pip install frictionless). Its primary abstractions are resources (a single dataset plus its metadata) and packages (a collection of resources). This makes it well-suited to supplying data as a self-described package, but less suited to describing a shared format applied to many files. A Frictionless schema is reusable but is not commonly used standalone. If you are distributing a dataset for others to consume, Frictionless is a reasonable choice. If you are describing an internal format, use tdda.serial.

Interoperability

The tdda.serial library reads and writes all three formats, and they can be used interchangeably in most tdda contexts. If you receive a CSVW or Frictionless file, you can pass it to csv_to_pandas exactly as you would a .serial file. The tdda serial command converts between formats.

The .serial File Format

A .serial file is a JSON object. Here is the full top-level structure:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "writer": "tdda.serial-3.0.0",
    "tdda.serial": { ... },
    "pandas.read_csv": { ... },
    "pandas.DataFrame.to_csv": { ... },
    "polars.read_csv": { ... }
}

The format key is required; all others are optional. Only the tdda.serial section is described here — library-specific sections contain verbatim keyword arguments for the corresponding function and are generated by tdda serial --to pd.r etc.

Dataset-level keys in tdda.serial

All are optional. Omitted keys fall back to library defaults.

Key Type Description
encoding string Text encoding, e.g. "UTF-8", "latin-1"
delimiter string Field separator, e.g. ",", "\|", ";"
quote_char string Quote character, almost always "\"" or "'"
escape_char string Escape character; "\\" means backslash
stutter_quotes bool If true, embedded quotes are doubled ("")
null_indicator string or array Null marker(s), e.g. "", "-", "NULL"
date_format string Default format for date fields
datetime_format string Default format for datetime fields
header_row_count int Number of header rows (default: 1)
header_row int Zero-based index of the column-name row (default: 0)
decimal_point string Decimal point character (default: ".")
thou_sep string Thousands separator, e.g. ","
true_values string or array Values interpreted as true for bool fields
false_values string or array Values interpreted as false for bool fields
quoting string Quoting style (see below)
fields array or object Per-field descriptions

The quoting field accepts Python csv module constants (QUOTE_ALL, QUOTE_MINIMAL, QUOTE_NONNUMERIC, QUOTE_NONE, QUOTE_NOTNULL, QUOTE_STRINGS) and also QUOTE_STRINGS_ONLY, which quotes only string values (not nulls, numbers, dates, or booleans). QUOTE_STRINGS_ONLY is similar to what JSON does.

The fields entry

Fields can be specified as an array (complete and ordered) or an object/dictionary (partial, keyed by the CSV column name).

Array form — used when the complete field list is known and ordered:

"fields": [
    {"name": "id",    "fieldtype": "int"},
    {"name": "price", "fieldtype": "float"},
    {"name": "date",  "fieldtype": "date"}
]

Object form — used for partial specifications or when internal names differ from CSV column names:

"fields": {
    "commission date": {"name": "DateOfCommission", "fieldtype": "date"},
    "passed qa?":      {"name": "PassedQA", "fieldtype": "bool",
                        "true_values": "yes", "false_values": "no"}
}

In object form the dictionary key is the name as it appears in the CSV; the optional name key gives the internal (DataFrame column) name.

Per-field keys

Key Description
name Internal name (DataFrame column name). Required in array form.
fieldtype Type of the field (see table below)
csvname CSV column name when different from name (array form)
format Date/datetime format for this field; overrides dataset-level
null_indicator Null marker(s) for this field; overrides dataset-level
true_values True value(s) for bool fields
false_values False value(s) for bool fields
description Human-readable description

Field types

Value Description
bool Boolean
int Integer
float Floating-point
number Unspecified numeric
string Text
date Date (no time component)
datetime Date and time
datetime_tz Date and time with timezone
time Time only
iso8601 ISO 8601 date or datetime (unspecified)

Date format specifications

Four forms are accepted:

  1. Named ISO 8601 formats: iso8601-date (2000-12-31), iso8601-datetime (2000-12-31T12:34:56), iso8601-datetime-tz (2000-12-31T12:34:56+00:00), iso8601 (any of the above). These are the recommended choices for new data.

  2. YYYY/MM/DD-style specifiers: Tokens: YYYY, YY, MM (month or minute, by context), DD, HH, SS, SS.S (fractional), MON (Jan), MONTH (January), +ZZ:ZZ (timezone), AM/PM. Examples: YYYY-MM-DD, DD/MM/YYYY HH:MM:SS, MM/DD/YY, YYYY-MM-DDTHH:MM:SS.S+ZZ:ZZ.

  3. Unambiguous literal examples: any actual date/time value where the day is ≥ 13 or the year is 4 digits or ≥ 60. So 2000-12-31T12:34:56 is accepted; 01/02/2000 is not (ambiguous: day-first or month-first?).

  4. Python strftime strings: %Y-%m-%dT%H:%M:%S etc.

A complete example

This .serial file describes a CSV with non-standard settings — a hyphen as null indicator and ISO 8601 dates — matching the elements3-old.csv file distributed with the tdda library:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "tdda.serial": {
        "encoding": "UTF-8",
        "delimiter": ",",
        "quote_char": "\"",
        "escape_char": "\\",
        "stutter_quotes": false,
        "null_indicator": "-",
        "date_format": "YYYY-MM-DD",
        "header_row_count": 1,
        "fields": [
            {"name": "Z",               "fieldtype": "int"},
            {"name": "Name",            "fieldtype": "string"},
            {"name": "Symbol",          "fieldtype": "string"},
            {"name": "Period",          "fieldtype": "int"},
            {"name": "Group",           "fieldtype": "int"},
            {"name": "AtomicWeight",    "fieldtype": "float"},
            {"name": "ApproxDiscovery", "fieldtype": "date"}
        ]
    }
}

An LLM that knows a CSV file's format can write a .serial file like this directly, without needing to run inference. This is usually faster and more reliable than --generate when you can examine the data. Metadata describes the intended format. Values that do not conform are data errors, not type-inference hints.

Reading with tdda.serial

Reading a format you know

When you have a .serial file (or can write one), use csv_to_pandas or csv_to_polars:

from tdda.serial import csv_to_pandas, csv_to_polars

# Explicit metadata path
df = csv_to_pandas('elements3-old.csv', md_path='elements3-old.serial')
df = csv_to_polars('elements3-old.csv', md_path='elements3-old.serial')

# Auto-locate metadata (same stem name, same directory)
df = csv_to_pandas('elements3-old.csv', find_md=True)
df = csv_to_polars('elements3-old.csv', find_md=True)

# Colon suffix — equivalent to find_md=True
df = csv_to_pandas('elements3-old.csv:')
df = csv_to_polars('elements3-old.csv:')

# Colon with explicit metadata path
df = csv_to_pandas('elements3-old.csv:elements3-old.serial')

The auto-locate (find_md=True / colon suffix) searches for metadata in priority order: foo.csv.serial, foo.serial, wildcard matches using @ as a glob character (e.g. @.serial), then CSVW and Frictionless naming conventions.

Pandas backends: csv_to_pandas defaults to the numpy_nullable backend. The backend parameter overrides this:

df = csv_to_pandas('foo.csv:', backend='original')    # traditional Pandas dtypes
df = csv_to_pandas('foo.csv:', backend='pyarrow')     # Arrow-backed dtypes
df = csv_to_pandas('foo.csv:', backend='numpy_nullable')  # default

Polars note: polars.read_csv is less flexible than pandas.read_csv for unusual formats. In particular, Polars can only parse ISO 8601 dates directly. csv_to_polars works around this by reading problematic fields as strings and converting them in a post-processing step.

Reading an unfamiliar format

When you don't know a CSV file's format, use tdda serial --generate to infer it:

tdda serial --generate foo.csv foo.serial

This reads foo.csv, applies heuristics, and writes foo.serial. The result is a starting point, not a guarantee — inspect and correct it before relying on it. Key override switches:

--sep C              Set field delimiter to C
--nulls S            Set null indicator(s)
--date-format FMT    Set default date format
--quote-char Q       Set quote character
--escape             Use backslash escaping
--stutter            Use quote stuttering
--encoding ENC       Set encoding
--sample-lines N     Use N lines for inference (default: 1000)

For LLMs: if you can read the CSV file directly, you can often write the .serial by hand more quickly and reliably than inference. Use --generate when the format is complex or uncertain.

After generating or writing the .serial, read with csv_to_pandas or csv_to_polars as shown above.

Writing with tdda.serial

The write wrapper is currently available for Pandas only.

pandas_to_csv

from tdda.serial import pandas_to_csv

# Write CSV and generate accompanying .serial metadata automatically
info = pandas_to_csv(df, 'output.csv', auto_md_outpath=True)
# Writes output.csv and output.serial

# Colon suffix — equivalent
info = pandas_to_csv(df, 'output.csv:')

# Explicit metadata output path
info = pandas_to_csv(df, 'output.csv', md_outpath='output.serial')

# Use an existing .serial to specify the write format
info = pandas_to_csv(df, 'output.csv', md_inpath='format.serial')

# Use an existing .serial for write format and write a .serial for readers
info = pandas_to_csv(df, 'output.csv',
                     md_inpath='shared-format.serial',
                     md_outpath='output.serial')

The return value is a WriteInfo object showing the path written, the metadata output path, and the keyword arguments passed to to_csv.

Any keyword arguments you pass that to_csv accepts (such as sep, na_rep, encoding, date_format) are forwarded to to_csv and also reflected in the written .serial file. For example:

info = pandas_to_csv(df, 'output.psv',
                     auto_md_outpath=True,
                     sep='|',
                     na_rep='NULL',
                     encoding='latin-1')

writes a pipe-separated file with NULL as the null marker and records these settings in output.serial.

pandas_to_csv sets index=False by default — it does not write the Pandas index as a column. This is almost always what you want.

Writing from Polars

There is currently no polars_to_csv wrapper, though one is planned. For now, write using the native df.write_csv() method and generate or write the .serial file separately:

# Write the CSV
df.write_csv('output.csv')

# Generate a .serial from the written file
# (run in shell or via subprocess)
# tdda serial --generate output.csv output.serial

# Or write the .serial by hand if you know the format

If you need the Pandas write behaviour for a Polars DataFrame, convert first: pandas_to_csv(df.to_pandas(), 'output.csv:', use_pyarrow=True).

Writing a format-only .serial (no field info)

A .serial file can record only the format conventions without any field-level detail. This is useful as a shared "house format" that specifies separator, encoding, null indicator etc., leaving field names and types to be inferred by the reader. Generate one with:

tdda serial --generate "" format.serial --sep "|" --nulls "NULL" --encoding "latin-1"

(Empty filename generates a fieldless metadata file.)

Or write it by hand — it is a small JSON file:

{
    "format": "http://tdda.info/ns/tdda.serial",
    "tdda.serial": {
        "encoding": "latin-1",
        "delimiter": "|",
        "null_indicator": "NULL"
    }
}

Use md_inpath='format.serial' when writing any file in this format.

The tdda serial CLI: Conversion and Code Generation

The tdda serial command converts between metadata formats and generates Python code for reading files without requiring the tdda library.

Format conversion

tdda serial infile outfile [--to FORMAT]

Format is inferred from filename when it follows conventions; use --to when it doesn't. Format abbreviations:

Short Long form
. tdda.serial (default)
pd.r pandas.read_csv
pd.w pandas.DataFrame.to_csv
pl.r polars.read_csv
pl.w polars.DataFrame.write_csv
csvw CSVW
fl Frictionless
fl.r Frictionless resource
fl.p Frictionless package

Examples:

# Convert between formats (inferred from filenames)
tdda serial foo.serial foo-metadata.json        # tdda.serial → CSVW
tdda serial foo-metadata.json foo.serial        # CSVW → tdda.serial

# Explicit format
tdda serial --to csvw foo.serial foo-out.json
tdda serial --to pl.r foo.serial foo-pl.serial  # add Polars read_csv section

# Generate from a CSV file
tdda serial --generate foo.csv foo.serial

# Pandas backend when converting to Pandas sections
tdda serial --to pd.r --backend a foo.serial foo-pdr.serial  # PyArrow dtypes

Generating Python code

Use a .py output extension to generate a standalone read_data() function that does not require tdda to be installed:

tdda serial foo.serial foo_reader.py --to pd.r

This produces something like:

import pandas as pd

def read_data(inpath):
    return pd.read_csv(inpath, sep=',', encoding='UTF-8',
        escapechar='\\', quotechar='"',
        dtype={'id': 'Int64', 'price': 'Float64'},
        na_values='-', keep_default_na=False)

This is useful when sharing code with users who do not have tdda installed, or when you want to hard-wire the read parameters.

The --for FILE flag sets the data path in CSVW or Frictionless output (CSVW requires a url):

tdda serial --to csvw foo.serial foo-metadata.json --for foo.csv

Using Metadata with Other tdda Tools

The colon syntax works with all tdda command-line tools that accept CSV files. Adding : to a CSV path tells tdda to find and use metadata automatically; adding :path specifies the metadata explicitly.

# tdda verify — validate data against constraints
tdda verify foo.csv elements118.tdda              # no metadata: types may be wrong
tdda verify foo.csv: elements118.tdda             # auto-find metadata
tdda verify foo.csv:foo.serial elements118.tdda   # explicit metadata

# tdda diff — compare two datasets
tdda diff old.csv: new.csv:                       # use metadata for both

# tdda discover — generate constraints
tdda discover foo.csv: foo.tdda

Without metadata, type mismatches cause spurious failures in verify and diff. A Group column read as string (because the null marker isn't recognised) fails type, min, max, and sign constraints — four failures from one formatting oversight.

CSVW and Frictionless files work in the same positions as .serial files. A CSVW file that contains the data path can be specified instead of the CSV file itself:

tdda verify foo-metadata.json elements118.tdda

When you encounter a CSV with a .serial file alongside it, use csv_to_pandas('foo.csv:') or csv_to_polars('foo.csv:') rather than bare pd.read_csv.

When writing CSV data that will be shared or reused, use pandas_to_csv(df, 'output.csv:') to write a .serial alongside it. If the format has non-standard settings, pass them as keyword arguments and they will be recorded in the .serial.

When writing data-producing code, consider whether the CSV will be read later or by other code. If so, add auto_md_outpath=True. For LLM-generated code the overhead is negligible.

When reading an unfamiliar CSV, run tdda serial --generate foo.csv foo.serial, inspect the result, correct if needed, then use csv_to_pandas('foo.csv:'). If you can read the file directly, writing the .serial by hand is often faster.

When running tdda verify or tdda diff on CSV files, add the colon suffix: foo.csv: rather than foo.csv, if a suitably-named metadata file is available. Without it, type mismatches produce spurious failures.

CSVW and Frictionless files are accepted wherever .serial files are. If you receive data with CSVW or Frictionless metadata, pass it directly to csv_to_pandas or use the colon syntax.

Don't add metadata to simple throwaway scripts where the CSV is standard (comma, UTF-8, no dates, no unusual nulls) and nothing downstream will read it. The overhead is not worth it.

To share read code without requiring tdda, generate a standalone Python reader: tdda serial foo.serial foo_reader.py --to pd.r.

To convert between metadata formats (tdda.serial ↔ CSVW ↔ Frictionless), use tdda serial infile outfile.

Further Reading