Title: CSV Metadata and tdda.serial: A Guide for LLMs and Coding Agents
Date: 2026-05-21T12:00:00
Tags: tdda.serial, CSV, metadata, LLMs, pandas, polars, frictionless, csvw
Category: TDDA
Status: published


_[Raw markdown](https://tdda.info/md/077-csv-metadata-and-tdda-serial-a-guide-for-llms.md) for direct use in LLM context windows._

This post is addressed primarily to LLMs and coding agents. It covers the
`tdda.serial` module from the [tdda library](https://github.com/tdda/tdda),
distilled from [_Test-Driven Data Analysis_](https://www.routledge.com/Test-Driven-Data-Analysis/Radcliffe/p/book/9781032897158)
(Radcliffe, CRC Press, 2026; [extra resources](https://book.tdda.info)),
the [tdda source code](https://github.com/tdda/tdda), the
[documentation](https://tdda.readthedocs.io/en/latest/tdda.serial.html),
and the [man page](https://github.com/tdda/tdda/blob/master/tdda/man/tdda-serial.md).
If you are a human wanting a practical reference, you are welcome too.

**If you are a coding agent:** (1) Look for a companion `.serial` file.
(2) If present, treat it as authoritative — do not infer types that
contradict it. (3) Use `csv_to_pandas('foo.csv:')` or
`csv_to_polars('foo.csv:')`. (4) If no metadata exists, generate one
with `tdda serial --generate` or write one by hand.
(5) When writing data that will be reused, use `pandas_to_csv(df, 'output.csv:')`.


### What CSV Data Loses in Transit

CSV files are untyped. When a reader opens one, it must infer types from
the formatted values. This inference is often wrong, and different libraries
make different inference decisions.

Common failures:

- **Integer columns with nulls become floats.** Even in Pandas 3.0, with
  its two nullable backends, the default `read_csv` promotes a nullable
  integer column to `float64` because `NaN` is a float. A column of
  `1, 2, null` becomes `1.0, 2.0, NaN`.
- **Non-standard null markers are read as strings.** A column using `-`
  to indicate missing values produces a `string` column rather than a
  nullable `int` or `float`.
- **Dates become strings.** Unless the reader is told to parse dates,
  a date column like `2024-03-15` is read as a `str` or `object` column.
- **The Pandas index round-trip.** `df.to_csv()` writes the Pandas index
  as an unnamed first column by default. `pd.read_csv()` does not restore
  it as an index. The resulting DataFrame has an extra unnamed column.
- **Different libraries, different results.** The same CSV file may read
  as different types in Pandas and Polars, or between Pandas with the
  original backend and Pandas with the `numpy_nullable` backend.

A minimal demonstration: write a single-column Pandas DataFrame with a
nullable integer and read it back using all defaults.

```python
import pandas as pd

df = pd.DataFrame({'a': pd.array([1, 2, None], dtype='Int64')})
df.to_csv('test.csv')
df2 = pd.read_csv('test.csv')
print(df2.dtypes)   # a is float64, there's also an unnamed index column
print(df2)
```

Output:
```
Unnamed: 0    int64
a           float64
dtype: object
   Unnamed: 0    a
0           0  1.0
1           1  2.0
2           2  NaN
```

The round-trip has lost type information and added a spurious column.
This happens with default settings, not unusual data.


### When Metadata Is Worth Using

CSV metadata — a companion file that records the format used — addresses
these problems. It is not always worth the overhead.

**Use metadata when:**

- The CSV uses non-standard formats: an unusual null indicator, a non-ISO
  date format, a non-comma separator, a non-UTF-8 encoding.
- The data will be read by code written separately from the code that
  wrote it, or by a different person or system.
- The data will be read again some time after it was written.
- You are writing data-producing code and the format is non-obvious.
  The marginal cost of writing a `.serial` file alongside the write code
  is small, and it serves as machine-readable documentation.

**Don't use metadata when:**

- The CSV is clean, standard, and will be read once by the same code
  that wrote it.
- It is a simple file that whatever reads it handles correctly in practice.
- It is a throwaway file that nothing downstream will consume.

**A note for LLMs:** the cost of generating a `.serial` file alongside
data-writing code is near zero at code-generation time. The threshold
for "worth it" is therefore lower for LLM-generated code than for humans
typing it out. Lean toward including metadata when the format is
non-standard or the data leaves the immediate script.


### Three Metadata Formats

Three metadata formats are in use for CSV files. Understanding their
philosophical differences helps you choose and explains why they don't
translate perfectly to each other.

#### `tdda.serial` (`.serial` files)

A `tdda.serial` file describes a **format**, not a specific file. One
`.serial` file can apply to any number of CSV files that share the same
format. It is not URL-centric. It can be hand-written or generated.
It has a strong, flexible date format system. It is the native format
of the `tdda.serial` module. You can use `@` as a glob wildcard in `.serial`
filenames to indicate which files a metadata file is intended for —
`sales_@.serial` would match `sales_2024.csv`, `sales_2025.csv` etc.

A `.serial` file is a JSON file. By convention, it has the same stem
name as the data file it accompanies (`foo.csv` → `foo.serial`), but
this is not required. Any `.serial` file can be applied to any
compatible flat file.

#### CSVW

[CSVW](https://csvw.org/) is a W3C standard for describing CSV files on
the web. It is designed around a **one-to-one relationship between a
metadata file and a specific named data file**, identified by URL. A
CSVW file that describes `foo.csv` contains a `url` pointing to that
specific file.

CSVW is comprehensive and has W3C backing, but tooling is sparse and
fragmented — the tools listed on the CSVW site are mostly RDF-focused,
not CSV-focused. It is a heavy format, and date handling is less
flexible than `tdda.serial`. If you receive a CSVW file, `tdda.serial`
can use it; if you are creating new metadata, `tdda.serial` is a
simpler and more practical choice.

#### Frictionless

[Frictionless](https://frictionlessdata.io/) is a data packaging
ecosystem with good Python tooling (`pip install frictionless`). Its
primary abstractions are **resources** (a single dataset plus its
metadata) and **packages** (a collection of resources). This makes it
well-suited to *supplying* data as a self-described package, but less
suited to describing a shared format applied to many files. A
Frictionless `schema` is reusable but is not commonly used standalone.
If you are distributing a dataset for others to consume, Frictionless
is a reasonable choice. If you are describing an internal format, use
`tdda.serial`.

#### Interoperability

The `tdda.serial` library reads and writes all three formats, and they
can be used interchangeably in most `tdda` contexts. If you receive a
CSVW or Frictionless file, you can pass it to `csv_to_pandas` exactly
as you would a `.serial` file. The `tdda serial` command converts
between formats.


### The `.serial` File Format

A `.serial` file is a JSON object. Here is the full top-level structure:

```json
{
    "format": "http://tdda.info/ns/tdda.serial",
    "writer": "tdda.serial-3.0.0",
    "tdda.serial": { ... },
    "pandas.read_csv": { ... },
    "pandas.DataFrame.to_csv": { ... },
    "polars.read_csv": { ... }
}
```

The `format` key is required; all others are optional. Only the
`tdda.serial` section is described here — library-specific sections
contain verbatim keyword arguments for the corresponding function and
are generated by `tdda serial --to pd.r` etc.

#### Dataset-level keys in `tdda.serial`

All are optional. Omitted keys fall back to library defaults.

| Key | Type | Description |
|-----|------|-------------|
| `encoding` | string | Text encoding, e.g. `"UTF-8"`, `"latin-1"` |
| `delimiter` | string | Field separator, e.g. `","`, `"\|"`, `";"` |
| `quote_char` | string | Quote character, almost always `"\""` or `"'"` |
| `escape_char` | string | Escape character; `"\\"` means backslash |
| `stutter_quotes` | bool | If true, embedded quotes are doubled (`""`) |
| `null_indicator` | string or array | Null marker(s), e.g. `""`, `"-"`, `"NULL"` |
| `date_format` | string | Default format for `date` fields |
| `datetime_format` | string | Default format for `datetime` fields |
| `header_row_count` | int | Number of header rows (default: 1) |
| `header_row` | int | Zero-based index of the column-name row (default: 0) |
| `decimal_point` | string | Decimal point character (default: `"."`) |
| `thou_sep` | string | Thousands separator, e.g. `","` |
| `true_values` | string or array | Values interpreted as `true` for bool fields |
| `false_values` | string or array | Values interpreted as `false` for bool fields |
| `quoting` | string | Quoting style (see below) |
| `fields` | array or object | Per-field descriptions |

The `quoting` field accepts Python `csv` module constants
(`QUOTE_ALL`, `QUOTE_MINIMAL`, `QUOTE_NONNUMERIC`, `QUOTE_NONE`,
`QUOTE_NOTNULL`, `QUOTE_STRINGS`) and also `QUOTE_STRINGS_ONLY`,
which quotes only string values (not nulls, numbers, dates, or booleans).
`QUOTE_STRINGS_ONLY` is similar to what JSON does.

#### The `fields` entry

Fields can be specified as an **array** (complete and ordered) or an
**object/dictionary** (partial, keyed by the CSV column name).

**Array form** — used when the complete field list is known and ordered:

```json
"fields": [
    {"name": "id",    "fieldtype": "int"},
    {"name": "price", "fieldtype": "float"},
    {"name": "date",  "fieldtype": "date"}
]
```

**Object form** — used for partial specifications or when internal names
differ from CSV column names:

```json
"fields": {
    "commission date": {"name": "DateOfCommission", "fieldtype": "date"},
    "passed qa?":      {"name": "PassedQA", "fieldtype": "bool",
                        "true_values": "yes", "false_values": "no"}
}
```

In object form the dictionary key is the name as it appears in the CSV;
the optional `name` key gives the internal (DataFrame column) name.

#### Per-field keys

| Key | Description |
|-----|-------------|
| `name` | Internal name (DataFrame column name). Required in array form. |
| `fieldtype` | Type of the field (see table below) |
| `csvname` | CSV column name when different from `name` (array form) |
| `format` | Date/datetime format for this field; overrides dataset-level |
| `null_indicator` | Null marker(s) for this field; overrides dataset-level |
| `true_values` | True value(s) for bool fields |
| `false_values` | False value(s) for bool fields |
| `description` | Human-readable description |

#### Field types

| Value | Description |
|-------|-------------|
| `bool` | Boolean |
| `int` | Integer |
| `float` | Floating-point |
| `number` | Unspecified numeric |
| `string` | Text |
| `date` | Date (no time component) |
| `datetime` | Date and time |
| `datetime_tz` | Date and time with timezone |
| `time` | Time only |
| `iso8601` | ISO 8601 date or datetime (unspecified) |

#### Date format specifications

Four forms are accepted:

1. **Named ISO 8601 formats:**
   `iso8601-date` (`2000-12-31`), `iso8601-datetime` (`2000-12-31T12:34:56`),
   `iso8601-datetime-tz` (`2000-12-31T12:34:56+00:00`), `iso8601` (any of the above).
   These are the recommended choices for new data.

2. **YYYY/MM/DD-style specifiers:**
   Tokens: `YYYY`, `YY`, `MM` (month or minute, by context), `DD`, `HH`,
   `SS`, `SS.S` (fractional), `MON` (Jan), `MONTH` (January), `+ZZ:ZZ`
   (timezone), `AM`/`PM`. Examples: `YYYY-MM-DD`, `DD/MM/YYYY HH:MM:SS`,
   `MM/DD/YY`, `YYYY-MM-DDTHH:MM:SS.S+ZZ:ZZ`.

3. **Unambiguous literal examples:** any actual date/time value where
   the day is ≥ 13 or the year is 4 digits or ≥ 60. So `2000-12-31T12:34:56`
   is accepted; `01/02/2000` is not (ambiguous: day-first or month-first?).

4. **Python `strftime` strings:** `%Y-%m-%dT%H:%M:%S` etc.

#### A complete example

This `.serial` file describes a CSV with non-standard settings — a hyphen
as null indicator and ISO 8601 dates — matching the `elements3-old.csv`
file distributed with the `tdda` library:

```json
{
    "format": "http://tdda.info/ns/tdda.serial",
    "tdda.serial": {
        "encoding": "UTF-8",
        "delimiter": ",",
        "quote_char": "\"",
        "escape_char": "\\",
        "stutter_quotes": false,
        "null_indicator": "-",
        "date_format": "YYYY-MM-DD",
        "header_row_count": 1,
        "fields": [
            {"name": "Z",               "fieldtype": "int"},
            {"name": "Name",            "fieldtype": "string"},
            {"name": "Symbol",          "fieldtype": "string"},
            {"name": "Period",          "fieldtype": "int"},
            {"name": "Group",           "fieldtype": "int"},
            {"name": "AtomicWeight",    "fieldtype": "float"},
            {"name": "ApproxDiscovery", "fieldtype": "date"}
        ]
    }
}
```

An LLM that knows a CSV file's format can write a `.serial` file like
this directly, without needing to run inference. This is usually faster
and more reliable than `--generate` when you can examine the data.
Metadata describes the intended format. Values that do not conform are
data errors, not type-inference hints.


### Reading with `tdda.serial`

#### Reading a format you know

When you have a `.serial` file (or can write one), use `csv_to_pandas`
or `csv_to_polars`:

```python
from tdda.serial import csv_to_pandas, csv_to_polars

# Explicit metadata path
df = csv_to_pandas('elements3-old.csv', md_path='elements3-old.serial')
df = csv_to_polars('elements3-old.csv', md_path='elements3-old.serial')

# Auto-locate metadata (same stem name, same directory)
df = csv_to_pandas('elements3-old.csv', find_md=True)
df = csv_to_polars('elements3-old.csv', find_md=True)

# Colon suffix — equivalent to find_md=True
df = csv_to_pandas('elements3-old.csv:')
df = csv_to_polars('elements3-old.csv:')

# Colon with explicit metadata path
df = csv_to_pandas('elements3-old.csv:elements3-old.serial')
```

The auto-locate (`find_md=True` / colon suffix) searches for metadata
in priority order: `foo.csv.serial`, `foo.serial`, wildcard matches using
`@` as a glob character (e.g. `@.serial`), then CSVW and Frictionless
naming conventions.

**Pandas backends:** `csv_to_pandas` defaults to the `numpy_nullable`
backend. The `backend` parameter overrides this:

```python
df = csv_to_pandas('foo.csv:', backend='original')    # traditional Pandas dtypes
df = csv_to_pandas('foo.csv:', backend='pyarrow')     # Arrow-backed dtypes
df = csv_to_pandas('foo.csv:', backend='numpy_nullable')  # default
```

**Polars note:** `polars.read_csv` is less flexible than `pandas.read_csv`
for unusual formats. In particular, Polars can only parse ISO 8601 dates
directly. `csv_to_polars` works around this by reading problematic fields
as strings and converting them in a post-processing step.

#### Reading an unfamiliar format

When you don't know a CSV file's format, use `tdda serial --generate`
to infer it:

```
tdda serial --generate foo.csv foo.serial
```

This reads `foo.csv`, applies heuristics, and writes `foo.serial`. The
result is a starting point, not a guarantee — inspect and correct it
before relying on it. Key override switches:

```
--sep C              Set field delimiter to C
--nulls S            Set null indicator(s)
--date-format FMT    Set default date format
--quote-char Q       Set quote character
--escape             Use backslash escaping
--stutter            Use quote stuttering
--encoding ENC       Set encoding
--sample-lines N     Use N lines for inference (default: 1000)
```

For LLMs: if you can read the CSV file directly, you can often write
the `.serial` by hand more quickly and reliably than inference. Use
`--generate` when the format is complex or uncertain.

After generating or writing the `.serial`, read with `csv_to_pandas` or
`csv_to_polars` as shown above.


### Writing with `tdda.serial`

The write wrapper is currently available for Pandas only.

#### `pandas_to_csv`

```python
from tdda.serial import pandas_to_csv

# Write CSV and generate accompanying .serial metadata automatically
info = pandas_to_csv(df, 'output.csv', auto_md_outpath=True)
# Writes output.csv and output.serial

# Colon suffix — equivalent
info = pandas_to_csv(df, 'output.csv:')

# Explicit metadata output path
info = pandas_to_csv(df, 'output.csv', md_outpath='output.serial')

# Use an existing .serial to specify the write format
info = pandas_to_csv(df, 'output.csv', md_inpath='format.serial')

# Use an existing .serial for write format and write a .serial for readers
info = pandas_to_csv(df, 'output.csv',
                     md_inpath='shared-format.serial',
                     md_outpath='output.serial')
```

The return value is a `WriteInfo` object showing the path written, the
metadata output path, and the keyword arguments passed to `to_csv`.

Any keyword arguments you pass that `to_csv` accepts (such as `sep`,
`na_rep`, `encoding`, `date_format`) are forwarded to `to_csv` and also
reflected in the written `.serial` file. For example:

```python
info = pandas_to_csv(df, 'output.psv',
                     auto_md_outpath=True,
                     sep='|',
                     na_rep='NULL',
                     encoding='latin-1')
```

writes a pipe-separated file with `NULL` as the null marker and records
these settings in `output.serial`.

`pandas_to_csv` sets `index=False` by default — it does not write the
Pandas index as a column. This is almost always what you want.

#### Writing from Polars

There is currently no `polars_to_csv` wrapper, though one is planned.
For now, write using the native `df.write_csv()` method and generate
or write the `.serial` file separately:

```python
# Write the CSV
df.write_csv('output.csv')

# Generate a .serial from the written file
# (run in shell or via subprocess)
# tdda serial --generate output.csv output.serial

# Or write the .serial by hand if you know the format
```

If you need the Pandas write behaviour for a Polars DataFrame, convert
first: `pandas_to_csv(df.to_pandas(), 'output.csv:', use_pyarrow=True)`.

#### Writing a format-only `.serial` (no field info)

A `.serial` file can record only the format conventions without any
field-level detail. This is useful as a shared "house format" that
specifies separator, encoding, null indicator etc., leaving field names
and types to be inferred by the reader. Generate one with:

```
tdda serial --generate "" format.serial --sep "|" --nulls "NULL" --encoding "latin-1"
```

(Empty filename generates a fieldless metadata file.)

Or write it by hand — it is a small JSON file:

```json
{
    "format": "http://tdda.info/ns/tdda.serial",
    "tdda.serial": {
        "encoding": "latin-1",
        "delimiter": "|",
        "null_indicator": "NULL"
    }
}
```

Use `md_inpath='format.serial'` when writing any file in this format.


### The `tdda serial` CLI: Conversion and Code Generation

The `tdda serial` command converts between metadata formats and generates
Python code for reading files without requiring the `tdda` library.

#### Format conversion

```
tdda serial infile outfile [--to FORMAT]
```

Format is inferred from filename when it follows conventions; use `--to`
when it doesn't. Format abbreviations:

| Short | Long form |
|-------|-----------|
| `.` | `tdda.serial` (default) |
| `pd.r` | `pandas.read_csv` |
| `pd.w` | `pandas.DataFrame.to_csv` |
| `pl.r` | `polars.read_csv` |
| `pl.w` | `polars.DataFrame.write_csv` |
| `csvw` | CSVW |
| `fl` | Frictionless |
| `fl.r` | Frictionless resource |
| `fl.p` | Frictionless package |

Examples:

```
# Convert between formats (inferred from filenames)
tdda serial foo.serial foo-metadata.json        # tdda.serial → CSVW
tdda serial foo-metadata.json foo.serial        # CSVW → tdda.serial

# Explicit format
tdda serial --to csvw foo.serial foo-out.json
tdda serial --to pl.r foo.serial foo-pl.serial  # add Polars read_csv section

# Generate from a CSV file
tdda serial --generate foo.csv foo.serial

# Pandas backend when converting to Pandas sections
tdda serial --to pd.r --backend a foo.serial foo-pdr.serial  # PyArrow dtypes
```

#### Generating Python code

Use a `.py` output extension to generate a standalone `read_data()`
function that does not require `tdda` to be installed:

```
tdda serial foo.serial foo_reader.py --to pd.r
```

This produces something like:

```python
import pandas as pd

def read_data(inpath):
    return pd.read_csv(inpath, sep=',', encoding='UTF-8',
        escapechar='\\', quotechar='"',
        dtype={'id': 'Int64', 'price': 'Float64'},
        na_values='-', keep_default_na=False)
```

This is useful when sharing code with users who do not have `tdda`
installed, or when you want to hard-wire the read parameters.

The `--for FILE` flag sets the data path in CSVW or Frictionless output
(CSVW requires a `url`):

```
tdda serial --to csvw foo.serial foo-metadata.json --for foo.csv
```


### Using Metadata with Other `tdda` Tools

The colon syntax works with all `tdda` command-line tools that accept
CSV files. Adding `:` to a CSV path tells `tdda` to find and use
metadata automatically; adding `:path` specifies the metadata explicitly.

```
# tdda verify — validate data against constraints
tdda verify foo.csv elements118.tdda              # no metadata: types may be wrong
tdda verify foo.csv: elements118.tdda             # auto-find metadata
tdda verify foo.csv:foo.serial elements118.tdda   # explicit metadata

# tdda diff — compare two datasets
tdda diff old.csv: new.csv:                       # use metadata for both

# tdda discover — generate constraints
tdda discover foo.csv: foo.tdda
```

Without metadata, type mismatches cause spurious failures in `verify`
and `diff`. A `Group` column read as string (because the null marker
isn't recognised) fails type, min, max, and sign constraints — four
failures from one formatting oversight.

CSVW and Frictionless files work in the same positions as `.serial` files.
A CSVW file that contains the data path can be specified instead of the
CSV file itself:

```
tdda verify foo-metadata.json elements118.tdda
```


### Checklist and Recommended Agent Behaviour

☐ **When you encounter a CSV with a `.serial` file alongside it**, use
`csv_to_pandas('foo.csv:')` or `csv_to_polars('foo.csv:')` rather than
bare `pd.read_csv`.

☐ **When writing CSV data that will be shared or reused**, use
`pandas_to_csv(df, 'output.csv:')` to write a `.serial` alongside it.
If the format has non-standard settings, pass them as keyword arguments
and they will be recorded in the `.serial`.

☐ **When writing data-producing code**, consider whether the CSV will
be read later or by other code. If so, add `auto_md_outpath=True`.
For LLM-generated code the overhead is negligible.

☐ **When reading an unfamiliar CSV**, run
`tdda serial --generate foo.csv foo.serial`, inspect the result, correct
if needed, then use `csv_to_pandas('foo.csv:')`. If you can read the
file directly, writing the `.serial` by hand is often faster.

☐ **When running `tdda verify` or `tdda diff` on CSV files**, add the
colon suffix: `foo.csv:` rather than `foo.csv`, if a suitably-named
metadata file is available. Without it, type mismatches produce
spurious failures.

☐ **CSVW and Frictionless files** are accepted wherever `.serial` files
are. If you receive data with CSVW or Frictionless metadata, pass it
directly to `csv_to_pandas` or use the colon syntax.

☐ **Don't add metadata to simple throwaway scripts** where the CSV
is standard (comma, UTF-8, no dates, no unusual nulls) and nothing
downstream will read it. The overhead is not worth it.

☐ **To share read code without requiring `tdda`**, generate a standalone
Python reader: `tdda serial foo.serial foo_reader.py --to pd.r`.

☐ **To convert between metadata formats** (tdda.serial ↔ CSVW ↔
Frictionless), use `tdda serial infile outfile`.


### Further Reading

 - [tdda.serial documentation](https://tdda.readthedocs.io/en/latest/tdda.serial.html)
 - [`tdda serial` man page](https://tdda.readthedocs.io/en/latest/cli.html#tdda-serial)
 - [tdda library source](https://github.com/tdda/tdda/tree/master/tdda/serial)
 - [_Test-Driven Data Analysis_](https://www.routledge.com/Test-Driven-Data-Analysis/Radcliffe/p/book/9781032897158)
   (Radcliffe, CRC Press, 2026), Chapter 8
 - [Book resources](https://book.tdda.info)