The TDDA Constraints File Format
Posted on Fri 04 November 2016 in TDDA
Background
We recently extended the tdda library to include support for automatic discovery of constraints from datasets, and for verification of datasets against constraints. Yesterday's post—Constraint Discovery and Verification for Pandas DataFrames—describes these developments and the API.
The library we published is intended to be a base for producing various
implementations of the constraint discovery and verification process,
and uses a JSON file format (extension .tdda
) to save constraints in a form
that should be interchangable between implementations.
We currently have two compatible implementations—the open-source
Pandas code in the library and the implementation in our own analytical
software, Miró.
This post describes the .tdda
JSON file format. The bulk of it is
merely a snapshot of the documentation shipped with the library in the
Github repository (visible on
Github).
We intend to keep that file up to date as we expand the format.
The TDDA JSON File Format
The TDDA constraints library (Repository https://github.com/tdda/tdda, module constraints) uses a JSON file to store constraints.
This document describes that file format.
Purpose
TDDA files describe constraints on a dataset, with a view to verifying the dataset to check whether any or all of the specified constraints are satisfied.
A dataset is assumed to consist of one or more fields (also known
as columns), each of which has a (different) name1
and a well-defined type.2
Each field has a value for each of a number of records (also
known as rows). In some cases, values may be null
(or missing).3
Even a field consisting entirely of nulls can be considered to have
a type.
Familiar examples of datasets include:
- tables in relational databases
- DataFrames in (Pandas and R)
- flat ("CSV") files (subject to type inference or assigment)
- sheets in spreadsheets, or areas within sheets, if the columns have names, are not merged, and have values with consistent meanings and types over an entire column
- more generally, many forms of tabular data.
In principle, TDDA files are intended to be capable of supporting any kind of constraint regarding datasets. Today, we are primarily concerned with * field types * minimum and maximum values (or in the case of string fields, minumum and maximum string lengths) * whether nulls are allowed, * whether duplicate values are allowed within a field * the allowed values for a field.
The format also has support for describing relations between fields.
Future extensions we can already foresee include:
- dataset-level constraints (e.g. numbers of records; required or disallowed fields)
- sortedness of fields, of field values or both
- regular expressions to which string fields should conform
- constraints on subsets of the data (e.g. records dated after July 2016 should not have null values for the ID field)
- constraints on substructure within fields (e.g. constraints on tagged subexpressions from regular expressions to which string fields are expected to conform)
- potentially checksums (though this is more suitable for checking the integreity of transfer of a specific dataset, than for use across multiple related datasets)
- constraints between datasets, most obviously key relations (e.g. every value field KA in dataset A should also occur in field KB in dataset B).
The motivation for generating, storing and verifying datasets against such sets of constraints is that they can provide a powerful way of detecting bad or unexpected inputs to, or outputs from, a data analysis process. They can also be valuable as checks on intermediate results. While manually generating constraints can be burdensome, automatic discovery of constraints from example datasets, potentially followed by manual removal of over-specific constraints, provides a good cost/benefit ratio in many situations.
Filename and encoding
-
The preferred extension for TDDA Constraints files is
.tdda
. -
TDDA constraints files must be encoded as UTF-8.
-
TDDA files must be valid JSON.
Example
This is an extremely simple example TDDA file:
{
"fields": {
"a": {
"type": "int",
"min": 1,
"max": 9,
"sign": "positive",
"max_nulls": 0,
"no_duplicates": true
},
"b": {
"type": "string",
"min_length": 3,
"max_length": 3,
"max_nulls": 1,
"no_duplicates": true,
"allowed_values": [
"one",
"two"
]
}
}
}
General Structure
A TDDA file is a JSON dictionary. There are currently two supported top-level keys:
-
fields
: constraints for individual fields, keyed on the field name. (In TDDA, we generally refer to dataset columns as fields.) -
field_groups
: constraints specifying relations between multiple fields (two, for now).field_groups
constraints are keyed on a comma-separated list of the names of the fields to which they relate, and order is significant.
Both top-level keys are optional.
In future, we expect to add further top-level keys (e.g. for possible constraints on the number of rows, required or disallowed fields etc.)
The order of constraints in the file is immaterial (of course; this is JSON), though writers may choose to present fields in a particular order, e.g. dataset order or sorted on fieldname.
Field Constraints
The value of a field constraints entry (in the fields
section)
is a dictionary keyed on constraint kind.
For example, the constraints on field a
in the example above are
specified as:
"a": {
"type": "int",
"min": 1,
"max": 9,
"sign": "positive",
"max_nulls": 0,
"no_duplicates": true
}
The TDDA library currently recognizes the following kinds of constraints:
type
min
max
min_length
max_length
sign
sign
max_nulls
no_duplicates
allowed_values
Other constraint libraries are free to define their own, custom kinds
of constraints. We will probably recommend that non-standard
constraints have names beginning with colon-terminated prefix. For
example, if we wanted to support more specific Pandas type
constraints, we would probably use a key such as pandas:type
for
this.
The value of a constraint is often simply a scalar value, but can be a list
or a dictionary; when it is a dictionary, it should include a key value
,
with the principle value associated with the constraint (true
, if there
is no specific value beyond the name of the constraint).
If the value of a constraint (the scalar value, or the value
key if the
value is a dictionary) is null
, this is taken to indicate
the absence of a constraint. A constraint with value null
should be
completely ignored, so that a constraints file including null
-valued
constraints should produce identical results to one omitting those constraints.
(This obviously means that we are discouraging using null
as a meaningful
constraint value, though a string "null"
is fine, and in fact we use this
for sign
constraints.)
The semantics and values of the standard field constraint types are as follows:
-
type
: the allowed (standard, TDDA) type of the field. This can be a single value frombool
(boolean),int
(integer; whole-numbered);real
(floating point values);string
(unicode in Python3; byte string in Python2) ordate
(any kind of date or date time, with or without timezone information). It can also be a list of such allowed values (in which case, order is not significant).It is up to the generation and verification libraries to map between the actual types in whatever dataset/dataframe/table/... object is used and these TDDA constraint types, though over time we may provide further guidance.
Examples:
{"type": "int"}
{"type": ["int", "real"]}
-
min
: the minimum allowed value for a field. This is often a simple value, but in the case of real fields, it can be convenient to specify a level of precision. In particular, a minimum value can haveprecision
(default:fuzzy
):closed
: all non-null values in the field must be greater than or equal to the value specified.open
: all non-null values in the field must be strictly greater than the value specified.fuzzy
: when the precision is specified as fuzzy, the verifier should allow a small degree of violation of the constraint without generating a failure. Verifiers take a parameter,epsilon
, which specifies how the fuzzy constraints should be taken to be: epsilon is a fraction of the constraint value by which field values are allowed to exceed the constraint without being considered to fail the constraint. This defaults to 0.01 (i.e. 1%). Notice that this means that constraint values of zero are never fuzzy.
Examples are:
{"min": 1}
,{"min": 1.2}
,{"min": {"value": 3.4}, {"precision": "fuzzy"}}
.
JSON, does not—of course—have a date type. TDDA files specifying dates should use string representations in one of the following formats:
YYYY-MM-DD
for dates without timesYYYY-MM-DD hh:mm:ss
for date-times without timezoneYYYY-MM-DD hh:mm:ss [+-]ZZZZ
for date-times with timezone.
We recommend that writers use precisely these formats, but that readers offer some flexibility in reading, e.g. accepting
/
as well as-
to separate date components, andT
as well as space to separate the time component from the date. -
max
: the maximum allowed value for a field. Much likemin
, but for maximum values. Examples are:{"max": 1}
,{"max": 1.2}
,{"max": {"value": 3.4}, {"precision": "closed"}}
.
Dates should be formatted as for
min
values. -
min_length
: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation. Example:{"min_length": 2}
-
max_length
: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation.{"max_length": 22}
-
sign
: For numeric fields, the allowed sign of (non-null) values. Although this overlaps with minimum and maximum values, it it often useful to have a separate sign constraint, which carries semantically different information. Allowed values are:positive
: All values must be greater than zeronon-negative
: No value may be less than zerozero
: All values must be zeronon-positive
: No value may be greater than zeronegative
: All values must be negativenull
: No signed values are allowed, i.e. the field must be entirely null.
Example:
{"sign": "non-negative"}
-
max_nulls
: The maximum number of nulls allowed in the field. This can be any non-negative value. We recommend only writing values of zero (no nulls values are allowed) or 1 (At most a single null is allowed) into this constraint, but checking against any value found.Example:
{"max_nulls": 0}
-
no_duplicates
: When this constraint is set on a field (with valuetrue
), it means that each non-null value must occur only once in the field. The current implementation only uses this constraint for string fields.Example:
{"no_duplicates": true}
-
allowed_values
: The value of this constraint is a list of allowed values for the field. The order is not significant.Example:
{"allowed_values": ["red", "green", "blue"]}
MultiField Constraints
Multifield constraints are not yet being generated by this implementation, though our (proprietary) Miró implementation does produce them. The currently planned constraint types for field relations cover field equality and inequality for pairs of fields, with options to specify null relations too.
A simple example would be:
"field_groups": {
"StartDate,EndDate": {"lt": true}
}
This is a less-than constraint, to be interpreted as
StartDate < EndDate
whereverStartDate
andEndDate
and both non-null.
The plan is to support the obvious five equality and inequality relations:
lt
: first field value is strictly less than the second field value for each recordlte
: first field value is less than or equal to the second field value for each recordeq
: first field value is equal to the second field value for each recordgte
: first field value is greater than or equal to the second field value for each recordgt
: first field value is strictly greater than the second field value for each record.
In the case of equality (only), we will probably also support a precision
parameter with values fuzzy
or precise
.
There should probably also be an option to specify relations between null values in pairs of columns, either as a separate constraint or as a quality on each of the above.
-
Pandas, of course, allows multiple columns to have the same name. This format makes no concessions to such madness, though there is nothing to stop a verifier or generator sharing constraints across all columns with the same name. The Pandas generators and verifiers in this library do not currently attempt to do this. ↩
-
Pandas also allows columns of mixed type. Again, this file format does not recognize such columns, and it would probably be sensible not to use type constraints for columns of mixed type. ↩
-
Pandas uses not-a-number (
pandas.np.NaN
) to representnull
values for numeric, string and boolean fields; it uses a special not-a-time (pandas.NaT
) value to represent null date (Timestamp
) values. ↩