Constraint Discovery and Verification for Pandas DataFrames
Posted on Thu 03 November 2016 in TDDA
Background
In a previous post, Constraints and Assertions, we introduced the idea of using constraints to verify input, output and intermediate datasets for an analytical process. We also demonstrated that candidate constraints can be automatically generated from example datasets. We prototyped this in our own software (Miró) expressing constraints as lisp S-expressions.
Improving and Extending the Approach: Open-Source Pandas Code
We have now taken the core ideas, polished them a little and made them available through an open-source library, currently on Github. We will push it to PyPI when it has solidified a little further.
The constraint code I'm referring to is available in the constraints
module of the tdda
repository for the tdda
user on github. So if you
issue the command
git clone https://github.com/tdda/tdda.git
in a directory somewhere on your PYTHONPATH
,
this should enable you to use it.
The TDDA library:
- is under an MIT licence;
- runs under Python2 and Python3;
- other than Pandas itself, has no dependencies outside
the standard library unless you want
to use
feather
files (see below); - includes a base layer to help with building constraint verification and discovery libraries for various systems;
- includes Pandas implementations of constraint discovery and verification through a (Python) API;
- uses a new JSON format (normally stored in
.tdda
files) for saving constraints; - also includes a prototype command-line tool for verifying
a dataframe stored in
feather
format against a
.tdda
file of constraints. Feather is a file format developed by Wes McKinney (the original creator of Pandas) and Hadley Wickham (of ggplot and tidyverse fame) for dataframes that allows interchange between R and Pandas while preserving type information and exact values. It is based on the Apache Arrow project. It can be used directly or using our ugly-but-useful extension library pmmif, which allows extra metadata (including extended type information) to be saved alongside a.feather
file, in a companion.pmm
file.
Testing
All the constraint-handling code is in the constraints module within the TDDA repository.
After you've cloned the repository, it's probably a good idea to run the tests. There are two sets, and both should run under Python2 or Python3.
$ cd tdda/constraints
$ python testbase.py
.....
----------------------------------------------------------------------
Ran 5 tests in 0.003s
OK
$ python testpdconstraints.py
.....................
----------------------------------------------------------------------
Ran 21 tests in 0.123s
OK
There is example code (which we'll walk through below) in the examples
subdirectory of constraints
.
Basic Use
Constraint Discovery
Here is some minimal code for getting the software to discover constraints satisfied by a Pandas DataFrame:
import pandas as pd
from tdda.constraints.pdconstraints import discover_constraints
df = pd.DataFrame({'a': [1,2,3], 'b': ['one', 'two', pd.np.NaN]})
constraints = discover_constraints(df)
with open('/tmp/example_constraints.tdda', 'w') as f:
f.write(constraints.to_json())
(This is the core of the example code in
tdda/constraints/examples/simple_discovery.py
, and is included in
the docstring for the discover_constraints
function.)
Hopefully the code is fairly self-explanatory, but walking through the lines after the imports:
- We first generate a trivial 3-row DataFrame with an integer column
a
and a string columnb
. The string column includes a Pandas null (NaN
). - We then pass that DataFrame to the
discover_constraints
function fromtdda.constraints.pdconstraints
, and it returns aDatasetConstraints
object, which is defined intdda.constraints.base
. - The resulting constraints object has a
to_json()
method which converts the structured constraints into a JSON string. - In the example, we write that to
/tmp/example_constraints.tdda
; we encourage everyone to use the.tdda
extension for these JSON constraint files.
This is what happens if we run the example file:
$ cd tdda/constraints/examples
$ python simple_discovery.py
Written /tmp/example_constraints.tdda successfully.
$ cat /tmp/example_constraints.tdda
{
"fields": {
"a": {
"type": "int",
"min": 1,
"max": 9,
"sign": "positive",
"max_nulls": 0,
"no_duplicates": true
},
"b": {
"type": "string",
"min_length": 3,
"max_length": 3,
"max_nulls": 1,
"no_duplicates": true,
"allowed_values": [
"one",
"two"
]
}
}
}
As you can see, in this case, the system has 'discovered' six constraints
for each field, and it's rather easy to read what they are and at least
roughly what they mean. We'll do a separate post describing the .tdda
JSON file format, but it's documented in
tdda/constraints/tdda_json_file_format.md
in the repository
(which—through almost unfathomable power of Github—means you can see it
formatted
here).
Constraint Verification
Now that we have a .tdda
file, we can use it to verify a DataFrame.
First, let's look at code that should lead to a successful verification
(this code is in tdda/constraints/examples/simple_verify_pass.py
).
import pandas as pd
from tdda.constraints.pdconstraints import verify_df
df = pd.DataFrame({'a': [2, 4], 'b': ['one', pd.np.NaN]})
v = verify_df(df, 'example_constraints.tdda')
print('Passes: %d' % v.passes)
print('Failures: %d\n\n\n' % v.failures)
print(str(v))
print('\n\n')
print(v.to_frame())
Again, hopefully the code is fairly self-explanatory, but:
df
is a DataFrame that is different from the one we used to generate the constraints, but is nevertheless consistent with the constraints.- The
verify_df
function fromtdda.constraints.pdconstraints
takes a DataFrame and the location of a.tdda
file and verifies the DataFrame against the constraints in the file. The function returns aPandasVerification
object. ThePandasVerification
class is a subclass ofVerification
fromtdda.constraints.base
, adding the ability to turn the verification object into a DataFrame. - All verification objects include
passes
andfailures
attributes, which respectively indicate the number of constraints that passed and the number that failed. So the simplest complete verification is simply to check thatv.failures == 0
. - Verification methods also include a
__str__
method. Its output currently includes a section for fields and a summary. - the
.to_frame()
method converts aPandasVerification
into a Pandas DataFrame.
If we run this, the result is as follows:
$ python simple_verify_pass.py
Passes: 12
Failures: 0
FIELDS:
a: 0 failures 6 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✓ no_duplicates ✓
b: 0 failures 6 passes type ✓ min_length ✓ max_length ✓ max_nulls ✓ no_duplicates ✓ allowed_values ✓
SUMMARY:
Passes: 12
Failures: 0
field failures passes type min min_length max max_length sign \
0 a 0 6 True True NaN True NaN True
1 b 0 6 True NaN True NaN True NaN
max_nulls no_duplicates allowed_values
0 True True NaN
1 True True True
In the DataFrame produced from the Verification
True
indicates a constraint that was satisfied in the dataset;False
indicates a constraint that was not satisfied in the dataset;NaN
(null) indicates a constraint that was not present for that field.
As you would expect, the field
, failures
and passes
columns are,
respectively, the name of the field, the number of failures and the
number of passes for that field.
If we now change the DataFrame definition to:
df = pd.DataFrame({'a': [0, 1, 2, 10, pd.np.NaN],
'b': ['one', 'one', 'two', 'three', pd.np.NaN]})
(as is the case in tdda/constraints/examples/simple_verify_fail.py
), we now
expect some constraint failures. If we run this, we see:
$ python simple_verify_fail.py
Passes: 5
Failures: 7
FIELDS:
a: 4 failures 2 passes type ✓ min ✗ max ✗ sign ✗ max_nulls ✗ no_duplicates ✓
b: 3 failures 3 passes type ✓ min_length ✓ max_length ✗ max_nulls ✓ noh_duplicates ✗ allowed_values ✗
SUMMARY:
Passes: 5
Failures: 7
field failures passes type min min_length max max_length sign \
0 a 4 2 True False NaN False NaN False
1 b 3 3 True NaN True NaN False NaN
max_nulls no_duplicates allowed_values
0 False True NaN
1 True False False
Final Notes
There are more options and there's more to say about the Pandas implementation, but that's probably enough for one post. We'll have follow-ups on the file format, more options, and the foibles of Pandas.
If you want to hear more, follow us on twitter at @tdda0.