The TDDA Constraints File Format

Posted on Fri 04 November 2016 in TDDA • Tagged with tdda, constraints, pandas

Background

We recently extended the tdda library to include support for automatic discovery of constraints from datasets, and for verification of datasets against constraints. Yesterday's post—Constraint Discovery and Verification for Pandas DataFrames—describes these developments and the API.

The library we published is intended to be a base for producing various implementations of the constraint discovery and verification process, and uses a JSON file format (extension .tdda) to save constraints in a form that should be interchangable between implementations. We currently have two compatible implementations—the open-source Pandas code in the library and the implementation in our own analytical software, Miró.

This post describes the .tdda JSON file format. The bulk of it is merely a snapshot of the documentation shipped with the library in the Github repository (visible on Github). We intend to keep that file up to date as we expand the format.

The TDDA JSON File Format

The TDDA constraints library (Repository https://github.com/tdda/tdda, module constraints) uses a JSON file to store constraints.

This document describes that file format.

Purpose

TDDA files describe constraints on a dataset, with a view to verifying the dataset to check whether any or all of the specified constraints are satisfied.

A dataset is assumed to consist of one or more fields (also known as columns), each of which has a (different) name1 and a well-defined type.2 Each field has a value for each of a number of records (also known as rows). In some cases, values may be null (or missing).3 Even a field consisting entirely of nulls can be considered to have a type.

Familiar examples of datasets include:

  • tables in relational databases
  • DataFrames in (Pandas and R)
  • flat ("CSV") files (subject to type inference or assigment)
  • sheets in spreadsheets, or areas within sheets, if the columns have names, are not merged, and have values with consistent meanings and types over an entire column
  • more generally, many forms of tabular data.

In principle, TDDA files are intended to be capable of supporting any kind of constraint regarding datasets. Today, we are primarily concerned with * field types * minimum and maximum values (or in the case of string fields, minumum and maximum string lengths) * whether nulls are allowed, * whether duplicate values are allowed within a field * the allowed values for a field.

The format also has support for describing relations between fields.

Future extensions we can already foresee include:

  • dataset-level constraints (e.g. numbers of records; required or disallowed fields)
  • sortedness of fields, of field values or both
  • regular expressions to which string fields should conform
  • constraints on subsets of the data (e.g. records dated after July 2016 should not have null values for the ID field)
  • constraints on substructure within fields (e.g. constraints on tagged subexpressions from regular expressions to which string fields are expected to conform)
  • potentially checksums (though this is more suitable for checking the integreity of transfer of a specific dataset, than for use across multiple related datasets)
  • constraints between datasets, most obviously key relations (e.g. every value field KA in dataset A should also occur in field KB in dataset B).

The motivation for generating, storing and verifying datasets against such sets of constraints is that they can provide a powerful way of detecting bad or unexpected inputs to, or outputs from, a data analysis process. They can also be valuable as checks on intermediate results. While manually generating constraints can be burdensome, automatic discovery of constraints from example datasets, potentially followed by manual removal of over-specific constraints, provides a good cost/benefit ratio in many situations.

Filename and encoding

  • The preferred extension for TDDA Constraints files is .tdda.

  • TDDA constraints files must be encoded as UTF-8.

  • TDDA files must be valid JSON.

Example

This is an extremely simple example TDDA file:

{
    "fields": {
        "a": {
            "type": "int",
            "min": 1,
            "max": 9,
            "sign": "positive",
            "max_nulls": 0,
            "no_duplicates": true
        },
        "b": {
            "type": "string",
            "min_length": 3,
            "max_length": 3,
            "max_nulls": 1,
            "no_duplicates": true,
            "allowed_values": [
                "one",
                "two"
            ]
        }
    }
}

General Structure

A TDDA file is a JSON dictionary. There are currently two supported top-level keys:

  • fields: constraints for individual fields, keyed on the field name. (In TDDA, we generally refer to dataset columns as fields.)

  • field_groups: constraints specifying relations between multiple fields (two, for now). field_groups constraints are keyed on a comma-separated list of the names of the fields to which they relate, and order is significant.

Both top-level keys are optional.

In future, we expect to add further top-level keys (e.g. for possible constraints on the number of rows, required or disallowed fields etc.)

The order of constraints in the file is immaterial (of course; this is JSON), though writers may choose to present fields in a particular order, e.g. dataset order or sorted on fieldname.

Field Constraints

The value of a field constraints entry (in the fields section) is a dictionary keyed on constraint kind. For example, the constraints on field a in the example above are specified as:

"a": {
    "type": "int",
    "min": 1,
    "max": 9,
    "sign": "positive",
    "max_nulls": 0,
    "no_duplicates": true
}

The TDDA library currently recognizes the following kinds of constraints:

  • type
  • min
  • max
  • min_length
  • max_length
  • sign
  • sign
  • max_nulls
  • no_duplicates
  • allowed_values

Other constraint libraries are free to define their own, custom kinds of constraints. We will probably recommend that non-standard constraints have names beginning with colon-terminated prefix. For example, if we wanted to support more specific Pandas type constraints, we would probably use a key such as pandas:type for this.

The value of a constraint is often simply a scalar value, but can be a list or a dictionary; when it is a dictionary, it should include a key value, with the principle value associated with the constraint (true, if there is no specific value beyond the name of the constraint).

If the value of a constraint (the scalar value, or the value key if the value is a dictionary) is null, this is taken to indicate the absence of a constraint. A constraint with value null should be completely ignored, so that a constraints file including null-valued constraints should produce identical results to one omitting those constraints. (This obviously means that we are discouraging using null as a meaningful constraint value, though a string "null" is fine, and in fact we use this for sign constraints.)

The semantics and values of the standard field constraint types are as follows:

  • type: the allowed (standard, TDDA) type of the field. This can be a single value from bool (boolean), int (integer; whole-numbered); real (floating point values); string (unicode in Python3; byte string in Python2) or date (any kind of date or date time, with or without timezone information). It can also be a list of such allowed values (in which case, order is not significant).

    It is up to the generation and verification libraries to map between the actual types in whatever dataset/dataframe/table/... object is used and these TDDA constraint types, though over time we may provide further guidance.

    Examples:

    • {"type": "int"}
    • {"type": ["int", "real"]}
  • min: the minimum allowed value for a field. This is often a simple value, but in the case of real fields, it can be convenient to specify a level of precision. In particular, a minimum value can have precision (default: fuzzy):

    • closed: all non-null values in the field must be greater than or equal to the value specified.
    • open: all non-null values in the field must be strictly greater than the value specified.
    • fuzzy: when the precision is specified as fuzzy, the verifier should allow a small degree of violation of the constraint without generating a failure. Verifiers take a parameter, epsilon, which specifies how the fuzzy constraints should be taken to be: epsilon is a fraction of the constraint value by which field values are allowed to exceed the constraint without being considered to fail the constraint. This defaults to 0.01 (i.e. 1%). Notice that this means that constraint values of zero are never fuzzy.

    Examples are:

    • {"min": 1},
    • {"min": 1.2},
    • {"min": {"value": 3.4}, {"precision": "fuzzy"}}.

    JSON, does not—of course—have a date type. TDDA files specifying dates should use string representations in one of the following formats:

    • YYYY-MM-DD for dates without times
    • YYYY-MM-DD hh:mm:ss for date-times without timezone
    • YYYY-MM-DD hh:mm:ss [+-]ZZZZ for date-times with timezone.

    We recommend that writers use precisely these formats, but that readers offer some flexibility in reading, e.g. accepting / as well as - to separate date components, and T as well as space to separate the time component from the date.

  • max: the maximum allowed value for a field. Much like min, but for maximum values. Examples are:

    • {"max": 1},
    • {"max": 1.2},
    • {"max": {"value": 3.4}, {"precision": "closed"}}.

    Dates should be formatted as for min values.

  • min_length: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation. Example:

    • {"min_length": 2}
  • max_length: the minimum allowed length of strings in a string field. How unicode strings are counted is up to the implementation.

    • {"max_length": 22}
  • sign: For numeric fields, the allowed sign of (non-null) values. Although this overlaps with minimum and maximum values, it it often useful to have a separate sign constraint, which carries semantically different information. Allowed values are:

    • positive: All values must be greater than zero
    • non-negative: No value may be less than zero
    • zero: All values must be zero
    • non-positive: No value may be greater than zero
    • negative: All values must be negative
    • null: No signed values are allowed, i.e. the field must be entirely null.

    Example:

    • {"sign": "non-negative"}
  • max_nulls: The maximum number of nulls allowed in the field. This can be any non-negative value. We recommend only writing values of zero (no nulls values are allowed) or 1 (At most a single null is allowed) into this constraint, but checking against any value found.

    Example:

    • {"max_nulls": 0}
  • no_duplicates: When this constraint is set on a field (with value true), it means that each non-null value must occur only once in the field. The current implementation only uses this constraint for string fields.

    Example:

    • {"no_duplicates": true}
  • allowed_values: The value of this constraint is a list of allowed values for the field. The order is not significant.

    Example:

    • {"allowed_values": ["red", "green", "blue"]}

MultiField Constraints

Multifield constraints are not yet being generated by this implementation, though our (proprietary) Miró implementation does produce them. The currently planned constraint types for field relations cover field equality and inequality for pairs of fields, with options to specify null relations too.

A simple example would be:

"field_groups": {
    "StartDate,EndDate": {"lt": true}
}

This is a less-than constraint, to be interpreted as

  • StartDate < EndDate wherever StartDate and EndDate and both non-null.

The plan is to support the obvious five equality and inequality relations:

  • lt: first field value is strictly less than the second field value for each record
  • lte: first field value is less than or equal to the second field value for each record
  • eq: first field value is equal to the second field value for each record
  • gte: first field value is greater than or equal to the second field value for each record
  • gt: first field value is strictly greater than the second field value for each record.

In the case of equality (only), we will probably also support a precision parameter with values fuzzy or precise.

There should probably also be an option to specify relations between null values in pairs of columns, either as a separate constraint or as a quality on each of the above.


  1. Pandas, of course, allows multiple columns to have the same name. This format makes no concessions to such madness, though there is nothing to stop a verifier or generator sharing constraints across all columns with the same name. The Pandas generators and verifiers in this library do not currently attempt to do this. 

  2. Pandas also allows columns of mixed type. Again, this file format does not recognize such columns, and it would probably be sensible not to use type constraints for columns of mixed type. 

  3. Pandas uses not-a-number (pandas.np.NaN) to represent null values for numeric, string and boolean fields; it uses a special not-a-time (pandas.NaT) value to represent null date (Timestamp) values. 


Constraint Discovery and Verification for Pandas DataFrames

Posted on Thu 03 November 2016 in TDDA • Tagged with tdda, constraints, pandas

Background

In a previous post, Constraints and Assertions, we introduced the idea of using constraints to verify input, output and intermediate datasets for an analytical process. We also demonstrated that candidate constraints can be automatically generated from example datasets. We prototyped this in our own software (Miró) expressing constraints as lisp S-expressions.

Improving and Extending the Approach: Open-Source Pandas Code

We have now taken the core ideas, polished them a little and made them available through an open-source library, currently on Github. We will push it to PyPI when it has solidified a little further.

The constraint code I'm referring to is available in the constraints module of the tdda repository for the tdda user on github. So if you issue the command

git clone https://github.com/tdda/tdda.git

in a directory somewhere on your PYTHONPATH, this should enable you to use it.

The TDDA library:

  • is under an MIT licence;
  • runs under Python2 and Python3;
  • other than Pandas itself, has no dependencies outside the standard library unless you want to use feather files (see below);
  • includes a base layer to help with building constraint verification and discovery libraries for various systems;
  • includes Pandas implementations of constraint discovery and verification through a (Python) API;
  • uses a new JSON format (normally stored in .tdda files) for saving constraints;
  • also includes a prototype command-line tool for verifying a dataframe stored in feather format against a .tdda file of constraints. Feather is a file format developed by Wes McKinney (the original creator of Pandas) and Hadley Wickham (of ggplot and tidyverse fame) for dataframes that allows interchange between R and Pandas while preserving type information and exact values. It is based on the Apache Arrow project. It can be used directly or using our ugly-but-useful extension library pmmif, which allows extra metadata (including extended type information) to be saved alongside a .feather file, in a companion .pmm file.

Testing

All the constraint-handling code is in the constraints module within the TDDA repository.

After you've cloned the repository, it's probably a good idea to run the tests. There are two sets, and both should run under Python2 or Python3.

$ cd tdda/constraints
$ python testbase.py
.....
----------------------------------------------------------------------
Ran 5 tests in 0.003s

OK

$ python testpdconstraints.py
.....................
----------------------------------------------------------------------
Ran 21 tests in 0.123s

OK

There is example code (which we'll walk through below) in the examples subdirectory of constraints.

Basic Use

Constraint Discovery

Here is some minimal code for getting the software to discover constraints satisfied by a Pandas DataFrame:

import pandas as pd
from tdda.constraints.pdconstraints import discover_constraints

df = pd.DataFrame({'a': [1,2,3], 'b': ['one', 'two', pd.np.NaN]})
constraints = discover_constraints(df)
with open('/tmp/example_constraints.tdda', 'w') as f:
    f.write(constraints.to_json())

(This is the core of the example code in tdda/constraints/examples/simple_discovery.py, and is included in the docstring for the discover_constraints function.)

Hopefully the code is fairly self-explanatory, but walking through the lines after the imports:

  • We first generate a trivial 3-row DataFrame with an integer column a and a string column b. The string column includes a Pandas null (NaN).
  • We then pass that DataFrame to the discover_constraints function from tdda.constraints.pdconstraints, and it returns a DatasetConstraints object, which is defined in tdda.constraints.base.
  • The resulting constraints object has a to_json() method which converts the structured constraints into a JSON string.
  • In the example, we write that to /tmp/example_constraints.tdda; we encourage everyone to use the .tdda extension for these JSON constraint files.

This is what happens if we run the example file:

$ cd tdda/constraints/examples

$ python simple_discovery.py
Written /tmp/example_constraints.tdda successfully.

$ cat /tmp/example_constraints.tdda
{
    "fields": {
        "a": {
            "type": "int",
            "min": 1,
            "max": 9,
            "sign": "positive",
            "max_nulls": 0,
            "no_duplicates": true
        },
        "b": {
            "type": "string",
            "min_length": 3,
            "max_length": 3,
            "max_nulls": 1,
            "no_duplicates": true,
            "allowed_values": [
                "one",
                "two"
            ]
        }
    }
}

As you can see, in this case, the system has 'discovered' six constraints for each field, and it's rather easy to read what they are and at least roughly what they mean. We'll do a separate post describing the .tdda JSON file format, but it's documented in tdda/constraints/tdda_json_file_format.md in the repository (which—through almost unfathomable power of Github—means you can see it formatted here).

Constraint Verification

Now that we have a .tdda file, we can use it to verify a DataFrame.

First, let's look at code that should lead to a successful verification (this code is in tdda/constraints/examples/simple_verify_pass.py).

import pandas as pd
from tdda.constraints.pdconstraints import verify_df

df = pd.DataFrame({'a': [2, 4], 'b': ['one', pd.np.NaN]})
v = verify_df(df, 'example_constraints.tdda')

print('Passes: %d' % v.passes)
print('Failures: %d\n\n\n' % v.failures)
print(str(v))
print('\n\n')
print(v.to_frame())

Again, hopefully the code is fairly self-explanatory, but:

  • df is a DataFrame that is different from the one we used to generate the constraints, but is nevertheless consistent with the constraints.
  • The verify_df function from tdda.constraints.pdconstraints takes a DataFrame and the location of a .tdda file and verifies the DataFrame against the constraints in the file. The function returns a PandasVerification object. The PandasVerification class is a subclass of Verification from tdda.constraints.base, adding the ability to turn the verification object into a DataFrame.
  • All verification objects include passes and failures attributes, which respectively indicate the number of constraints that passed and the number that failed. So the simplest complete verification is simply to check that v.failures == 0.
  • Verification methods also include a __str__ method. Its output currently includes a section for fields and a summary.
  • the .to_frame() method converts a PandasVerification into a Pandas DataFrame.

If we run this, the result is as follows:

$ python simple_verify_pass.py
Passes: 12
Failures: 0



FIELDS:

a: 0 failures  6 passes  type   min   max   sign   max_nulls   no_duplicates ✓

b: 0 failures  6 passes  type   min_length   max_length   max_nulls   no_duplicates   allowed_values ✓

SUMMARY:

Passes: 12
Failures: 0



  field  failures  passes  type   min min_length   max max_length  sign  \
0     a         0       6  True  True        NaN  True        NaN  True
1     b         0       6  True   NaN       True   NaN       True   NaN

  max_nulls no_duplicates allowed_values
0      True          True            NaN
1      True          True           True

In the DataFrame produced from the Verification

  • True indicates a constraint that was satisfied in the dataset;
  • False indicates a constraint that was not satisfied in the dataset;
  • NaN (null) indicates a constraint that was not present for that field.

As you would expect, the field, failures and passes columns are, respectively, the name of the field, the number of failures and the number of passes for that field.

If we now change the DataFrame definition to:

df = pd.DataFrame({'a': [0, 1, 2, 10, pd.np.NaN],
                   'b': ['one', 'one', 'two', 'three', pd.np.NaN]})

(as is the case in tdda/constraints/examples/simple_verify_fail.py), we now expect some constraint failures. If we run this, we see:

$ python simple_verify_fail.py
Passes: 5
Failures: 7



FIELDS:

a: 4 failures  2 passes  type   min   max   sign   max_nulls   no_duplicates ✓

b: 3 failures  3 passes  type   min_length   max_length   max_nulls   noh_duplicates   allowed_values ✗

SUMMARY:

Passes: 5
Failures: 7



  field  failures  passes  type    min min_length    max max_length   sign  \
0     a         4       2  True  False        NaN  False        NaN  False
1     b         3       3  True    NaN       True    NaN      False    NaN

  max_nulls no_duplicates allowed_values
0     False          True            NaN
1      True         False          False

Final Notes

There are more options and there's more to say about the Pandas implementation, but that's probably enough for one post. We'll have follow-ups on the file format, more options, and the foibles of Pandas.

If you want to hear more, follow us on twitter at @tdda0.


WritableTestCase: Example Use

Posted on Sun 18 September 2016 in TDDA • Tagged with tdda

In my PyCon UK talk yesterday I promised to update the and document the copy of writabletestcase.WritableTestCase on GitHub.

The version I've put up is not quite as powerful as the example I showed in the talk—that will follow—but has the basic functionality.

I've now added examples to the repository and, below, show how these work.

The library is available with

git clone https://github.com/tdda/tdda.git

WritableTestCase extends unittest.TestCase, from the Python's standard library, in three main ways:

  • It provides methods for testing strings produced in memory or files written to disk against reference results in files. When a test fails, rather than just showing a hard-to-read difference, it writes the actual result to file (if necessary) and then shows the diff command needed to compare it—something like this:

    Compare with "diff /path/to/actual-output /path/to/expected-output"
    

    Obviously, the diff command can be replaced with a graphical diff tool, an open command or whatever.

    Although this shouldn't be necessary (see below), you also have the option, after verification, or replacing diff with cp to copy the actual output as the new reference output.

  • Secondly, the code supports excluding lines of the output contain nominated strings. This is often handy for excluding things like date stamps, version numbers, copyright notices etc. These often appear in output, and vary, without affecting the semantics.

    (The version of the library I showed at PyCon had more powerful variants of this, which I'll add later.)

  • Thirdly, if you verify that the new output is correct, the library supports re-running with the -w flag to overwrite the expected ("reference") results with the ones generated by the code.

    Obviously, if this feature is abused, the value of the tests will be lost, but provided you check the output carefully before re-writing, this is a significant convenience.

The example code is in the examples subdirectory, called test_using_writabletestcase.py. It has two test functions, one of which generates HTML output as a string, and the other of which produces some slightly different HTML output as a file. In each case, the output produced by the function is not identical to the expected "reference" output (in examples/reference), but differs only on lines containing "Copyright" and "Version". Since these are passed into the test as exclusions, the tests should pass.

Here is the example code:

# -*- coding: utf-8 -*-
"""
test_using_writabletestcase.py: A simple example of how to use
tdda.writabletestcase.WritableTestCase.

Source repository: https://github.com/tdda/tdda

License: MIT

Copyright (c) Stochastic Solutions Limited 2016
"""
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import tempfile

from tdda import writabletestcase
from tdda.examples.generators import generate_string, generate_file


class TestExample(writabletestcase.WritableTestCase):
    def testExampleStringGeneration(self):
        """
        This test uses generate_string() from tdda.examples.generators
        to generate some HTML as a string.

        It is similar to the reference HTML in
        tdda/examples/reference/string_result.html except that the
        Copyright and version lines are slightly different.

        As shipped, the test should pass, because the ignore_patterns
        tell it to ignore those lines.

        Make a change to the generation code in the generate_string
        function in generators.py to change the HTML output.

        The test should then fail and suggest a diff command to run
        to see the difference.

        Rerun with

            python test_using_writabletestcase.py -w

        and it should re-write the reference output to match your
        modified results.
        """
        actual = generate_string()
        this_dir = os.path.abspath(os.path.dirname(__file__))
        expected_file = os.path.join(this_dir, 'reference',
                                     'string_result.html')
        self.check_string_against_file(actual, expected_file,
                                       ignore_patterns=['Copyright',
                                                        'Version'])


    def testExampleFileGeneration(self):
        """
        This test uses generate_file() from tdda.examples.generators
        to generate some HTML as a file.

        It is similar to the reference HTML in
        tdda/examples/reference/file_result.html except that the
        Copyright and version lines are slightly different.

        As shipped, the test should pass, because the ignore_patterns
        tell it to ignore those lines.

        Make a change to the generation code in the generate_file function
        in generators.py to change the HTML output.

        The test should then fail and suggest a diff command to run
        to see the difference.

        Rerun with

            python test_using_writabletestcase.py -w

        and it should re-write the reference output to match your
        modified results.
        """
        outdir = tempfile.gettempdir()
        outpath = os.path.join(outdir, 'file_result.html')
        generate_file(outpath)
        this_dir = os.path.abspath(os.path.dirname(__file__))
        expected_file = os.path.join(this_dir, 'reference',
                                     'file_result.html')
        self.check_file(outpath, expected_file,
                        ignore_patterns=['Copyright', 'Version'])


if __name__ == '__main__':
    writabletestcase.main(argv=writabletestcase.set_write_from_argv())

If you download it, and try running it, you should output similar to the following:

$ python test_using_writabletestcase.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.004s

OK

The reference output files it compares against are:

  • examples/reference/string_result.html
  • examples/reference/file_result.html

To see what happens when there's a difference, try editing one or both of the main functions that generate the HTML in generators.py. They're most just using explicit strings, so the simplest thing is just to change a word or something in the output.

If I change It's to It is in the generate_string() function and rerun, I get this output:

$ python test_using_writabletestcase.py
.
File check failed.
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/string_result.html /Users/njr/python/tdda/examples/reference/string_result.html".

Note exclusions:
Copyright
Version
F
======================================================================
FAIL: testExampleStringGeneration (__main__.TestExample)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_using_writabletestcase.py", line 55, in testExampleStringGeneration
    'Version'])
  File "/Users/njr/python/tdda/writabletestcase.py", line 294, in check_string_against_file
    self.assertEqual(failures, 0)
AssertionError: 1 != 0

----------------------------------------------------------------------
Ran 2 tests in 0.005s

FAILED (failures=1)
1 godel:$

If I then run the diff command it suggests, the output is:

$ diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/string_result.html /Users/njr/python/tdda/examples/reference/string_result.html
5,6c5,6
<     Copyright (c) Stochastic Solutions, 2016
<     Version 1.0.0
—
>     Copyright (c) Stochastic Solutions Limited, 2016
>     Version 0.0.0
29c29
<     It is not terribly exciting.
—
>     It's not terribly exciting.

Here you can see the differences that are excluded, and the change I actually made.

(The version I showed at PyCon has an option to see the only the non-excluded differences, but this version doesn't; that will come!)

If I now run again using -w, to re-write the reference output, it shows:

$ python test_using_writabletestcase.py -w
.Expected file /Users/njr/python/tdda/examples/reference/string_result.html written.
.
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

And, of course, if I run a third time, without -w, the test now passes:

$ python test_using_writabletestcase.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

So that's a quick overview of it works.


Slides and Rough Transcript of TDDA talk from PyCon UK 2016

Posted on Sat 17 September 2016 in TDDA • Tagged with tdda

Python UK 2016, Cardiff.

I gave a talk on test-driven data analysis at PyCon UK 2016, Cardiff, today.

The slides (which are kind-of useless without the words) are available here.

More usefully, a rough transcript, with thumbnail slides, is available here.


Extracting More Apple Health Data

Posted on Wed 20 April 2016 in TDDA • Tagged with xml, apple, health

The first version of the Python code for extracting data from the XML export from the Apple Health on iOS neglected to extract Activity Summaries and Workout data. We will now fix that.

As usual, I'll remind you how to get the code, if you want, then discuss the changes to the code, the reference test and the unit tests. Then in the next post, we'll actually start looking at the data.

The Updated Code

As before, you can get the code from Github with

$ git clone https://github.com/tdda/applehealthdata.git

or if you have pulled it before, with

$ git pull --tags

This version of the code is tagged with v1.3, so if it has been updated by the time you read this, get that version with

$ git checkout v1.3

I'm not going to list all the code here, but will pull out a few key changes as we discuss them.

Changes

Change 1: Change FIELDS to handle three different field structures.

The first version of the extraction code wrote only Records, which contain the granular activity data (which is the vast bulk of it, by volume).

Now I want to extend the code to handle the other two main kinds of data it writes—ActivitySummary and Workout elements in the XML.

The three different element types contain different XML attributes, which correspond to different fields in the CSV, and although they overlap, I think the best approach is to have separate record structures for each, and then to create a dictionary mapping the element kind to its field information.

Accordingly, the code that sets FIELDS changes to become:

RECORD_FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('type', 's'),
    ('unit', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('value', 'n'),
))

ACTIVITY_SUMMARY_FIELDS = OrderedDict((
    ('dateComponents', 'd'),
    ('activeEnergyBurned', 'n'),
    ('activeEnergyBurnedGoal', 'n'),
    ('activeEnergyBurnedUnit', 's'),
    ('appleExerciseTime', 's'),
    ('appleExerciseTimeGoal', 's'),
    ('appleStandHours', 'n'),
    ('appleStandHoursGoal', 'n'),
))

WORKOUT_FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('workoutActivityType', 's'),
    ('duration', 'n'),
    ('durationUnit', 's'),
    ('totalDistance', 'n'),
    ('totalDistanceUnit', 's'),
    ('totalEnergyBurned', 'n'),
    ('totalEnergyBurnedUnit', 's'),
))

FIELDS = {
    'Record': RECORD_FIELDS,
    'ActivitySummary': ACTIVITY_SUMMARY_FIELDS,
    'Workout': WORKOUT_FIELDS,
}

and we have to change references (in both the main code and the test code) to refer to RECORD_FIELDS where previously there were references to FIELDS.

Change 2: Add a Workout to the test data

There was a single workout in the data I exported from the phone (a token one I performed primarily to generate a record of this type). I simply used grep to extract that line from export.xml and poked it into the test data `testdata/export6s3sample.xml'.

Change 3: Update the tag and field counters

The code for counting record types previously considered only nodes of type Record. Now we also want to handle Workout and ActivitySummary elements. Workouts do come in different types (they have a workoutActivityType field), so it may be that we will want to write out different workout types into different CSV files, but since I have only, so far, seen a single workout, I don't really want to do this. So instead, we'll write all Workout elements to a corresponding Workout.csv file, and all ActivitySummary elements to an ActivitySummary.csv file.

Accordingly, the count_record_types method now uses an extra Counter attribute, other_types to count the number of each of these elements, keyed on their tag (i.e. Workout or ActivitySummary).

Change 4: Update the test results to reflect the new behaviour

Two of the unit tests introduced last time need to be updated to reflect this Change 3. First, the field counts change, and secondly we need reference values for the other_types counts. Hence the new section in test_extracted_reference_stats:

    expectedOtherCounts = [
       ('ActivitySummary', 2),
       ('Workout', 1),
    ]
    self.assertEqual(sorted(data.other_types.items()),
                     expectedOtherCounts)

Change 5: Open (and close) files for Workouts and ActivitySummaries

We need to open new files for Workout.csv and ActivitySummary.csv if we have any such records. This is handled in the open_for_writing method.

Change 6: Write records for Workouts and ActivitySummaries

There are minor changes to the write_records method to allow it to handle writing Workout and ActivitySummary records. The only real difference is that the different CSV files have different fields, so we need to look up the right values, in the order specified by the header for each kind. The new code does that:

def write_records(self):
    kinds = FIELDS.keys()
    for node in self.nodes:
        if node.tag in kinds:
            attributes = node.attrib
            kind = attributes['type'] if node.tag == 'Record' else node.tag
            values = [format_value(attributes.get(field), datatype)
                      for (field, datatype) in FIELDS[node.tag].items()]
            line = encode(','.join(values) + '\n')
            self.handles[kind].write(line)

Change 7: Update the reference test

Finally, the reference test itself now generates two more files, so I've added reference copies of those to the testdata subdirectory and changed the test to loop over all four files:

def test_tiny_reference_extraction(self):
    path = copy_test_data()
    data = HealthDataExtractor(path, verbose=VERBOSE)
    data.extract()
    for kind in ('StepCount', 'DistanceWalkingRunning',
                 'Workout', 'ActivitySummary'):
        self.check_file('%s.csv' % kind)

Mission Accomplished

We've now extracted essentially all the data from the export.xml file from the Apple Health app, and created various tests for that extraction process. We'll start to look at the data in future posts. There is one more component in my extract—another XML file called export_cda.xml. This contains a ClinicalDocument, apparently conforming to a standard from (or possibly administered by) Health Level Seven International. It contains heart-rate data from my Apple Watch. I probably will extract it and publish the code for doing so, but later.