Constraint Discovery and Verification for Pandas DataFrames

Posted on Thu 03 November 2016 in TDDA • Tagged with tdda, constraints, pandas

Background

In a previous post, Constraints and Assertions, we introduced the idea of using constraints to verify input, output and intermediate datasets for an analytical process. We also demonstrated that candidate constraints can be automatically generated from example datasets. We prototyped this in our own software (Miró) expressing constraints as lisp S-expressions.

Improving and Extending the Approach: Open-Source Pandas Code

We have now taken the core ideas, polished them a little and made them available through an open-source library, currently on Github. We will push it to PyPI when it has solidified a little further.

The constraint code I'm referring to is available in the constraints module of the tdda repository for the tdda user on github. So if you issue the command

git clone https://github.com/tdda/tdda.git

in a directory somewhere on your PYTHONPATH, this should enable you to use it.

The TDDA library:

  • is under an MIT licence;
  • runs under Python2 and Python3;
  • other than Pandas itself, has no dependencies outside the standard library unless you want to use feather files (see below);
  • includes a base layer to help with building constraint verification and discovery libraries for various systems;
  • includes Pandas implementations of constraint discovery and verification through a (Python) API;
  • uses a new JSON format (normally stored in .tdda files) for saving constraints;
  • also includes a prototype command-line tool for verifying a dataframe stored in feather format against a .tdda file of constraints. Feather is a file format developed by Wes McKinney (the original creator of Pandas) and Hadley Wickham (of ggplot and tidyverse fame) for dataframes that allows interchange between R and Pandas while preserving type information and exact values. It is based on the Apache Arrow project. It can be used directly or using our ugly-but-useful extension library pmmif, which allows extra metadata (including extended type information) to be saved alongside a .feather file, in a companion .pmm file.

Testing

All the constraint-handling code is in the constraints module within the TDDA repository.

After you've cloned the repository, it's probably a good idea to run the tests. There are two sets, and both should run under Python2 or Python3.

$ cd tdda/constraints
$ python testbase.py
.....
----------------------------------------------------------------------
Ran 5 tests in 0.003s

OK

$ python testpdconstraints.py
.....................
----------------------------------------------------------------------
Ran 21 tests in 0.123s

OK

There is example code (which we'll walk through below) in the examples subdirectory of constraints.

Basic Use

Constraint Discovery

Here is some minimal code for getting the software to discover constraints satisfied by a Pandas DataFrame:

import pandas as pd
from tdda.constraints.pdconstraints import discover_constraints

df = pd.DataFrame({'a': [1,2,3], 'b': ['one', 'two', pd.np.NaN]})
constraints = discover_constraints(df)
with open('/tmp/example_constraints.tdda', 'w') as f:
    f.write(constraints.to_json())

(This is the core of the example code in tdda/constraints/examples/simple_discovery.py, and is included in the docstring for the discover_constraints function.)

Hopefully the code is fairly self-explanatory, but walking through the lines after the imports:

  • We first generate a trivial 3-row DataFrame with an integer column a and a string column b. The string column includes a Pandas null (NaN).
  • We then pass that DataFrame to the discover_constraints function from tdda.constraints.pdconstraints, and it returns a DatasetConstraints object, which is defined in tdda.constraints.base.
  • The resulting constraints object has a to_json() method which converts the structured constraints into a JSON string.
  • In the example, we write that to /tmp/example_constraints.tdda; we encourage everyone to use the .tdda extension for these JSON constraint files.

This is what happens if we run the example file:

$ cd tdda/constraints/examples

$ python simple_discovery.py
Written /tmp/example_constraints.tdda successfully.

$ cat /tmp/example_constraints.tdda
{
    "fields": {
        "a": {
            "type": "int",
            "min": 1,
            "max": 9,
            "sign": "positive",
            "max_nulls": 0,
            "no_duplicates": true
        },
        "b": {
            "type": "string",
            "min_length": 3,
            "max_length": 3,
            "max_nulls": 1,
            "no_duplicates": true,
            "allowed_values": [
                "one",
                "two"
            ]
        }
    }
}

As you can see, in this case, the system has 'discovered' six constraints for each field, and it's rather easy to read what they are and at least roughly what they mean. We'll do a separate post describing the .tdda JSON file format, but it's documented in tdda/constraints/tdda_json_file_format.md in the repository (which—through almost unfathomable power of Github—means you can see it formatted here).

Constraint Verification

Now that we have a .tdda file, we can use it to verify a DataFrame.

First, let's look at code that should lead to a successful verification (this code is in tdda/constraints/examples/simple_verify_pass.py).

import pandas as pd
from tdda.constraints.pdconstraints import verify_df

df = pd.DataFrame({'a': [2, 4], 'b': ['one', pd.np.NaN]})
v = verify_df(df, 'example_constraints.tdda')

print('Passes: %d' % v.passes)
print('Failures: %d\n\n\n' % v.failures)
print(str(v))
print('\n\n')
print(v.to_frame())

Again, hopefully the code is fairly self-explanatory, but:

  • df is a DataFrame that is different from the one we used to generate the constraints, but is nevertheless consistent with the constraints.
  • The verify_df function from tdda.constraints.pdconstraints takes a DataFrame and the location of a .tdda file and verifies the DataFrame against the constraints in the file. The function returns a PandasVerification object. The PandasVerification class is a subclass of Verification from tdda.constraints.base, adding the ability to turn the verification object into a DataFrame.
  • All verification objects include passes and failures attributes, which respectively indicate the number of constraints that passed and the number that failed. So the simplest complete verification is simply to check that v.failures == 0.
  • Verification methods also include a __str__ method. Its output currently includes a section for fields and a summary.
  • the .to_frame() method converts a PandasVerification into a Pandas DataFrame.

If we run this, the result is as follows:

$ python simple_verify_pass.py
Passes: 12
Failures: 0



FIELDS:

a: 0 failures  6 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓  no_duplicates ✓

b: 0 failures  6 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  no_duplicates ✓  allowed_values ✓

SUMMARY:

Passes: 12
Failures: 0



  field  failures  passes  type   min min_length   max max_length  sign  \
0     a         0       6  True  True        NaN  True        NaN  True
1     b         0       6  True   NaN       True   NaN       True   NaN

  max_nulls no_duplicates allowed_values
0      True          True            NaN
1      True          True           True

In the DataFrame produced from the Verification

  • True indicates a constraint that was satisfied in the dataset;
  • False indicates a constraint that was not satisfied in the dataset;
  • NaN (null) indicates a constraint that was not present for that field.

As you would expect, the field, failures and passes columns are, respectively, the name of the field, the number of failures and the number of passes for that field.

If we now change the DataFrame definition to:

df = pd.DataFrame({'a': [0, 1, 2, 10, pd.np.NaN],
                   'b': ['one', 'one', 'two', 'three', pd.np.NaN]})

(as is the case in tdda/constraints/examples/simple_verify_fail.py), we now expect some constraint failures. If we run this, we see:

$ python simple_verify_fail.py
Passes: 5
Failures: 7



FIELDS:

a: 4 failures  2 passes  type ✓  min ✗  max ✗  sign ✗  max_nulls ✗  no_duplicates ✓

b: 3 failures  3 passes  type ✓  min_length ✓  max_length ✗  max_nulls ✓  noh_duplicates ✗  allowed_values ✗

SUMMARY:

Passes: 5
Failures: 7



  field  failures  passes  type    min min_length    max max_length   sign  \
0     a         4       2  True  False        NaN  False        NaN  False
1     b         3       3  True    NaN       True    NaN      False    NaN

  max_nulls no_duplicates allowed_values
0     False          True            NaN
1      True         False          False

Final Notes

There are more options and there's more to say about the Pandas implementation, but that's probably enough for one post. We'll have follow-ups on the file format, more options, and the foibles of Pandas.

If you want to hear more, follow us on twitter at @tdda0.


WritableTestCase: Example Use

Posted on Sun 18 September 2016 in TDDA • Tagged with tdda

In my PyCon UK talk yesterday I promised to update the and document the copy of writabletestcase.WritableTestCase on GitHub.

The version I've put up is not quite as powerful as the example I showed in the talk—that will follow—but has the basic functionality.

I've now added examples to the repository and, below, show how these work.

The library is available with

git clone https://github.com/tdda/tdda.git

WritableTestCase extends unittest.TestCase, from the Python's standard library, in three main ways:

  • It provides methods for testing strings produced in memory or files written to disk against reference results in files. When a test fails, rather than just showing a hard-to-read difference, it writes the actual result to file (if necessary) and then shows the diff command needed to compare it—something like this:

    Compare with "diff /path/to/actual-output /path/to/expected-output"
    

    Obviously, the diff command can be replaced with a graphical diff tool, an open command or whatever.

    Although this shouldn't be necessary (see below), you also have the option, after verification, or replacing diff with cp to copy the actual output as the new reference output.

  • Secondly, the code supports excluding lines of the output contain nominated strings. This is often handy for excluding things like date stamps, version numbers, copyright notices etc. These often appear in output, and vary, without affecting the semantics.

    (The version of the library I showed at PyCon had more powerful variants of this, which I'll add later.)

  • Thirdly, if you verify that the new output is correct, the library supports re-running with the -w flag to overwrite the expected ("reference") results with the ones generated by the code.

    Obviously, if this feature is abused, the value of the tests will be lost, but provided you check the output carefully before re-writing, this is a significant convenience.

The example code is in the examples subdirectory, called test_using_writabletestcase.py. It has two test functions, one of which generates HTML output as a string, and the other of which produces some slightly different HTML output as a file. In each case, the output produced by the function is not identical to the expected "reference" output (in examples/reference), but differs only on lines containing "Copyright" and "Version". Since these are passed into the test as exclusions, the tests should pass.

Here is the example code:

# -*- coding: utf-8 -*-
"""
test_using_writabletestcase.py: A simple example of how to use
tdda.writabletestcase.WritableTestCase.

Source repository: https://github.com/tdda/tdda

License: MIT

Copyright (c) Stochastic Solutions Limited 2016
"""
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import tempfile

from tdda import writabletestcase
from tdda.examples.generators import generate_string, generate_file


class TestExample(writabletestcase.WritableTestCase):
    def testExampleStringGeneration(self):
        """
        This test uses generate_string() from tdda.examples.generators
        to generate some HTML as a string.

        It is similar to the reference HTML in
        tdda/examples/reference/string_result.html except that the
        Copyright and version lines are slightly different.

        As shipped, the test should pass, because the ignore_patterns
        tell it to ignore those lines.

        Make a change to the generation code in the generate_string
        function in generators.py to change the HTML output.

        The test should then fail and suggest a diff command to run
        to see the difference.

        Rerun with

            python test_using_writabletestcase.py -w

        and it should re-write the reference output to match your
        modified results.
        """
        actual = generate_string()
        this_dir = os.path.abspath(os.path.dirname(__file__))
        expected_file = os.path.join(this_dir, 'reference',
                                     'string_result.html')
        self.check_string_against_file(actual, expected_file,
                                       ignore_patterns=['Copyright',
                                                        'Version'])


    def testExampleFileGeneration(self):
        """
        This test uses generate_file() from tdda.examples.generators
        to generate some HTML as a file.

        It is similar to the reference HTML in
        tdda/examples/reference/file_result.html except that the
        Copyright and version lines are slightly different.

        As shipped, the test should pass, because the ignore_patterns
        tell it to ignore those lines.

        Make a change to the generation code in the generate_file function
        in generators.py to change the HTML output.

        The test should then fail and suggest a diff command to run
        to see the difference.

        Rerun with

            python test_using_writabletestcase.py -w

        and it should re-write the reference output to match your
        modified results.
        """
        outdir = tempfile.gettempdir()
        outpath = os.path.join(outdir, 'file_result.html')
        generate_file(outpath)
        this_dir = os.path.abspath(os.path.dirname(__file__))
        expected_file = os.path.join(this_dir, 'reference',
                                     'file_result.html')
        self.check_file(outpath, expected_file,
                        ignore_patterns=['Copyright', 'Version'])


if __name__ == '__main__':
    writabletestcase.main(argv=writabletestcase.set_write_from_argv())

If you download it, and try running it, you should output similar to the following:

$ python test_using_writabletestcase.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.004s

OK

The reference output files it compares against are:

  • examples/reference/string_result.html
  • examples/reference/file_result.html

To see what happens when there's a difference, try editing one or both of the main functions that generate the HTML in generators.py. They're most just using explicit strings, so the simplest thing is just to change a word or something in the output.

If I change It's to It is in the generate_string() function and rerun, I get this output:

$ python test_using_writabletestcase.py
.
File check failed.
Compare with "diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/string_result.html /Users/njr/python/tdda/examples/reference/string_result.html".

Note exclusions:
Copyright
Version
F
======================================================================
FAIL: testExampleStringGeneration (__main__.TestExample)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_using_writabletestcase.py", line 55, in testExampleStringGeneration
    'Version'])
  File "/Users/njr/python/tdda/writabletestcase.py", line 294, in check_string_against_file
    self.assertEqual(failures, 0)
AssertionError: 1 != 0

----------------------------------------------------------------------
Ran 2 tests in 0.005s

FAILED (failures=1)
1 godel:$

If I then run the diff command it suggests, the output is:

$ diff /var/folders/w7/lhtph66x7h33t9pns0616qk00000gn/T/string_result.html /Users/njr/python/tdda/examples/reference/string_result.html
5,6c5,6
<     Copyright (c) Stochastic Solutions, 2016
<     Version 1.0.0
—
>     Copyright (c) Stochastic Solutions Limited, 2016
>     Version 0.0.0
29c29
<     It is not terribly exciting.
—
>     It's not terribly exciting.

Here you can see the differences that are excluded, and the change I actually made.

(The version I showed at PyCon has an option to see the only the non-excluded differences, but this version doesn't; that will come!)

If I now run again using -w, to re-write the reference output, it shows:

$ python test_using_writabletestcase.py -w
.Expected file /Users/njr/python/tdda/examples/reference/string_result.html written.
.
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

And, of course, if I run a third time, without -w, the test now passes:

$ python test_using_writabletestcase.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

So that's a quick overview of it works.


Slides and Rough Transcript of TDDA talk from PyCon UK 2016

Posted on Sat 17 September 2016 in TDDA • Tagged with tdda

Python UK 2016, Cardiff.

I gave a talk on test-driven data analysis at PyCon UK 2016, Cardiff, today.

The slides (which are kind-of useless without the words) are available here.

More usefully, a rough transcript, with thumbnail slides, is available here.


Extracting More Apple Health Data

Posted on Wed 20 April 2016 in TDDA • Tagged with xml, apple, health

The first version of the Python code for extracting data from the XML export from the Apple Health on iOS neglected to extract Activity Summaries and Workout data. We will now fix that.

As usual, I'll remind you how to get the code, if you want, then discuss the changes to the code, the reference test and the unit tests. Then in the next post, we'll actually start looking at the data.

The Updated Code

As before, you can get the code from Github with

$ git clone https://github.com/tdda/applehealthdata.git

or if you have pulled it before, with

$ git pull --tags

This version of the code is tagged with v1.3, so if it has been updated by the time you read this, get that version with

$ git checkout v1.3

I'm not going to list all the code here, but will pull out a few key changes as we discuss them.

Changes

Change 1: Change FIELDS to handle three different field structures.

The first version of the extraction code wrote only Records, which contain the granular activity data (which is the vast bulk of it, by volume).

Now I want to extend the code to handle the other two main kinds of data it writes—ActivitySummary and Workout elements in the XML.

The three different element types contain different XML attributes, which correspond to different fields in the CSV, and although they overlap, I think the best approach is to have separate record structures for each, and then to create a dictionary mapping the element kind to its field information.

Accordingly, the code that sets FIELDS changes to become:

RECORD_FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('type', 's'),
    ('unit', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('value', 'n'),
))

ACTIVITY_SUMMARY_FIELDS = OrderedDict((
    ('dateComponents', 'd'),
    ('activeEnergyBurned', 'n'),
    ('activeEnergyBurnedGoal', 'n'),
    ('activeEnergyBurnedUnit', 's'),
    ('appleExerciseTime', 's'),
    ('appleExerciseTimeGoal', 's'),
    ('appleStandHours', 'n'),
    ('appleStandHoursGoal', 'n'),
))

WORKOUT_FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('workoutActivityType', 's'),
    ('duration', 'n'),
    ('durationUnit', 's'),
    ('totalDistance', 'n'),
    ('totalDistanceUnit', 's'),
    ('totalEnergyBurned', 'n'),
    ('totalEnergyBurnedUnit', 's'),
))

FIELDS = {
    'Record': RECORD_FIELDS,
    'ActivitySummary': ACTIVITY_SUMMARY_FIELDS,
    'Workout': WORKOUT_FIELDS,
}

and we have to change references (in both the main code and the test code) to refer to RECORD_FIELDS where previously there were references to FIELDS.

Change 2: Add a Workout to the test data

There was a single workout in the data I exported from the phone (a token one I performed primarily to generate a record of this type). I simply used grep to extract that line from export.xml and poked it into the test data `testdata/export6s3sample.xml'.

Change 3: Update the tag and field counters

The code for counting record types previously considered only nodes of type Record. Now we also want to handle Workout and ActivitySummary elements. Workouts do come in different types (they have a workoutActivityType field), so it may be that we will want to write out different workout types into different CSV files, but since I have only, so far, seen a single workout, I don't really want to do this. So instead, we'll write all Workout elements to a corresponding Workout.csv file, and all ActivitySummary elements to an ActivitySummary.csv file.

Accordingly, the count_record_types method now uses an extra Counter attribute, other_types to count the number of each of these elements, keyed on their tag (i.e. Workout or ActivitySummary).

Change 4: Update the test results to reflect the new behaviour

Two of the unit tests introduced last time need to be updated to reflect this Change 3. First, the field counts change, and secondly we need reference values for the other_types counts. Hence the new section in test_extracted_reference_stats:

    expectedOtherCounts = [
       ('ActivitySummary', 2),
       ('Workout', 1),
    ]
    self.assertEqual(sorted(data.other_types.items()),
                     expectedOtherCounts)

Change 5: Open (and close) files for Workouts and ActivitySummaries

We need to open new files for Workout.csv and ActivitySummary.csv if we have any such records. This is handled in the open_for_writing method.

Change 6: Write records for Workouts and ActivitySummaries

There are minor changes to the write_records method to allow it to handle writing Workout and ActivitySummary records. The only real difference is that the different CSV files have different fields, so we need to look up the right values, in the order specified by the header for each kind. The new code does that:

def write_records(self):
    kinds = FIELDS.keys()
    for node in self.nodes:
        if node.tag in kinds:
            attributes = node.attrib
            kind = attributes['type'] if node.tag == 'Record' else node.tag
            values = [format_value(attributes.get(field), datatype)
                      for (field, datatype) in FIELDS[node.tag].items()]
            line = encode(','.join(values) + '\n')
            self.handles[kind].write(line)

Change 7: Update the reference test

Finally, the reference test itself now generates two more files, so I've added reference copies of those to the testdata subdirectory and changed the test to loop over all four files:

def test_tiny_reference_extraction(self):
    path = copy_test_data()
    data = HealthDataExtractor(path, verbose=VERBOSE)
    data.extract()
    for kind in ('StepCount', 'DistanceWalkingRunning',
                 'Workout', 'ActivitySummary'):
        self.check_file('%s.csv' % kind)

Mission Accomplished

We've now extracted essentially all the data from the export.xml file from the Apple Health app, and created various tests for that extraction process. We'll start to look at the data in future posts. There is one more component in my extract—another XML file called export_cda.xml. This contains a ClinicalDocument, apparently conforming to a standard from (or possibly administered by) Health Level Seven International. It contains heart-rate data from my Apple Watch. I probably will extract it and publish the code for doing so, but later.


Unit Tests

Posted on Tue 19 April 2016 in TDDA • Tagged with xml, apple, health

In the last post, we presented some code for implementing a "reference" test for the code for extracting CSV files from the XML dump that the Apple Health app on iOS can produce.

We will now expand that test with a few other, smaller and more conventional unit tests. Each unit test focuses on a smaller block of functionality.

The Test Code

As before, you can get the code from Github with

$ git clone https://github.com/tdda/applehealthdata.git

or if you have pulled it previously, with

$ git pull

This version of the code is tagged with v1.2, so if it has been updated by the time you read this, get that version with

$ git checkout v1.2

Here is the updated test code.

# -*- coding: utf-8 -*-
"""
testapplehealthdata.py: tests for the applehealthdata.py

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import shutil
import sys
import unittest

from collections import Counter


from applehealthdata import (HealthDataExtractor,
                             format_freqs, format_value,
                             abbreviate, encode)

CLEAN_UP = True
VERBOSE = False


def get_base_dir():
    """
    Return the directory containing this test file,
    which will (normally) be the applyhealthdata directory
    also containing the testdata dir.
    """
    return os.path.split(os.path.abspath(__file__))[0]


def get_testdata_dir():
    """Return the full path to the testdata directory"""
    return os.path.join(get_base_dir(), 'testdata')


def get_tmp_dir():
    """Return the full path to the tmp directory"""
    return os.path.join(get_base_dir(), 'tmp')


def remove_any_tmp_dir():
    """
    Remove the temporary directory if it exists.
    Returns its location either way.
    """
    tmp_dir = get_tmp_dir()
    if os.path.exists(tmp_dir):
        shutil.rmtree(tmp_dir)
    return tmp_dir


def make_tmp_dir():
    """
    Remove any existing tmp directory.
    Create empty tmp direcory.
    Return the location of the tmp dir.
    """
    tmp_dir = remove_any_tmp_dir()
    os.mkdir(tmp_dir)
    return tmp_dir


def copy_test_data():
    """
    Copy the test data export6s3sample.xml from testdata directory
    to tmp directory.
    """
    tmp_dir = make_tmp_dir()
    name = 'export6s3sample.xml'
    in_xml_file = os.path.join(get_testdata_dir(), name)
    out_xml_file = os.path.join(get_tmp_dir(), name)
    shutil.copyfile(in_xml_file, out_xml_file)
    return out_xml_file


class TestAppleHealthDataExtractor(unittest.TestCase):
    @classmethod
    def tearDownClass(cls):
        """Clean up by removing the tmp directory, if it exists."""
        if CLEAN_UP:
            remove_any_tmp_dir()

    def check_file(self, filename):
        expected_output = os.path.join(get_testdata_dir(), filename)
        actual_output = os.path.join(get_tmp_dir(), filename)
        with open(expected_output) as f:
            expected = f.read()
        with open(actual_output) as f:
            actual = f.read()
        self.assertEqual(expected, actual)

    def test_tiny_reference_extraction(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)
        data.extract()
        self.check_file('StepCount.csv')
        self.check_file('DistanceWalkingRunning.csv')

    def test_format_freqs(self):
        counts = Counter()
        self.assertEqual(format_freqs(counts), '')
        counts['one'] += 1
        self.assertEqual(format_freqs(counts), 'one: 1')
        counts['one'] += 1
        self.assertEqual(format_freqs(counts), 'one: 2')
        counts['two'] += 1
        counts['three'] += 1
        self.assertEqual(format_freqs(counts),
                         '''one: 2
three: 1
two: 1''')

    def test_format_null_values(self):
        for dt in ('s', 'n', 'd', 'z'):
            # Note: even an illegal type, z, produces correct output for
            # null values.
            # Questionable, but we'll leave as a feature
            self.assertEqual(format_value(None, dt), '')

    def test_format_numeric_values(self):
        cases = {
            '0': '0',
            '3': '3',
            '-1': '-1',
            '2.5': '2.5',
        }
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 'n')), (k, v))

    def test_format_date_values(self):
        hearts = 'any string not need escaping or quoting; even this: ♥♥'
        cases = {
            '01/02/2000 12:34:56': '01/02/2000 12:34:56',
            hearts: hearts,
        }
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 'd')), (k, v))

    def test_format_string_values(self):
        cases = {
            'a': '"a"',
            '': '""',
            'one "2" three': r'"one \"2\" three"',
            r'1\2\3': r'"1\\2\\3"',
        }
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 's')), (k, v))

    def test_abbreviate(self):
        changed = {
            'HKQuantityTypeIdentifierHeight': 'Height',
            'HKQuantityTypeIdentifierStepCount': 'StepCount',
            'HK*TypeIdentifierStepCount': 'StepCount',
            'HKCharacteristicTypeIdentifierDateOfBirth': 'DateOfBirth',
            'HKCharacteristicTypeIdentifierBiologicalSex': 'BiologicalSex',
            'HKCharacteristicTypeIdentifierBloodType': 'BloodType',
            'HKCharacteristicTypeIdentifierFitzpatrickSkinType':
                                                    'FitzpatrickSkinType',
        }
        unchanged = [
            '',
            'a',
            'aHKQuantityTypeIdentifierHeight',
            'HKQuantityTypeIdentityHeight',
        ]
        for (k, v) in changed.items():
            self.assertEqual((k, abbreviate(k)), (k, v))
            self.assertEqual((k, abbreviate(k, False)), (k, k))
        for k in unchanged:
            self.assertEqual((k, abbreviate(k)), (k, k))

    def test_encode(self):
        # This test looks strange, but because of the import statments
        #     from __future__ import unicode_literals
        # in Python 2, type('a') is unicode, and the point of the encode
        # function is to ensure that it has been converted to a UTF-8 string
        # before writing to file.
        self.assertEqual(type(encode('a')), str)

    def test_extracted_reference_stats(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)

        self.assertEqual(data.n_nodes, 19)
        expectedRecordCounts = [
           ('DistanceWalkingRunning', 5),
           ('StepCount', 10),
        ]
        self.assertEqual(sorted(data.record_types.items()),
                         expectedRecordCounts)

        expectedTagCounts = [
           ('ActivitySummary', 2),
           ('ExportDate', 1),
           ('Me', 1),
           ('Record', 15),
        ]
        self.assertEqual(sorted(data.tags.items()),
                         expectedTagCounts)
        expectedFieldCounts = [
            ('HKCharacteristicTypeIdentifierBiologicalSex', 1),
            ('HKCharacteristicTypeIdentifierBloodType', 1),
            ('HKCharacteristicTypeIdentifierDateOfBirth', 1),
            ('HKCharacteristicTypeIdentifierFitzpatrickSkinType', 1),
            ('activeEnergyBurned', 2),
            ('activeEnergyBurnedGoal', 2),
            ('activeEnergyBurnedUnit', 2),
            ('appleExerciseTime', 2),
            ('appleExerciseTimeGoal', 2),
            ('appleStandHours', 2),
            ('appleStandHoursGoal', 2),
            ('creationDate', 15),
            ('dateComponents', 2),
            ('endDate', 15),
            ('sourceName', 15),
            ('startDate', 15),
            ('type', 15),
            ('unit', 15),
            ('value', 16),
        ]
        self.assertEqual(sorted(data.fields.items()),
                         expectedFieldCounts)


if __name__ == '__main__':
    unittest.main()

Notes

We're not going to discuss every part of the code, but will point out a few salient features.

  • I've added a coding line at the top of both the test script and the main applehealthdata.py script. This tells Python (and my editor, Emacs) the encoding of the file on disk (UTF-8). This is now relevant because one of the new tests (test_format_date_values) features a non-ASCII character in a string literal.

  • The previous test method test_tiny_fixed_extraction has been renamed test_tiny_reference_extraction, but is otherwise unchanged.

  • Several of the tests loop over dictionaries or lists of input-output pairs, with an assertion of some kind in the main body. Some people don't like this, and prefer one assertion per test. I don't really agree with that, but do think it's important to be able to see easily which assertion fails. An idiom I often use to assist this is to include the input on both sides of the test. For example, in test_abbreviate, when checking the abbreviation of items that should change, the code reads:

    for (k, v) in changed.items():
        self.assertEqual((k, abbreviate(k)), (k, v))
    

    rather than

    for (k, v) in changed.items():
        self.assertEqual(abbreviate(k), v)
    

    This makes it easy to tell which input fails, if one does, even in cases in which the main values being compared (abbreviate(k) and v, in this case) are long, complex or repeated across different inputs. It doesn't actually make much difference in these examples, but in general I find it helpful.

  • The test test_extracted_reference_stats checks that three counters used by the code work as expected. Some people would definitely advocate splitting this into three tests, but, even though it's quick, it seems more natural to test these together to me. This also means we don't have to process the XML file three times. There are other ways of achieving the same end, and this approach has the potential disadvantage that the later cases won't be run if the first one fails.

    The other point to note here is that the Counter objects are unordered, so I've sorted the expected results on their keys in the expected values, and then used Python's sorted function, which returns a generator to return the values of a list (or other iterable) in sorted order. We could avoid the sort by constructing sets or a dictionaries from the Counter objects and checking those instead, but the sort here is not expensive, and this approach is probably simpler.

  • I haven't bothered to write a separate test for the extraction phase (checking that it writes the right CSV files) because that seems to me to add almost nothing over the existing reference test (test_tiny_reference_extraction).

Closing

That's it for this post. The unit tests are not terribly exciting, but they will prove useful as we extend the extraction code, which we'll start to do in the next post.

In a few posts' time, we will start analysing the data extracted from the app; it will be interesting to see whether, at that stage, we discover any more serious problems with the extraction code. Experience teaches that we probably will.