Unit Tests

Posted on Tue 19 April 2016 in TDDA • Tagged with xml, apple, health

In the last post, we presented some code for implementing a "reference" test for the code for extracting CSV files from the XML dump that the Apple Health app on iOS can produce.

We will now expand that test with a few other, smaller and more conventional unit tests. Each unit test focuses on a smaller block of functionality.

The Test Code

As before, you can get the code from Github with

$ git clone https://github.com/tdda/applehealthdata.git

or if you have pulled it previously, with

$ git pull

This version of the code is tagged with v1.2, so if it has been updated by the time you read this, get that version with

$ git checkout v1.2

Here is the updated test code.

# -*- coding: utf-8 -*-
"""
testapplehealthdata.py: tests for the applehealthdata.py

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import shutil
import sys
import unittest

from collections import Counter


from applehealthdata import (HealthDataExtractor,
                             format_freqs, format_value,
                             abbreviate, encode)

CLEAN_UP = True
VERBOSE = False


def get_base_dir():
    """
    Return the directory containing this test file,
    which will (normally) be the applyhealthdata directory
    also containing the testdata dir.
    """
    return os.path.split(os.path.abspath(__file__))[0]


def get_testdata_dir():
    """Return the full path to the testdata directory"""
    return os.path.join(get_base_dir(), 'testdata')


def get_tmp_dir():
    """Return the full path to the tmp directory"""
    return os.path.join(get_base_dir(), 'tmp')


def remove_any_tmp_dir():
    """
    Remove the temporary directory if it exists.
    Returns its location either way.
    """
    tmp_dir = get_tmp_dir()
    if os.path.exists(tmp_dir):
        shutil.rmtree(tmp_dir)
    return tmp_dir


def make_tmp_dir():
    """
    Remove any existing tmp directory.
    Create empty tmp direcory.
    Return the location of the tmp dir.
    """
    tmp_dir = remove_any_tmp_dir()
    os.mkdir(tmp_dir)
    return tmp_dir


def copy_test_data():
    """
    Copy the test data export6s3sample.xml from testdata directory
    to tmp directory.
    """
    tmp_dir = make_tmp_dir()
    name = 'export6s3sample.xml'
    in_xml_file = os.path.join(get_testdata_dir(), name)
    out_xml_file = os.path.join(get_tmp_dir(), name)
    shutil.copyfile(in_xml_file, out_xml_file)
    return out_xml_file


class TestAppleHealthDataExtractor(unittest.TestCase):
    @classmethod
    def tearDownClass(cls):
        """Clean up by removing the tmp directory, if it exists."""
        if CLEAN_UP:
            remove_any_tmp_dir()

    def check_file(self, filename):
        expected_output = os.path.join(get_testdata_dir(), filename)
        actual_output = os.path.join(get_tmp_dir(), filename)
        with open(expected_output) as f:
            expected = f.read()
        with open(actual_output) as f:
            actual = f.read()
        self.assertEqual(expected, actual)

    def test_tiny_reference_extraction(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)
        data.extract()
        self.check_file('StepCount.csv')
        self.check_file('DistanceWalkingRunning.csv')

    def test_format_freqs(self):
        counts = Counter()
        self.assertEqual(format_freqs(counts), '')
        counts['one'] += 1
        self.assertEqual(format_freqs(counts), 'one: 1')
        counts['one'] += 1
        self.assertEqual(format_freqs(counts), 'one: 2')
        counts['two'] += 1
        counts['three'] += 1
        self.assertEqual(format_freqs(counts),
                         '''one: 2
three: 1
two: 1''')

    def test_format_null_values(self):
        for dt in ('s', 'n', 'd', 'z'):
            # Note: even an illegal type, z, produces correct output for
            # null values.
            # Questionable, but we'll leave as a feature
            self.assertEqual(format_value(None, dt), '')

    def test_format_numeric_values(self):
        cases = {
            '0': '0',
            '3': '3',
            '-1': '-1',
            '2.5': '2.5',
        }
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 'n')), (k, v))

    def test_format_date_values(self):
        hearts = 'any string not need escaping or quoting; even this: ♥♥'
        cases = {
            '01/02/2000 12:34:56': '01/02/2000 12:34:56',
            hearts: hearts,
        }
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 'd')), (k, v))

    def test_format_string_values(self):
        cases = {
            'a': '"a"',
            '': '""',
            'one "2" three': r'"one \"2\" three"',
            r'1\2\3': r'"1\\2\\3"',
        }
        for (k, v) in cases.items():
            self.assertEqual((k, format_value(k, 's')), (k, v))

    def test_abbreviate(self):
        changed = {
            'HKQuantityTypeIdentifierHeight': 'Height',
            'HKQuantityTypeIdentifierStepCount': 'StepCount',
            'HK*TypeIdentifierStepCount': 'StepCount',
            'HKCharacteristicTypeIdentifierDateOfBirth': 'DateOfBirth',
            'HKCharacteristicTypeIdentifierBiologicalSex': 'BiologicalSex',
            'HKCharacteristicTypeIdentifierBloodType': 'BloodType',
            'HKCharacteristicTypeIdentifierFitzpatrickSkinType':
                                                    'FitzpatrickSkinType',
        }
        unchanged = [
            '',
            'a',
            'aHKQuantityTypeIdentifierHeight',
            'HKQuantityTypeIdentityHeight',
        ]
        for (k, v) in changed.items():
            self.assertEqual((k, abbreviate(k)), (k, v))
            self.assertEqual((k, abbreviate(k, False)), (k, k))
        for k in unchanged:
            self.assertEqual((k, abbreviate(k)), (k, k))

    def test_encode(self):
        # This test looks strange, but because of the import statments
        #     from __future__ import unicode_literals
        # in Python 2, type('a') is unicode, and the point of the encode
        # function is to ensure that it has been converted to a UTF-8 string
        # before writing to file.
        self.assertEqual(type(encode('a')), str)

    def test_extracted_reference_stats(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)

        self.assertEqual(data.n_nodes, 19)
        expectedRecordCounts = [
           ('DistanceWalkingRunning', 5),
           ('StepCount', 10),
        ]
        self.assertEqual(sorted(data.record_types.items()),
                         expectedRecordCounts)

        expectedTagCounts = [
           ('ActivitySummary', 2),
           ('ExportDate', 1),
           ('Me', 1),
           ('Record', 15),
        ]
        self.assertEqual(sorted(data.tags.items()),
                         expectedTagCounts)
        expectedFieldCounts = [
            ('HKCharacteristicTypeIdentifierBiologicalSex', 1),
            ('HKCharacteristicTypeIdentifierBloodType', 1),
            ('HKCharacteristicTypeIdentifierDateOfBirth', 1),
            ('HKCharacteristicTypeIdentifierFitzpatrickSkinType', 1),
            ('activeEnergyBurned', 2),
            ('activeEnergyBurnedGoal', 2),
            ('activeEnergyBurnedUnit', 2),
            ('appleExerciseTime', 2),
            ('appleExerciseTimeGoal', 2),
            ('appleStandHours', 2),
            ('appleStandHoursGoal', 2),
            ('creationDate', 15),
            ('dateComponents', 2),
            ('endDate', 15),
            ('sourceName', 15),
            ('startDate', 15),
            ('type', 15),
            ('unit', 15),
            ('value', 16),
        ]
        self.assertEqual(sorted(data.fields.items()),
                         expectedFieldCounts)


if __name__ == '__main__':
    unittest.main()

Notes

We're not going to discuss every part of the code, but will point out a few salient features.

  • I've added a coding line at the top of both the test script and the main applehealthdata.py script. This tells Python (and my editor, Emacs) the encoding of the file on disk (UTF-8). This is now relevant because one of the new tests (test_format_date_values) features a non-ASCII character in a string literal.

  • The previous test method test_tiny_fixed_extraction has been renamed test_tiny_reference_extraction, but is otherwise unchanged.

  • Several of the tests loop over dictionaries or lists of input-output pairs, with an assertion of some kind in the main body. Some people don't like this, and prefer one assertion per test. I don't really agree with that, but do think it's important to be able to see easily which assertion fails. An idiom I often use to assist this is to include the input on both sides of the test. For example, in test_abbreviate, when checking the abbreviation of items that should change, the code reads:

    for (k, v) in changed.items():
        self.assertEqual((k, abbreviate(k)), (k, v))
    

    rather than

    for (k, v) in changed.items():
        self.assertEqual(abbreviate(k), v)
    

    This makes it easy to tell which input fails, if one does, even in cases in which the main values being compared (abbreviate(k) and v, in this case) are long, complex or repeated across different inputs. It doesn't actually make much difference in these examples, but in general I find it helpful.

  • The test test_extracted_reference_stats checks that three counters used by the code work as expected. Some people would definitely advocate splitting this into three tests, but, even though it's quick, it seems more natural to test these together to me. This also means we don't have to process the XML file three times. There are other ways of achieving the same end, and this approach has the potential disadvantage that the later cases won't be run if the first one fails.

    The other point to note here is that the Counter objects are unordered, so I've sorted the expected results on their keys in the expected values, and then used Python's sorted function, which returns a generator to return the values of a list (or other iterable) in sorted order. We could avoid the sort by constructing sets or a dictionaries from the Counter objects and checking those instead, but the sort here is not expensive, and this approach is probably simpler.

  • I haven't bothered to write a separate test for the extraction phase (checking that it writes the right CSV files) because that seems to me to add almost nothing over the existing reference test (test_tiny_reference_extraction).

Closing

That's it for this post. The unit tests are not terribly exciting, but they will prove useful as we extend the extraction code, which we'll start to do in the next post.

In a few posts' time, we will start analysing the data extracted from the app; it will be interesting to see whether, at that stage, we discover any more serious problems with the extraction code. Experience teaches that we probably will.


First Test

Posted on Mon 18 April 2016 in TDDA • Tagged with xml, apple, health

In the last post, I presented some code for extracting (some of) the data from the XML file exported by the Apple Health app on iOS, but—almost comically, given this blog's theme—omitted to include any tests. This post and the next couple (in quick succession) will aim to fix that.

This post begins to remedy that by writing a single "reference" test. To recap: a reference test is a test that tests a whole analytical process, checking that the known inputs produce the expected outputs. So far, our analytical process is quite small, consisting only of data extraction, but this will still prove very worthwhile.

Dogma

While the mainstream TDD dogma states that tests should be written before the code, it is far from uncommon to write them afterwards, and in the context of test-driven data analysis I maintain that this is usually preferable. Regardless, when you find yourself in a situation in which you have written some code and possess any reasonable level of belief that it might be right,1 an excellent starting point is simply to capture the input(s) that you have already used, together with the output that it generates, and write a test that checks that the input you provided produces the expected output. That's exactly the procedure I advocated for TDDA, and that's how we shall start here.

Test Data

The only flies in the ointment in this case are

  1. the input data I used initially was quite large (5.5MB compressed; 109MB uncompressed), leading to quite a slow test;

  2. the data is somewhat personal.

For both these reasons, I have decided to reduce it so that it will be more manageable, run more quickly, and be more suitable for public sharing.

So I cut down the data to contain only the DTD header, the Me record, ten StepCount records, and five DistanceWalkingRunning records. That results in a small, valid XML file (under 7K) containing exactly 100 lines. It's in the testdata subdirectory of the repository, and if I run it (which you probably don't want do, at least in situ, as that will trample over the reference output), the following output is produced:

$ python applehealthdata/applehealthdata.py testdata/export6s3sample.xml
Reading data from testdata/export6s3sample.xml . . . done

Tags:
ActivitySummary: 2
ExportDate: 1
Me: 1
Record: 15

Fields:
HKCharacteristicTypeIdentifierBiologicalSex: 1
HKCharacteristicTypeIdentifierBloodType: 1
HKCharacteristicTypeIdentifierDateOfBirth: 1
HKCharacteristicTypeIdentifierFitzpatrickSkinType: 1
activeEnergyBurned: 2
activeEnergyBurnedGoal: 2
activeEnergyBurnedUnit: 2
appleExerciseTime: 2
appleExerciseTimeGoal: 2
appleStandHours: 2
appleStandHoursGoal: 2
creationDate: 15
dateComponents: 2
endDate: 15
sourceName: 15
startDate: 15
type: 15
unit: 15
value: 16

Record types:
DistanceWalkingRunning: 5
StepCount: 10

Opening /Users/njr/qs/testdata/StepCount.csv for writing
Opening /Users/njr/qs/testdata/DistanceWalkingRunning.csv for writing
Written StepCount data.
Written DistanceWalkingRunning data.

The two CSV files it writes, which are also in the testdata subdirectory in the repository, are as follows:

$ cat testdata/StepCount.csv
sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329
"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283
"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216
"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282

and

$ cat testdata/DistanceWalkingRunning.csv
sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:28 +0100,2014-09-20 10:41:30 +0100,0.00288
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:30 +0100,2014-09-20 10:41:33 +0100,0.00284
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:33 +0100,2014-09-20 10:41:36 +0100,0.00142
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:54 +0100,2014-09-20 10:43:56 +0100,0.00639
"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:59 +0100,2014-09-20 10:44:01 +0100,0.0059

Reference Test

The code for a single reference test is below. It's slightly verbose, because it tries to use sensible locations for everything, but not complex.

As before, you can get the code from Github with

$ git clone https://github.com/tdda/applehealthdata.git

or if you have pulled it previously, you can update it with

$ git pull

This version of the code is tagged with v1.1, so if it has been updated by the time you read this, get that version with

$ git checkout v1.1

Here is the code:

"""
testapplehealthdata.py: tests for the applehealthdata.py

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import shutil
import sys
import unittest

from applehealthdata import HealthDataExtractor

CLEAN_UP = True
VERBOSE = False


def get_base_dir():
    """
    Return the directory containing this test file,
    which will (normally) be the applyhealthdata directory
    also containing the testdata dir.
    """
    return os.path.split(os.path.abspath(__file__))[0]


def get_testdata_dir():
    """Return the full path to the testdata directory"""
    return os.path.join(get_base_dir(), 'testdata')


def get_tmp_dir():
    """Return the full path to the tmp directory"""
    return os.path.join(get_base_dir(), 'tmp')


def remove_any_tmp_dir():
    """
    Remove the temporary directory if it exists.
    Returns its location either way.
    """
    tmp_dir = get_tmp_dir()
    if os.path.exists(tmp_dir):
        shutil.rmtree(tmp_dir)
    return tmp_dir


def make_tmp_dir():
    """
    Remove any existing tmp directory.
    Create empty tmp direcory.
    Return the location of the tmp dir.
    """
    tmp_dir = remove_any_tmp_dir()
    os.mkdir(tmp_dir)
    return tmp_dir


def copy_test_data():
    """
    Copy the test data export6s3sample.xml from testdata directory
    to tmp directory.
    """
    tmp_dir = make_tmp_dir()
    name = 'export6s3sample.xml'
    in_xml_file = os.path.join(get_testdata_dir(), name)
    out_xml_file = os.path.join(get_tmp_dir(), name)
    shutil.copyfile(in_xml_file, out_xml_file)
    return out_xml_file


class TestAppleHealthDataExtractor(unittest.TestCase):
    @classmethod
    def tearDownClass(cls):
        """Clean up by removing the tmp directory, if it exists."""
        if CLEAN_UP:
            remove_any_tmp_dir()

    def check_file(self, filename):
        expected_output = os.path.join(get_testdata_dir(), filename)
        actual_output = os.path.join(get_tmp_dir(), filename)
        with open(expected_output) as f:
            expected = f.read()
        with open(actual_output) as f:
            actual = f.read()
        self.assertEqual(expected, actual)

    def test_tiny_fixed_extraction(self):
        path = copy_test_data()
        data = HealthDataExtractor(path, verbose=VERBOSE)
        data.extract()
        self.check_file('StepCount.csv')
        self.check_file('DistanceWalkingRunning.csv')


if __name__ == '__main__':
    unittest.main()

Running the Test

This is what I get if I run it:

$ python testapplehealthdata.py
.
----------------------------------------------------------------------
Ran 1 test in 0.007s

OK
$

That's encouraging, but not particularly informative. If we change the value of VERBOSE at the top of the test file to True, we see slightly more reassuring output:

$ python testapplehealthdata.py
Reading data from /Users/njr/qs/applehealthdata/tmp/export6s3sample.xml . . . done
Opening /Users/njr/qs/applehealthdata/tmp/StepCount.csv for writing
Opening /Users/njr/qs/applehealthdata/tmp/DistanceWalkingRunning.csv for writing
Written StepCount data.
Written DistanceWalkingRunning data.
.
----------------------------------------------------------------------
Ran 1 test in 0.006s

NOTE: The tearDownClass method is a special Python class method that the unit testing framework runs after executing all the tests in the class, regardless of whether they pass, fail or produce errors. I use it to remove the tmp directory containing any test output, which is normally good practice. In a later post, we'll either modify this to leave the output around if any tests fail, or make some other change to make it easier to diagnose what's gone wrong. In the meantime, if you change the value of CLEAN_UP, towards the top of the code, to False, it will leave the tmp directory around, allowing you to examine the files it has produced.

Overview

The test itself is in the 5-line method test_tiny_fixed_extraction. Here's what the five lines do:

  1. Copy the input XML file from the testdata directory to the tmp directory. The Github repository contains the 100-line input XML file together with the expected output in the testdata subdirectory. Because the data extractor writes the CSV files next to the input data, the cleanest thing for us to do is to take a copy of the input data, write it into a new directory (applehealthdata/tmp) and also to use that directory as the location for the output CSV files. The copy_test_data function removes any existing tmp directory it finds, creates a fresh one, copies the input test data into it and returns the path to the test data file. The only "magic" here is that the get_base_dir function figures out where to locate everything by using __file__, which is the location of the source file being executed by Python.

  2. Create a HealthDataExtractor object, using the location of the copy of the input data returned by copy_test_data(). Note that it sets verbose to False, making the test silent, and allowing the line of dots from a successful test run (in this case, a single dot) to be presented without interruption.

  3. Extract the data. This writes two output files to the applehealthdata/tmp directory.

  4. Check that the contents of tmp/StepCount.csv match the reference output in testdata/StepCount.csv.

  5. Check that the contents of tmp/DistanceWalkingRunning.csv match the reference output in testdata/DistanceWalkingRunning.csv.

Write-Test-Break-Run-Repair-Rerun

In cases in which the tests are written after the code, it's important to check that they really are running correctly. My usual approach to that is to write the test, and if appears to pass first time,2 to break it deliberately to verify that it fails when it should, before repairing it. In this case, the simplest way to break the test is to change the reference data temporarily. This will also reveal a weakness in the current check_file function.

We'll try three variants of this:

Variant 1: Break the StepCount.csv reference data.

First, I add a Z to the end of testdata/StepCount.csv and re-run the tests:

$ python testapplehealthdata.py
F
======================================================================
FAIL: test_tiny_fixed_extraction (__main__.TestAppleHealthDataExtractor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "testapplehealthdata.py", line 98, in test_tiny_fixed_extraction
    self.check_file('StepCount.csv')
  File "testapplehealthdata.py", line 92, in check_file
    self.assertEqual(expected, actual)
AssertionError: 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\nZ' != 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\n'

----------------------------------------------------------------------
Ran 1 test in 0.005s

FAILED (failures=1)
$

That causes the expected failure. Because, however, we've compared the entire contents of the two CSV files, it's hard to see what's actually gone wrong. We'll address this by improving the check_file method in a later post.

Variant 2: Break the DistanceWalkingRunning.csv reference data.

After restoring the StepCount.csv data, I modify the reference testdata/DistanceWalkingRunning.csv data. This time, I'll change Health to Wealth throughout.

$ python testapplehealthdata.py
F
======================================================================
FAIL: test_tiny_fixed_extraction (__main__.TestAppleHealthDataExtractor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "testapplehealthdata.py", line 99, in test_tiny_fixed_extraction
    self.check_file('DistanceWalkingRunning.csv')
  File "testapplehealthdata.py", line 92, in check_file
    self.assertEqual(expected, actual)
AssertionError: 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:28 +0100,2014-09-20 10:41:30 +0100,0.00288\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:30 +0100,2014-09-20 10:41:33 +0100,0.00284\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:33 +0100,2014-09-20 10:41:36 +0100,0.00142\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:54 +0100,2014-09-20 10:43:56 +0100,0.00639\n"Wealth",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:59 +0100,2014-09-20 10:44:01 +0100,0.0059\n' != 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:28 +0100,2014-09-20 10:41:30 +0100,0.00288\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:30 +0100,2014-09-20 10:41:33 +0100,0.00284\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:41:33 +0100,2014-09-20 10:41:36 +0100,0.00142\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:54 +0100,2014-09-20 10:43:56 +0100,0.00639\n"Health",,,"DistanceWalkingRunning","km",2014-09-21 07:08:49 +0100,2014-09-20 10:43:59 +0100,2014-09-20 10:44:01 +0100,0.0059\n'

----------------------------------------------------------------------
Ran 1 test in 0.005s

FAILED (failures=1)
$

The story is very much the same: the test has failed, which is good, but again the source of difference is hard to discern.

Variant 3: Break the input XML Data.

After restoring DistanceWalkingRunning.csv, I modify the input XML file. In this case, I'll just change the first step count to be 330 instead of 329:

$ python testapplehealthdata.py
F
======================================================================
FAIL: test_tiny_fixed_extraction (__main__.TestAppleHealthDataExtractor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "testapplehealthdata.py", line 98, in test_tiny_fixed_extraction
    self.check_file('StepCount.csv')
  File "testapplehealthdata.py", line 92, in check_file
    self.assertEqual(expected, actual)
AssertionError: 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,329\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\n' != 'sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:27:54 +0100,2014-09-13 10:27:59 +0100,330\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:34:09 +0100,2014-09-13 10:34:14 +0100,283\n"Health",,,"StepCount","count",2014-09-21 07:08:47 +0100,2014-09-13 10:39:29 +0100,2014-09-13 10:39:34 +0100,426\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:45:36 +0100,2014-09-13 10:45:41 +0100,61\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:51:16 +0100,2014-09-13 10:51:21 +0100,10\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 10:57:40 +0100,2014-09-13 10:57:45 +0100,200\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:03:00 +0100,2014-09-13 11:03:05 +0100,390\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:08:10 +0100,2014-09-13 11:08:15 +0100,320\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:27:22 +0100,2014-09-13 11:27:27 +0100,216\n"Health",,,"StepCount","count",2014-09-21 07:08:48 +0100,2014-09-13 11:33:24 +0100,2014-09-13 11:33:29 +0100,282\n'

----------------------------------------------------------------------
Ran 1 test in 0.005s

FAILED (failures=1)
$

Again, we get the expected failure, and again it's hard to see what it is. (We really will need to improve check_file.)

Enough

That's enough for this post. We've successfully added a single "reference" test to the code, which should at least make sure that if we break it during further enhancements, we will notice. It will also check that it is working correctly on other platforms (e.g., yours).

We haven't done anything to check the the CSV files produced are genuinely right beyond the initial eye-balling I did on first extracting the data before. But if we see problems when we start doing proper analysis, it will be easy to correct the expected output to keep the test running. And in the meantime, we'll notice if we make changes to the code that result in different output when it wasn't meant to do so. This is one part of the pragmatic essence of basic TDDA.

We also haven't written any unit tests at all for the extraction code; we'll do that in a later post.


  1. For example, you might have already blogged about it and pushed it to a public repository on Github 

  2. Which is not always the case 


In Defence of XML: Exporting and Analysing Apple Health Data

Posted on Fri 15 April 2016 in TDDA • Tagged with xml, apple, health

I'm going to present a series of posts based around the sort of health and fitness data that can now be collected by some phones and dedicated fitness trackers. Not all of these will be centrally on topic for test-driven data analysis, but I think they'll provide an interesting set of data for discussing many issues of relevance, so I hope readers will forgive me to the extent that these stray from the central theme.

The particular focus for this series will be the data available from an iPhone and the Apple Health app, over a couple of different phones, and with a couple of different devices paired to them.

In particular, the setup will be:

  • Apple iPhone 6s (November 2015 to present)
  • Apple iPhone 5s (with fitness data from Sept 2014 to Nov 2015)
  • Several Misfit Shine activity trackers (until early March 2016)
  • An Apple Watch (about a month of data, to date)

Getting data out of Apple Health (The Exploratory Version)

I hadn't initially spotted a way to get the data out of Apple's Health app, but a quick web search1 turned up this very helpful article: https://www.idownloadblog.com/2015/06/10/how-to-export-import-health-data. It turns out there is a properly supported way to export granular data from Apple Health, described in detail in the post. Essentially:

  • Open the Apple Health App.
  • Navigate to the Health Data section (left icon at the bottom)
  • Select All from the list of categories
  • There is a share icon at the top right (a vertical arrow sticking up from a square)
  • Tap that to export all data
  • It thinks for a while (quite a while, in fact) and then offers you various export options, which for me included Airdrop, email and handing the data to other apps. I used Airdrop to dump it onto a Mac.

The result is a compressed XML file called export.zip. For me, this was about 5.5MB, which expanded to 109MB when unzipped. (Interestingly, I started this with an earlier export a couple of weeks ago, when the zipped file was about 5MB and the expanded version was 90MB, so it is growing fairly quickly, thanks to the Watch.)

As helpful as the iDownloadBlog article is, I have to comment on its introduction to exporting data, which reads

There are actually two ways to export the data from your Health app. The first way, is one provided by Apple, but it is virtually useless.

To be fair to iDownloadBlog, an XML file like this probably is useless to the general reader, but it builds on a meme fashionable among developers and data scientists to the effect of "XML is painful to process, verbose and always worse than JSON", and I think this is somewhat unfair.

Let's explore export.xml using Python and the ElementTree library. Although the decompressed file is quite large (109MB), it's certainly not problematically large to read into memory on a modern machine, so I'm not going to worry about reading it in bits: I'm just going to find out as quickly as possible what's in it.

The first thing to do, of course, is simply to look at the file, probably using either the more or less command, assuming you are on some flavour of Unix or Linux. Let's look at the top of my export.xml:

$ head -79 export6s3/export.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HealthData [
<!-- HealthKit Export Version: 3 -->
<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout|ActivitySummary)*)>
<!ATTLIST HealthData
  locale CDATA #REQUIRED
>
<!ELEMENT ExportDate EMPTY>
<!ATTLIST ExportDate
  value CDATA #REQUIRED
>
<!ELEMENT Me EMPTY>
<!ATTLIST Me
  HKCharacteristicTypeIdentifierDateOfBirth         CDATA #REQUIRED
  HKCharacteristicTypeIdentifierBiologicalSex       CDATA #REQUIRED
  HKCharacteristicTypeIdentifierBloodType           CDATA #REQUIRED
  HKCharacteristicTypeIdentifierFitzpatrickSkinType CDATA #REQUIRED
>
<!ELEMENT Record (MetadataEntry*)>
<!ATTLIST Record
  type          CDATA #REQUIRED
  unit          CDATA #IMPLIED
  value         CDATA #IMPLIED
  sourceName    CDATA #REQUIRED
  sourceVersion CDATA #IMPLIED
  device        CDATA #IMPLIED
  creationDate  CDATA #IMPLIED
  startDate     CDATA #REQUIRED
  endDate       CDATA #REQUIRED
>
<!-- Note: Any Records that appear as children of a correlation also appear as top-level records in this document. -->
<!ELEMENT Correlation ((MetadataEntry|Record)*)>
<!ATTLIST Correlation
  type          CDATA #REQUIRED
  sourceName    CDATA #REQUIRED
  sourceVersion CDATA #IMPLIED
  device        CDATA #IMPLIED
  creationDate  CDATA #IMPLIED
  startDate     CDATA #REQUIRED
  endDate       CDATA #REQUIRED
>
<!ELEMENT Workout ((MetadataEntry|WorkoutEvent)*)>
<!ATTLIST Workout
  workoutActivityType   CDATA #REQUIRED
  duration              CDATA #IMPLIED
  durationUnit          CDATA #IMPLIED
  totalDistance         CDATA #IMPLIED
  totalDistanceUnit     CDATA #IMPLIED
  totalEnergyBurned     CDATA #IMPLIED
  totalEnergyBurnedUnit CDATA #IMPLIED
  sourceName            CDATA #REQUIRED
  sourceVersion         CDATA #IMPLIED
  device                CDATA #IMPLIED
  creationDate          CDATA #IMPLIED
  startDate             CDATA #REQUIRED
  endDate               CDATA #REQUIRED
>
<!ELEMENT WorkoutEvent EMPTY>
<!ATTLIST WorkoutEvent
  type CDATA #REQUIRED
  date CDATA #REQUIRED
>
<!ELEMENT ActivitySummary EMPTY>
<!ATTLIST ActivitySummary
  dateComponents           CDATA #IMPLIED
  activeEnergyBurned       CDATA #IMPLIED
  activeEnergyBurnedGoal   CDATA #IMPLIED
  activeEnergyBurnedUnit   CDATA #IMPLIED
  appleExerciseTime        CDATA #IMPLIED
  appleExerciseTimeGoal    CDATA #IMPLIED
  appleStandHours          CDATA #IMPLIED
  appleStandHoursGoal      CDATA #IMPLIED
>
<!ELEMENT MetadataEntry EMPTY>
<!ATTLIST MetadataEntry
  key   CDATA #REQUIRED
  value CDATA #REQUIRED
>
]>

This is immediately encouraging: Apple has provided DOCTYPE (DTD) information, which even though slightly old fashioned, tells us what we should expect to find in the file. DTD's are awkward to use, and when coming from untrusted sources, can leave the user potentially vulnerable to malicious attacks, but despite this, they are quite expressive and helpful, even just as plain-text documentation.

Roughly speaking, the lines:

<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout)*)>
<!ATTLIST HealthData
  locale CDATA #REQUIRED
>

say

  • that the top element will be a HealthData element

  • that this HealthData element will contain

    • an ExportDate element
    • a Me element
    • zero or more elements of type Record, Correlation or Workout
  • and that the HealthData element will have an attribute locale (which is mandatory).

The rest of this DTD section describes each kind of record in more detail.

The next 6 lines in my XML file are as follows (spread out for readability):

<HealthData locale="en_GB">
 <ExportDate value="2016-04-15 07:27:26 +0100"/>
 <Me HKCharacteristicTypeIdentifierDateOfBirth="1965-07-31"
     HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexMale"
     HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet"
     HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
 <Record type="HKQuantityTypeIdentifierHeight"
         sourceName="Health"
         sourceVersion="9.2"
         unit="cm"
         creationDate="2016-01-02 09:45:10 +0100"
         startDate="2016-01-02 09:44:00 +0100"
         endDate="2016-01-02 09:44:00 +0100"
         value="194">
  <MetadataEntry key="HKWasUserEntered" value="1"/>
 </Record>

As you can see, the export format is verbose, but extremely comprehensible and comprehensive. It's also very easy to read into Python and explore.

Let's do that, here with an interactive python:

>>> from xml.etree import ElementTree as ET
>>> with open('export.xml') as f:
...     data = ET.parse(f)
... 
>>> data
<xml.etree.ElementTree.ElementTree object at 0x107347a50>

The ElementTree module turns each XML element into an Element object, described by its tag, with a few standard attributes.

Inspecting the data object, we find:

>>> data.__dict__
{'_root': <Element 'HealthData' at 0x1073c2050>}

i.e., we have a single entry in data—a root element called HealthData.

Like all Element objects, it has the four standard attributes:2

>>> root = data._root
>>> root.__dict__.keys()
['text', 'attrib', 'tag', '_children']

These are:

>>> root.attrib
{'locale': 'en_GB'}

>>> root.text
'\n '

>>> root.tag
'HealthData'

>>> len(root._children)
446702

So nothing much apart from an encoding and a whole lot of child nodes. Let's inspect the first few of them:

>>> nodes = root._children
>>> nodes[0]
<Element 'ExportDate' at 0x1073c2090>

>>> ET.dump(nodes[0])
<ExportDate value="2016-04-15 07:27:26 +0100" />

>>> nodes[1]
<Element 'Me' at 0x1073c2190>
>>> ET.dump(nodes[1])
<Me HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexMale"
    HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet"
    HKCharacteristicTypeIdentifierDateOfBirth="1965-07-31"
    HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet" />

>>> nodes[2]
<Element 'Record' at 0x1073c2410>
>>> ET.dump(nodes[2])
<Record creationDate="2016-01-02 09:45:10 +0100"
        endDate="2016-01-02 09:44:00 +0100"
        sourceName="Health"
        sourceVersion="9.2"
        startDate="2016-01-02 09:44:00 +0100"
        type="HKQuantityTypeIdentifierHeight"
        unit="cm"
        value="194">
  <MetadataEntry key="HKWasUserEntered" value="1" />
 </Record>

>>> nodes[3]
<Element 'Record' at 0x1073c2550>
>>> nodes[4]
<Element 'Record' at 0x1073c2650>

So, exactly as the DTD indicated, we have an ExportDate node, a Me node and then what looks like a great number of records. Let's confirm that:

>>> set(node.tag for node in nodes[2:])
set(['Record', 'Workout', 'ActivitySummary'])

So in fact, there are three kinds of nodes after the ExportDate and Me records. Let's count them:

>>> records = [node for node in nodes if node.tag == 'Record']
>>> len(records)
446670

These records are ones like the Height record we saw above, though in fact most of them are not Height but either StepCount, CaloriesBurned or DistanceWalkingRunning, e.g.:

>>> ET.dump(nodes[100000])
<Record creationDate="2015-01-11 07:40:15 +0000"
        endDate="2015-01-10 13:39:35 +0000"
        sourceName="njr iPhone 6s"
        startDate="2015-01-10 13:39:32 +0000"
        type="HKQuantityTypeIdentifierStepCount"
        unit="count"
        value="4" />

There is also one activity summary per day (since I got the watch).

>>> acts = [node for node in nodes if node.tag == 'ActivitySummary']
>>> len(acts)
29

The first one isn't very exciting:

>>> ET.dump(acts[0])
<ActivitySummary activeEnergyBurned="0"
                 activeEnergyBurnedGoal="0"
                 activeEnergyBurnedUnit="kcal"
                 appleExerciseTime="0"
                 appleExerciseTimeGoal="30"
                 appleStandHours="0"
                 appleStandHoursGoal="12"
                 dateComponents="2016-03-18" />

but they get better:

>>> ET.dump(acts[2])
<ActivitySummary activeEnergyBurned="652.014"
                 activeEnergyBurnedGoal="500"
                 activeEnergyBurnedUnit="kcal"
                 appleExerciseTime="77"
                 appleExerciseTimeGoal="30"
                 appleStandHours="17"
                 appleStandHoursGoal="12"
                 dateComponents="2016-03-20" />

Finally, there is a solitary Workout record.

>>> ET.dump(workouts[0])
<Workout creationDate="2016-04-02 11:12:57 +0100"
         duration="31.73680251737436"
         durationUnit="min"
         endDate="2016-04-02 11:12:22 +0100"
         sourceName="NJR Apple&#160;Watch"
         sourceVersion="2.2"
         startDate="2016-04-02 10:40:38 +0100"
         totalDistance="0"
         totalDistanceUnit="km"
         totalEnergyBurned="139.3170000000021"
         totalEnergyBurnedUnit="kcal"
         workoutActivityType="HKWorkoutActivityTypeOther" />

So there we have it.

Getting data out of Apple Health (The Code)

Given this exploration, we can take a first shot at writing an exporter for Apple Health Data. I'm going to ignore the activity summaries and workout(s) for now, and concentrate on the main records. (We'll get to the others in a later post.)

Here is the code:

"""
applehealthdata.py: Extract data from Apple Health App's export.xml.

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import sys

from xml.etree import ElementTree
from collections import Counter, OrderedDict

__version__ = '1.0'

FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('type', 's'),
    ('unit', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('value', 'n'),
))

PREFIX_RE = re.compile('^HK.*TypeIdentifier(.+)$')
ABBREVIATE = True
VERBOSE = True

def format_freqs(counter):
    """
    Format a counter object for display.
    """
    return '\n'.join('%s: %d' % (tag, counter[tag])
                     for tag in sorted(counter.keys()))


def format_value(value, datatype):
    """
    Format a value for a CSV file, escaping double quotes and backslashes.

    None maps to empty.

    datatype should be
        's' for string (escaped)
        'n' for number
        'd' for datetime
    """
    if value is None:
        return ''
    elif datatype == 's':  # string
        return '"%s"' % value.replace('\\', '\\\\').replace('"', '\\"')
    elif datatype in ('n', 'd'):  # number or date
        return value
    else:
        raise KeyError('Unexpected format value: %s' % datatype)


def abbreviate(s):
    """
    Abbreviate particularly verbose strings based on a regular expression
    """
    m = re.match(PREFIX_RE, s)
    return m.group(1) if ABBREVIATE and m else s


def encode(s):
    """
    Encode string for writing to file.
    In Python 2, this encodes as UTF-8, whereas in Python 3,
    it does nothing
    """
    return s.encode('UTF-8') if sys.version_info.major < 3 else s



class HealthDataExtractor(object):
    """
    Extract health data from Apple Health App's XML export, export.xml.

    Inputs:
        path:      Relative or absolute path to export.xml
        verbose:   Set to False for less verbose output

    Outputs:
        Writes a CSV file for each record type found, in the same
        directory as the input export.xml. Reports each file written
        unless verbose has been set to False.
    """
    def __init__(self, path, verbose=VERBOSE):
        self.in_path = path
        self.verbose = verbose
        self.directory = os.path.abspath(os.path.split(path)[0])
        with open(path) as f:
            self.report('Reading data from %s . . . ' % path, end='')
            self.data = ElementTree.parse(f)
            self.report('done')
        self.root = self.data._root
        self.nodes = self.root.getchildren()
        self.n_nodes = len(self.nodes)
        self.abbreviate_types()
        self.collect_stats()

    def report(self, msg, end='\n'):
        if self.verbose:
            print(msg, end=end)
            sys.stdout.flush()

    def count_tags_and_fields(self):
        self.tags = Counter()
        self.fields = Counter()
        for record in self.nodes:
            self.tags[record.tag] += 1
            for k in record.keys():
                self.fields[k] += 1

    def count_record_types(self):
        self.record_types = Counter()
        for record in self.nodes:
            if record.tag == 'Record':
                self.record_types[record.attrib['type']] += 1

    def collect_stats(self):
        self.count_record_types()
        self.count_tags_and_fields()

    def open_for_writing(self):
        self.handles = {}
        self.paths = []
        for kind in self.record_types:
            path = os.path.join(self.directory, '%s.csv' % abbreviate(kind))
            f = open(path, 'w')
            f.write(','.join(FIELDS) + '\n')
            self.handles[kind] = f
            self.report('Opening %s for writing' % path)

    def abbreviate_types(self):
        """
        Shorten types by removing common boilerplate text.
        """
        for node in self.nodes:
            if node.tag == 'Record':
                if 'type' in node.attrib:
                    node.attrib['type'] = abbreviate(node.attrib['type'])


    def write_records(self):
        for node in self.nodes:
            if node.tag == 'Record':
                attributes = node.attrib
                kind = attributes['type']
                values = [format_value(attributes.get(field), datatype)
                          for (field, datatype) in FIELDS.items()]
                line = encode(','.join(values) + '\n')
                self.handles[kind].write(line)

    def close_files(self):
        for (kind, f) in self.handles.items():
            f.close()
            self.report('Written %s data.' % abbreviate(kind))

    def extract(self):
        self.open_for_writing()
        self.write_records()
        self.close_files()

    def report_stats(self):
        print('\nTags:\n%s\n' % format_freqs(self.tags))
        print('Fields:\n%s\n' % format_freqs(self.fields))
        print('Record types:\n%s\n' % format_freqs(self.record_types))


if __name__ == '__main__':
    if len(sys.argv) != 2:
        print('USAGE: python applehealthdata.py /path/to/export.xml',
              file=sys.stderr)
        sys.exit(1)
    data = HealthDataExtractor(sys.argv[1])
    data.report_stats()
    data.extract()

To run this code, clone the repo from github.com/tdda/applehealthdata with:

$ git clone https://github.com/tdda/applehealthdata.git

or save the text from this post as healthdata.py. At the time of posting, the code is consistent with this, but this commit is also tagged with the version number, v1.0, so if you check it out later and want to use this version, check out that version by saying:

$ git checkout v1.0

If your data is in the same directory as the code, then simply run:

$ python healthdata.py export.xml

and, depending on size, wait a few minutes while it runs. The code runs under both Python 2 and Python 3.

When I do this, the output is as follows:

$ python applehealthdata/applehealthdata.py export6s3/export.xml
Reading data from export6s3/export.xml . . . done

Tags:
ActivitySummary: 29
ExportDate: 1
Me: 1
Record: 446670
Workout: 1

Fields:
HKCharacteristicTypeIdentifierBiologicalSex: 1
HKCharacteristicTypeIdentifierBloodType: 1
HKCharacteristicTypeIdentifierDateOfBirth: 1
HKCharacteristicTypeIdentifierFitzpatrickSkinType: 1
activeEnergyBurned: 29
activeEnergyBurnedGoal: 29
activeEnergyBurnedUnit: 29
appleExerciseTime: 29
appleExerciseTimeGoal: 29
appleStandHours: 29
appleStandHoursGoal: 29
creationDate: 446671
dateComponents: 29
device: 84303
duration: 1
durationUnit: 1
endDate: 446671
sourceName: 446671
sourceVersion: 86786
startDate: 446671
totalDistance: 1
totalDistanceUnit: 1
totalEnergyBurned: 1
totalEnergyBurnedUnit: 1
type: 446670
unit: 446191
value: 446671
workoutActivityType: 1

Record types:
ActiveEnergyBurned: 19640
AppleExerciseTime: 2573
AppleStandHour: 479
BasalEnergyBurned: 26414
BodyMass: 155
DistanceWalkingRunning: 196262
FlightsClimbed: 2476
HeartRate: 3013
Height: 4
StepCount: 195654

Opening /Users/njr/qs/export6s3/BasalEnergyBurned.csv for writing
Opening /Users/njr/qs/export6s3/HeartRate.csv for writing
Opening /Users/njr/qs/export6s3/BodyMass.csv for writing
Opening /Users/njr/qs/export6s3/DistanceWalkingRunning.csv for writing
Opening /Users/njr/qs/export6s3/AppleStandHour.csv for writing
Opening /Users/njr/qs/export6s3/StepCount.csv for writing
Opening /Users/njr/qs/export6s3/Height.csv for writing
Opening /Users/njr/qs/export6s3/AppleExerciseTime.csv for writing
Opening /Users/njr/qs/export6s3/ActiveEnergyBurned.csv for writing
Opening /Users/njr/qs/export6s3/FlightsClimbed.csv for writing
Written BasalEnergyBurned data.
Written HeartRate data.
Written BodyMass data.
Written DistanceWalkingRunning data.
Written ActiveEnergyBurned data.
Written StepCount data.
Written Height data.
Written AppleExerciseTime data.
Written AppleStandHour data.
Written FlightsClimbed data.
$

As a quick preview of one of the files, here is the top of the second biggest output fiele, StepCount.csv:

$ head -5 StepCount.csv
sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:27:54 +0000,2014-09-13 09:27:59 +0000,329
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:34:09 +0000,2014-09-13 09:34:14 +0000,283
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:39:29 +0000,2014-09-13 09:39:34 +0000,426
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:48 +0000,2014-09-13 09:45:36 +0000,2014-09-13 09:45:41 +0000,61

You may need to scroll right to see all of it, or expand your browser window.

This blog post is long enough already, so I'll discuss (and plot) the contents of the various output files in later posts.

Notes on the Output

Format: The code writes CSV files including a header record with field names. Since the fields are XML attributes, which get read into a dictionary, they are unordered so the code sorts them alphabetically, which isn't optimal, but is at least consistent. Nulls are written as empty spaces, strings are quoted with double quotes, double quotes in strings are escaped with backslash and backslash is itself escaped with backslash. The output encoding is UTF-8.

Filenames: One file is written per record type, and the names is just the record type with extension .csv, except for record types including HK...TypeIdentifier, which is excised.

Summary Stats: Summary stats about the various CSV files are printed before the main extraction occurs.

Overwriting: Any existings CSV files are silently overwritten, so if you have multiple health data export files in the same directory, take care.

Data Sanitization: The code is almost completely opinionless, and with one exception simply flattens the data in the XML file into a collection of CSV files. The exception concerns file names and the type field file. Apple uses extraordinarily verbose and ugly names like HKQuantityTypeIdentifierStepCount and HKQuantityTypeIdentifierHeight to describe the contents of each record: the abbreviate function in the code uses a regular expression to strip off the nonsense, resulting in nicer, shorter, more comprehensible file names and record types. However, if you prefer to get your data verbatim, simply change the value of ABBREVIATE to False near the top of the file and all your HealthKit prefixes will be preserved, at the cost of a non-trivial expansion of the output file sizes.

Notes on the code: Wot, no tests?

The first thing to say about the code is that there are no tests provided with it, which is—cough—slightly ironic, given the theme of this blog. This isn't because I've written them but am holding them back for pedagogical reasons, or as an ironical meta-commentary on the whole test-driven movement, but merely because I haven't written any yet. Happily, writing tests is a good way of documenting and explaining code, so another post will follow, in which I will present some tests, possibly correct myriad bugs, and explain more about what the code is doing.


  1. I almost said 'I googled "Apple Health export"', but the more accurate statement would be that 'I DuckDuckGoed "Apple Health export"', but there are so many problems with DuckDuckGo as a verb, even in the present tense, let alone in the past as DuckDuckGod. Maybe I should propose the neologism "to DDGoogle". Or as Greg Wilson suggested, "to Duckle". Or maybe not . . . 

  2. The ElementTree structure in Python 3 is slightly different in this respect: this exploration was carried out with Python 2. However, the main code presented later in the post works under Python 2 and 3. 


Lessons Learned: Bad Data and other SNAFUs

Posted on Mon 15 February 2016 in TDDA • Tagged with tdda, bad data

My first paid programming job was working for my local education authority during the summer. The Advisory Unit for Computer-Based Education (AUCBE), run by a fantastic visionary and literal "greybeard" called Bill Tagg, produced software for schools in Hertfordshire and environs, and one of their products was a simple database called Quest. At this time (the early 1980s), two computers dominated UK schools—the Research Machines 380Z, a Zilog Z-80-based machine running CP/M, and the fantastic, new BBC Micro, 6502-based machine produced by Acorn, to specification agreed with the British Broadcasting Corporation. I was familiar with both, as my school had a solitary 380Z, and I had harangued my parents into getting me a BBC Model B,1 which was the joy of my life.

Figure: BBC Micro

The Quest database existed in two data-compatible forms. Peter Andrews had written a machine code implementation for the 380Z, and Bill Tagg himself had written an implementation in BBC Basic for the BBC Micro. They shared an interface and a manual, and my job was to produce a 6502 version that would also share that manual. Every deviation from the documented and actual behaviour of the BBC Basic implementation had to be personally signed off by Bill Tagg.

Writing Quest was a fantastic project for me, and the most highly constrained I have ever done: every aspect of it was pinned down by a combination of manuals, existing data files, specified interfaces, existing users and reference implementations. Peter Andrews was very generous in writing out, in fountain pen, on four A4 pages, a suggested implementation plan, which I followed scrupulously. That plan probably made the difference between my successfully completing the project and flailing endlessly, and the project was a success.

I learned an enormous amount writing Quest, but the path to success was not devoid of bumps in the road.

Once I had implemented enough of Quest for it to be worth testing, I took to delivering versions to Bill periodically. This was the early 1980s, so he didn't get them by pulling from Github, nor even by FTP or email; rather, I handed him floppy disks,2 in the early days, and later on, EPROMs—Erasable, Programmable Read-Only Memory chips that he could plug into the Zero-Insertion Force ("ZIF") socket3 on the side of his machine. (Did I mention how cool the BBC Micro was?)

Figure: ZIF Socket

Towards the end of my development of the 6502 implementation of Quest, I proudly handed over a version to Bill, and was slightly disappointed when he complained that it didn't work with one of his database files. In fact, his database file caused it to hang. He gave me a copy of his data and I set about finding the problem. It goes without saying that a bug that caused the software to hang was pretty bad, so it was clearly important to find it.

It was hard to track down. As I recall, it took me the best part of two solid days to find the problem. When I eventually did find it, it turned out to be a "bad data" problem. If I remember correctly, Quest saved data as flat files using the pipe character "|" to separate fields. The dataset Bill had given me had an extra pipe separator on one line, and was therefore not compliant with the data format. My reaction to this discovery was to curse Bill for sending me on a 2-day wild goose chase, and the following day I marched into AUCBE and told him—with the righteousness that only an arrogant teenager can muster—that it was his data that was at fault, not my beautiful code.

. . . to which Bill, of course, countered:

"And why didn't your beautiful code detect the bad data and report it, rather than hanging?"

Oops.

Introducing SNAFU of the Week

Needless to say, Bill was right. Even if my software was perfect and would never write invalid data (which might not have been the case), and even if data could never become corrupt through disk errors (which was demonstrably not the case), that didn't mean it would never encounter bad data. So the software had to deal with invalid inputs rather better than going into an infinite loop (which is exactly what it did—nothing a hard reset wouldn't cure!)

And so it is with data analysis.

Obviously, there is such a thing as good data—perfectly formatted, every value present and correct; it's just that it is almost never safe to assume that data your software will receive will be good. Rather, we almost always need to perform checks to validate it, and to give various levels of warnings when things are not as they should be. Hanging or crashing on bad data is obviously bad, but in some ways, it is less bad than reading it without generating a warning or error. The hierarchy of evils for analytical software runs something like this:

  1. (Worst) Producing plausible but materially incorrect results from good inputs.

  2. Producing implausible, materially incorrect results from good inputs (generally less bad, because these are much less likely to go unnoticed, though obviously they can be even more serious if they do).

  3. (Least serious) Hanging or crashing (embarrassing and inconvenient, but not actively misleading).

In this spirit, we are going to introduce "SNAFU of the Week", which will be a (not-necessarily weekly) series of examples of kinds of things that can go wrong with data (especially data feeds), analysis, and analytical software, together with a discussion of whether and how it was, or could have been detected and what lessons we might learn from them.


  1. BBC Micro Image: Dave Briggs, https://www.flickr.com/photos/theclosedcircle/3349126651/ under CC-BY-2.0

  2. Floppy disks were like 3D-printed versions of the save icon still used in much software, and in some cases could store over half a megabyte of data. Of course, the 6502 was a 16-bit processor, that could address a maximum of 64K of RAM. In the case of the BBC micro, a single program could occupy at most 16K, so a massive floppy disk could store many versions of Quest together with enormous database files. 

  3. Zero-Insertion Force Socket: Windell Oskay, https://www.flickr.com/photos/oskay/2226425940 under CC-BY-2.0


How far in advance are flights cheapest? An error of interpretation

Posted on Wed 06 January 2016 in TDDA • Tagged with tdda, errors, interpretation

Guest Post by Patrick Surry, Chief Data Scientist, Hopper

Every year, Expedia and ARC collaborate to publish some annual statistics about domestic airfare, including their treatment of the perennial question "How far in advance should you book your flight?" Here's what they presented in their report last year:

Figure: Average Ticket Price cs. Advance Purchase Days for Domestic Flights (Source; Expedia/ARC)

Although there are a lot of things wrong with this picture (including the callout not being at the right spot on the x-axis, and the $496 average appearing above $500 . . .), the most egregious is a more subtle error of interpretation. The accompanying commentary reads:

Still, the question remains: How early should travelers book? . . . Data collected by ARC indicates that the lowest average ticket price, about US$401, can be found 57 days in advance.

While that statement is presumably mathematically correct, it's completely misleading. The chart is drawn by calculating the average price of all domestic roundtrip tickets sold at each advance. That answers the question "how far in advance is the average ticket sold on the day lowest?" but is mistakenly interpreted as answering "how far in advance is a typical ticket cheapest?". That's a completely different question, because the mix of tickets changes with advance. Indeed, travelers tend to book more expensive trips earlier, and cheaper trips later. In fact, for most markets, prices are fairly flat at long advances, and then rise more or less steeply at some point before departure. As a simplification, assume there are only two domestic markets, a short, cheap trip, and a long, expensive one. Both have prices that are flat at long advances, and which start rising about 60 days before departure:

Figure: Price as a function of booking window, for short-haul and long-haul flights (Simulated Data)

Now let's assume that the relative demand is directly proportional to advance, i.e. 300 days ahead, all tickets sold are for FarFarAway, and 0 days ahead, all tickets sold are for StonesThrow, and let's calculate the price of the average ticket sold as a function of advance:

Figure: Average price as a function of booking window across long- and short-haul flights, with time-verying proportionate demand (simulated data)

What do you know? The average price declines as demand switches from more expensive to cheaper tickets, with a minimum coincidentally just less than 60 days in advance. To get a more meaningful answer to the question "how far in advance is the typical ticket cheapest?", we should instead simply calculate separate advance curves for each market, and then combine them based on the total number (or value) of tickets sold in each market. In our simple example, if we assume the two markets have equal overall weight, we get a much more intuitive result, with prices flat up to 60 days, and then rising towards departure:

Figure: Weighted average advance-purchase price across long-haul and short-haul, with weighting by volume

All this goes to show how important it is that we frame our analytical questions (and answers!) carefully. When the traveller asks: "How far in advance should I book my flight?", it's our responsibility as analysts to recognize that they mean

How far in advance is any given ticket cheapest?

rather than

How far in advance is the average price of tickets sold that day lowest?

Even a correct answer to the latter is dangerously misleading because the traveller is unlikely to recognize the distinction and take it as the (wrong!) answer to their real question.