In Defence of XML: Exporting and Analysing Apple Health Data

Posted on Fri 15 April 2016 in TDDA • Tagged with xml, apple, health

I'm going to present a series of posts based around the sort of health and fitness data that can now be collected by some phones and dedicated fitness trackers. Not all of these will be centrally on topic for test-driven data analysis, but I think they'll provide an interesting set of data for discussing many issues of relevance, so I hope readers will forgive me to the extent that these stray from the central theme.

The particular focus for this series will be the data available from an iPhone and the Apple Health app, over a couple of different phones, and with a couple of different devices paired to them.

In particular, the setup will be:

  • Apple iPhone 6s (November 2015 to present)
  • Apple iPhone 5s (with fitness data from Sept 2014 to Nov 2015)
  • Several Misfit Shine activity trackers (until early March 2016)
  • An Apple Watch (about a month of data, to date)

Getting data out of Apple Health (The Exploratory Version)

I hadn't initially spotted a way to get the data out of Apple's Health app, but a quick web search1 turned up this very helpful article: https://www.idownloadblog.com/2015/06/10/how-to-export-import-health-data. It turns out there is a properly supported way to export granular data from Apple Health, described in detail in the post. Essentially:

  • Open the Apple Health App.
  • Navigate to the Health Data section (left icon at the bottom)
  • Select All from the list of categories
  • There is a share icon at the top right (a vertical arrow sticking up from a square)
  • Tap that to export all data
  • It thinks for a while (quite a while, in fact) and then offers you various export options, which for me included Airdrop, email and handing the data to other apps. I used Airdrop to dump it onto a Mac.

The result is a compressed XML file called export.zip. For me, this was about 5.5MB, which expanded to 109MB when unzipped. (Interestingly, I started this with an earlier export a couple of weeks ago, when the zipped file was about 5MB and the expanded version was 90MB, so it is growing fairly quickly, thanks to the Watch.)

As helpful as the iDownloadBlog article is, I have to comment on its introduction to exporting data, which reads

There are actually two ways to export the data from your Health app. The first way, is one provided by Apple, but it is virtually useless.

To be fair to iDownloadBlog, an XML file like this probably is useless to the general reader, but it builds on a meme fashionable among developers and data scientists to the effect of "XML is painful to process, verbose and always worse than JSON", and I think this is somewhat unfair.

Let's explore export.xml using Python and the ElementTree library. Although the decompressed file is quite large (109MB), it's certainly not problematically large to read into memory on a modern machine, so I'm not going to worry about reading it in bits: I'm just going to find out as quickly as possible what's in it.

The first thing to do, of course, is simply to look at the file, probably using either the more or less command, assuming you are on some flavour of Unix or Linux. Let's look at the top of my export.xml:

$ head -79 export6s3/export.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HealthData [
<!-- HealthKit Export Version: 3 -->
<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout|ActivitySummary)*)>
<!ATTLIST HealthData
  locale CDATA #REQUIRED
>
<!ELEMENT ExportDate EMPTY>
<!ATTLIST ExportDate
  value CDATA #REQUIRED
>
<!ELEMENT Me EMPTY>
<!ATTLIST Me
  HKCharacteristicTypeIdentifierDateOfBirth         CDATA #REQUIRED
  HKCharacteristicTypeIdentifierBiologicalSex       CDATA #REQUIRED
  HKCharacteristicTypeIdentifierBloodType           CDATA #REQUIRED
  HKCharacteristicTypeIdentifierFitzpatrickSkinType CDATA #REQUIRED
>
<!ELEMENT Record (MetadataEntry*)>
<!ATTLIST Record
  type          CDATA #REQUIRED
  unit          CDATA #IMPLIED
  value         CDATA #IMPLIED
  sourceName    CDATA #REQUIRED
  sourceVersion CDATA #IMPLIED
  device        CDATA #IMPLIED
  creationDate  CDATA #IMPLIED
  startDate     CDATA #REQUIRED
  endDate       CDATA #REQUIRED
>
<!-- Note: Any Records that appear as children of a correlation also appear as top-level records in this document. -->
<!ELEMENT Correlation ((MetadataEntry|Record)*)>
<!ATTLIST Correlation
  type          CDATA #REQUIRED
  sourceName    CDATA #REQUIRED
  sourceVersion CDATA #IMPLIED
  device        CDATA #IMPLIED
  creationDate  CDATA #IMPLIED
  startDate     CDATA #REQUIRED
  endDate       CDATA #REQUIRED
>
<!ELEMENT Workout ((MetadataEntry|WorkoutEvent)*)>
<!ATTLIST Workout
  workoutActivityType   CDATA #REQUIRED
  duration              CDATA #IMPLIED
  durationUnit          CDATA #IMPLIED
  totalDistance         CDATA #IMPLIED
  totalDistanceUnit     CDATA #IMPLIED
  totalEnergyBurned     CDATA #IMPLIED
  totalEnergyBurnedUnit CDATA #IMPLIED
  sourceName            CDATA #REQUIRED
  sourceVersion         CDATA #IMPLIED
  device                CDATA #IMPLIED
  creationDate          CDATA #IMPLIED
  startDate             CDATA #REQUIRED
  endDate               CDATA #REQUIRED
>
<!ELEMENT WorkoutEvent EMPTY>
<!ATTLIST WorkoutEvent
  type CDATA #REQUIRED
  date CDATA #REQUIRED
>
<!ELEMENT ActivitySummary EMPTY>
<!ATTLIST ActivitySummary
  dateComponents           CDATA #IMPLIED
  activeEnergyBurned       CDATA #IMPLIED
  activeEnergyBurnedGoal   CDATA #IMPLIED
  activeEnergyBurnedUnit   CDATA #IMPLIED
  appleExerciseTime        CDATA #IMPLIED
  appleExerciseTimeGoal    CDATA #IMPLIED
  appleStandHours          CDATA #IMPLIED
  appleStandHoursGoal      CDATA #IMPLIED
>
<!ELEMENT MetadataEntry EMPTY>
<!ATTLIST MetadataEntry
  key   CDATA #REQUIRED
  value CDATA #REQUIRED
>
]>

This is immediately encouraging: Apple has provided DOCTYPE (DTD) information, which even though slightly old fashioned, tells us what we should expect to find in the file. DTD's are awkward to use, and when coming from untrusted sources, can leave the user potentially vulnerable to malicious attacks, but despite this, they are quite expressive and helpful, even just as plain-text documentation.

Roughly speaking, the lines:

<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout)*)>
<!ATTLIST HealthData
  locale CDATA #REQUIRED
>

say

  • that the top element will be a HealthData element

  • that this HealthData element will contain

    • an ExportDate element
    • a Me element
    • zero or more elements of type Record, Correlation or Workout
  • and that the HealthData element will have an attribute locale (which is mandatory).

The rest of this DTD section describes each kind of record in more detail.

The next 6 lines in my XML file are as follows (spread out for readability):

<HealthData locale="en_GB">
 <ExportDate value="2016-04-15 07:27:26 +0100"/>
 <Me HKCharacteristicTypeIdentifierDateOfBirth="1965-07-31"
     HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexMale"
     HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet"
     HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
 <Record type="HKQuantityTypeIdentifierHeight"
         sourceName="Health"
         sourceVersion="9.2"
         unit="cm"
         creationDate="2016-01-02 09:45:10 +0100"
         startDate="2016-01-02 09:44:00 +0100"
         endDate="2016-01-02 09:44:00 +0100"
         value="194">
  <MetadataEntry key="HKWasUserEntered" value="1"/>
 </Record>

As you can see, the export format is verbose, but extremely comprehensible and comprehensive. It's also very easy to read into Python and explore.

Let's do that, here with an interactive python:

>>> from xml.etree import ElementTree as ET
>>> with open('export.xml') as f:
...     data = ET.parse(f)
... 
>>> data
<xml.etree.ElementTree.ElementTree object at 0x107347a50>

The ElementTree module turns each XML element into an Element object, described by its tag, with a few standard attributes.

Inspecting the data object, we find:

>>> data.__dict__
{'_root': <Element 'HealthData' at 0x1073c2050>}

i.e., we have a single entry in data—a root element called HealthData.

Like all Element objects, it has the four standard attributes:2

>>> root = data._root
>>> root.__dict__.keys()
['text', 'attrib', 'tag', '_children']

These are:

>>> root.attrib
{'locale': 'en_GB'}

>>> root.text
'\n '

>>> root.tag
'HealthData'

>>> len(root._children)
446702

So nothing much apart from an encoding and a whole lot of child nodes. Let's inspect the first few of them:

>>> nodes = root._children
>>> nodes[0]
<Element 'ExportDate' at 0x1073c2090>

>>> ET.dump(nodes[0])
<ExportDate value="2016-04-15 07:27:26 +0100" />

>>> nodes[1]
<Element 'Me' at 0x1073c2190>
>>> ET.dump(nodes[1])
<Me HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexMale"
    HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet"
    HKCharacteristicTypeIdentifierDateOfBirth="1965-07-31"
    HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet" />

>>> nodes[2]
<Element 'Record' at 0x1073c2410>
>>> ET.dump(nodes[2])
<Record creationDate="2016-01-02 09:45:10 +0100"
        endDate="2016-01-02 09:44:00 +0100"
        sourceName="Health"
        sourceVersion="9.2"
        startDate="2016-01-02 09:44:00 +0100"
        type="HKQuantityTypeIdentifierHeight"
        unit="cm"
        value="194">
  <MetadataEntry key="HKWasUserEntered" value="1" />
 </Record>

>>> nodes[3]
<Element 'Record' at 0x1073c2550>
>>> nodes[4]
<Element 'Record' at 0x1073c2650>

So, exactly as the DTD indicated, we have an ExportDate node, a Me node and then what looks like a great number of records. Let's confirm that:

>>> set(node.tag for node in nodes[2:])
set(['Record', 'Workout', 'ActivitySummary'])

So in fact, there are three kinds of nodes after the ExportDate and Me records. Let's count them:

>>> records = [node for node in nodes if node.tag == 'Record']
>>> len(records)
446670

These records are ones like the Height record we saw above, though in fact most of them are not Height but either StepCount, CaloriesBurned or DistanceWalkingRunning, e.g.:

>>> ET.dump(nodes[100000])
<Record creationDate="2015-01-11 07:40:15 +0000"
        endDate="2015-01-10 13:39:35 +0000"
        sourceName="njr iPhone 6s"
        startDate="2015-01-10 13:39:32 +0000"
        type="HKQuantityTypeIdentifierStepCount"
        unit="count"
        value="4" />

There is also one activity summary per day (since I got the watch).

>>> acts = [node for node in nodes if node.tag == 'ActivitySummary']
>>> len(acts)
29

The first one isn't very exciting:

>>> ET.dump(acts[0])
<ActivitySummary activeEnergyBurned="0"
                 activeEnergyBurnedGoal="0"
                 activeEnergyBurnedUnit="kcal"
                 appleExerciseTime="0"
                 appleExerciseTimeGoal="30"
                 appleStandHours="0"
                 appleStandHoursGoal="12"
                 dateComponents="2016-03-18" />

but they get better:

>>> ET.dump(acts[2])
<ActivitySummary activeEnergyBurned="652.014"
                 activeEnergyBurnedGoal="500"
                 activeEnergyBurnedUnit="kcal"
                 appleExerciseTime="77"
                 appleExerciseTimeGoal="30"
                 appleStandHours="17"
                 appleStandHoursGoal="12"
                 dateComponents="2016-03-20" />

Finally, there is a solitary Workout record.

>>> ET.dump(workouts[0])
<Workout creationDate="2016-04-02 11:12:57 +0100"
         duration="31.73680251737436"
         durationUnit="min"
         endDate="2016-04-02 11:12:22 +0100"
         sourceName="NJR Apple&#160;Watch"
         sourceVersion="2.2"
         startDate="2016-04-02 10:40:38 +0100"
         totalDistance="0"
         totalDistanceUnit="km"
         totalEnergyBurned="139.3170000000021"
         totalEnergyBurnedUnit="kcal"
         workoutActivityType="HKWorkoutActivityTypeOther" />

So there we have it.

Getting data out of Apple Health (The Code)

Given this exploration, we can take a first shot at writing an exporter for Apple Health Data. I'm going to ignore the activity summaries and workout(s) for now, and concentrate on the main records. (We'll get to the others in a later post.)

Here is the code:

"""
applehealthdata.py: Extract data from Apple Health App's export.xml.

Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import re
import sys

from xml.etree import ElementTree
from collections import Counter, OrderedDict

__version__ = '1.0'

FIELDS = OrderedDict((
    ('sourceName', 's'),
    ('sourceVersion', 's'),
    ('device', 's'),
    ('type', 's'),
    ('unit', 's'),
    ('creationDate', 'd'),
    ('startDate', 'd'),
    ('endDate', 'd'),
    ('value', 'n'),
))

PREFIX_RE = re.compile('^HK.*TypeIdentifier(.+)$')
ABBREVIATE = True
VERBOSE = True

def format_freqs(counter):
    """
    Format a counter object for display.
    """
    return '\n'.join('%s: %d' % (tag, counter[tag])
                     for tag in sorted(counter.keys()))


def format_value(value, datatype):
    """
    Format a value for a CSV file, escaping double quotes and backslashes.

    None maps to empty.

    datatype should be
        's' for string (escaped)
        'n' for number
        'd' for datetime
    """
    if value is None:
        return ''
    elif datatype == 's':  # string
        return '"%s"' % value.replace('\\', '\\\\').replace('"', '\\"')
    elif datatype in ('n', 'd'):  # number or date
        return value
    else:
        raise KeyError('Unexpected format value: %s' % datatype)


def abbreviate(s):
    """
    Abbreviate particularly verbose strings based on a regular expression
    """
    m = re.match(PREFIX_RE, s)
    return m.group(1) if ABBREVIATE and m else s


def encode(s):
    """
    Encode string for writing to file.
    In Python 2, this encodes as UTF-8, whereas in Python 3,
    it does nothing
    """
    return s.encode('UTF-8') if sys.version_info.major < 3 else s



class HealthDataExtractor(object):
    """
    Extract health data from Apple Health App's XML export, export.xml.

    Inputs:
        path:      Relative or absolute path to export.xml
        verbose:   Set to False for less verbose output

    Outputs:
        Writes a CSV file for each record type found, in the same
        directory as the input export.xml. Reports each file written
        unless verbose has been set to False.
    """
    def __init__(self, path, verbose=VERBOSE):
        self.in_path = path
        self.verbose = verbose
        self.directory = os.path.abspath(os.path.split(path)[0])
        with open(path) as f:
            self.report('Reading data from %s . . . ' % path, end='')
            self.data = ElementTree.parse(f)
            self.report('done')
        self.root = self.data._root
        self.nodes = self.root.getchildren()
        self.n_nodes = len(self.nodes)
        self.abbreviate_types()
        self.collect_stats()

    def report(self, msg, end='\n'):
        if self.verbose:
            print(msg, end=end)
            sys.stdout.flush()

    def count_tags_and_fields(self):
        self.tags = Counter()
        self.fields = Counter()
        for record in self.nodes:
            self.tags[record.tag] += 1
            for k in record.keys():
                self.fields[k] += 1

    def count_record_types(self):
        self.record_types = Counter()
        for record in self.nodes:
            if record.tag == 'Record':
                self.record_types[record.attrib['type']] += 1

    def collect_stats(self):
        self.count_record_types()
        self.count_tags_and_fields()

    def open_for_writing(self):
        self.handles = {}
        self.paths = []
        for kind in self.record_types:
            path = os.path.join(self.directory, '%s.csv' % abbreviate(kind))
            f = open(path, 'w')
            f.write(','.join(FIELDS) + '\n')
            self.handles[kind] = f
            self.report('Opening %s for writing' % path)

    def abbreviate_types(self):
        """
        Shorten types by removing common boilerplate text.
        """
        for node in self.nodes:
            if node.tag == 'Record':
                if 'type' in node.attrib:
                    node.attrib['type'] = abbreviate(node.attrib['type'])


    def write_records(self):
        for node in self.nodes:
            if node.tag == 'Record':
                attributes = node.attrib
                kind = attributes['type']
                values = [format_value(attributes.get(field), datatype)
                          for (field, datatype) in FIELDS.items()]
                line = encode(','.join(values) + '\n')
                self.handles[kind].write(line)

    def close_files(self):
        for (kind, f) in self.handles.items():
            f.close()
            self.report('Written %s data.' % abbreviate(kind))

    def extract(self):
        self.open_for_writing()
        self.write_records()
        self.close_files()

    def report_stats(self):
        print('\nTags:\n%s\n' % format_freqs(self.tags))
        print('Fields:\n%s\n' % format_freqs(self.fields))
        print('Record types:\n%s\n' % format_freqs(self.record_types))


if __name__ == '__main__':
    if len(sys.argv) != 2:
        print('USAGE: python applehealthdata.py /path/to/export.xml',
              file=sys.stderr)
        sys.exit(1)
    data = HealthDataExtractor(sys.argv[1])
    data.report_stats()
    data.extract()

To run this code, clone the repo from github.com/tdda/applehealthdata with:

$ git clone https://github.com/tdda/applehealthdata.git

or save the text from this post as healthdata.py. At the time of posting, the code is consistent with this, but this commit is also tagged with the version number, v1.0, so if you check it out later and want to use this version, check out that version by saying:

$ git checkout v1.0

If your data is in the same directory as the code, then simply run:

$ python healthdata.py export.xml

and, depending on size, wait a few minutes while it runs. The code runs under both Python 2 and Python 3.

When I do this, the output is as follows:

$ python applehealthdata/applehealthdata.py export6s3/export.xml
Reading data from export6s3/export.xml . . . done

Tags:
ActivitySummary: 29
ExportDate: 1
Me: 1
Record: 446670
Workout: 1

Fields:
HKCharacteristicTypeIdentifierBiologicalSex: 1
HKCharacteristicTypeIdentifierBloodType: 1
HKCharacteristicTypeIdentifierDateOfBirth: 1
HKCharacteristicTypeIdentifierFitzpatrickSkinType: 1
activeEnergyBurned: 29
activeEnergyBurnedGoal: 29
activeEnergyBurnedUnit: 29
appleExerciseTime: 29
appleExerciseTimeGoal: 29
appleStandHours: 29
appleStandHoursGoal: 29
creationDate: 446671
dateComponents: 29
device: 84303
duration: 1
durationUnit: 1
endDate: 446671
sourceName: 446671
sourceVersion: 86786
startDate: 446671
totalDistance: 1
totalDistanceUnit: 1
totalEnergyBurned: 1
totalEnergyBurnedUnit: 1
type: 446670
unit: 446191
value: 446671
workoutActivityType: 1

Record types:
ActiveEnergyBurned: 19640
AppleExerciseTime: 2573
AppleStandHour: 479
BasalEnergyBurned: 26414
BodyMass: 155
DistanceWalkingRunning: 196262
FlightsClimbed: 2476
HeartRate: 3013
Height: 4
StepCount: 195654

Opening /Users/njr/qs/export6s3/BasalEnergyBurned.csv for writing
Opening /Users/njr/qs/export6s3/HeartRate.csv for writing
Opening /Users/njr/qs/export6s3/BodyMass.csv for writing
Opening /Users/njr/qs/export6s3/DistanceWalkingRunning.csv for writing
Opening /Users/njr/qs/export6s3/AppleStandHour.csv for writing
Opening /Users/njr/qs/export6s3/StepCount.csv for writing
Opening /Users/njr/qs/export6s3/Height.csv for writing
Opening /Users/njr/qs/export6s3/AppleExerciseTime.csv for writing
Opening /Users/njr/qs/export6s3/ActiveEnergyBurned.csv for writing
Opening /Users/njr/qs/export6s3/FlightsClimbed.csv for writing
Written BasalEnergyBurned data.
Written HeartRate data.
Written BodyMass data.
Written DistanceWalkingRunning data.
Written ActiveEnergyBurned data.
Written StepCount data.
Written Height data.
Written AppleExerciseTime data.
Written AppleStandHour data.
Written FlightsClimbed data.
$

As a quick preview of one of the files, here is the top of the second biggest output fiele, StepCount.csv:

$ head -5 StepCount.csv
sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:27:54 +0000,2014-09-13 09:27:59 +0000,329
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:34:09 +0000,2014-09-13 09:34:14 +0000,283
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:39:29 +0000,2014-09-13 09:39:34 +0000,426
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:48 +0000,2014-09-13 09:45:36 +0000,2014-09-13 09:45:41 +0000,61

You may need to scroll right to see all of it, or expand your browser window.

This blog post is long enough already, so I'll discuss (and plot) the contents of the various output files in later posts.

Notes on the Output

Format: The code writes CSV files including a header record with field names. Since the fields are XML attributes, which get read into a dictionary, they are unordered so the code sorts them alphabetically, which isn't optimal, but is at least consistent. Nulls are written as empty spaces, strings are quoted with double quotes, double quotes in strings are escaped with backslash and backslash is itself escaped with backslash. The output encoding is UTF-8.

Filenames: One file is written per record type, and the names is just the record type with extension .csv, except for record types including HK...TypeIdentifier, which is excised.

Summary Stats: Summary stats about the various CSV files are printed before the main extraction occurs.

Overwriting: Any existings CSV files are silently overwritten, so if you have multiple health data export files in the same directory, take care.

Data Sanitization: The code is almost completely opinionless, and with one exception simply flattens the data in the XML file into a collection of CSV files. The exception concerns file names and the type field file. Apple uses extraordinarily verbose and ugly names like HKQuantityTypeIdentifierStepCount and HKQuantityTypeIdentifierHeight to describe the contents of each record: the abbreviate function in the code uses a regular expression to strip off the nonsense, resulting in nicer, shorter, more comprehensible file names and record types. However, if you prefer to get your data verbatim, simply change the value of ABBREVIATE to False near the top of the file and all your HealthKit prefixes will be preserved, at the cost of a non-trivial expansion of the output file sizes.

Notes on the code: Wot, no tests?

The first thing to say about the code is that there are no tests provided with it, which is—cough—slightly ironic, given the theme of this blog. This isn't because I've written them but am holding them back for pedagogical reasons, or as an ironical meta-commentary on the whole test-driven movement, but merely because I haven't written any yet. Happily, writing tests is a good way of documenting and explaining code, so another post will follow, in which I will present some tests, possibly correct myriad bugs, and explain more about what the code is doing.


  1. I almost said 'I googled "Apple Health export"', but the more accurate statement would be that 'I DuckDuckGoed "Apple Health export"', but there are so many problems with DuckDuckGo as a verb, even in the present tense, let alone in the past as DuckDuckGod. Maybe I should propose the neologism "to DDGoogle". Or as Greg Wilson suggested, "to Duckle". Or maybe not . . . 

  2. The ElementTree structure in Python 3 is slightly different in this respect: this exploration was carried out with Python 2. However, the main code presented later in the post works under Python 2 and 3. 


Lessons Learned: Bad Data and other SNAFUs

Posted on Mon 15 February 2016 in TDDA • Tagged with tdda, bad data

My first paid programming job was working for my local education authority during the summer. The Advisory Unit for Computer-Based Education (AUCBE), run by a fantastic visionary and literal "greybeard" called Bill Tagg, produced software for schools in Hertfordshire and environs, and one of their products was a simple database called Quest. At this time (the early 1980s), two computers dominated UK schools—the Research Machines 380Z, a Zilog Z-80-based machine running CP/M, and the fantastic, new BBC Micro, 6502-based machine produced by Acorn, to specification agreed with the British Broadcasting Corporation. I was familiar with both, as my school had a solitary 380Z, and I had harangued my parents into getting me a BBC Model B,1 which was the joy of my life.

Figure: BBC Micro

The Quest database existed in two data-compatible forms. Peter Andrews had written a machine code implementation for the 380Z, and Bill Tagg himself had written an implementation in BBC Basic for the BBC Micro. They shared an interface and a manual, and my job was to produce a 6502 version that would also share that manual. Every deviation from the documented and actual behaviour of the BBC Basic implementation had to be personally signed off by Bill Tagg.

Writing Quest was a fantastic project for me, and the most highly constrained I have ever done: every aspect of it was pinned down by a combination of manuals, existing data files, specified interfaces, existing users and reference implementations. Peter Andrews was very generous in writing out, in fountain pen, on four A4 pages, a suggested implementation plan, which I followed scrupulously. That plan probably made the difference between my successfully completing the project and flailing endlessly, and the project was a success.

I learned an enormous amount writing Quest, but the path to success was not devoid of bumps in the road.

Once I had implemented enough of Quest for it to be worth testing, I took to delivering versions to Bill periodically. This was the early 1980s, so he didn't get them by pulling from Github, nor even by FTP or email; rather, I handed him floppy disks,2 in the early days, and later on, EPROMs—Erasable, Programmable Read-Only Memory chips that he could plug into the Zero-Insertion Force ("ZIF") socket3 on the side of his machine. (Did I mention how cool the BBC Micro was?)

Figure: ZIF Socket

Towards the end of my development of the 6502 implementation of Quest, I proudly handed over a version to Bill, and was slightly disappointed when he complained that it didn't work with one of his database files. In fact, his database file caused it to hang. He gave me a copy of his data and I set about finding the problem. It goes without saying that a bug that caused the software to hang was pretty bad, so it was clearly important to find it.

It was hard to track down. As I recall, it took me the best part of two solid days to find the problem. When I eventually did find it, it turned out to be a "bad data" problem. If I remember correctly, Quest saved data as flat files using the pipe character "|" to separate fields. The dataset Bill had given me had an extra pipe separator on one line, and was therefore not compliant with the data format. My reaction to this discovery was to curse Bill for sending me on a 2-day wild goose chase, and the following day I marched into AUCBE and told him—with the righteousness that only an arrogant teenager can muster—that it was his data that was at fault, not my beautiful code.

. . . to which Bill, of course, countered:

"And why didn't your beautiful code detect the bad data and report it, rather than hanging?"

Oops.

Introducing SNAFU of the Week

Needless to say, Bill was right. Even if my software was perfect and would never write invalid data (which might not have been the case), and even if data could never become corrupt through disk errors (which was demonstrably not the case), that didn't mean it would never encounter bad data. So the software had to deal with invalid inputs rather better than going into an infinite loop (which is exactly what it did—nothing a hard reset wouldn't cure!)

And so it is with data analysis.

Obviously, there is such a thing as good data—perfectly formatted, every value present and correct; it's just that it is almost never safe to assume that data your software will receive will be good. Rather, we almost always need to perform checks to validate it, and to give various levels of warnings when things are not as they should be. Hanging or crashing on bad data is obviously bad, but in some ways, it is less bad than reading it without generating a warning or error. The hierarchy of evils for analytical software runs something like this:

  1. (Worst) Producing plausible but materially incorrect results from good inputs.

  2. Producing implausible, materially incorrect results from good inputs (generally less bad, because these are much less likely to go unnoticed, though obviously they can be even more serious if they do).

  3. (Least serious) Hanging or crashing (embarrassing and inconvenient, but not actively misleading).

In this spirit, we are going to introduce "SNAFU of the Week", which will be a (not-necessarily weekly) series of examples of kinds of things that can go wrong with data (especially data feeds), analysis, and analytical software, together with a discussion of whether and how it was, or could have been detected and what lessons we might learn from them.


  1. BBC Micro Image: Dave Briggs, https://www.flickr.com/photos/theclosedcircle/3349126651/ under CC-BY-2.0

  2. Floppy disks were like 3D-printed versions of the save icon still used in much software, and in some cases could store over half a megabyte of data. Of course, the 6502 was a 16-bit processor, that could address a maximum of 64K of RAM. In the case of the BBC micro, a single program could occupy at most 16K, so a massive floppy disk could store many versions of Quest together with enormous database files. 

  3. Zero-Insertion Force Socket: Windell Oskay, https://www.flickr.com/photos/oskay/2226425940 under CC-BY-2.0


How far in advance are flights cheapest? An error of interpretation

Posted on Wed 06 January 2016 in TDDA • Tagged with tdda, errors, interpretation

Guest Post by Patrick Surry, Chief Data Scientist, Hopper

Every year, Expedia and ARC collaborate to publish some annual statistics about domestic airfare, including their treatment of the perennial question "How far in advance should you book your flight?" Here's what they presented in their report last year:

Figure: Average Ticket Price cs. Advance Purchase Days for Domestic Flights (Source; Expedia/ARC)

Although there are a lot of things wrong with this picture (including the callout not being at the right spot on the x-axis, and the $496 average appearing above $500 . . .), the most egregious is a more subtle error of interpretation. The accompanying commentary reads:

Still, the question remains: How early should travelers book? . . . Data collected by ARC indicates that the lowest average ticket price, about US$401, can be found 57 days in advance.

While that statement is presumably mathematically correct, it's completely misleading. The chart is drawn by calculating the average price of all domestic roundtrip tickets sold at each advance. That answers the question "how far in advance is the average ticket sold on the day lowest?" but is mistakenly interpreted as answering "how far in advance is a typical ticket cheapest?". That's a completely different question, because the mix of tickets changes with advance. Indeed, travelers tend to book more expensive trips earlier, and cheaper trips later. In fact, for most markets, prices are fairly flat at long advances, and then rise more or less steeply at some point before departure. As a simplification, assume there are only two domestic markets, a short, cheap trip, and a long, expensive one. Both have prices that are flat at long advances, and which start rising about 60 days before departure:

Figure: Price as a function of booking window, for short-haul and long-haul flights (Simulated Data)

Now let's assume that the relative demand is directly proportional to advance, i.e. 300 days ahead, all tickets sold are for FarFarAway, and 0 days ahead, all tickets sold are for StonesThrow, and let's calculate the price of the average ticket sold as a function of advance:

Figure: Average price as a function of booking window across long- and short-haul flights, with time-verying proportionate demand (simulated data)

What do you know? The average price declines as demand switches from more expensive to cheaper tickets, with a minimum coincidentally just less than 60 days in advance. To get a more meaningful answer to the question "how far in advance is the typical ticket cheapest?", we should instead simply calculate separate advance curves for each market, and then combine them based on the total number (or value) of tickets sold in each market. In our simple example, if we assume the two markets have equal overall weight, we get a much more intuitive result, with prices flat up to 60 days, and then rising towards departure:

Figure: Weighted average advance-purchase price across long-haul and short-haul, with weighting by volume

All this goes to show how important it is that we frame our analytical questions (and answers!) carefully. When the traveller asks: "How far in advance should I book my flight?", it's our responsibility as analysts to recognize that they mean

How far in advance is any given ticket cheapest?

rather than

How far in advance is the average price of tickets sold that day lowest?

Even a correct answer to the latter is dangerously misleading because the traveller is unlikely to recognize the distinction and take it as the (wrong!) answer to their real question.


Tools and Tooling

Posted on Wed 16 December 2015 in TDDA • Tagged with tdda, tools

Good tools for testing matter because the temptation to skimp on testing is real even for true believers: anything that reduces the friction and pain associated with actually adding tests therefore has a disproportionate effect on adoption and implementation rates.

I think there are several reasons the temptation to forego writing tests seems to be strong for most people.

The first is that code is sometimes implemented correctly without writing any tests. There's no real temptation to miss out mandatory boiler-plate code at the top of a Java program because we know it can never work without it. The fact that it is possible to produce correct programs (and correct analytical processes) without writing tests means that even if you are intellectually convinced, and conditioned by experience, to believe that it is ultimately faster to produce reliable results by following a test-driven methodology, there's always the lingering memory of those rare occasions when things did just work first time without the "extra" work of writing tests.

The second reason it's tempting to skimp on writing tests is that this is a support activity rather than the main task at hand, making the testing part seem less glamorous and more of a chore. (Everyone is familiar with Abraham Lincoln's "Give me six hours to chop down a tree and I will spend the first four sharpening the axe", but how many adopt his principle?)

Perhaps the final reason for skimping on test code is that large programs and processes typically start life as small programs that we may not intend to use repeatedly. This is especially true in data analysis. While no task is so simple it cannot be botched, the benefits of and need for systematic testing grow with project size. The dynamic of "knocking up a script to calculate something", then later finding yourself using and modifying that script regularly, gradually extending it to handle ever more cases, while very typical for some kinds of development and analysis, carries a high risk. The frequent result is that by the time anyone realises that it really needs a test suite, the code has become grown complex, undocumented and hard to back-fill with tests.

The approaches we propose for test-driven data analysis are deliberately compatible with retrofitting, because in practice so much code develops in this organic way. If we ignore this reality, we will automatically exclude a large proportion—perhaps a majority—of analytical processes actually deployed. Given that we believe the test-driven approach to data analysis has much to offer, we are keen to bring its benefits as widely as possible, so we need to have regard to the case of analytical processes developed in the real world without TDDA as a primary focus.

Understanding that there is a natural temptation not to develop tests helps us to realise that anything we can do to make the process of testing less painful to implement and easier to retrofit when it has been neglected is likely to help adoption.

With these thoughts in mind, over the coming weeks and months, we will have further posts on specific aspects of tooling for TDDA. The broad plan is to discuss ideas we have already implemented in our own Miró data analysis suite, some as extensions to Python's built-in unittest framework, and to start to extract and publish core functionality from there in forms that are applicable to the broader Python (and perhaps, in some cases R) data analysis toolsets.


Generalized Overfitting: Errors of Applicability

Posted on Mon 14 December 2015 in TDDA • Tagged with tdda, errors, applicability

Everyone building predictive models or performing statistical fitting knows about overfitting. This arises when the function represented by the model includes components or aspects that are overly specific to the particularities of the sample data used for training the model, and that are not general features of datasets to which the model might reasonably be applied. The failure mode associated with overfitting is that the performance of the model on the data we used to train it is significantly better than the performance when we apply the model to other data.

Figure: Overfitting

Figure: Overfitting. Points drawn from sin(x) + Gaussian noise. Left: Polynomial fit, degree 3 (cubic; good fit). Right: Polynomial fit, degree 10 (overfit).

Statisticians use the term cross-validation to describe the process of splitting the training data into two (or more) parts, and using one part to fit the model, and the other to assess whether or not it exhibits overfitting. In machine learning, this is more often referred to as a "test-training" approach.

A special form of this approach is longitudinal validation, in which we build the model on data from one time period and then check its performance against data from a later time period, either by partitioning the data available at build time into older and newer data, or by using outcomes collected after the model was built for validation. With longitudinal validation, we seek to verify not only that we did not overfit the characteristics of a particular data sample, but also that the patterns we model are stable over time.

Validating against data for which the outcomes were not known when the model was developed has the additional benefit of eliminating a common class of errors that arises when secondary information about validation outcomes "leaks" during the model building process. Some degree of such leakage—sometimes known as contaminating the validation data—is quite common.

Generalized Overfitting

As its name suggests, overfitting as normally conceived is a failure mode specific to model building, arising when we fit the training data "too well". Here, we are are going to argue that overfitting is an example of a more general failure mode that can be present in any analytical process, especially if we use the process with data other than that used to build it. Our suggested name for this broader class of failures is errors of applicability.

Here are some of the failure modes we are thinking about:

Changes in Distributions of Inputs (and Outputs)

  1. New categories. When we develop the analytical process, we see only categories A, B and C in some (categorical) input or output. In operation, we also see category D. At this point our process may fail completely ("crash"), produce meaningless outputs or merely produce less good results.

  2. Missing categories. The converse can be a problem too: what if a category disappears? Most prosaically, this might lead to a divide-by-zero error if we've explicitly used each category frequency in a denominator. Subtler errors can also creep in.

  3. Extended ranges. For numeric and other ordered data, the equivalent of new categories is values outside the range we saw in the development data. Even if the analysis code runs without incident, the process will be being used in a way that may be quite outside that considered and tested during development, so this can be dangerous.

  4. Distributions. More generally, even if the range of the input data doesn't change, its distribution may, either slowly or abruptly. At the very least, this indicates the process is being used in unfamiliar territory.

  5. Nulls. Did nulls appear in any fields where there were none when we developed the process? Does the process cater for this appropriately? And are such nulls valid?

  6. Higher Dimensional Shifts. Even if the the data ranges and distribution for individual fields don't change, their higher dimensional distributions (correlations) can change significantly. The pair of 2-dimensional distributions below illustrates this point in an extreme way. The distributions of both x and y values on the left and right are identical. But clearly, in 2 dimensions, we see that the space occupied by the two datasets is actually non-overlapping, and on the left x and y are negatively correlated, while on the right they are positively correlated.

    Figure: A shift in distribution (2D)

    Figure: The same x and y values are shared between these two plots (i.e. the disibution of x and y is identical in each case). However, the pairing of x and y coordinates is different. A model or other analytical process built with with negatively correlated data like that on the left might not work well for positively correlated data like that on the right. Even if it does work well, you may want to detect and report a fundamental change like this.

  7. Time (always marching on). Times and dates are notoriously problematical. There are many issues around date and time formats, many specifically around timezones (and the difference between a local times and times in a fixed time zone, such as GMT or UTC).

    For now, let's assume that we have an input that is a well-defined time, correctly read and analysed in a known timezone—say UTC.1 Obviously, new data will tend to have later times—sometimes non-overlapping later times. Most often, we need to change these to intervals measured with respect to a moving date (possibly today, or some variable event date, e.g. days since contact). But in other cases, absolute times, or times in a cycle matter. For example, season, time of month or time of day may matter—the last two, probably in local time rather than UTC.

    In handling time, we have to be careful about binnings, about absolute vs. relative measurement (2015-12-11T11:00:00 vs. 299 hours after the start of the current month), universal vs. local time, and appropriate bin boundaries that move or expand with the analytic time window being considered.

    Time is not unique in the way that its range and maximum naturally increase with each new data sample. Most obviously, other counters (such as customer number) and sum-like aggregates may have this same monotonically increasing character, meaning that it should be expected that new, higher (but perhaps not new lower) values will be present in newer data.

Concrete and Abstract Definitions

There's a general issue with choosing values based on data used during development. This concerns the difference between what we will term concrete and abstract values, and what it means to perform "the same" operation on different datasets.

Suppose we decide to handle outliers differently from the rest of the data in a dataset, at least for some part of the analysis. For example, suppose we're looking at flight prices in Sterling and we see the following distribution.

Figure: Ticket Prices

Figure: Ticket prices, in £100 bins to £1,000, then doubling widths to £256,000, with one final bin for prices above £256,000. (On the graph, the £100-width bins are red; the rest are blue.)

On the basis of this, we see that well over 99% of the data has prices under £4,000, and also that while there are a few thousand ticket prices in the £4,000–£32,000 range (most of which are probably real) the final few thousand probably contain bad data, perhaps as a result of currency conversion errors.

We may well want to choose one or more threshold values from the data—say £4,000 in this case—to specify some aspect of our analytical process. We might, for example, use this threshold in the analysis for filtering, outlier reporting, setting a final bin boundary or setting the range for the axes of a graph.

The crucial question here is: How do we specify and represent our threshold value?

  • Concrete Value: Our concrete value is £4,000. In the current dataset there are 60,995 ticket prices (0.55%) above this value and 10,807,905 (99.45%) below. (There are no prices of exactly £4,000.) Obviously, if we specify our threshold using this concrete value—£4,000—it will be the same for any dataset we use with the process.

  • Abstract Value: Alternatively, we might specify the value indirectly, as a function of the input data. One such abstract specification is the price P below which which 99.45% of ticket prices the dataset lie. If we specify a threshold using this abstract definition, it will vary according to the content of the dataset.

    • In passing, 99.45% is not precise: if we select the bottom 99.45% of this dataset by price we get 10,808,225 records with a maximum price of £4,007.65. The more precise specification is that 99.447046% of the dataset has prices under £4,000.

    • Of course, being human, if we were specifying the value in this way, we would probably round the percentage to 99.5%, and if we did that we would find that we shifted the threshold so that the maximum price below it was £4,186.15, and the minimum price above was £4,186.22.

  • Alternative Abstract Specifications: Of course, if we want to specify this threshold abstractly, there are countless other ways we might do it, some fraught with danger.

    Two things we should definitely avoid when working with data like this are means and variances across the whole column, because they will be rendered largely meaningless by outliers. If we blindly calculate the mean, μ, and standard deviation, σ, in this dataset, we get μ=£2,009.85 and σ=£983,956.28. That's because, as we noted previously, there are a few highly questionable ticket prices in the data, including a maximum of £1,390,276,267.42.2 Within the main body of the data—the ~99.45% with prices below £4,000.00—the corresponding values are μ=£462.09 and σ=£504.82. This emphasizes how dangerous it would be to base a definition on full-field moments3 such as mean or variance.

    In contrast, the median is much less affected by outliers. In the full dataset, for example the median ticket price is £303.77, while the median of those under £4,000.00 is £301.23. So another reasonably stable abstract definition of a threshold around £4,000.00 would be something like 13 times the median.

The reason for labouring this point around abstract vs. concrete definitions is that it arises very commonly and it is not always obvious which is preferable. Concrete definitions have the advantage of (numeric) consistency between analyses, but may result in analyses that are not well suited to a later dataset, because different choices would have been made if that later data had been considered by the developer of the process. Conversely, abstract definitions often make it easier to ensure that analyses are suitable for a broader range of input datasets, but can make comparability more difficult; they also tend to make it harder to get "nice" human-centric scales, bin boundaries and thresholds (because you end up, as we saw above, with values like £4,186.22, rather than £4,000).

Making a poor choice between abstract and concrete specifications of any data-derived values can lead to large sections of the data being omitted (if filtering is used), or made invisible (if used for axis boundaries), or conversely can lead to non-comparability between results or miscomputations if values are associated with bins having different boundaries in different datasets.

NOTE: A common source of the leakage of information from validation data into training data, as discussed above, is the use of the full dataset to make decisions about thresholds such as those discussed here. To get the full benefit of cross-validation, all modelling decisions need to be made solely on the basis of the training data; even feeding back performance information from the validation data begins to contaminate that data.

Data-derived thresholds and other values can occur almost anywhere in an analytical process, but specific dangers include:

  1. Selections (Filters). In designing analytical processes, we may choose to filter values, perhaps to removing outliers or nonsensical values. Over time, the distribution may shift, and these filters may become less appropriate and remove ever-increasing proportions of the data.

    A good example of this we have seen recently involves negative charges. In early versions of ticket price information, almost all charges were positive, and those that were negative were clearly erroneous, so we added a filter to remove all negative charges from the dataset. Later, we started seeing data in which there were many more, and less obviously erroneous negative charges. It turned out that a new data source generated valid negative charges, but we were misled in our initial analysis and the process we built was unsuitable for the new context.

  2. Binnings (Bandings, Buckets). Binning data is very common, and it is important to think carefully about when you want bin boundaries to be concrete (common across datasets) and when they should be abstract (computed, and therefore different for different datasets). Quantile binnings (such as deciles), of course, are intrinsically adaptive, though if those are used you have to be aware that any given bin in one dataset may have different boundaries from the "same" bin in another dataset.

  3. Statistics. As noted above, some care has to be taken when any statistic is used in the dataset to determine whether it should be recorded algorithmically (as an abstract value) in analysis or numerically (as a concrete value), and particular care should be taken with statistics that are sensitive to outliers.

Other Challenges to Applicability

In addition to the common sources of errors of applicability we have outlined above, we will briefly mention a few more.

  1. Non-uniqueness. Is a value that was different for each record in the input data now non-unique?

  2. Crazy outliers. Are there (crazy) outliers in fields where there were none before?

  3. Actually wrong. Are there detectable data errors in the operational data that were not seen during development?

  4. New data formats. Have formats changed, leading to misinterpretation of values?

  5. New outcomes. Even more problematical than new input categories or ranges are new outcome categories or a larger range of output values. When we see this, we should almost always re-evaluate our analytical processes.

Four Kinds of Analytical Errors

In the overview of TDDA we published in Predictive Analytic Times (available here), we made an attempt to summarize how the four main classes of errors arise with the following diagram:

Figure: Four Kinds of Analytical Error

While this was always intended to be a simplification, a particular problem is that it suggests there's no room for errors of interpretation in the operationalization phase, which is far from the case.4 Probably a better representation is as follows:

Figure: Four Kinds of Analytical Error (revisited)


  1. UTC is the curious abbreviation (malacronym?) used for coordinated universal time, which is the standardized version of Greenwich Mean Time now defined by the scientific community. It is the time at 0º longitude, with no "daylight saving" (British Summer Time) adjustment. 

  2. This is probably the result of a currency conversion error. 

  3. Statistical moments are the characterizations of distributions starting with mean and variance, and continuing with skewness and kurtosis. 

  4. It's never too late to misinterpret data or results.