In Defence of XML: Exporting and Analysing Apple Health Data
Posted on Fri 15 April 2016 in TDDA • Tagged with xml, apple, health
I'm going to present a series of posts based around the sort of health and fitness data that can now be collected by some phones and dedicated fitness trackers. Not all of these will be centrally on topic for test-driven data analysis, but I think they'll provide an interesting set of data for discussing many issues of relevance, so I hope readers will forgive me to the extent that these stray from the central theme.
The particular focus for this series will be the data available from an iPhone and the Apple Health app, over a couple of different phones, and with a couple of different devices paired to them.
In particular, the setup will be:
- Apple iPhone 6s (November 2015 to present)
- Apple iPhone 5s (with fitness data from Sept 2014 to Nov 2015)
- Several Misfit Shine activity trackers (until early March 2016)
- An Apple Watch (about a month of data, to date)
Getting data out of Apple Health (The Exploratory Version)
I hadn't initially spotted a way to get the data out of Apple's Health app, but a quick web search1 turned up this very helpful article: https://www.idownloadblog.com/2015/06/10/how-to-export-import-health-data. It turns out there is a properly supported way to export granular data from Apple Health, described in detail in the post. Essentially:
- Open the Apple Health App.
- Navigate to the Health Data section (left icon at the bottom)
- Select
All
from the list of categories - There is a share icon at the top right (a vertical arrow sticking up from a square)
- Tap that to export all data
- It thinks for a while (quite a while, in fact) and then offers you various export options, which for me included Airdrop, email and handing the data to other apps. I used Airdrop to dump it onto a Mac.
The result is a compressed XML file called export.zip
.
For me, this was about 5.5MB, which expanded to 109MB when unzipped.
(Interestingly, I started this with an earlier export a couple of weeks ago,
when the zipped file was about 5MB and the expanded version was 90MB, so it
is growing fairly quickly, thanks to the Watch.)
As helpful as the iDownloadBlog article is, I have to comment on its introduction to exporting data, which reads
There are actually two ways to export the data from your Health app. The first way, is one provided by Apple, but it is virtually useless.
To be fair to iDownloadBlog, an XML file like this probably is useless to the general reader, but it builds on a meme fashionable among developers and data scientists to the effect of "XML is painful to process, verbose and always worse than JSON", and I think this is somewhat unfair.
Let's explore export.xml
using Python and the ElementTree
library.
Although the decompressed file is quite large (109MB), it's certainly
not problematically large to read into memory on a modern machine, so
I'm not going to worry about reading it in bits: I'm just going to
find out as quickly as possible what's in it.
The first thing to do, of course, is simply to look at the file, probably
using either the more
or less
command, assuming you are on some flavour
of Unix or Linux. Let's look at the top of my export.xml
:
$ head -79 export6s3/export.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HealthData [
<!-- HealthKit Export Version: 3 -->
<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout|ActivitySummary)*)>
<!ATTLIST HealthData
locale CDATA #REQUIRED
>
<!ELEMENT ExportDate EMPTY>
<!ATTLIST ExportDate
value CDATA #REQUIRED
>
<!ELEMENT Me EMPTY>
<!ATTLIST Me
HKCharacteristicTypeIdentifierDateOfBirth CDATA #REQUIRED
HKCharacteristicTypeIdentifierBiologicalSex CDATA #REQUIRED
HKCharacteristicTypeIdentifierBloodType CDATA #REQUIRED
HKCharacteristicTypeIdentifierFitzpatrickSkinType CDATA #REQUIRED
>
<!ELEMENT Record (MetadataEntry*)>
<!ATTLIST Record
type CDATA #REQUIRED
unit CDATA #IMPLIED
value CDATA #IMPLIED
sourceName CDATA #REQUIRED
sourceVersion CDATA #IMPLIED
device CDATA #IMPLIED
creationDate CDATA #IMPLIED
startDate CDATA #REQUIRED
endDate CDATA #REQUIRED
>
<!-- Note: Any Records that appear as children of a correlation also appear as top-level records in this document. -->
<!ELEMENT Correlation ((MetadataEntry|Record)*)>
<!ATTLIST Correlation
type CDATA #REQUIRED
sourceName CDATA #REQUIRED
sourceVersion CDATA #IMPLIED
device CDATA #IMPLIED
creationDate CDATA #IMPLIED
startDate CDATA #REQUIRED
endDate CDATA #REQUIRED
>
<!ELEMENT Workout ((MetadataEntry|WorkoutEvent)*)>
<!ATTLIST Workout
workoutActivityType CDATA #REQUIRED
duration CDATA #IMPLIED
durationUnit CDATA #IMPLIED
totalDistance CDATA #IMPLIED
totalDistanceUnit CDATA #IMPLIED
totalEnergyBurned CDATA #IMPLIED
totalEnergyBurnedUnit CDATA #IMPLIED
sourceName CDATA #REQUIRED
sourceVersion CDATA #IMPLIED
device CDATA #IMPLIED
creationDate CDATA #IMPLIED
startDate CDATA #REQUIRED
endDate CDATA #REQUIRED
>
<!ELEMENT WorkoutEvent EMPTY>
<!ATTLIST WorkoutEvent
type CDATA #REQUIRED
date CDATA #REQUIRED
>
<!ELEMENT ActivitySummary EMPTY>
<!ATTLIST ActivitySummary
dateComponents CDATA #IMPLIED
activeEnergyBurned CDATA #IMPLIED
activeEnergyBurnedGoal CDATA #IMPLIED
activeEnergyBurnedUnit CDATA #IMPLIED
appleExerciseTime CDATA #IMPLIED
appleExerciseTimeGoal CDATA #IMPLIED
appleStandHours CDATA #IMPLIED
appleStandHoursGoal CDATA #IMPLIED
>
<!ELEMENT MetadataEntry EMPTY>
<!ATTLIST MetadataEntry
key CDATA #REQUIRED
value CDATA #REQUIRED
>
]>
This is immediately encouraging: Apple has provided DOCTYPE
(DTD)
information, which even though slightly old fashioned, tells us what
we should expect to find in the file. DTD's are awkward to use,
and when coming from untrusted sources, can leave the user potentially
vulnerable to malicious attacks, but despite this, they are quite expressive and helpful,
even just as plain-text documentation.
Roughly speaking, the lines:
<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout)*)>
<!ATTLIST HealthData
locale CDATA #REQUIRED
>
say
-
that the top element will be a
HealthData
element -
that this
HealthData
element will contain- an
ExportDate
element - a
Me
element - zero or more elements of type
Record
,Correlation
orWorkout
- an
-
and that the
HealthData
element will have an attributelocale
(which is mandatory).
The rest of this DTD section describes each kind of record in more detail.
The next 6 lines in my XML file are as follows (spread out for readability):
<HealthData locale="en_GB">
<ExportDate value="2016-04-15 07:27:26 +0100"/>
<Me HKCharacteristicTypeIdentifierDateOfBirth="1965-07-31"
HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexMale"
HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet"
HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
<Record type="HKQuantityTypeIdentifierHeight"
sourceName="Health"
sourceVersion="9.2"
unit="cm"
creationDate="2016-01-02 09:45:10 +0100"
startDate="2016-01-02 09:44:00 +0100"
endDate="2016-01-02 09:44:00 +0100"
value="194">
<MetadataEntry key="HKWasUserEntered" value="1"/>
</Record>
As you can see, the export format is verbose, but extremely comprehensible and comprehensive. It's also very easy to read into Python and explore.
Let's do that, here with an interactive python:
>>> from xml.etree import ElementTree as ET
>>> with open('export.xml') as f:
... data = ET.parse(f)
...
>>> data
<xml.etree.ElementTree.ElementTree object at 0x107347a50>
The ElementTree
module turns each XML element into an Element
object, described by its tag, with a few standard attributes.
Inspecting the data
object, we find:
>>> data.__dict__
{'_root': <Element 'HealthData' at 0x1073c2050>}
i.e., we have a single entry in data
—a root element called HealthData
.
Like all Element
objects, it has the four standard attributes:2
>>> root = data._root
>>> root.__dict__.keys()
['text', 'attrib', 'tag', '_children']
These are:
>>> root.attrib
{'locale': 'en_GB'}
>>> root.text
'\n '
>>> root.tag
'HealthData'
>>> len(root._children)
446702
So nothing much apart from an encoding and a whole lot of child nodes. Let's inspect the first few of them:
>>> nodes = root._children
>>> nodes[0]
<Element 'ExportDate' at 0x1073c2090>
>>> ET.dump(nodes[0])
<ExportDate value="2016-04-15 07:27:26 +0100" />
>>> nodes[1]
<Element 'Me' at 0x1073c2190>
>>> ET.dump(nodes[1])
<Me HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexMale"
HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet"
HKCharacteristicTypeIdentifierDateOfBirth="1965-07-31"
HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet" />
>>> nodes[2]
<Element 'Record' at 0x1073c2410>
>>> ET.dump(nodes[2])
<Record creationDate="2016-01-02 09:45:10 +0100"
endDate="2016-01-02 09:44:00 +0100"
sourceName="Health"
sourceVersion="9.2"
startDate="2016-01-02 09:44:00 +0100"
type="HKQuantityTypeIdentifierHeight"
unit="cm"
value="194">
<MetadataEntry key="HKWasUserEntered" value="1" />
</Record>
>>> nodes[3]
<Element 'Record' at 0x1073c2550>
>>> nodes[4]
<Element 'Record' at 0x1073c2650>
So, exactly as the DTD indicated, we have an ExportDate
node,
a Me
node and then what looks like a great number of records.
Let's confirm that:
>>> set(node.tag for node in nodes[2:])
set(['Record', 'Workout', 'ActivitySummary'])
So in fact, there are three kinds of nodes after the ExportDate
and Me
records. Let's count them:
>>> records = [node for node in nodes if node.tag == 'Record']
>>> len(records)
446670
These records are ones like the Height
record we saw above, though
in fact most of them are not Height
but either StepCount
,
CaloriesBurned
or DistanceWalkingRunning
, e.g.:
>>> ET.dump(nodes[100000])
<Record creationDate="2015-01-11 07:40:15 +0000"
endDate="2015-01-10 13:39:35 +0000"
sourceName="njr iPhone 6s"
startDate="2015-01-10 13:39:32 +0000"
type="HKQuantityTypeIdentifierStepCount"
unit="count"
value="4" />
There is also one activity summary per day (since I got the watch).
>>> acts = [node for node in nodes if node.tag == 'ActivitySummary']
>>> len(acts)
29
The first one isn't very exciting:
>>> ET.dump(acts[0])
<ActivitySummary activeEnergyBurned="0"
activeEnergyBurnedGoal="0"
activeEnergyBurnedUnit="kcal"
appleExerciseTime="0"
appleExerciseTimeGoal="30"
appleStandHours="0"
appleStandHoursGoal="12"
dateComponents="2016-03-18" />
but they get better:
>>> ET.dump(acts[2])
<ActivitySummary activeEnergyBurned="652.014"
activeEnergyBurnedGoal="500"
activeEnergyBurnedUnit="kcal"
appleExerciseTime="77"
appleExerciseTimeGoal="30"
appleStandHours="17"
appleStandHoursGoal="12"
dateComponents="2016-03-20" />
Finally, there is a solitary Workout
record.
>>> ET.dump(workouts[0])
<Workout creationDate="2016-04-02 11:12:57 +0100"
duration="31.73680251737436"
durationUnit="min"
endDate="2016-04-02 11:12:22 +0100"
sourceName="NJR Apple Watch"
sourceVersion="2.2"
startDate="2016-04-02 10:40:38 +0100"
totalDistance="0"
totalDistanceUnit="km"
totalEnergyBurned="139.3170000000021"
totalEnergyBurnedUnit="kcal"
workoutActivityType="HKWorkoutActivityTypeOther" />
So there we have it.
Getting data out of Apple Health (The Code)
Given this exploration, we can take a first shot at writing an exporter for Apple Health Data. I'm going to ignore the activity summaries and workout(s) for now, and concentrate on the main records. (We'll get to the others in a later post.)
Here is the code:
"""
applehealthdata.py: Extract data from Apple Health App's export.xml.
Copyright (c) 2016 Nicholas J. Radcliffe
Licence: MIT
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import os
import re
import sys
from xml.etree import ElementTree
from collections import Counter, OrderedDict
__version__ = '1.0'
FIELDS = OrderedDict((
('sourceName', 's'),
('sourceVersion', 's'),
('device', 's'),
('type', 's'),
('unit', 's'),
('creationDate', 'd'),
('startDate', 'd'),
('endDate', 'd'),
('value', 'n'),
))
PREFIX_RE = re.compile('^HK.*TypeIdentifier(.+)$')
ABBREVIATE = True
VERBOSE = True
def format_freqs(counter):
"""
Format a counter object for display.
"""
return '\n'.join('%s: %d' % (tag, counter[tag])
for tag in sorted(counter.keys()))
def format_value(value, datatype):
"""
Format a value for a CSV file, escaping double quotes and backslashes.
None maps to empty.
datatype should be
's' for string (escaped)
'n' for number
'd' for datetime
"""
if value is None:
return ''
elif datatype == 's': # string
return '"%s"' % value.replace('\\', '\\\\').replace('"', '\\"')
elif datatype in ('n', 'd'): # number or date
return value
else:
raise KeyError('Unexpected format value: %s' % datatype)
def abbreviate(s):
"""
Abbreviate particularly verbose strings based on a regular expression
"""
m = re.match(PREFIX_RE, s)
return m.group(1) if ABBREVIATE and m else s
def encode(s):
"""
Encode string for writing to file.
In Python 2, this encodes as UTF-8, whereas in Python 3,
it does nothing
"""
return s.encode('UTF-8') if sys.version_info.major < 3 else s
class HealthDataExtractor(object):
"""
Extract health data from Apple Health App's XML export, export.xml.
Inputs:
path: Relative or absolute path to export.xml
verbose: Set to False for less verbose output
Outputs:
Writes a CSV file for each record type found, in the same
directory as the input export.xml. Reports each file written
unless verbose has been set to False.
"""
def __init__(self, path, verbose=VERBOSE):
self.in_path = path
self.verbose = verbose
self.directory = os.path.abspath(os.path.split(path)[0])
with open(path) as f:
self.report('Reading data from %s . . . ' % path, end='')
self.data = ElementTree.parse(f)
self.report('done')
self.root = self.data._root
self.nodes = self.root.getchildren()
self.n_nodes = len(self.nodes)
self.abbreviate_types()
self.collect_stats()
def report(self, msg, end='\n'):
if self.verbose:
print(msg, end=end)
sys.stdout.flush()
def count_tags_and_fields(self):
self.tags = Counter()
self.fields = Counter()
for record in self.nodes:
self.tags[record.tag] += 1
for k in record.keys():
self.fields[k] += 1
def count_record_types(self):
self.record_types = Counter()
for record in self.nodes:
if record.tag == 'Record':
self.record_types[record.attrib['type']] += 1
def collect_stats(self):
self.count_record_types()
self.count_tags_and_fields()
def open_for_writing(self):
self.handles = {}
self.paths = []
for kind in self.record_types:
path = os.path.join(self.directory, '%s.csv' % abbreviate(kind))
f = open(path, 'w')
f.write(','.join(FIELDS) + '\n')
self.handles[kind] = f
self.report('Opening %s for writing' % path)
def abbreviate_types(self):
"""
Shorten types by removing common boilerplate text.
"""
for node in self.nodes:
if node.tag == 'Record':
if 'type' in node.attrib:
node.attrib['type'] = abbreviate(node.attrib['type'])
def write_records(self):
for node in self.nodes:
if node.tag == 'Record':
attributes = node.attrib
kind = attributes['type']
values = [format_value(attributes.get(field), datatype)
for (field, datatype) in FIELDS.items()]
line = encode(','.join(values) + '\n')
self.handles[kind].write(line)
def close_files(self):
for (kind, f) in self.handles.items():
f.close()
self.report('Written %s data.' % abbreviate(kind))
def extract(self):
self.open_for_writing()
self.write_records()
self.close_files()
def report_stats(self):
print('\nTags:\n%s\n' % format_freqs(self.tags))
print('Fields:\n%s\n' % format_freqs(self.fields))
print('Record types:\n%s\n' % format_freqs(self.record_types))
if __name__ == '__main__':
if len(sys.argv) != 2:
print('USAGE: python applehealthdata.py /path/to/export.xml',
file=sys.stderr)
sys.exit(1)
data = HealthDataExtractor(sys.argv[1])
data.report_stats()
data.extract()
To run this code, clone the repo from github.com/tdda/applehealthdata
with:
$ git clone https://github.com/tdda/applehealthdata.git
or save the text from this post as healthdata.py
.
At the time of posting, the code is consistent with this, but this
commit is also tagged with the version number, v1.0
, so if you check
it out later and want to use this version, check out that version
by saying:
$ git checkout v1.0
If your data is in the same directory as the code, then simply run:
$ python healthdata.py export.xml
and, depending on size, wait a few minutes while it runs. The code runs under both Python 2 and Python 3.
When I do this, the output is as follows:
$ python applehealthdata/applehealthdata.py export6s3/export.xml
Reading data from export6s3/export.xml . . . done
Tags:
ActivitySummary: 29
ExportDate: 1
Me: 1
Record: 446670
Workout: 1
Fields:
HKCharacteristicTypeIdentifierBiologicalSex: 1
HKCharacteristicTypeIdentifierBloodType: 1
HKCharacteristicTypeIdentifierDateOfBirth: 1
HKCharacteristicTypeIdentifierFitzpatrickSkinType: 1
activeEnergyBurned: 29
activeEnergyBurnedGoal: 29
activeEnergyBurnedUnit: 29
appleExerciseTime: 29
appleExerciseTimeGoal: 29
appleStandHours: 29
appleStandHoursGoal: 29
creationDate: 446671
dateComponents: 29
device: 84303
duration: 1
durationUnit: 1
endDate: 446671
sourceName: 446671
sourceVersion: 86786
startDate: 446671
totalDistance: 1
totalDistanceUnit: 1
totalEnergyBurned: 1
totalEnergyBurnedUnit: 1
type: 446670
unit: 446191
value: 446671
workoutActivityType: 1
Record types:
ActiveEnergyBurned: 19640
AppleExerciseTime: 2573
AppleStandHour: 479
BasalEnergyBurned: 26414
BodyMass: 155
DistanceWalkingRunning: 196262
FlightsClimbed: 2476
HeartRate: 3013
Height: 4
StepCount: 195654
Opening /Users/njr/qs/export6s3/BasalEnergyBurned.csv for writing
Opening /Users/njr/qs/export6s3/HeartRate.csv for writing
Opening /Users/njr/qs/export6s3/BodyMass.csv for writing
Opening /Users/njr/qs/export6s3/DistanceWalkingRunning.csv for writing
Opening /Users/njr/qs/export6s3/AppleStandHour.csv for writing
Opening /Users/njr/qs/export6s3/StepCount.csv for writing
Opening /Users/njr/qs/export6s3/Height.csv for writing
Opening /Users/njr/qs/export6s3/AppleExerciseTime.csv for writing
Opening /Users/njr/qs/export6s3/ActiveEnergyBurned.csv for writing
Opening /Users/njr/qs/export6s3/FlightsClimbed.csv for writing
Written BasalEnergyBurned data.
Written HeartRate data.
Written BodyMass data.
Written DistanceWalkingRunning data.
Written ActiveEnergyBurned data.
Written StepCount data.
Written Height data.
Written AppleExerciseTime data.
Written AppleStandHour data.
Written FlightsClimbed data.
$
As a quick preview of one of the files, here is the top of the second
biggest output fiele, StepCount.csv
:
$ head -5 StepCount.csv
sourceName,sourceVersion,device,type,unit,creationDate,startDate,endDate,value
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:27:54 +0000,2014-09-13 09:27:59 +0000,329
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:34:09 +0000,2014-09-13 09:34:14 +0000,283
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:47 +0000,2014-09-13 09:39:29 +0000,2014-09-13 09:39:34 +0000,426
"Health",,,"HKQuantityTypeIdentifierStepCount","count",2014-09-21 06:08:48 +0000,2014-09-13 09:45:36 +0000,2014-09-13 09:45:41 +0000,61
You may need to scroll right to see all of it, or expand your browser window.
This blog post is long enough already, so I'll discuss (and plot) the contents of the various output files in later posts.
Notes on the Output
Format: The code writes CSV files including a header record with field names. Since the fields are XML attributes, which get read into a dictionary, they are unordered so the code sorts them alphabetically, which isn't optimal, but is at least consistent. Nulls are written as empty spaces, strings are quoted with double quotes, double quotes in strings are escaped with backslash and backslash is itself escaped with backslash. The output encoding is UTF-8.
Filenames: One file is written per record type, and the names is just
the record type with extension .csv
, except for record types including
HK...TypeIdentifier
, which is excised.
Summary Stats: Summary stats about the various CSV files are printed before the main extraction occurs.
Overwriting: Any existings CSV files are silently overwritten, so if you have multiple health data export files in the same directory, take care.
Data Sanitization: The code is almost completely opinionless, and with
one exception simply flattens the data in the XML file into a collection
of CSV files. The exception concerns file names and the type
field file.
Apple uses extraordinarily verbose and ugly names like
HKQuantityTypeIdentifierStepCount
and HKQuantityTypeIdentifierHeight
to describe the contents of each record: the abbreviate function in the
code uses a regular expression to strip off the nonsense, resulting in
nicer, shorter, more comprehensible file names and record types. However,
if you prefer to get your data verbatim, simply change the value
of ABBREVIATE
to False
near the top of the file and all your HealthKit
prefixes will be preserved, at the cost of a non-trivial expansion of the
output file sizes.
Notes on the code: Wot, no tests?
The first thing to say about the code is that there are no tests provided with it, which is—cough—slightly ironic, given the theme of this blog. This isn't because I've written them but am holding them back for pedagogical reasons, or as an ironical meta-commentary on the whole test-driven movement, but merely because I haven't written any yet. Happily, writing tests is a good way of documenting and explaining code, so another post will follow, in which I will present some tests, possibly correct myriad bugs, and explain more about what the code is doing.
-
I almost said 'I googled "Apple Health export"', but the more accurate statement would be that 'I DuckDuckGoed "Apple Health export"', but there are so many problems with DuckDuckGo as a verb, even in the present tense, let alone in the past as DuckDuckGod. Maybe I should propose the neologism "to DDGoogle". Or as Greg Wilson suggested, "to Duckle". Or maybe not . . . ↩
-
The ElementTree structure in Python 3 is slightly different in this respect: this exploration was carried out with Python 2. However, the main code presented later in the post works under Python 2 and 3. ↩