Extracting More Apple Health Data
Posted on Wed 20 April 2016 in TDDA
The first version of the Python code for extracting data from the XML export from the Apple Health on iOS neglected to extract Activity Summaries and Workout data. We will now fix that.
As usual, I'll remind you how to get the code, if you want, then discuss the changes to the code, the reference test and the unit tests. Then in the next post, we'll actually start looking at the data.
The Updated Code
As before, you can get the code from Github with
$ git clone https://github.com/tdda/applehealthdata.git
or if you have pulled it before, with
$ git pull --tags
This version of the code is tagged with v1.3
, so if it has been updated
by the time you read this, get that version with
$ git checkout v1.3
I'm not going to list all the code here, but will pull out a few key changes as we discuss them.
Changes
Change 1: Change FIELDS to handle three different field structures.
The first version of the extraction code wrote only Records, which contain the granular activity data (which is the vast bulk of it, by volume).
Now I want to extend the code to handle the other two main kinds of data
it writes—ActivitySummary
and Workout
elements in the XML.
The three different element types contain different XML attributes, which correspond to different fields in the CSV, and although they overlap, I think the best approach is to have separate record structures for each, and then to create a dictionary mapping the element kind to its field information.
Accordingly, the code that sets FIELDS
changes to become:
RECORD_FIELDS = OrderedDict((
('sourceName', 's'),
('sourceVersion', 's'),
('device', 's'),
('type', 's'),
('unit', 's'),
('creationDate', 'd'),
('startDate', 'd'),
('endDate', 'd'),
('value', 'n'),
))
ACTIVITY_SUMMARY_FIELDS = OrderedDict((
('dateComponents', 'd'),
('activeEnergyBurned', 'n'),
('activeEnergyBurnedGoal', 'n'),
('activeEnergyBurnedUnit', 's'),
('appleExerciseTime', 's'),
('appleExerciseTimeGoal', 's'),
('appleStandHours', 'n'),
('appleStandHoursGoal', 'n'),
))
WORKOUT_FIELDS = OrderedDict((
('sourceName', 's'),
('sourceVersion', 's'),
('device', 's'),
('creationDate', 'd'),
('startDate', 'd'),
('endDate', 'd'),
('workoutActivityType', 's'),
('duration', 'n'),
('durationUnit', 's'),
('totalDistance', 'n'),
('totalDistanceUnit', 's'),
('totalEnergyBurned', 'n'),
('totalEnergyBurnedUnit', 's'),
))
FIELDS = {
'Record': RECORD_FIELDS,
'ActivitySummary': ACTIVITY_SUMMARY_FIELDS,
'Workout': WORKOUT_FIELDS,
}
and we have to change references (in both the main code and the test code)
to refer to RECORD_FIELDS
where previously there were references to FIELDS
.
Change 2: Add a Workout to the test data
There was a single workout in the data I exported from the phone (a token one
I performed primarily to generate a record of this type). I simply used
grep to extract that line from export.xml
and poked it into the test
data `testdata/export6s3sample.xml'.
Change 3: Update the tag and field counters
The code for counting record types previously considered only nodes of type
Record
. Now we also want to handle Workout
and ActivitySummary
elements.
Workouts do come in different types (they have a workoutActivityType
field),
so it may be that we will want to write out different workout types
into different CSV files, but since I have only, so far, seen a single
workout, I don't really want to do this. So instead, we'll write all
Workout
elements to a corresponding Workout.csv
file, and all
ActivitySummary
elements to an ActivitySummary.csv
file.
Accordingly, the count_record_types
method now uses an extra
Counter
attribute, other_types
to count the number of each of these
elements, keyed on their tag (i.e. Workout
or ActivitySummary
).
Change 4: Update the test results to reflect the new behaviour
Two of the unit tests introduced last time need to be updated to reflect
this Change 3. First, the field counts change, and secondly we need
reference values for the other_types
counts. Hence the new section
in test_extracted_reference_stats
:
expectedOtherCounts = [
('ActivitySummary', 2),
('Workout', 1),
]
self.assertEqual(sorted(data.other_types.items()),
expectedOtherCounts)
Change 5: Open (and close) files for Workouts and ActivitySummaries
We need to open new files for Workout.csv
and ActivitySummary.csv
if we have any such records. This is handled in the open_for_writing
method.
Change 6: Write records for Workouts and ActivitySummaries
There are minor changes to the write_records
method to allow it to
handle writing Workout
and ActivitySummary
records. The only
real difference is that the different CSV files have different fields,
so we need to look up the right values, in the order specified by the header
for each kind. The new code does that:
def write_records(self):
kinds = FIELDS.keys()
for node in self.nodes:
if node.tag in kinds:
attributes = node.attrib
kind = attributes['type'] if node.tag == 'Record' else node.tag
values = [format_value(attributes.get(field), datatype)
for (field, datatype) in FIELDS[node.tag].items()]
line = encode(','.join(values) + '\n')
self.handles[kind].write(line)
Change 7: Update the reference test
Finally, the reference test itself now generates two more files,
so I've added reference copies of those to the testdata
subdirectory
and changed the test to loop over all four files:
def test_tiny_reference_extraction(self):
path = copy_test_data()
data = HealthDataExtractor(path, verbose=VERBOSE)
data.extract()
for kind in ('StepCount', 'DistanceWalkingRunning',
'Workout', 'ActivitySummary'):
self.check_file('%s.csv' % kind)
Mission Accomplished
We've now extracted essentially all the data from the export.xml
file from the Apple Health app, and created various tests for that
extraction process. We'll start to look at the data in future posts.
There is one more component in my extract—another XML file called
export_cda.xml
. This contains a ClinicalDocument
, apparently
conforming to a standard from (or possibly administered by) Health Level
Seven International. It contains heart-rate data from my Apple Watch.
I probably will extract it and publish the code for doing so, but later.