Test-Driven Data Analysis

Why Test-Driven Data Analysis?

Posted on Thu 05 November 2015 in TDDA • Tagged with questions, tdda, tdd

OK, everything you need to know about TeX has been explained—unless you happen to be fallible. If you don't plan to make any errors, don't bother to read this chapter.

— The TeXbook, Chapter 27, Recovery from Errors. Donald E. Knuth.¹

The concept of test-driven data analysis seeks to improve the answers to two sets of questions, which are defined with reference to an "analytical process".

Figure 1: A typical analytical process

The questions assume that you have used the analytical process at least once, with one or more specific collections of inputs, and that you are ready to use, share, deliver or simply believe the results.

The questions in the first group concern the implementation of your analytical process:

Implementation Questions

How confident are you that the outputs produced by the analytical process, with the input data you have used, are correct?
How confident are you that the outputs would be the same if the analytical process were repeated using the same input data?
Does your answer change if you repeat the process using different hardware, or after upgrading the operating system or other software?
Would the analytical process generate any warning or error if its results were different from when you first ran it and satisfied yourself with the results?
If the analytical process relies on any reference data, how confident are you that you would know if that reference data changed or became corrupted?
If the analytical process were run with different input data, how confident are you that the output would be correct on that data?
If corrupt or invalid input data were used, how confident are you that the process would detect this and raise an appropriate warning, error or failure?
Would someone else be able reliably to produce the same results as you from the same inputs, given detailed instructions and access?
Corollary: do such detailed instructions exist? If you were knocked down by the proverbial bus, how easily could someone else use the analytical process?
If someone developed an equivalent analytical process, and their results were different, how confident are you that yours would prove to be correct?

These questions are broadly similar to the questions addressed by test-driven development, set in the specific context of data analysis.

The questions in our second group are concerned with the meaning of the analysis, and a larger, more important sense of correctness:

** Interpretation Questions**

Is the input data² correct?³
Is your interpretation of the input data correct?
Are the algorithms you are applying to the data meaningful and appropriate?
Are the results plausible?
Is your interpretation of the results correct?
More generally, what are you missing?

These questions are less clear cut than the implementation questions, but are at least as important, and in some ways are more important. If the implementation questions are about producing the right answers, the interpretation questions are about asking the right questions, and understanding the answers.

Over the coming posts, we will seek to shape a coherent methodology and set of tools to help us provide better answers to both sets of questions—implementational and interpretational. If we succeed, the result should be something worthy of the name test-driven data analysis.

Donald E. Knuth, The TeXbook, Chapter 27, Recovery from Errors. Addison Wesley (Reading Mass) 1984. ↩
I am aware that, classically, data is the plural of datum, and that purists would prefer my question to be phrased as "Are the data correct?" If the use of 'data' in the singular offends your sensibilities, I apologise. ↩
When adding Error of Implementation and Error of Interpretation to the glossary, we decided that this first question really pertained to a third category of error, namely an Error of Process. ↩

Test-Driven Data Analysis

Posted on Thu 05 November 2015 in TDDA • Tagged with motivation

A dozen or so years ago I stumbled across the idea of test-driven development from reading various posts by Tim Bray on his Ongoing blog. It was obvious that this was a significant idea, and I adopted it immediately. It has since become an integral part of the software development processes at Stochastic Solutions, where we develop our own analytical software (Miró and the Artists Suite) and custom solutions for clients. But software development is only part of what we do at the company: the larger part of our work consists of actually doing data analysis for clients. This has a rather different dynamic.

Fast forward to 2012, and a conversation with my long-term collaborator and friend, Patrick Surry, during which he said something to the effect of:

So what about test-driven data analysis?

— Patrick Surry, c. 2012

The phrase resonated instantly, but neither of us entirely knew what it meant. It has lurked in my brain ever since, a kind of proto-meme, attempting to inspire and attach itself to a concept worthy of the name.

For the last fifteen months, my colleagues—Sam Rhynas and Simon Brown—and I have been feeling our way towards an answer to the question

What is test-driven data analysis?

We haven't yet pulled all the pieces together into coherent methodology, but we have assembled a set of useful practices, tools and processes that feel as if they are part of the answer.

A few weeks ago, my friend and ex-colleague Greg Wilson was in town for Edinburgh Parallel Computing Centre's twenty-fifth birthday bash. Greg is a computer scientist and former lecturer from University of Toronto. He now spends most of his time teaching scientists key ideas from software engineering through his Software Carpentry organization. He lamented that while he has no trouble persuading scientists of the benefits of adopting ideas such as version control, he finds them almost completely unreceptive when he champions software testing. I was initially rather shocked by this, since I routinely say that test-driven development is the most significant idea in software in the last thirty or forty years. Thinking about it more, however, I suspect the reasons for the resistance Greg encounters are similar to the reasons we have found it harder than we expected to take mainstream ideas from test-driven development and apply them in the rather specialized area of data analysis. Testing scientific code is more like testing analysis processes than it is like testing software per se.

As I reflected further on what Greg had said, I experienced a moment of clarity. The new insight it that while we have a lot useful components for test-driven data analysis, including some useful fragments of a methodology, we really don't have appropriate tools: the xUnit frameworks and their ilk are excellent for test-driven development, but don't provide specific support for the patterns we tend to need in analysis, and address only a subset of the issues we should want test-driven data analysis to cover.

The purpose of this new blog is to think out loud as we—in partnership with one of our key clients, Skyscanner—try to develop tools and methodologies to form coherent framework and support system for a more systematic approach to data science—a test-driven approach to data analysis.

So watch this space.

If you want to subscribe, this site has RSS and ATOM feeds, and also offers email subscriptions.¹ We'll be tweeting on @tdda0 whenever there are new posts. Twitter is also probably the best to send feedback, since we haven't plumbed in comments at this time: we'd love to hear what you think.

through MailChimp; thanks, MailChimp! ↩