Glossary

Analytical Process. A software system for transforming a specified set of inputs, including parameters, into a set of specified outputs for the purpose of performing a particular piece of analysis. See the post Why Test-Driven Data Analysis?

Assertion. An assertion is an expression that is supposed to evaluate to a true value. The assertion fails if it evaluates to non-true value. Normally, each test in a test suite consists of one or more assertions. Outside the context of testing, assertions normally generate errors or exceptions.

Declaration. Declaration is the term we are using for the kinds of assertions we are using in TDDA. They are like any other assertions except that (1) we can choose at run-time whether they generate warnings or errors, when violated, (2) we are experimenting with generating them automatically, and (3) our conception of them potentially extends to probabilistic assertions, perhaps describing expected distributions, or expected performance levels.

Defensive Programming. Defensive progamming is a broad term use to cover safety checks and other precautionary measures to try to mitigate problems that can arise from unexpected situations encountered by software. These might include unexpected user input, unanticipated system issues and programming errors in the code itself. Wikipedia entry: Defensive Progamming

Error of Implementation. Incorrect analysis resulting from failures to implement the intended algorithm successfully, so that the inputs are not manipulated as intended to produce the outputs (and may, therefore, produce different, incorrect outputs).

Error of Interpretation. Incorrect analysis resulting from incorrect formulation of problem (e.g. inappropriate algorithm selection), misunderstanding of the outputs or misapplication of the outputs.

Error of Process. Errors of process arise when an analytical process is used in a way that was not intended, causing incorrect results to be produced. Examples include passing the wrong data into the process, reading the wrong results from the process, applying the wrong process to the data, failing to perform required steps before or after running the analytical process, naming a file incorrectly so that it doesn't get used, and passing the wrong parameters to the process. Errors of process will generally cause the results to be incorrect even if the algorithm chosen when developing the process was correct, and the implementation of the algorithm was correct.

Integration Test. In test-driven development, tests of larger parts of a software system, especially if they include interactions with other software systems (such as a database or a web service) are sometimes known as integration tests or system-tests. These contrast with unit tests. The division between integration/system tests and unit tests is not absolute or unambiguous. Wikipedia entry: Integration testing

Metadata. Metadata is data about data. In the context of tabular data, the simplest kinds of metadata are the field names and types. Any statistics we can compute are another form of metadata, e.g. minimum and maximum values, averages, null counts, values present etc. There is literally no limit to what metadata can be associated with an underlying dataset. Wikipedia entry: Metadata

GE: Missing value: see null value.

Mocking. Mocking is the process of modifying some aspect of a system under test to replace some functionality—often an external dependency, such as a database, a network service or a function call—with some alternative, mock functionality. The mock typically produces a deterministic result with very little computation. Mocking is used to allow tests to focus on specific functionality in unit tests, to reduce or elimate dependencies on external systems, and to reduce the time tests take to run. Wikipedia entry: Mock Object.

Overfitting. A statistical model is said to exhibit overfitting when it models the specific patterns in the sample data used to fit (or train) the model, improving its measured accuracy on that trainig data generally to the detriment of its accuracy on other data. We use the term generalization to express the inverse condition, i.e. the less a model overfits its training data, the better it is at generalization.

Refactoring. Refactoring is the term generally used in test-driven development to refer to restructuring code with a view to improving it without changing its functionality. There is often a specific focus on simplifying and decoupling the code, since software complexity is widely recognized as a key problem in software. In the context of TDD, refactoring refers to modifying code while keeping all of the tests passing.

Reference Test. A reference test is a name we use in test-driven data analysis for the particular kind of system test that checks a complete analysis process with one or more collections of input datasets, input parameters and outputs. A reference test is thus both a system test and a regression test. See Infinite Gain: the First Test.

Resultoid. A spurious result generated by software, whose credibility is enhanced by computational provenance or impressive formatting.

Revision Control System. A revision control system is a software package that allows a complete history of check-pointed changes to a collection of files to be tracked, allowing any previous check-pointed version of a file, or a collection of files, to be reconstructed. Typically, multiple contributers can read from and write to a revision-control system (perhaps asymmetrically), and the system has facilities for highlighting and helping users to resolve conflicts between incompatible file changes. Git, Mercurial, Subversion, cvs, rcs and sccs are examples of well-known revision control systems. Sites such as Github and Bitbucket offer hosted versions of such systems. Wikipedia entry: Version control.

System Test. See Integration Test.

Test-Driven Data Analysis. That which we are trying to define in this blog; an embryonic methodology for improving the reliability, re-usability and malleability of analytical processes, with a focus on avoiding errors of implementation, errors of interpretation and errors of process.

Test-Driven Development. The practice of creating, maintaining and repeatedly running a suite of software tests with a view to increasing software reliability, development speed and software malleability. Also sometimes referred to as test-driven design. Wikipedia entry: Test-driven development.

TDD. See test-driven development.

TDDA. See test-driven data analysis.

Unit Test. In test-driven development, tests of isolated functions, methods or other subcomponents of a software system are often known as unit tests. These contrast with integration tests, which test larger components or whole systems, possibly including external systems. The division between unit tests and integration/system tests is not absolute or unambiguous. Wikipedia entry: Test-driven development.