Gentest Talk at 2022 Toronto Workshop on Reproducibility

Posted on Fri 25 February 2022 in TDDA • Tagged with tests, reference tests, gentest

We released version 2.0 of the Python TDDA library this week. The radical new feature of the 2.0 release is Gentest, a command-line tool for automatically generating tests for more-or-less any code that you can run from a command line.

Gentest was introduced at the 2022 Toronto Workshop on Reproducibility yesterday (24th February), where demonstrations included using it to write tests for three increasing complex R programs. This was to emphasize that Gentest is useful for much more than just testing Python code. Our (only slightly facetious) strapline is

Gentest writes tests, so you don't have to.

Slides from the talk are available here:

And here's the video:

We'll be posting more detail about Gentest here over the coming weeks.

Another major upgrade in the 2.0 TDDA release is the documentation. We've made much more effort to separate out the command-line uses of the TDDA library

  • constraint generation
  • data verification
  • data validation
  • inference of regular expressions, with Rexpy
  • (now) automatic test generation, with Gentest

from the API documentation, which is only really relevant to Python users.

The documentation is available at Read The Docs.


Unix & Linux Survival Guide for Data Science etc.

Posted on Mon 21 February 2022 in TDDA • Tagged with tests, cartoon

Cheat-sheet for unix and linux

PDF Version (A4)

PDF Version (Letter)


One Tiny Bug Fix etc.

Posted on Wed 16 February 2022 in TDDA • Tagged with tests, cartoon

White Cat: The tests have failed again. Black Cat: Did you change the code? White Cat: No! Black Cat: Really? White Cat: I just fixed on TINY BUG in a COMPLETE DIFFERENT part of the code. There's NO WAY that could cause this!


Why Code Rusts

Posted on Mon 07 February 2022 in TDDA • Tagged with tests, reference tests, rust

or Why Tests Spontanously Fail

You might think that if you write a program, and don't change anything, then come back a day later (or a decade later) and run it with the same inputs, it would produce the same output. At their core, reference tests exist because this isn't true, and it's useful to find out if code you wrote in the past no longer does the same thing it used to. This post collects together some of reasons the behaviour of code changes over time.1

The Environment Has Changed

E1 You updated your compiler/interpreter (Python/R etc.)
E2 You updated libraries used in your code (e.g. from PyPI/CRAN).
E3 You updated the operating system of the machine you're running on.
E4 Someone else updated the operating system or library/compiler etc.
E5 Your code uses some other software on your machine (or another) machine that has been updated (e.g. a database).
E6 Your codes uses an external service whose behaviour has changed (e.g. calling a web service to get/do something).
E7 You have updated/replaced your hardware.
E8 You run it on different hardware (another machine or OS or OS version or under a different compiler or...)
E9 You move the code to a different location in the file system.
E10 You have changed something in the file system that messes up the code e.g.

  • deleting a file the code uses
  • renaming a file the code uses
  • editing a file the code uses
  • removing or renaming a directory the code uses
  • changing permissions on a file or directory the code uses
  • creating a file or directory that the code expects to create and is now unable to, e.g. because of permissions.

E11 You run as a different user.
E12 You run from a different directory while leaving the code in the same place.
E13 You run the code in a different way (e.g. from a script instead of interactively, or in a scheduler).
E14 A disk fills or some other resource becomes full or depleted.
E15 The load on the machine is higher, and the code runs out of memory or disk or some other source; or has a subtle timing dependency or assumption that fails under load.
E15a The load on the machine is lower, meaning part of your code runs faster, causing a race condition to behave differently. [Added 2022-02-17]
E16 The hardware has developed a fault.
E17 A systems manager has changed some limits e.g. disk quotas, allowed nice levels, a directory service, some permissions or groups...
E18 A shell variable or changed or was created or destroyed.
E19 The locale in which the machine is running changed.
E20 You changed your PYTHONPATH or equivalent.
E21 A new library that you don't use (or didn't think you used) has appeared in a site-packages or similar location, and was picked up by your code or something else your code uses.
E22 You updated your editor/IDE and now whenever you load a file it gets changes in some subtle way that matters (e.g. line endings, blank lines at the of files, encoding, tabs vs. spaces).
E23 The physical environment has changed in some way that affects the machine you are running on (e.g. causing it to slow down).
E24 A file has been touched2 and the software determines order of processing by last update date.
E25 The code uses a password or key that is changed, expires or is revoked.
E26 The code requires network access and the network is unavailable, slow, or unreliable at the time the test is run.
E27 Almost any of the above (or below), but for a dependency of your code rather than your code itself, e.g. something in a data centre or library.
E28 Your PATH (the list of locations checks for executables) has changed, or an alias has changed so that the executable you run is different from before. [Added 2022-02-11]
E29 A different disk or share is mounted, so that even though you specify the same path, some file that you are using is different from before. [Added 2022-02-11]
E30 You run the code under a different shell or changed something in a shell startup file. [Added 2022-02-17]

Many of these are illuminated by one of my favourite quote from Beth Andres-Beck:

Mocking in unit tests makes the tests more stable because they don’t break when your code breaks.
— @bethcodes, 2020-12-29T01:26:00Z https://twitter.com/bethcodes/status/1343730015851069440

The Code Has, in Fact, Changed

C1 You think you didn't change the code, but actually you did.
C2 You did change the code, but only in a way that couldn't possibly change the behaviour in the case you're testing.
C3 You didn't change the code, you fixed a bug.
C4 You didn't change the code, but someone else did.
C5 You didn't change the code, but disk corruption did.
C6 You didn't change the code, but you did update some data it uses.
C7 You pulled the code again from a source-code repository but

  • someone else had pushed a change
  • you checked out a different branch
  • you pulled from the wrong repository.

C8 You're on the wrong branch.
C9 The system was restored from backup and you lost changes.
C10 You used a hard link to a file and didn't change the file here but did change it in one of the other linked locations.
C11 You used symbolic links and though your symbolic link didn't change, the code (or other file or files) it symbolically linked did.
C12 You used a diff tool to compare files, but a difference that does matter to your code was not detected by the diff tool (e.g. line endings or capitalization or whitespace).
C13 You are in fact running more tests than previously, or different tests from the ones you ran previously, without realising it.
C14 You reformatted your code thinking that you were only making changes to appearance.
C15 You ran a code formatter/beautifier/coding standard enforcement tool that had a bug in it and changed the meaning.
C16 You believe nothing has changed because git status tells you nothing has changed, but you are using files that aren't tracked or are ignored.
C17 You think a file hasn't changed because of its timestamp, but the timestamp is wrong or doesn't mean what you think it means.
C18 A hidden file changed (e.g. a dotfile).
C19 A file that doesn't match a glob pattern you use changed.

Also from Beth Andres-Beck:

If you have 100% test coverage and your tests use mocks, no you don’t.
— @bethcodes, 2020-12-29T01:51:00Z https://twitter.com/bethcodes/status/1343736477839020032

You Aren't Running the Code You Think You Are

There is another set of problems that aren't strictly causes of code rusting, but which help to explain a set of related situations every developer has probably experienced, which all fall under the general heading of you aren't running the code you think you are.

M1 The code you're running is not the the version you think it is (e.g. you're in the wrong directory).
M2 You are running the code on a different server from the one you think you are (e.g. you haven't realised you're ssh'd in to a different machine or editing a file over a network).
M3 You're editing the code in one place but running it in another.
M4 You have cross-mounted a file system and it's the wrong file system or you think you are/aren't using it when you actually aren't/are (respectively).
M5 Something (e.g. a browser) is caching your code (or some CSS or an image or something).
M6 The code has in fact run correctly (tests have passed) but you're look at the wrong output (wrong directory, wrong tab, wrong URL, wrong window, wrong machine...)
M7 Your compiled code is out-of-sync with your source code, so you're not running what you think you are.
M8 You're running (or not running) a virtual environment when you think you are not (or are), respectively.
M9 You're running a virtual environment and not understanding how it's doing its magic, with the result that you're not using the libraries/code you think you are.
M10 You use a package manager that's installed the right libraries into a different Python (or whatever) from the one you think it has.3
M11 You think you haven't changed the code/libraries/Python you're using, but in fact you did when you updated (what you thought was) a different virtual (or non-virtual) environment.
M12 You have a conflict between different import directories (e.g. a local site-packages and a system site-packages), with different versions of the same library, and aren't importing the one you think you are.
M13 You think the code hasn't changed because you recorded the version number, but there was a code change that didn't cause the version number to be changed, or the code has multiple version numbers, or the code is reporting its version number wrongly, or the version number actually refers to a number of slightly different builds that are supposed to have the same behaviour, but don't.
M14 You have defined the same class or method or function or variable more than once in a language that doesn't mind such things, and are looking at (and possibly) editing a copy of the relevant function/callable/object that is masked by the later definition. [Added 2022-09-14]
M15 A web server or application server has your code in memory and changing or recompiling your code won't have any effect until you restart that web server or application server. This is really a variation of M5, but is subtly different because you wouldn't normally think of this as caching. [Added 2024-03-30]

These are the ones that make you question your sanity.

TIP If what's happening can't be happening, trying introducing a clear syntax error or debug statement or some other change you should be able to see. Then check that it shows up as expected when you're running your code.

Almost every time I think I'm losing my mind when coding, it's because I'm editing and running different code (or viewing results from different code).

Time has Moved On

T1 Your code has a (usually implicit) date/time dependence in it, e.g.

  • it uses 2-digit dates
  • it assumes it's running in 2022
  • it assumes it's not 29th February, or 1st January, or isn't a weekend...
  • it assumes something else that's not true about (computer) time (no leap seconds, no daylight savings times, no time-zones, no half-hour-aligned timezones...)
  • it uses 2-digit dates with a pivot year and time (or some computed time the code uses) moves past the pivot year.

T2 Time is 'bigger' in some material way that causes a problem, e.g.

  • Y2K
  • Unix 2038 (when the numner of seconds from 1 Jan 1970 overflows 32-bit integers)
  • Number of days since the code was written needs more digits (10, 100, 1000).

T3 While the code is running, daylight savings time starts or stops, and a measured (local) time interval goes negative.
T4 Your code uses Excel to interpret data and today's a special date that Excel doesn't (or more likely does) recognize.
T5 The system clock is wrong (perhaps badly wrong); or the system clock was wrong when you ran it before and is now right.

Resources Used by the Code Have Changed

R1 A resource your code uses (a database, a reference file, a page on the internet, a web service) returns different data from the data it always previously returned.
R2 A resource your code uses returns data in a different format e.g. a different text encoding, different precision, different line endings (Unix vs. PC vs. Mac), presence or absence of a byte-order marker (BOM) in UTF-8, presence of new characters in Unicode, different normalization of unicode, indented or unindented JSON/XML, different sort order etc.
R3 A resource you depend on returns “the same” data as expected but something about the interaction is different, e.g. a different status code or some extra data you can ignore, or some redundant data you use has been removed.

Stochastic and Indeterminate Effects

S1 Your code uses random numbers and doesn't fix the seed.
S2 Your code uses random numbers and does fix the main seed but not other seeds that get used (e.g. the the seed for numpy is different from Python's main seed).
S3 A cosmic ray hits the machine and causes a bit flip.
S4 The code is running on a GPU (or even CPU) that does not, in fact, always produce the same answer.
S5 The code is running on a parallel, distributed, or multi-threaded system and there is inderminacy, a race condition, possible deadlock or livelock, or any number of other things that might cause indeterminate behaviour.
S6 Your code assumes something is deterministic or has specified behaviour that is in fact not determinisic or specified, especially if that result is the same most but not all of the time, e.g. tie-breaking in sorts, order of extraction from sets or (unordered) dictionaries, or the order in which results arrive from asynchronous calls.4
S7 Your code relies on something likely but not certain, e.g. that two randomly-generated, fairly long IDs will be different from each other.
S8 Your code uses random numbers and does fix the main seed, but the sequence of random numbers has changed. This has happened with NumPy, where they realised that one of the sampling functions was drawing unnecessary samples from the PRNG. In making the sampler more efficient, they changed the samples that were returned for the same PRNG seed. [Contributed by Rob Moss (@rob_models and @rob_models@mas.to), who "had a quick search for the relevant issue/changelog item, but it was a long time ago (~NumPy 1.7, maybe)." He "couldn't find the original NumPy issue, but here's a similar one: https://github.com/numpy/numpy/issues/14522". Thanks, Rob!]

It Never Worked (or didn't work when you thought it did)

[Added 2024-07-19]

I realised there's another whole class of errors of process/errors of interpretation that could lead us to think that code has “rusted” despite not having been changed. These are all broadly the same as one of the explanations offered before, but now for the original run when you thought it worked, rather than for the current or new run, when it fails.

N1 You thought you ran the code before, and that it worked correctly, but you are mistaken: you didn't run it at all, or it in fact failed but you did not notice.
N2 You did run the code before, but picked up the output from a previous state, before you broke it, when it did work.
N3 You did run the code before, and it did produce the wrong output then as now, but you used a defective procedure or tool to examine the output then, and failed to realise it was wrong/failing.
N4 You did run the code before, and it did pass, but you passed the wrong parameters/inputs/whatever and are now passing the correct (or different) parameters/inputs/whatever so it now fails as it would have done then if you had done the same.


  1. If you have think of other reasons code rusts, do let me know and I'll be happy to expand this list (and attribute, of course) 

  2. Touching a file (the unix touch command) updates the last update date on a file without changing its contents. 

  3. For this reason, a lot of people prefer to run python -m pip rather than pip, because this way you can have greater confidence that the module is getting installed in the site-packages for the version of python you're actually running. 

  4. Most of these kinds of indeterminacy will, in fact, usually be stable given identical inputs on the same machine running the same software, but it can take very little to change that, and should not be relied upon. 


Flat Files (a.k.a. CSV files)

Posted on Fri 16 July 2021 in TDDA • Tagged with data

This week, a client I'm working for received a large volume of data, and as usual the data was sent as "flat" files—or CSV (comma-separated values1) files, as they are more often called. Everyone hates CSV files, because they are badly specified, contain little metadata and are generally an unreliable way to transfer information accurately. They continue to be used, of course, because they are the lowest-common denominator format and just about everything can read and write them in some fashion.

Some of the problems with CSV files are well captured in a pithy blog post by Jesse Donat entitled Falsehoods Programmers Believe about CSVs.

Among other things, the data we received this week featured:

  • unescaped commas in unquoted (comma-separated) values;
  • an unspecified non-UTF-8 encoding that also did not appear to be iso-8859-1 ("latin-1" to its friends), nor indeed iso-8859-15 ("latin-9");
  • different null markers in different fields, and some cases, different null markers in a single field;2
  • field names (column headers) that included spaces, apostrophes, dashes and (in at least one case) a non-ASCII non-alphanumeric character;
  • multiple date formats, even within a single field, including some dates with three-digit years.

All of this is a bit frustrating, but far from unusual, and only one of these problems was actually fatal—the use of unquoted, unescaped separators in values, which makes the file inherently ambiguous. I'm almost sure this data was written but not read or validated, because I don't believe the supplier would have been able to read it reliably either.

Metadata

In an ideal world, we'd move away from CSV files, but we also need to recognise not only that this probably won't happen, but that the universality, plain-text nature, grokkability and simplicity of CSV files are all strengths; for all that we might gain using fancier, better-specified formats, we would lose quite a lot too, not least the utility of awk, split, grep and friends in many cases.

So if we can't get away from CSV files, how can we increase reliability when using them? Standardizing might be good, but again, this is going to be hard to achieve. What we might be able to do, however, is to work towards a way of specifying flat files that at least allows a receiver of them to know what to expect, or a generator to know what to write. I've been involved with a few such ideas over the years, and the software my company produces (Miró) used its own non-standard, XML-based way of describing flat files.

What I'm thinking about is trying to produce something more general, less opinionated, and more modern (think JSON, rather than XML, for starters) that addresses more issues. The initial goal would be simply descriptive—to allow a metadata file to be created that accurately describes the specific features of a given flat file so that a reader (human or machine) knows how to interpret it. Over time, this might grow into something bigger. I think obvious things to do after the format is created include:

  1. In my case, getting Miró to accept these in place of3 its current XML-based files when reading (or writing) flat files. (Initially, at least, Miró would not be able to read or write all files that could be specified in this way, but could at least warn the user when it couldn't.)
  2. Also getting the Python tdda library to be able to use this when using CSV files for input (and perhaps also for output).
  3. Writing an "argument generator" for some of the standard (Python) CSV readers and writers to set the read/write options to be consistent with a given metadata description, and then probably to provide wrapped versions of those readers/writers that can accept a path for a CSV file and a path to a metadata file and use the underlying CSV reader or writer to read or write the file using that specification.
  4. Writing (yet another) "smart" reader to try to read any old CSV files (using heuristics) and write out a metadata file that appears to match the data provided. This could not possibly work completely reliably because of all inherent ambiguity in flat files already alluded to, but an "80%" solution for real-world files should certainly be achievable as many programs make a reasonable job of handling arbitrary CSV files already.
  5. Writing a validator to confirm whether a given CSV file is consistent with the specification in the metadata file.
  6. Incorporating such a flat-file validator into TDDA so that it can check not only the (semantic) content of a dataset, but also the syntactic/formatting validity of data, confirming that it has been or can be read correctly.4

Together, a smart reader that generates a metadata file for a CSV file (item 4 above) and a validator that validates a CSV file against such a metadata specification (item 5) are very analogous to the current constraint discovery and data verification, respectively, but in the space of CSV files—roughly, "syntactic" conformance—rather than data (or "semantic") correctness.

Miró's Flat File Description format (XMD Files)

Here is an example, from its documentation, of the XMD data files that Miró uses.

<?xml version="1.0" encoding="UTF-8"?>
<dataformat>
    <sep>,</sep>                     <!-- field separator -->
    <null></null>                    <!-- NULL marker -->
    <quoteChar>"</quoteChar>         <!-- Quotation mark -->
    <encoding>UTF-8</encoding>       <!-- any python coding name -->
    <allowApos>True</allowApos>      <!-- allow apostophes in strings -->
    <skipHeader>False</skipHeader>   <!-- ignore the first line of file -->
    <pc>False</pc>                   <!-- Convert 1.2% to 0.012 etc. -->
    <excel>False</excel>             <!-- pad short lines with NULLs -->
    <dateFormat>eurodt</dateFormat>  <!-- Miró date format name -->
    <fields>
        <field extname="ext1" name="int2" type="string"/>
        <field extname="ext2" name="int2" type="int"/>
        <field name="int3" type="real"/>
        <field extname="ext4" type="date"/>
        <field extname="ext5" type="date" format="rdt"/>
        <field extname="ext6" type="date" format="rd"/>
        <field extname="ext7" type="bool"/>
    </fields>
    <requireAllFields>False</requireAllFields>
    <banExtraFields>False</banExtraFields>
</dataformat>

Three things to note immediately about this:

  • I'm not presenting this as the solution: the XMD format is now rather out of vogue and there are a number of things I would definitely do differently fifteen years on (such as using more standard names for types and more standard date format specifiers).
  • The XMD format is slightly more than just a flat file description, in that it contains a couple of things that are more about how to interpret and handle the data after reading, rather than simply describing the data.
  • The XMD file supports the notion of two different names for a field. The extname is the name in the CSV file (the external name), while the name is the name for Miró to use for the field. The semantics of this are slightly complicated, but allow for renaming of fields on import, and for naming of fields where there is no external name, or external names are repeated, or the external name is otherwise unusable by Miró. If the CSV file has a header and each field has a different name in the header, the order of the fields int he XMD file does not matter, but if there are missing or repeated field names, Miró will use the field order in the XMD file.

Notwithstanding the amazing variety seen in CSV files, as illuminated by Jesse Donat's aforementioned blogpost, most CSV files from mature systems vary only in the ways covered by a few of the items described in the CSV file. The most important things to know about a flat file overall are normally:

  • Encoding. The file encoding—these days, most commonly UTF-8.
  • Separator. The separator character—most commonly a comma (,), but pipe (|), tab and semicolon (;) are also frequently used.
  • Quoting. What character is used to quote strings (if any). There are quite number of subtleties here (not all capable of being expressed in the XMD file) including:
    • Are all strings quoted or just some (e.g. ones containing the field separator)?
    • Are non-string values (e.g. numbers) quoted too?5
    • Are missing values (NULL) quoted?6
  • Missing Values. How are missing values (NULLs) denoted in the file, should there be any?
  • Escaping. How are characters "escaped"? This really covers a set of different issues, and the XMD file is not rich enough to cover all possibilities. One aspect is, when strings are quoted, how are quotes in the string handled? The most common answers are either by preceding them with an escape character, usually backslash (\), e.g.

    "This is an escaped \" character in a string"
    

    or by stuttering:

    "This is a stuttered "" character in a string"
    

    Escaping is also a way of including the separator in non-quoted values, like these display prices:

    Price,DisplayPrice
    100.0,£100.00
    1000.0,£1\,000.00
    1000000,£1\,000\,000.00
    

    Escaping is also a way of specifying some special characters, e.g. \n for a newline, \t for a tab etc., and as a result when an actually backslash is required it is self-escaped (as \\).

  • Row Truncation after the last non-null value. Are rows in which the last value is missing truncated? Like many CSV writers, Excel writes missing values as blanks so that 1,,3 is read as 1 for the first field, a missing value for the second field and 3 for the third field. More quirkily, when Excel writes out CSV files, if there are n columns and the last m of them on a row are missing, Excel will write out only the non-missing values, and no further separators, so that there will be only n – m values on that line and only n – m – 1 separators. This behaviour is hard to describe and (as far as I know) unique to Excel, so in the XMD file this is simply marked as <excel>True</excel>.7

  • Header handling. Although the common case is for CSV files to have a single line at the start with the field names, sometimes there is no such line, and sometimes there are multiple lines before the data (one or more of which many specify the field names). As a minimum, a metadata description needs to be able to specify whether there is a header line, and ideally how many such lines there are and how headers should be extracted from them. If there are no headers, the specification should probably specify the field names. (Miró imaginatively calls the fields Field1 to FieldN if no fieldnames are available in the flat file or any XMD file.)

Per-Field Information

It's always useful and sometimes necessary to specify field types, and as discussed above, sometimes field names. Typing is almost always ambigous, and such ambiguity is increased if there are any bad values in the data. Moreover, in some cases (especially dates and timestamps), it is useful to specify the date format. Although good flat-file readers generally make a reasonable job of inferring types, and often date formats too, it is clearly helpful for a metadata specification to include these.

Just as date formats can vary between fields, other things can vary too, most obviously null indicators (missing value information), quoting and escaping. Moreover, if numeric data is formatted (e.g. including currency indicators, thousand separators etc.) these can all usefully be specified.

Required/Allowed Fields

The final pair of settings in the XMD file look slightly different from the others, partly because they are phrased as directives rather than descriptions. requireAllFields, when set, is a directive to Miró to raise a warning or an error if any of the fields in the XMD file are not present in the CSV file. Similarly, banExtraFields is a directive to raise such a warning or error if any fields are found in the CSV file that are not listed in the XMD file. Miró has several ways to specify whether infringements result in warnings or errors.

These directives can, however, be recast as declarations. The banExtraFields directive, when true, can equally be thought of as a declaration the field list is complete. Similarly, the requireAllFields directive, when true, can be thought of as a declaration that the field list is not just describing types that and formats for fields that might be in the CSV files, but rather that all fields listed are actually in the file.8

In principle, I think it would probably be better if these descriptions were more obviously descriptive or declarative, but I am struggling to find a pair of words/phrases that would capture that elegantly. At this point I am tempted to retain their imperative nature but make them slightly more symmetrical, perhaps with:

"require-all-fields": true,
"allow-extra-fields": false

Alternatively a more declarative syntax might be something like:

"csv-file-might-omit-fields": false,
"csv-file-might-include-extra-fields": false

The reader might wonder why the fields in the metadata file would ever not correspond exactly to those in file. In practice, it is not uncommon when dealing with relatively "good" CSV files to write an XMD file that specifies types and formats only for fields that trip up the flat-file reader. Conversely, it can be useful to have XMD files that describe a variety of possible files that share field names and types; in those cases, the extra ones do no harm.

What Might a Metadata File Look Like?

The XMD file gets quite a lot of things right:

  • As XML, it's a standard format that's easy to read, though today JSON is clearly more popular for this sort of use. (It would be fairly easy to allow a common format to be expressed in JSON, XML or YAML, but there's something to be said for a single format, probably JSON.)
  • All of the most fundamental overall properties are represented—encoding, separator, null marker, escape characters, and date format.
  • There's a separation between the overall file properties and the per-field properties, with the ability to specify the actual fieldname in the file, the field type and, in the case of date fields, custom formats on a per-field basis, if necessary.
  • It can give enough enough information to allow Excel-style truncated lines can be read successfully.

There are also a few major shortcomings:

  • The single escape chaaracter specification covers multiple things.
  • There is no explicit support for quote stuttering (which is fairly common).
  • The format does not recognise multiple headers.
  • The format does not provide any way to specify non-date field formats such as boolean specifiers, possible thousand separators and decimal point markers.
  • The format assume a single NULL indicator for all fields and assumes that there is only one kind of missing value/missing value.
  • The date formats supported are not comprehensive and are not expressed in a standard way.
  • Type specifiers are also somewhat non-standard.
  • XMD files fail to recognize the possibility that null markers are quoted, and implicitly assume that any empty string is distinct from a missing string value. This is probably too opinionated.

Some of these shortcomings reflect the fact that the XMD format was conceived less as a general-purpose flat-file descriptor than a specification as to how Miró should read or write a given flat file, and also a way for Miró to specify how it has written a flat file.

Essentially, I think a good flat-file description format would preserve the good aspects and remedy the faults identified, as well as providing a mechanism for specifying some more esoteric possibilies not mentioned so far.

I'll propose something concrete in subsequent posts.


  1. Sometimes the separator in a flat file is a character other than a comma, and you occasionally see .tsv used an extension when the separator is a tab character, or .psv when the separator is a pipe character (|). Often, however, a csv extension is still used, and as result the acronym CSV is sometimes restyled as character-separated values. I had always heard this extension attributed to Microsoft, but have been unable to verify this. 

  2. To be fair, the notion of different kinds of missing values is reasonable—missing because it wasn't recorded, missing because it was unreadable, missing because it's an undefined result (e.g. mean of no values) etc. But this wasn't that: it was just multiple ways of denoting generic missing values. 

  3. by which, of course, I mean as well as ... 

  4. There's an interesting question as to whether the CSV format specification should be incorporated as an optional part of a TDDA file, and if so, whether it should simply be a nested section or whether the field-specific components should be merged with TDDA's field sections. There are pros and cons. 

  5. Yes, some systems do this. 

  6. I know, madness! But such practices occur! 

  7. Maybe it should have been called quirks mode 

  8. Miró's slightly extended version of TDDA files includes lists of required and allowed fields, which serve a similar purpose to these settings.