TOMLParams: TOML-based parameter files made better
Posted on Sun 16 July 2023 in TDDA • Tagged with TDDA, reproducibility
TOMLParams is a new open-source library that helps Python developers to externalize parameters in TOML files. This post will explain why storing parameters in non-code files is beneficial (including for reproducibility), why TOML was chosen, and some of the useful features of the library, which include structured sets of parameters using TOML tables, hierarchical inclusion with overriding, default values, parameter (key) checking, optional type checking, and features to help use across programs, including built-in support for setting parameters using environment variables.
The Benefits of Externalizing Parameters
Almost all software can do more than one thing, and has various parameters that are used to control exactly what it does. Some of these parameters are set once and never changed (typically configuration parameters, that each user or installation chooses), while others may be changed more often, perhaps from run to run. Command-line tools often accept some parameters on the command line itself, most obviously input and output files and core parameters such as search terms for search commands. On Unix and Linux systems, it's also common to use command line "switches" (also called flags or options) to refine behaviour. So for example, the Unix/Linux grep tool might be used in any of the following ways:
grep time # find all lines including 'time' on stdin
grep time p*.txt # ... on .txt files starting with 'p'
grep -i time p*.txt # ... ignoring capitalization (-i switch)
All of time
, p*.txt
and -i
are examples of command-line parameters.
Many tools also use configuration files to control the behaviour of
the software. On Unix and Linux, these are typically stored in the user's
home directory, often in 'dot' files,
such as ~/.bashrc
for the Bash shell,
~/.emacs
for the Emacs editor and ~/.zshrc
for the Z Shell.
These files are sometimes in propriery formats, but increasly
often are in "standard" formats such as
.ini
(MS-DOS initialization files),
JSON (JavaScript Object Notation),
YAML (YAML Ain't Markup Language),
or TOML (Tom's Obvious Minimal Language).
When writing code, it's often tempting just to set parameters in code.
-
The quickest, dirtiest practice is just to have parameter values "hard-wired" as literals in code where they are used, possibly with the same values being repeated in different places. This is generally frowned upon for a few reasons:
- Hard-wired parameters are hard to find, and therefore hard to update and inspect;
- There may be repetition, violating the don't repeat yourself (DRY) principle, and leading to possible inconsistency (though see also the counterveiling rule of three (ROT) and the write everything in threes (WET) principle.
- If you want to change parameters, you have to go to many places;
- There may be no name associated with the parameter, making it hard to know what it means;
So implementing an energy calculation using Einstein's formula E = mc², we would probably prefer either:
def energy(mass):
"""Return energy in joules from mass in kilograms"""
return mass * constants.speed_of_light * constants.speed_of_light
or
def energy(m):
"""Return energy in joules from mass m in kilograms"""
c = constants.c # the speed of light
return m * c ** 2
(for a physicist) to
def energy(m):
return m * 3e8 * 3e8
or (worse!)
def energy(m):
return m * 9e16
- For something like the speed of light (which is after all, a universal
constant) in code is probably appropriate, so we might have a file
called
constants.py
containing something like:
# constants.py: physical constants
speed_of_light = 299_792_458 # metres per second, in vacuuo. (c. 300 000 km/s)
h_cross = 1.054_571_817e-34 # reduced Planck constant in joule seconds (= h/2π)
- For parameters that change in different situations, or for different users,
it's usually much better practice to store the parameters in a separate file,
ideally in a common format for which is good library support.
Some of the advantages of this include:
- Changing parameters then does not require changing code. Code may be unavilable to the user—e.g. write locked, compiled or running on a remote server through an API. Moreover, editing code often requires more expertise and confidence than editing a parameter file.
- The same code can be run multiple times using different parameters at the same time; this can be more challenging if the parameters are "baked into" the code.
- It becomes easy to maintain multiple sets of parameters for different situations, allowing a user simply to choose a named set on each occasion (usually through the parameter file's name). Of course, the code itself can have different named sets of parameters, but this is less common, and usually less transparent to the user, and harder to maintain.
Why TOML?
There is nothing particularly special about TOML, but it is well suited as a format for parameter files, with a sane and pragmatic set of features, good library support in most modern programming languages and rapidly increasing adoption.
Here's a simple example of a TOML file:
# A TOML file (This is a comment)
start_date = 2024-01-01 # This will produce a datetime.date in Python
run_days = 366 # an integer (int)
tolerance = 0.0001 # a floating-point vlaue (float)
locale = "en_GB" # a string value
title = "2024" # also a string valuue
critical_event_time = 2024-07-31T03:22:22+03:00 # a datetime.datetime
fermi_approximate = true # a boolean value (bool)
[logging] # A TOML "table", which acts like a section, or group
format = ".csv" # Yet another string
events = [ # An array (list)
"financial",
"telecoms",
]
title = "2024 events" # The same key as before, but this one is
# in the logging table, so does not conflict
# with the previous title
Strengths of TOML as a parameter/configuration format
- Ease of reading, writing and editing ("obviousness").
- Lack of surprises: there are few if any weirdnesses in TOML (which is less true of the other main competing formats).
- Good coverage of the main expected types: TOML supports (and
differentiates between) booleans,
integers, floating point values, strings,
ISO8601-formatted
dates and timestamps (with and without timezones,
and mapping to
datetime.date
anddatetime.datetime
in Python), arrays (lists), key-value pairs (dictionaries), and hierarchical sections. - It supports comments (which begin with a hash mark
#
) - Support in most modern langauges, including C#, C++, Clojure, Dart, Elixir, Erlang, Go, Haskell, Java, Javascript, Lua, Objective-C, Perl, PHP, Python, Ruby, Rust, Swift, Scala
- Flexible quoting with support for multiline strings, and no use of bare (unquoted) strings (avoiding ambiguity)
- Well-defined semantics without becoming fussy and awkward to edit (e.g., being unfussy about trailing commas in arrays).
- TOML is specifically designed to be a configuration file format. Quoting the front page of the website
TOML aims to be a minimal configuration file format that: is easy to read due to obvious semantics maps unambiguously to a hash table is easy to parse into data structures in a wide variety of languages"
- In the context of Python, TOML is already being adopted widely,
most notably in the increasingly ubiquitous
pyproject.toml
files. There are good libraries available for reading (tomli
) and writing (tomli_w
) TOML files, although these are not part of the standard library. (It appears that Python 3.11 does includetomllib
, for reading TOML files, in the standard library.)
Weaknesses of TOML as a parameter/configuration format
We don't think TOML has any major weaknesses for this purpose, but points that might count against it for some include:
- TOML does not have a null (NULL/nil/None/undefined) value.
(
TOMLParams
could address this, but has no current plans to do so.) - Hierarchical sections ('tables') in TOML are not nested. So if you want
So if you want a section/table called
behaviours
and subsections/subtables calledpersonal
andbusiness
, in TOML this might be represented by something like the excerpt below (possibly with a[behaviours]
table with its own parameters as well).
Some people, however, really don't like TOML.
[behaviours.personal]
frequency = 3
[behaviours.business]
frequency = 3
Why not JSON, YAML, .ini...
We chose TOML because we think, overall, it is better than each of the obvious competing formats, but the purpose of this post isn't to do down those formats, which all have their established places. But we will comment briefly on the most obvious alternatives.
JSON, while good as a transfer format and very popular in web services, is tricky for humans to write correctly, even with editor support, because it requires correct nesting and refuses to accept trailing commas in lists and dictionaries. The lack of support for dates and timestamps is also a frequent source of frustration, with quoted strings typically being used instead, with all the concomitant problems of that approach.
At first glance, YAML appears more suitable as a
configuration/parameter-file format, but is the opposite of obvious
and often produces unexpected results (in practice). Sources of frustration
include no requirement to quote strings, "magic values" like yes
(which maps to true
) and no
which maps to false
(much to the
annoyance of Norwegians and anyone using the NO country code),
inadvent coercison of numbers with leading zeroes to octal (in YAML 1.1),
whitespace sensitivity, and issues around multi-line strings.1
2 3 4
.ini
files look a lot like TOML (I would guess they a major inspiration
for it) but are much simpler and less rich, have less well-defined syntax,
have fewer types and don't require quoting of strings, leading to ambiguity.
What are the Key Features of TOMLParams?
Simple externalization of parameters in one or more TOML files
You write your parameters in a TOML file, pass the name of the
TOML file an instance of the TOMLParams
class
and it reads the parameters from the file
and makes them available as attributes on the object, but also
makes them available using dictionary-style look-up, i.e. if you
TOMLParams instance is p
and you have a parameter startdate
you can access it as p.startdate
or p['startdate']
(and, more pertinently,
also as p[k]
if k
is set to 'startdate'
).
If you use tables, you can "dot" you way throught to the parameters
(p.behaviours.personal.frequency
) or use repeated dictionary lookups
(p['behaviours']['personal']['frequency']
).
Loading, saving, default values, parameter name checking and optional parameter type checking
The parameters that exist are defined by default value with TOMLParams
.
You can either store you defaults in a TOML file (e.g. defaults.toml
)
or pass them to the TOMParams
initializer as a dictionary.
If you choose a different TOML file, all the parameter values are first set to their default values, and then any parameters set in the file you specify override those. New parameters (i.e. any not listed in defaults) raise an error.
If you wish turn on type checking, the library will check that the all parameter values provided match those of the defaults, and you choose whether these cases result in a warning (which doesn't raise an exception) or an error (which does).
Hierarchical file inclusion with overriding
Perhaps the most powerful feature of TOMLParams is the ability for one
parameters file to 'include' one or more others. If you use TOMLParams
specifying the parameters file name as base
, and the first line of
base.toml
is
include = 'uk'
then parameters from uk.toml
will be read and processed before
those from base.toml
. So all parameters will first be set to
default values, then anything in uk.toml
will override those values,
and finally any values in base.toml
will override those.
The include statement can also be a list
include = ['uk', 'metric', '2023']
Inclusions are processed left to right (i.e. uk.toml
is processed,
before metric.toml
, followed by 2023.toml
), followed by the parameters
in the including file itself. So in this case, if defaults are in
defaults.toml
and the TOML file specified is base.toml
, the full
sequence is
defaults.toml
uk.toml
metric.toml
2023.toml
base.toml
Of course, the included files can themselves include others, but the library keeps track of this and each file is only processed once (the first time it is encountered), which prevents infinite recursion and repeated setting.
This makes it very easy to maintain different kinds and groups of
parameters in different files, and to create a variation of a set
of parameters simply by making the first line include
whatever
TOML file specifies the parameters you want to start from, and then
override the specific parameter or parameters you want to be different
in your new file.
Support for writing consolidated parameters as TOML after hierarchical inclusion and resolution
The library can also write the final parameters used out as a single, consolidated TOML file, which is useful when hierarchical inclusion and overriding are used, and keeps a record of the final values of all parameters. This helps with reproducibility and logging.
Support for using environment variables to select parameter set (as well as API)
You can choose how to specify the name of the parameters file to be used,
and the name two which it should default. If you create a TOMLParams
instance
with:
params = TOMLParams(
defaults='defaults',
name='newparams',
standard_params_dir='.'
)
default parameters will be read from ./defaults.toml
and then ./newparams.toml
will be processed, overriding default values.
If you want to specify the name of the parameters file to use on the command line, the usual pattern would be something like:
import sys
from tomlparams import TOMLParams
class Simulate:
def __init__(params, ...)
if __name__ == '__main__:
name = sys.argv[1] if len(sys.argv) > 1 else 'base'
params = TOMLParams('defaults.toml', name)
sim = Simulate(params, ...)
Sometimes, however, it's convenient to use an environment variable
to set the name of the parameters file, particularly if you want
to use the same parameters in multiple programs, or run from a
shell script or a Makefile
. You can specify an environment variable
to use for this and TOMLParams
will inspect that environment variable
if no name is passed. If you choose SIMPARAMS
for this and say
params = TOMLParams('defaults.toml', env_var='SIMPARAMS')
it will look for a name in SIMPARAMS
in the environment which you can
set with
SIMPARAMS="foo" python pythonfile.py
or
export SIMPARAMS="foo"
python pythonfile.py
If it's not set, it will use base.toml
as the file name, or something
else you choose with the base_params_stem
argument to TOMLParams
.
Check it out
You can install TOMLParams
from PyPI in the usual way, e.g.
python -m pip -U tomlparams
The source is available from Github (under an MIT license), at github.com/smartdatafoundry/tomlparams
There's a README.md
and documentation is available on ReAd the Docs at
tomlparams.readthedocs.io.
You can get help within Python on the TOMLParams
class with
>>> import tomlparams
>>> help(tomlparams)
After installing tomlparams, you will find you have a tomlparams
command,
which you can use to copy example code from the README
$ tomlparams examples
Examples copied to ./tomlparams_examples.
You can also get help from with tomlparams help
:
$ tomlparams help
TOMLParams
USAGE:
tomlparams help — show this message
tomlparams version — report version number
tomlparams examples — copy the examples to ./tomlparams_examples
tomlparams test — run the tomlparams tests
Documentation: https://tomlparams.readthedocs.io/
Source code: https://github.com/smartdatafoundry.com/tomlparams
Website: https://tomlparams.com
Installation:
python -m pip install -U tomlparams