Learning the Hard Way: Regression to the Mean

Posted on Thu 20 June 2024 in TDDA • Tagged with TDDA, reproducibility, errors, interpretation

I was at the tenth PyData London Conference last weekend, which was excellent, as always. One of the keynote speakers was Rebecca Bilbro who gave a rather brilliant (and cleverly titled) talk called Mistakes Were Made: Data Science 10 Years In.

The title is, of course, a reference to the tendency many of us have to be more willing to admit that mistakes were made, than to say "I made mistakes". So I thought I'd share a mistake I made fairly early in my data science career, probably around 1996 or 1997. This is not one of those interview-style "I-sometimes-worry-I'm-a-bit-too-much-of-a-perfectionist"-style admissions that we have all heard; this one was bad.

My company at the time, Quadstone, was under contract to analyse a large retailer's customer base for relationship marketing, using loyalty-card data. We had done all sorts of work in the area with this retailer, and one day the relationship manager we were working with decided that it would be good to incentivise more spending among the retailer's less active, lower spending customers. This is fairly standard. The idea was to set a reasonably high, but achievable, target spend level for each of these customers over a period of a few weeks. Those customers who hit their individual target would receive a large number of loyalty points worth a reasonable amount of money.

We had been tracking spend carefully, placing customers on a behavioural segmentation, and had enough data that we felt relatively confident we knew what be good indivualized stretch goals for customers (wrongly, as events would prove). We set the targets at levels such that retailer should break even if people just met them (foregoing profit, but not losing much money), and estimated how many people we thought would hit the target if the campaign did not have much effect, and then estimated volumes and costs for various higher levels of campaign success.

I'm sure many of you can already see how this will go, and even more of you will have been attuned to the problem by the title of this post. We, however, we had not seen the title of this post, and although I knew about the phenomenon of regression to the mean, I had not really internalized it. I didn't know it in my bones. I had not been bitten by it. I did not see the trap we were walking into.

As Confucious apparently did not say:

I hear and I forget. I see and I remember. I do and I understand.

— probably not Confucious; possibly Xunzi.

Well, I certainly now understand.

On the positive side, our treated group increased its level of spend by a decent amount, and a large number of the group earned many extra loyalty points. I don't believe we had developed uplift modelling at the time of this work, we were very aware that we needed a randomized control group in order to understand the behaviour change we had driven, and we had kept one. To our dismay, the level of spend in the control group, though lower than that in the treated group, also increased quite considerably. In fact, it increased enough that the return on investment for the activity was negative, rather than positive. It was at this point (just before admitting to the client what had happened, and negotiating with them about exactly who should shoulder the loss1) a little voice in my head started saying regression to the mean, regression to the mean, regression to the mean, almost like a more analytical version of Long John Silver's parrot.

So (for those of you who don't know), what is regression to the mean? And why did it occur in this case? And why should we, in fact, have predicted that?

Allow me to lead you through the gory details.

Background: Control Groups

We all know that marketers can't honestly claim the credit for all the sales from people included in a direct marketing campaign, because (in almost all circumstances) some of them would have bought anyway. As with randomized control trials in medicine, in order to understand the true effect of our campaign, we need to divide our target population, uniformly at random,2 into a treatment group, who receive the marketing treatment in question, and a control group, who remain untreated. The two groups do not need to be the same size, but both need to be big enough to allow us to measure the outcome accurately, and indeed to measure the difference between the behaviour of the two groups. This is slightly problematical, because we don't know the effect size before we take action. Happily, however, if the effect is too small to measure, it is pretty much guaranteed to be uninteresting and not to achieve a meaningful return on investment, so we can size the two groups by calculating the minimum effect we need to be able to detect in order to achieve a sufficiently positive ROI.

The effect size is the difference between the outcome in the treated group and the control group—usually a difference in response rate, for a binary outcome, or a difference in a continuous variable such as revenue. Things become more interesting when there are negative effects in play, which is sometimes the case with intrusive marketing or when retention activity is being undertaken. There can be negative effects for a subpopulation or, in the worst cases, for the population as a whole. When these happen, a company is literally spending money to drive customers away, which is usually undesirable.

Let's suppose, for simplicity, that we have selected an ideal target population of 2 million and we mail half of them (chosen on the toss of a fair coin) and keep the other 1 million as controls. If we then send a motivational mailing to the 1 million encouraging them to spend more, with or without an incentive to do so, we can measure their average weekly spend in a pre-period (say six weeks) and their average weekly send in a post-period, which for simplicity we will also take to be six weeks. In this case, we will assume that there was no financial incentive: it was simply a motivation mail along the lines of "we're really great: come and give us more of your money". (Good creatives would craft the message more attractively than this.) Let's suppose we do this and that the results for the treated group of one million are as follows:


BeforeAfter
£50 £60

This is not enough information to say whether the campaign worked, and the reason has nothing to with statistical errors: at the scale of 1 million, you can guarantee the errors will be insignificant. It's also not primarily because we are measuring revenue rather than profit, nor because we haven't taken into account the cost of action (though those are things we should do). We can see that our 1 million customers spent, on average £10 per week more in the post-period than in the pre-period (a cool £60m in increased revenue over six weeks), but we don't know about causality: did our marketing campaign cause the effect?

To answer this, we need to look at what happened in the control group.


BeforeAfter
Mailed (Treated) £50 £60
Unmailed (Control) £50 £55

We immediately see that the spend in the pre-period was the same in both groups, as must be the case for a proper treatment-control split. We also seee that the spend in the control group rose to £55.

We now have enough evidence to say for with very high degree of confidence that the treatment caused a £5 per week increase in spend, but that some other factor—perhaps seasonality, TV ads, or a mis-step by our competitors—caused the other £5 increase per week across the larger population of 2 million customers. I should emphasize that this is a valid conclusion regardless of how the population of 2 million was chosen, as long as the treatment-control split was unbiased.

Behavioural Segmentation

Now let us take a similar treatment group drawn uniformly, at random, from a larger population of 10 million, We segment the treatment population by average weekly spend in the pre-period and plit the increase or decrease in spend, between the pre- and post-periods in each segment for our treatment group. The graph below shows a possible outcome.

A bar graph showing a split of the treated population
          using average spend bands for the pre-period
          of £0, and £0.01 to £10, and ten-pound intervals
          up to £90, and finally a bar for over £90.
          The vertical scale is the change in spend between
          the pre- and post periods, quantified by the difference
          between them (post-spend minus pre-spend).
          The bars decrease monotonically, with the £0 group
          increasing spend by about £15, and these increases
          dropping to zero for the £60--70 per week group,
          and being negative to the tune of about £10 a week
          for those spending over £90 in the pre-period.

For people who are not steeped in regression to mean, this graph may appear somewhat alarming. Depending on the distribution of the population, this might well represent an overall increase in spending (since probably more of the people are on the left of the graph, where the change in spend is positive). But I can almost guarantee that any marketing director would declare this to be disaster, saying (possibly more colourful language) "Look at the damage you have wreaked on my best customers!"

But would this be a reasonable reaction? Would it, in fact, be accurate? At this point, we have no idea whether the campaign caused the higher-spending customers' spend to decline, or whether something else did. To assess that, we need once more to look at the same information for the control group (9 million people, in this case). That's shown below.

The same graph as above, but now showing the change in
          spend for the control group as well as for the treated group.
          The same general pattern is seen in the control group,
          but the increases are smaller in the control group
          (starting at £8 for the group that had spent £0 in the
          six weeks before the mailing, and going down to --£15
          for the group spending over £90 in the pre-period.
          So change in spend is more positive, or less negative,
          in the treated group than in the control group in every
          behavioural segment.

What we clearly see is that in every segment the change in spend was either more positive or less negative in the treated group than in the control group. So the campaign did have a positive effect in every segment. Not so embarrassing. (If only this had been our case!)

Regression to the mean

To understand more clearly what's going on here, it's helpful to look at the same data but focus only the control group.

The same graph as above the last, but now with the treated
          group removed.

Remember, this is the control group: we have not done anything to this population. This is a classic case of regression to the mean. I would confidently predict that for almost any customer base, if we allocate people to segments on the basis of a behavioural characteristic over some period, then measure that same characteristic for the same people, using the same segment allocations, at a later period, we would see a pattern like this: the segments with low rates of the behaviour in question in the first place would increase that behaviour (at least, relative to the population as a whole), and the people in the segments that exhibited the behaviour more would fall back, on average.

Why?

Mixing Effects

When you segment a population on the basis of a behaviour, many of the people are captured exhibiting their typical behaviours. But inevitably, you capture some of the people exhibiting behaviour that is for them atypical.

Consider the first bar in the example above—people who spent nothing in the six weeks before the mailing. Ignoring the possibility of people returning goods, it is impossible for the average spend of this group to decline. In fact, if even a single person from this group buys something in the post-campaign period, the average spend for that segment will increase. In terms of the mixing I am talking about, some of the people will have completely lapsed, and will never spend again, while others were in an atypically low spending period for them: maybe they were on holiday, or trying out a competitor or didn't use their loyalty card and so their spending was not tracked. The thing that's special about this first group is that they literally cannot be in an atypically high spending period when they were assigned to segments, because they weren't spending anything.

It's less clear-cut, but a similar argument pertains to the group on the far right of the graph. Some of those are people who routinely spend over £90 a week at this retailer. But others will have had atypically high spend when we assigned them to segments: maybe they had a huge party and bought lots of alcohol for it, or maybe they shopped for someone else over that period. With the highest-spending group, there will probably be a small number of people whose spend was atypically low during the period we assigned them to segments, but there are likely to be far more people for whom this spend was atypically high at the right side of the distribution. So in this case, we can see it's likely that the average spend of these higher-spending segments will decline (relative to the population as a whole) if we measure them at a later time period.3

For people in the middle of the distribution, the story is similar but more balanced. Some people will have their typical spend where we measured it, and there will be others whom we captured at atypically high or atypically low spending periods, but those will tend to cancel out.

These mixing effects give the best explanation I know of the phenomenon of regression to the mean. It is always something to look for when you assign people to segments based a behaviour and then look for changes in the people in those segments a later time.

So how did we lose so much money?

The reason our campaign worked out so poorly was that we did not take into account regression to the mean when we set the targets, because we didn't think of it. Because we targeted more people with below-median spend than above-median spend, regression to the mean meant that although spend increased quite strongly among our treated customers, it also increased quite strongly for the control group in each segment. In that regard, the uplifts were less similar across the spend segments that I have shown here; something, in fact, that I now know to be characteristic of retail unike in many other areas. The most active shoppers are often also the most responsive to marketing campaigns.

The campaign produced an uplift in all segments, but much more of the increase in spend than we expected was due to regression to the mean in the population we targeted, with the result that value of the loyalty points given out significantly exceeded the incremental profit contribution from the people awarded those points.

This was a tough one. But at least I will remember it forever.


  1. Without getting too specific, the loss was a six-figure sterling sum, back when that was an more significant amount of money than it is today. It was not really a material amount for the retailer, which had a significant fraction of the UK population as regular customers; but it was a highly material amount for Quadstone: more, in fact, than the four founders had invested in the company, the probably less than our entire first-round funding. And retailers don't get to be big and dominant by treating six-figure losses with equinimity. 

  2. Uniformly at random means that, conceptually, a coin is tossed or a die is rolled to determine whether someone is allocated to the control group or the treated group. The coin or die does not need to be fair: it's fine to allocate all the 1's on the die to control and all the 2-6's to treated, or to use a weighted coin, as long as the procedure does use any other information to determine the allocation. For example, choosing to put two thirds of the men in control (chosen randomly) and only one third of the women in control is no good, because now, if there's a difference, it is hard to easily separate out the effect of the treatment from the effect of sex. (If the volume suffices, you could assess the uplift independently for men and women, in this particular case, but that quickly gets complicated, and there is more than enough scope for errors without such complications.) 

  3. One possible confusion here is that I'm describing a static, rather than a dynamic, segmentation here: people are allocated to a segment on the basis of the spend in the pre-period and remain in that segment when assessed in the post-period. If we reassigned people on the basis of their later behaviour, we would not see this effect if the spend distribution were static. 


Name Styles

Posted on Mon 04 March 2024 in TDDA • Tagged with TDDA, names

This is just a bit of fun, but I've always been interested in the different kinds of names allowed, encouraged, and used in different areas of computing and data.

A few years ago, I tweeted some well-known naming styles and a collection of lesser-known naming styles. I was playing about with the same idea while thinking about metadata standards today and came up with this. Just as I often think one of the boxes on the uniqitous 2x2 "Boston-Box"-style matrices makes no sense, I think some of the boxes on the evil-good-lawful-chaotic breakdown (which I gather comes from Dungeons and Dragons make little sense, so forgive me if some of this looks slightly forced. But I think it's fun.

Evil-Good-Lawful-Chaotic 3x3 matrix classification of name styles. LAWFUL GOOD: CamelCase, dromedaryCase, snake_case.  NEUTRAL GOOD: kebab-case, SCREAMING_SNAKE_CASE.  CHAOTIC GOOD: Train-Case, SCREAMING-KEBAB-CASE.  LAWFUL NEUTRAL: Pascal_Snake_Case, camel Snake Case, flatcase, UPPERFLATCASE.  NEUTRAL: reservedcase_, private ish case.  CHAOTIC NEUTRAL: space case.  LAWFUL EVIL: double quoted case, single quoted case, __dunder_case__.  NEUTRAL EVIL: path/case.extended, colon:caseflatcase, path/case, endash-kebab-case, quoted embedded newline case.  CHAOTIC EVIL: teRRorIsT nOTe CAse, alternating_separator-case, curly double quoted case, curly single quoted case, unquoted embedded, newline case.


TOMLParams: TOML-based parameter files made better

Posted on Sun 16 July 2023 in TDDA • Tagged with TDDA, reproducibility

TOMLParams is a new open-source library that helps Python developers to externalize parameters in TOML files. This post will explain why storing parameters in non-code files is beneficial (including for reproducibility), why TOML was chosen, and some of the useful features of the library, which include structured sets of parameters using TOML tables, hierarchical inclusion with overriding, default values, parameter (key) checking, optional type checking, and features to help use across programs, including built-in support for setting parameters using environment variables.

The Benefits of Externalizing Parameters

Almost all software can do more than one thing, and has various parameters that are used to control exactly what it does. Some of these parameters are set once and never changed (typically configuration parameters, that each user or installation chooses), while others may be changed more often, perhaps from run to run. Command-line tools often accept some parameters on the command line itself, most obviously input and output files and core parameters such as search terms for search commands. On Unix and Linux systems, it's also common to use command line "switches" (also called flags or options) to refine behaviour. So for example, the Unix/Linux grep tool might be used in any of the following ways:

grep time            # find all lines including 'time' on stdin
grep time p*.txt     # ... on .txt files starting with 'p'
grep -i time p*.txt  # ... ignoring capitalization (-i switch)

All of time, p*.txt and -i are examples of command-line parameters.

Many tools also use configuration files to control the behaviour of the software. On Unix and Linux, these are typically stored in the user's home directory, often in 'dot' files, such as ~/.bashrc for the Bash shell, ~/.emacs for the Emacs editor and ~/.zshrc for the Z Shell. These files are sometimes in propriery formats, but increasly often are in "standard" formats such as .ini (MS-DOS initialization files), JSON (JavaScript Object Notation), YAML (YAML Ain't Markup Language), or TOML (Tom's Obvious Minimal Language).

When writing code, it's often tempting just to set parameters in code.

  • The quickest, dirtiest practice is just to have parameter values "hard-wired" as literals in code where they are used, possibly with the same values being repeated in different places. This is generally frowned upon for a few reasons:

    • Hard-wired parameters are hard to find, and therefore hard to update and inspect;
    • There may be repetition, violating the don't repeat yourself (DRY) principle, and leading to possible inconsistency (though see also the counterveiling rule of three (ROT) and the write everything in threes (WET) principle.
    • If you want to change parameters, you have to go to many places;
    • There may be no name associated with the parameter, making it hard to know what it means;

    So implementing an energy calculation using Einstein's formula E = mc², we would probably prefer either:

def energy(mass):
    """Return energy in joules from mass in kilograms"""
    return mass * constants.speed_of_light * constants.speed_of_light

or

def energy(m):
    """Return energy in joules from mass m in kilograms"""
    c = constants.c  # the speed of light
    return m * c ** 2

(for a physicist) to

def energy(m):
    return m * 3e8 * 3e8

or (worse!)

def energy(m):
    return m * 9e16
  • For something like the speed of light (which is after all, a universal constant) in code is probably appropriate, so we might have a file called constants.py containing something like:
# constants.py: physical constants

speed_of_light = 299_792_458   # metres per second, in vacuuo. (c. 300 000 km/s)
h_cross = 1.054_571_817e-34    # reduced Planck constant in joule seconds (= h/2π)
  • For parameters that change in different situations, or for different users, it's usually much better practice to store the parameters in a separate file, ideally in a common format for which is good library support. Some of the advantages of this include:
    • Changing parameters then does not require changing code. Code may be unavilable to the user—e.g. write locked, compiled or running on a remote server through an API. Moreover, editing code often requires more expertise and confidence than editing a parameter file.
    • The same code can be run multiple times using different parameters at the same time; this can be more challenging if the parameters are "baked into" the code.
    • It becomes easy to maintain multiple sets of parameters for different situations, allowing a user simply to choose a named set on each occasion (usually through the parameter file's name). Of course, the code itself can have different named sets of parameters, but this is less common, and usually less transparent to the user, and harder to maintain.

Why TOML?

There is nothing particularly special about TOML, but it is well suited as a format for parameter files, with a sane and pragmatic set of features, good library support in most modern programming languages and rapidly increasing adoption.

Here's a simple example of a TOML file:

# A TOML file (This is a comment)

start_date = 2024-01-01   # This will produce a datetime.date in Python
run_days = 366            # an integer (int)
tolerance = 0.0001        # a floating-point vlaue (float)
locale = "en_GB"          # a string value
title = "2024"            # also a string valuue
critical_event_time = 2024-07-31T03:22:22+03:00  # a datetime.datetime
fermi_approximate = true  # a boolean value (bool)


[logging]   # A TOML "table", which acts like a section, or group

format = ".csv"           # Yet another string
events = [                # An array (list)
    "financial",
    "telecoms",
]
title = "2024 events"     # The same key as before, but this one is
                          # in the logging table, so does not conflict
                          # with the previous title

Strengths of TOML as a parameter/configuration format

  • Ease of reading, writing and editing ("obviousness").
  • Lack of surprises: there are few if any weirdnesses in TOML (which is less true of the other main competing formats).
  • Good coverage of the main expected types: TOML supports (and differentiates between) booleans, integers, floating point values, strings, ISO8601-formatted dates and timestamps (with and without timezones, and mapping to datetime.date and datetime.datetime in Python), arrays (lists), key-value pairs (dictionaries), and hierarchical sections.
  • It supports comments (which begin with a hash mark #)
  • Support in most modern langauges, including C#, C++, Clojure, Dart, Elixir, Erlang, Go, Haskell, Java, Javascript, Lua, Objective-C, Perl, PHP, Python, Ruby, Rust, Swift, Scala
  • Flexible quoting with support for multiline strings, and no use of bare (unquoted) strings (avoiding ambiguity)
  • Well-defined semantics without becoming fussy and awkward to edit (e.g., being unfussy about trailing commas in arrays).
  • TOML is specifically designed to be a configuration file format. Quoting the front page of the website

TOML aims to be a minimal configuration file format that: is easy to read due to obvious semantics maps unambiguously to a hash table is easy to parse into data structures in a wide variety of languages"

  • In the context of Python, TOML is already being adopted widely, most notably in the increasingly ubiquitous pyproject.toml files. There are good libraries available for reading (tomli) and writing (tomli_w) TOML files, although these are not part of the standard library. (It appears that Python 3.11 does include tomllib, for reading TOML files, in the standard library.)

Weaknesses of TOML as a parameter/configuration format

We don't think TOML has any major weaknesses for this purpose, but points that might count against it for some include:

  • TOML does not have a null (NULL/nil/None/undefined) value. (TOMLParams could address this, but has no current plans to do so.)
  • Hierarchical sections ('tables') in TOML are not nested. So if you want So if you want a section/table called behaviours and subsections/subtables called personal and business, in TOML this might be represented by something like the excerpt below (possibly with a [behaviours] table with its own parameters as well).

Some people, however, really don't like TOML.

[behaviours.personal]

frequency = 3

[behaviours.business]

frequency = 3

Why not JSON, YAML, .ini...

We chose TOML because we think, overall, it is better than each of the obvious competing formats, but the purpose of this post isn't to do down those formats, which all have their established places. But we will comment briefly on the most obvious alternatives.

JSON, while good as a transfer format and very popular in web services, is tricky for humans to write correctly, even with editor support, because it requires correct nesting and refuses to accept trailing commas in lists and dictionaries. The lack of support for dates and timestamps is also a frequent source of frustration, with quoted strings typically being used instead, with all the concomitant problems of that approach.

At first glance, YAML appears more suitable as a configuration/parameter-file format, but is the opposite of obvious and often produces unexpected results (in practice). Sources of frustration include no requirement to quote strings, "magic values" like yes (which maps to true) and no which maps to false (much to the annoyance of Norwegians and anyone using the NO country code), inadvent coercison of numbers with leading zeroes to octal (in YAML 1.1), whitespace sensitivity, and issues around multi-line strings.1 2 3 4

.ini files look a lot like TOML (I would guess they a major inspiration for it) but are much simpler and less rich, have less well-defined syntax, have fewer types and don't require quoting of strings, leading to ambiguity.

What are the Key Features of TOMLParams?

Simple externalization of parameters in one or more TOML files

You write your parameters in a TOML file, pass the name of the TOML file an instance of the TOMLParams class and it reads the parameters from the file and makes them available as attributes on the object, but also makes them available using dictionary-style look-up, i.e. if you TOMLParams instance is p and you have a parameter startdate you can access it as p.startdate or p['startdate'] (and, more pertinently, also as p[k] if k is set to 'startdate').

If you use tables, you can "dot" you way throught to the parameters (p.behaviours.personal.frequency) or use repeated dictionary lookups (p['behaviours']['personal']['frequency']).

Loading, saving, default values, parameter name checking and optional parameter type checking

The parameters that exist are defined by default value with TOMLParams. You can either store you defaults in a TOML file (e.g. defaults.toml) or pass them to the TOMParams initializer as a dictionary.

If you choose a different TOML file, all the parameter values are first set to their default values, and then any parameters set in the file you specify override those. New parameters (i.e. any not listed in defaults) raise an error.

If you wish turn on type checking, the library will check that the all parameter values provided match those of the defaults, and you choose whether these cases result in a warning (which doesn't raise an exception) or an error (which does).

Hierarchical file inclusion with overriding

Perhaps the most powerful feature of TOMLParams is the ability for one parameters file to 'include' one or more others. If you use TOMLParams specifying the parameters file name as base, and the first line of base.toml is

include = 'uk'

then parameters from uk.toml will be read and processed before those from base.toml. So all parameters will first be set to default values, then anything in uk.toml will override those values, and finally any values in base.toml will override those.

The include statement can also be a list

include = ['uk', 'metric', '2023']

Inclusions are processed left to right (i.e. uk.toml is processed, before metric.toml, followed by 2023.toml), followed by the parameters in the including file itself. So in this case, if defaults are in defaults.toml and the TOML file specified is base.toml, the full sequence is

  • defaults.toml
  • uk.toml
  • metric.toml
  • 2023.toml
  • base.toml

Of course, the included files can themselves include others, but the library keeps track of this and each file is only processed once (the first time it is encountered), which prevents infinite recursion and repeated setting.

This makes it very easy to maintain different kinds and groups of parameters in different files, and to create a variation of a set of parameters simply by making the first line include whatever TOML file specifies the parameters you want to start from, and then override the specific parameter or parameters you want to be different in your new file.

** Support for writing consolidated parameters as TOML after hierarchical inclusion and resolution**

The library can also write the final parameters used out as a single, consolidated TOML file, which is useful when hierarchical inclusion and overriding are used, and keeps a record of the final values of all parameters. This helps with reproducibility and logging.

Support for using environment variables to select parameter set (as well as API)

You can choose how to specify the name of the parameters file to be used, and the name to which it should default. If you create a TOMLParams instance with:

params = TOMLParams(
    defaults='defaults',
    name='newparams',
    standard_params_dir='.'
)

default parameters will be read from ./defaults.toml and then ./newparams.toml will be processed, overriding default values.

If you want to specify the name of the parameters file to use on the command line, the usual pattern would be something like:

import sys
from tomlparams import TOMLParams

class Simulate:
    def __init__(params, ...)


if __name__ == '__main__:
    name = sys.argv[1] if len(sys.argv) > 1 else 'base'
    params = TOMLParams('defaults.toml', name)
    sim = Simulate(params, ...)

Sometimes, however, it's convenient to use an environment variable to set the name of the parameters file, particularly if you want to use the same parameters in multiple programs, or run from a shell script or a Makefile. You can specify an environment variable to use for this and TOMLParams will inspect that environment variable if no name is passed. If you choose SIMPARAMS for this and say

params = TOMLParams('defaults.toml', env_var='SIMPARAMS')

it will look for a name in SIMPARAMS in the environment which you can set with

SIMPARAMS="foo" python pythonfile.py

or

export SIMPARAMS="foo"
python pythonfile.py

If it's not set, it will use base.toml as the file name, or something else you choose with the base_params_stem argument to TOMLParams.

Check it out

You can install TOMLParams from PyPI in the usual way, e.g.

python -m pip -U tomlparams

The source is available from Github (under an MIT license), at github.com/smartdatafoundry/tomlparams

There's a README.md and documentation is available on ReAd the Docs at tomlparams.readthedocs.io.

You can get help within Python on the TOMLParams class with

>>> import tomlparams
>>> help(tomlparams)

After installing tomlparams, you will find you have a tomlparams command, which you can use to copy example code from the README

$ tomlparams examples
Examples copied to ./tomlparams_examples.

You can also get help from with tomlparams help:

$ tomlparams help
TOMLParams

USAGE:
    tomlparams help       show this message
    tomlparams version    report version number
    tomlparams examples   copy the examples to ./tomlparams_examples
    tomlparams test       run the tomlparams tests

Documentation: https://tomlparams.readthedocs.io/
Source code:   https://github.com/smartdatafoundry.com/tomlparams
Website:       https://tomlparams.com

Installation:

    python -m pip install -U tomlparams

TDDA on the Coding for Thought Podcast

Posted on Tue 11 July 2023 in TDDA • Tagged with podcast, TDDA

I had the pleasure of discussing TDDA with Peter Schmidt on his Coding for Thought podcast.

I think it came out really well, so this might be a nice way for people to learn about the ideas and motivations for the ideas and the library, which Simon Brown, Sam Rhynas and I developed over some years at Stochastic Solutions.

The podcast should be available by search in any podcast player for 'Code for Thought'.

Direct links are:


Overcast Logged-in iCloud Users: Self-Selection Bias and Customer Stickiness

Posted on Sun 08 January 2023 in TDDA • Tagged with stats, bias, interpretation

On Episode 258 of Marco Arment and “Underscore” David Smith’s podcast Under the Radar, and then on Episode 516 of Marco & co’s Accidental Tech Podcast, Marco describes the fact that his data suggests that about 12% of his users don’t have logged-in iCloud accounts with iCloud Drive enabled, which was a significant obstacle to moving his sync system for Overcast to use Apple’s CloudKit, which requires both.

This was a surprise to Marco, who had expected that the figure would be closer to 1%, and led Casey Liss to worry aloud about producing an app that depended on CloudKit.

Marco, entirely correctly, suggested that his user base may be non-representative, and pointed out that

  1. Many people listen to podcasts at work and when commuting, so may be more that usually likely to use work-issued phones; such devices are often locked out of use of iCloud in general, and iCloud Drive, most specifically.
  2. His users are almost certainly biased towards geeks.

It may not be immediately obvious, but there is also a strong “statistical” explanation for why Marco is almost certainly right, and Casey’s fears are likely to be somewhere between exaggerated and unfounded.

TL;DR: Self-Selection Bias and Customer Stickiness

Overcast is unusual (possibly unique) in offering a non-CloudKit-based sync system. Users who need non-CloudKit-based podcast syncing have a limited choice of options, possibly Hobson’s Choice. So it is extremely likely that Overcast has a disproportionately high number of users who can’t/don’t use iCloud Drive. Interestingly these people might also be exceptionally loyal, because they (perhaps) have nowhere else to go.

The Long Version

Let’s do some Fermi Estimation.

  • Let’s suppose Apple has 1 billion iOS users.
  • Let’s suppose 10% of them use a podcast player. That’s 100 million people.
  • Let’s suppose (consistent with Marco’s assumption) that 1% of those don’t have a logged-in iCloud account with iCloud Drive (forthwith to be refered to as “iCloudless”). That’s 1 million people.
  • Let’s suppose Overcast has 1% market share (1 million active users).

Understandably, Marco doesn’t release user stats per se, but did say:

“So, to give you some idea of what I mean by ’hardly anybody uses it’, I’m looking at a few hundred people who use the website, and that is not a large portion of the user base. And this is ... per day. ... It’s under 1,000 people. ... snd that’s ... well under 1% ... [I]t’s a very small portion of the user base.”

— Marco Arment, Accidential Tech Podcast #516, One of My Fits of Outrage, 3rd January 2023, from 31:16 (listen here).

That sounds consist with a million users, and certainly says the number significantly exceeds 100,000. Further, BuzzSprout’s Global Player stats for December 2022 put Overcast’s Market Share at 1%,1 with that number pegged at 1,134,0262 (users, presumably).

Based on these estimates, there are the same number of iCloudless podcast listeners—one million—as there are Overcast users, and their only choices are * to use Overcast, * to find another podcast player that offers non-iCloud-Drive-based sync (if there is one), or, * to forego sync. This is the self-selection bias. Just as restaurants with good vegan food probably have a disportionately high number of vegan customers (and much higher than ones that don't cater for vegans), and buildings with proper disabled access probably have disportionately high numbers of disabled customers (and much higher than those that don't offer disabled access), a podcast player that supports iCloudless sync will almost inevitably have a disproportionately high number of iCloudless users. In principle, if the estimates are reasonable, Marco could see up to 100% of his users in this iCloudless category without running out of iCloudless users (though this would plainly be ridiculous).

If these estimates are the right order of magnitude, 12% seems like a very plausible figure for what Marco would see in his stats. It means only 12% of people who might benefit from the sync service are using Overcast, but that’s an order of magnitude more than its estimated market share, which is pretty good. It also means that 88% of podcast listeners who might benefit from iCloudless sync are not currently using Overcast.

As I said, I don’t know whether any other podcast players do have an non-iCloud sync service, but either way, it also suggests that now that Marco has put his plan to discontinue it on ice, it might not be a terrible idea for him to lean into it. Maybe his marketing should specifically try to target users who want to sync podcasts across devices (and the web!) but are “iCloudless". After all, it’s very plausible that there are a million of them out there; and it would be really hard for Marco’s competitors to respond. As noted above, these also might be exceptionally loyal/sticky customers, because there may be nowhere else for them to go. It’s one of Marco's stronger competitive moats.


  1. amazingly, given that I made the estimate before trying to look up any stats, and as you can see, my numbers really are Fermi Estimates. 

  2. even more amazingly, this number of users is also within 2% of my Fermi Estimate. It’s a just a coincidence, but a very happy one!