Controlled experiments on the web: survey and practical guide

Controlled experiments on the web: survey and practical guide. Ron Kohavi · Roger Longbotham · Dan Sommerfield · Randal M. Henne (2009) Data Min Knowl Disc 18:140–181.


Controlled experiments neutralize confounding variables by distributing them equally over all values through random assignment, thus establishing a causal relationship between the changes made in the different variants and the measure(s) of interest, including the Overall Evaluation Criterion (OEC).

We agree and believe that companies can accelerate innovation through experimentation because it is the customers’ experience that ultimately matters, and we should listen to them all the time by running experiments.

Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). Many organizations have strong managers who have strong opinions, but lack data, so we started to use the term HiPPO, which stands for Highest Paid Person’s Opinion, as a way to remind everyone that success really depends on the users’ perceptions.

The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests

When a company builds a system for experimentation, the cost of testing and experimental failure becomes small, thus encouraging innovation through experimentation. Failing fast and knowing that an idea is not as great as was previously thought helps provide necessary course adjustments so that other more successful ideas can be proposed and implemented.

Overall Evaluation Criterion (OEC)

A good OEC should not be short-term focused (e.g., clicks); to the contrary, it should include factors that predict long-term goals, such as predicted lifetime value and repeat visits. Ulwick describes some ways to measure what customers want (although not specifically for the web). [Book by Ulwick – What customers want, pdf].

When running experiments, it is important to decide in advance on the OEC (a planned comparison); otherwise, there is an increased risk of finding what appear to be significant results by chance (familywise type I error).

If the experiment was designed and executed properly, the only thing consistently different between the two variants is the change between the Control and Treatment, so any differences in the OEC are inevitably the result of this assignment, establishing causality.

we need to aggressively filter out robots that do not delete cookies and have a large number of actions


An experiment can be initiated with a small percentage of users assigned to the Treatment(s), and then that percentage can be gradually increased. For example, if you plan to run an A/B test at 50%/50%, you might start with a 99.9%/0.1% split, then rampup the Treatment from 0.1% to 0.5% to 2.5% to 10% to 50%. At each step, which could run for, say, a couple of hours, you can analyze the data to make sure there are no egregious problems with the Treatment before exposing it to more users. The square factor in the power formula implies that such errors could be caught quickly on small populations and the experiment can be aborted before many users are exposed to the bad Treatment.


    1. Quantitative metrics, but no explanations (no “why”)
    2. Short term versus long term effects (Long-term goals should be part of the OEC).
    3. Primacy and newness effects
    4. Features must be implemented (paper prototyping can be used for qualitative feedback, can complement controlled experiments).
    5. Consistency
    6. Parallel experiments (strong interactions are rare in practice; see below)
    7. Launch Events and Media Announcements. If there is a big announcement made about a new feature, such that the feature is announced to the media, all users need to see it

Parallel or sequential?

There are two primary benefits of a single MVT (MultiVariable test) versus multiple sequential A/B tests to test the same factors
  1. You can test many factors in a short period of time
  2. You can estimate interactions between factors

Three common limitations are:

  1. Some combinations of factors may give a poor user experience
  2. Analysis and interpretation are more difficult
  3. It can take longer to begin the test


It is commonly thought that the power of the experiment decreases with the number of treatment combinations (cells). This may be true if the analysis is conducted by comparing each individual cell to the Control cell. However, if the analysis is the more traditional one of calculating main effects and interactions using all the data for each effect, little or no power is lost.
There are two things that will decrease your power, though. One is increasing the number of levels (variants) for a factor. This will effectively decrease the sample size for any comparison you want to make, whether the test is an MVT or an A/B test. The other is to assign less than 50% of the test population to the treatment (if there are two levels). It is especially important for treatments in an MVT to have the same percentage of the population as the Control.
If you want to test ideas as quickly as possible and aren’t concerned about interactions, use the overlapping experiments approach. (With overlapping experiments you test the factors more quickly and, if there is sufficient overlap in any two factors, you can estimate the interaction between those factors.)

If it is important to estimate interactions run the experiments concurrently with users being independently randomized into each test effectively giving you a full factorial experiment.

Lessons learned

6.1 Analysis

  • 6.1.1 Mine the data (a population of  users with a specific browser version was significantly worse for the Treatment)
  • 6.1.2 Speed matters
  • 6.1.3 Test one factor at a time (or not)
    • Conduct single-factor experiments for gaining insights and when you make incremental changes that could be decoupled
    • Try some bold bets and very different designs
    • Use full or fractional factorial designs suitable for estimating interactions when several factors are suspected to interact strongly. Limit the number of values per factor and assign the same percentage to the treatments as to the control. This gives your experiment maximum power to detect effects.

6.2 Trust and execution

  • 6.2.1 Run continuous A/A tests
  • 6.2.2 Automate ramp-up and abort
  • 6.2.3 Determine the minimum sample size [online power calculator]
  • 6.2.4 Assign 50% of users to treatment (For example, if an experiment is run at 99%/1%, then it will have to run about 25 times longer than if it ran at 50%/50%.)
  • 6.2.5 Beware of day of week effects

6.3 Culture and busines

  • 6.3.1 Agree on the OEC upfront; the interested parties have agreed on how an experiment is to be evaluated before the experiment is run.
  • 6.3.2 Beware of launching features that “do not hurt” users.(It is possible that the experiment is negative but underpowered)
  • 6.3.3 Weigh the feature maintenance costs (A small increase in the OEC may not outweigh the cost of maintaining the feature)
  • 6.3.4 Change to a data-driven culture


The paper has some nice quotes:

  • The fewer the facts, the stronger the opinion – Arnold Glasow
  • The difference between theory and practice is larger in practice than the difference between theory and practice in theory – Jan L.A. van de Snepscheut
  • The road to hell is paved with good intentions and littered with sloppy analysis– Anonymous
  • It is difficult to get a man to understand something when his salary depends upon his not understanding it.– Upton Sinclair
  • Almost any question can be answered cheaply, quickly and finally, by a test campaign. And that’s the way to answer them – not by arguments around a table. Go to the court of last resort – buyers of your products. – Claude Hopkins, Scientific Advertising, 1923
  • …the ability to experiment easily is a critical factor for Web-based applications. The online world is never static. There is a constant flow of new users, new products and new technologies. Being able to figure out quickly what works and what doesn’t can mean the difference between survival and extinction. – Hal Varian, 2007

Michael Wolf – Life in Cities

Tot 22 april is de tentoonstelling Life in Cities van Michael Wolf in het Fotomuseum Den Haag. Aardige introductie is dit filmpje van 6 minuten.

De serie Architecture of density is prachtig; wolkenkrabbers in Hong Kong.


Ook Transparent City (Chicago) bevat hele mooie foto’s


En Paris Roof Tops (waren heel mooi uitgelicht in een donkere kelder):


Andere series die getoond werden, waren Tokyo compression:


en werk als student:


Zeker een aanrader!

Op de terugweg kwamen we nog langs Ringen aan Zee


Psychologie van beleggen en Bitcoins

Bitcoins zijn hot en veel in het nieuws. De koers lijkt –met af en toe een hapering- alleen maar omhoog te gaan. Maar cryptomunten zijn zeker niet zonder risico. Bij deze hype spelen ook psychologische aspecten, waardoor consumenten mogelijk niet de juiste (risico-)afwegingen maken.

Psychologie van beleggers

Uit AFM onderzoek in 2015 onder zelfstandig beleggers “blijkt dat het gedrag van beleggers vaak afwijkt van de wijze waarop toezichthouders, financiële ondernemingen en wetgevers zouden willen dat beleggers zich gedragen.” Dat echte mensen zich soms anders gedragen dan perfect rationele modellen voorspellen, is een reden waarom de AFM een apart Consumentengedrag-team heeft. We gebruiken inzicht in daadwerkelijk beslisgedrag om risico’s in kaart te brengen en om effectiever toezicht te kunnen houden.

De psychologische valkuilen bij beleggen gelden misschien nog wel sterker bij het kopen en verkopen van Bitcoins en andere cryptomunten. Omdat koersveranderingen hierbij veel extremer omhoog en omlaag gaan dan meer reguliere beleggingen zoals aandelen of obligaties. Vergelijkingen met de tulpenmanie, de camping-hausse of de internetbubbel liggen op de loer (hoewel: Kent u het verhaal van de tulpenmanie? Klopt niet (en lessen voor de Bitcoin).


In Google Trends, op basis van zoekopdrachten, zien we een grote toename in de interesse in Bitcoins, die bijna net zo hard stijgt als de koers. Zoekvolume kan voorspellende waarde hebben, blijkt uit een studie van het Centraal Planbureau (CPB). “[Er] blijkt een sterke correlatie tussen het aantal zoekopdrachten [naar “hypotheek”] bij Google en het aantal feitelijke transacties op de huizenmarkt” schrijven de CPB-onderzoekers in Een voorlopende huizenmarktindicator.



Ook voor de “gewone” (aandelen)beurs geldt dat als de beurskoersen aantrekken, er meer mensen gaan beleggen. Volgens onderzoeksbureau Kantar TNS telt Nederland in 2017 bijna 1,4 miljoen beleggende huishoudens, 14% meer dan in 2016. De onderzoekers spreken ook van een bandwagon-effect. Dat is een bekend fenomeen uit de psychologie en onderstreept dat de mens een sociaal wezen is en sterk beïnvloed wordt door wat mensen om hem of haar heen doen. In 2015 schreven we hierover (pagina 16 van rapport Belangrijke inzichten over zelfstandige beleggers):

[sociale vergelijking en kuddegedrag] refereren aan het feit dat onze beslissingen worden beïnvloed door de beslissingen die anderen maken. Mensen vergelijken hun eigen situatie met die van anderen in hun sociale omgeving. Als veel mensen in onze omgeving al een bepaald product hebben, nemen we onbewust aan dat dit een goed product is (Cialdini, 1993). Dus, als veel mensen in onze omgeving starten met beleggen, nemen we aan dat het een goed moment is om dat ook te doen. Bewijs hiervoor volgt uit de sterke correlatie tussen de stand van de AEX en het aantal beleggers.

En natuurlijk speelt ook FOMO mee; Fear Of Missing Out. Als de buurman een nieuwe boot kan kopen van zijn Bitcoin-winsten (al dan niet alleen op papier), dan willen we daar niet bij achter blijven. Uit onderzoek van Kantar blijkt overigens ook dat “nauwelijks 2% van de huishoudens” cryptocurrencies zoals de bitcoin heeft. In een update van februari 2018 schrijft Reg van Steen van Kantar: Aantal Nederlandse beleggers cryptovaluta geëxplodeerd, maar nog geen kwart ervan staat op winst. Ze becijferen dat Nederland 580.000 cryptobeleggers telt.

Er is ook wetenschappelijke literatuur die stelt dat particuliere beleggers vaak slecht timen; ze stappen in als de koers hoog is en verkopen als de koers laag is. Om winst te maken, moet je dat juist andersom doen. Dit satirische bericht van De Speld bevat wel een kern van waarheid: Bitcoin stopt pas met stijgen als jij instapt.

Goede risicoafweging

De AFM is een toezichthouder, ze geeft geen beleggingsadvies. Maar de AFM vindt het wel belangrijk dat consumenten een goede risico-afweging maken. Daarom schreef de toezichthouder in 2013 Wees je bewust van de risico’s van Bitcoins, en onlangs nog Reële risico’s bij cryptocurrencies. Meer informatie over de risico’s van investeren in virtuele valuta lees je hier op de AFM website.

From Proof of Concept to Scalable Policies: Challenges and Solutions, with an Application

Crepon, Duflo, Gurgand, Rathelot, and Zamora From Proof of Concept to Scalable Policies: Challenges and Solutions, with an Application (2017) Abhijit Banerjee, Rukmini Banerji, James Berry, Esther Duflo, Harini Kannan,  Shobhini Mukerji, Marc Shotland, Michael Walton. Journal of Economic Perspectives vol. 31, no. 4, Fall 2017 (pp. 73-102). Suggested by my colleague Alexandra van Geen.


The promise of randomized controlled trials is that evidence gathered through the evaluation of a specific program helps us—possibly after several rounds of fine-tuning and multiple replications in different contexts—to inform policy. However, critics have pointed out that a potential constraint in this agenda is that results from small “proof-of-concept” studies run by nongovernment organizations may not apply to policies that can be implemented by governments on a large scale. After discussing the potential issues, this paper describes the journey from the original concept to the design and evaluation of scalable policy. (…) We use this example to draw general lessons about using randomized control trials to design scalable policies.

In terms of establishing causal claims, it is generally accepted within the discipline that randomized controlled trials are particularly credible from the point of view of internal validity. This credibility applies to the interventions studied—at that time, on that population, implemented by the organization that was studied—but does not necessarily extend beyond. Not at all clear that results from small “proof-of-concept” studies run by nongovernment organizations can or should be directly turned into recommendations for policies for implementation by governments on a large scale. While external validity of a randomized controlled trial cannot be taken for granted, is it far from unattainable.

6 obstacles

Six main challenges in drawing conclusions from a localized randomized controlled trial about a policy implemented at scale:

[1] Market equilibrium effects.  When an intervention is implemented at scale, it could change the nature of the market.

To assess the equilibrium impact of an intervention (…) The typical design is a two-stage randomization procedure in which the treatment is randomly assigned at the market level in addition to the random assignment within a market. For example, the experiment of Crepon, Duflo, Gurgand, Rathelot, and Zamora (2013) varied the treatment density of a job placement assistance program in France within labor markets, in addition to random assignment of individuals within each market.

One potential challenge with the experimental identification of equilibrium effects is that it is not always obvious what the “market” is

When a particular intervention is scaled up, more people will be needed to implement it. This may lead to an increase in their wages or in difficulties hiring them.

[2] Spillover Effects. Many treatments have spillovers on neighboring units, which implies that those units are not ideal control groups. Not all spillovers are easy to detect in pilot experiments: in some cases, they may be highly nonlinear.

[3] Political reactions. including either resistance to or support for a program, may vary as programs scale up.

Potential political backlash?

  • Worth exploring whether some changes in potentially inessential program details are available.
  • It is also important to try to anticipate the backlash and create a constituency for the reform from the start
  • Finally, the potential for political backlash may provide an argument for not doing too many pilots, since large-scale programs are less likely to be scotched.

[4] Context Dependence. Would results extend in a different setting (even within the same country)? Would the results depend on some observed or unobserved characteristics of the location where the intervention was carried out?

[5] Randomization or Site-selection bias. Organizations or individuals who agree to participate in an early experiment may be different from the rest of the population; randomization bias.

  • Organizations (and even individuals within governments) who agree to participate in randomized controlled trials are often exceptional
  • A well-understood problem arises when individuals select into treatment
  • Site-selection bias arises because an organization chooses a location or
    a subgroup where effects are particularly large

Blair, Iyengar, and Shapiro (2013): randomized controlled trials across are disproportionally conducted in countries with democratic governments.

[6] Piloting Bias/Implementation Challenges. A number of studies have found differences between implementation by nongovernment organizations and governments. Banerjee, Hanna, Kyle, Olken, and Sumarto (2016): the [Indonesian] government was less effective at running a pilot program and more effective with full implementation.

As the discussion in this section has emphasized, the issue of how to travel from evidence at proof-of-concept level to a scaled-up version cannot be settled in the abstract. The issue of [4] context-dependence needs to be addressed through replications, ideally guided by theory. [1] General equilibrium and [2] spillover effects can be addressed by incorporating estimation of these effects into study designs, or by conducting large-scale experiments where the equilibrium plays out. [5] Randomization bias and [6] piloting bias can be addressed by trying out the programs on a sufficient scale with the government that will eventually implement it, documenting success and failure, and moving from there.

[I skipped the Teaching at the Right Level example]

General Lessons

Perhaps the key point is to remember what small pilot experiments are good for and what they are not good for.

If the objective is to design or test a model [i.e. no policy implications], the researcher can ignore most of the concerns that we talked about in this paper. Something valuable will be learnt anyway.

For researchers, a strong temptation in a stage-two trial will be to do what it takes “to make it work,” but the risk of implementation challenges means that it is important to think about how far to go in that direction. On the one hand, trial and error will be needed to embed any new intervention within an existing bureaucracy. Anything new is challenging, and at the beginning of a stage-two trial, considerable time needs to be spent to give the program a fair shot. On the other hand, if the research team embeds too much of its own staff and effort and ends up substituting for the organization, not enough will need to be learnt about where implementation problems might emerge

The Consumer Financial Protection Bureau and the Quest for Consumer Comprehension – Lauren Willis

CFPB_Vertical_RGB-300x212The Consumer Financial Protection Bureau and the Quest for Consumer Comprehension (book chapter, April 2017) by Lauren Willis.

I found out about this new strand of work via ASIC. I really liked Willis’ debunking paper The Financial Education Fallacy (2011). Related to the paper I summarize below: Performance-Based Consumer Law (2015) and Performance-Based Remedies: Ordering Firms to Eradicate Their Own Fraud (2017). Perhaps I will dive deeper into one of those in another blog post.


To ensure that consumers understand financial products’ “costs, benefits, and risks,” the Consumer Financial Protection Bureau has been redesigning mandated disclosures, primarily through iterative lab testing. But no matter how well these disclosures perform in experiments, firms will run circles around the disclosures when studies end and marketing begins. To meet the challenge of the dynamic twenty-first-century consumer financial marketplace, the bureau should require firms to demonstrate that a good proportion of their customers understand key pertinent facts about the financial products they buy. Comprehension rules would induce firms to inform consumers and simplify products, tasks that firms are better equipped than the bureau to perform.
[unless otherwise stated, all text below is quoted from the paper]
The bureau [CFPB] must induce firms themselves to promote consumer comprehension:
Demonstrating sufficient customer comprehension could be a precondition firms must meet before enforcing a term or charging a fee, or firms could be sanctioned (or rewarded) for low (or high) demonstrated comprehension levels. In effect, rather than prescriptively regulating the marketing and sales process with mandated disclosures or pursuing firms on an ad hoc ex post basis for unfair, deceptive, and abusive marketing and sales practices, the bureau would monitor firms and incentivize them to minimize customer confusion as the marketing and sales process unfolds over time.
Comprehension rules are a form of performance-based regulation, in that they regulate outputs not inputs.
By moving testing of disclosure from the lab to the field, and trying to stimulate firms to develop creative disclosure methods, the CFPB implicitly acknowledges that:
  1. disclosures that do well in experimental conditions may not work in real-world conditions,
  2. firms are better situated than regulators to innovate to achieve consumer comprehension,
  3. valid, reliable consumer confusion audits are possible.

How might this form of regulation operate in practice?

  1. Measuring the quality of a valued outcome (comprehension) rather than of an input that is often pointless (mandated or pre-approved disclosure);
  2. Assessing actual customer comprehension in the field as conditions change over time, rather than imagining what the “reasonable consumer” would understand or testing consumers in the lab or in single-shot field experiments;
  3. Requiring firms to affirmatively and routinely demonstrate customer understanding, rather than relying on the bureau’s limited resources to examine firm performance ad hoc when problems arise;
  4. Giving firms the flexibility and responsibility to effectively inform their customers about key relevant costs, benefits and risks through whatever means the firms see fit, whether that be education or product simplification, rather than asking regulators to dictate how disclosures and products should be designed.
Certainly comprehension is often neither necessary nor sufficient for good decisions (…) Even knowledgeable consumers make bad decisions, whether as a result of inadequate willpower or decisionmaking biases. (…) many decisions require basic financial knowledge that consumers lack; the effective annual percentage rate (APR) for a credit card account “defies plain language efforts”.
It might well be more cost effective for society to engage in substantive regulation of product design or performance-based regulation of consumer welfare outcomes (e.g. a lender that does not follow the bureau’s underwriting rules can instead demonstrate annually that no more than five percent of its loan portfolio defaulted).


Even without any intent to deceive, firms not only will but must leverage consumer confusion to compete with other firms that deceive customers.
Firms have a bevy of means at their disposal to undermine mandated disclosures’ effectiveness:
  1. By altering the design of the transaction (e.g. banks are adept at sabotaging overdraft disclosures, see When Nudges Fail: Slippery Defaults, Willis, 2013)
  2. Frame consumers’ thought processes long before consumers see a disclosure. Consumers may think they are unaffected, but advertising works (Wood and Poltrack 2015; Lewis and Reiley 2014).
  3. Physically divert attention from disclosures. AT&T designed the envelope, cover letter, and amended contract after extensive “antimarketing” market testing to ensure that most consumers would not open the envelope, or if they did open it, would not read beyond the cover letter (Ting v. AT&T, 319 F.3d 1126, 9th Cir. 2003)
  4. Take proactive steps to ferret out easy marks, vulnerable customers. Savvy firms might use inferred cognitive load, mood, or stress levels to sell consumers products at the very moment when mandated disclosures will be misinterpreted or ignored. Firms can even engage in real-time marketing through Internet and mobile devices to reach consumers at vulnerable moments (Digital Market Manipulation, Calo, 2014).
Like sausage-makers, marketers do not want the public to know how their product is made.

Comprehension rules & customer confusion audits

Comprehension rules would align firms’ goals with the CFPB’s mandate to ensure consumer understanding of financial product costs, benefits, and risks. The effect of successful regulation through comprehension rules would be to bring transactions into closer alignment with consumer expectations.
Firms know a lot about their customers, as they already collect this information for marketing and product development purposes.
The very capacities that modern firms use to market products and defeat mandated disclosures enable them to attain better consumer comprehension more quickly and at a lower cost than regulators. The bureau can try to educate consumers, but nothing beats professional marketers when it comes to sending consumers a message.
Firms are in a better position than regulators to decide when it is worth the cost of educating consumers about complex or unintuitive features and when simplifying products is more cost-effective. Firms might find that educating their customers is so costly that it would be cheaper for firms to directly channel consumers to suitable products.
The bureau would need to remain mindful of firm agility at circumventing disclosure, and guard against firms’ manipulation of customer confusion audit results.


The benchmarks against which firm performance in customer confusion audits ought to be judged depend on which of the bureau’s statutory purposes it is pursuing: transparency, competition, or fairness.
Benchmarks if the goal is:
  • Fairness: the benchmarks would need to be high, perhaps as high as the approximately 85 percent benchmark implicitly used in false advertising cases
  • Competition: the benchmarks might be lower, depending on the firm’s ability to differentiate informed from uninformed consumers.
  • Prevent firms from undermining mandated disclosures: the benchmarks might be set at the comprehension levels the bureau can obtain in its disclosure testing.
  • Increase consumer comprehension from where consumers stand now:
    the benchmark might be set based on industrywide performance.

Benefits of Comprehension Rules

The effect of successful regulation through comprehension rules would be to bring transactions into closer alignment with consumer expectations.

The ultimate direct benefit of comprehension rules is increased consumer decisional autonomy; consumers would get what they think they are getting, not whatever hidden features firms can slip into the transaction.

Empowered choices free of confusion are only possible, and the market is only driven to efficiency, when consumers comprehend the transactions in which they engage.

Today we pretend that individual consumers use disclosures to drive market competition and make welfare-enhancing decisions, but we do not spend the resources needed to realize actual consumer understanding. As a result,  consumers neither discipline the market nor consistently enhance their own welfare.

Verslag Netspar International Pension Workshop 2018

Van 17-19 januari 2018 was de International Pension Workshop ’18 van Netspar. Mijn verslag/samenvatting, vooral in Twitter-draadjes.

Dag 1

Monika Böhnke – Choice in Pensions: Insights from the Swedish Premium Pension System

Opvallend: het default-fonds is het meest riskante. Idee bij opzet was dat dat juist zou stimuleren om te kiezen. Dus niet, ongeveer 1000 Zweden kiezen zelf per jaar.

Hazel BatemanRegulation of information provision for pension choices: Australia and the Netherlands compared

Vooral over onderzoek in Australië. Conclusies: “People are not using pension information as expected” & “Testing should be on real, actual behavior”

Ward Romp  – What drives pension reform measures in the OECD?

Niet heel nuttig voor mijn werk bij de AFM, wel interessant soort onderzoek. Handig overzicht pension reforms Nederland.

Johannes HagenA nudge to quit? The effect of fixed-term pensions on labor supply and retirement choices among older workers

Niet gezien (later wel ppt gekregen), wel interessant sterk effect van nudge voor keuze; als “lump sum in 5 jaar” meer salient, duidelijk weergegeven, dan +30%-punt die dat kiest en effect op werkkeuzes; mensen stoppen eerder met werken (want in 5 jaar hogere maandelijkse uitkering).

Dag 2

Jesper RangvidComparison of pension systems in The Netherlands and Denmark: Shifts from “safe” to “less safe” pensions products

“Given a choice, young males with economic background in cities are more likely to give up guarantee”

Goed punt van discussant: slechts 18% maakt een keuze in Denemarken; waarom dan keuze invoeren?

Raymond MontizaanPension reform: Disentangling the impact on Retirement Behavior and Private Savings

Snelle, harde aanpassing in pensioen leidt tot meerdere problemen; beleidsimplicatie: geleidelijk pensioenveranderingen doorvoeren.


Dag 3

Arthur van SoestPension Communication in the Netherlands and other countries.

DNB onderzoek gebruikt maat voor “objectieve” pensioenkennis: Pension literacy: 3 questions that respondents think they know the answer to (researchers don’t know the right answers). Only “don’t know” is scored as wrong answer. Any other answer is “correct”

Onderzoekers vinden wel relaties tussen pensioencommunicatie en pensioenkennis en actieve pensioenbeslissingen (even kwijt hoe ze dat definieerden), maar zijn zwakke verbanden en vaak niet causaal.


Henriëtte Prast – The Power of Percentage: Quantitative Framing of Pension Income.

Dit is onderzoek dat we als AFM ook gerepliceerd hebben. Zie hier mooie samenvatting op DasKapital:


Paul SmeetsFinancial Incentives Beat Social Norms: A Field Experiment on Retirement information search.

Mooi experiment met 250.000 brieven aan deelnemers pensioenfonds detailhandel

Inspelen op sociale norm werkt niet om meer inlogs te krijgen, kans op VVV-bon wel.


Gregor BeckerCan Digital Information Treatments Intensify the Search for Household Spending Data and Improve Liquidity?

Experiment met 100.000 bankklanten om ze online huishoudboekje te laten gebruiken; leidde bij huishoudens met betalingsproblemen tot gemiddeld €453 meer tegoed op lopende rekening


Tabea Bucher-KoenenFinTech as a solution for rational inattention in individual pension planning?

Nog in ruwe/opstartfase; willen effect van een pensioendashboard testen. Alleen in Duitsland weinig digitale data, daarom handmatig informatie voor 1000 mensen ingevoerd door studenten (à 24 minuten per persoon).

Discussant had mooie link naar onderzoek uit 2016 in UK dat ik nog niet kende: Understanding consumer experience of pension scams a year on from pension freedoms.


Rondom de praatjes had ik ook nog even contact met Marike Knoef. Onderzoek van haar was toen net in FD verschenen Nederlander weet nog altijd weinig van pensioen.

In dat stuk ook de 5 kennisvragen. Twee daarvan komen rechtstreeks uit AFM kennistest weetwatjeweet. Een draadje op Twitter:




Dag 1 Marketing Insights Analytics (#MIE18)

Net als vorig jaar ben ik weer naar het Marketing Insights Evenement (#MIE18) geweest.Een recap van dag 1, woensdag 31 januari 2018.

De officiële titel van MIE18 bevat nu ook Analytics, dus na de verbreding van alleen marktonderzoek naar marketing een verdere stretch. Volledige naam: MIE, The Snowball Effect of Marketing, Insights & Analytics.

Dit was editie 16, en van 600 deelnemers in het Sparta stadion bij de eerste MIE, is het evenement nu gegroeid naar 2500 bezoekers. Ook wordt MIE steeds internationaler, een kwart van de praatjes is in het Engels.

AI or die
De ochtend key-note van Alix Rübsaam ging over Artifical Intelligence: AI or Die: Redefining what it means to be human in the digital age?

Verhaal begint eng (“AI poses existential threat to humanity. We are superf***ed“). Met flair en soepele anekdotes, schetst Rübsaam historisch perspectief van hoe mensen tegen technologie aankeken (Golems, hydrolische pompen). Om hoopvol te eindigen: “AI is only only threat if we define that humanity as computational”.


Voor mijn eerste parallel sessie ging ik naar een verhaal over ANWB en autoverzekeringen. ANWB wil graag ook autoverzekeringen aan leden verkopen.

Robert van Ossenbrugge leidt in, met basale en fundamentele vragen:

Daarom in kaart brengen(onderzoek voor ANWB focust op mentale beschikbaarheid):

  • Fysieke beschikbaarheid (vindbaar Google, in het schap)
  •  Mentale beschikbaarheid
    • Zeggen we juiste dingen? (category entry points)
    • Afzender duidelijk? (distinctive brand assets)

Over Category entry points: grotere merken gekoppeld aan meer category entry points, maar ANWB heft lager daadwerkelijk marktaandeel dan verwacht op basis van mentaal marktaandeel (=aantal geassocieerde category entry points).

Over Distinctive brand assets: logo ANWB scoort in juiste kwadrant, hoge bekendheid en hoge uniciteit (onderscheidend). Andere, nieuwere brand assets blijven nog wat achter (scoren lager op Fame en Uniciteit).

Uitdaging voor ANWB om nu hier mee aan de slag te gaan, ze worden wel geassocieerd met “iets met auto’s”, maar dat leidt niet per se tot meer autoverzekeringverkoop. Ik vond het boeiende onderzoeken en ook verfrissend simpel gebracht, hoewel de technieken en nuances onder de motorkap natuurlijk veel ingewikkelder zijn. Dat is knap.

Duurzaam bankieren voor jongeren
ASN bank en Motivaction spraken op de volgende sessie waar ik was: Hoe verjongen we duurzaam bankieren? Hoe kan ASN nieuwe, jonge klanten trekken, zonder oudere (“progressieve spaarders”) te vervreemden?

Deze case vond ik zelf minder aansprekend; leek vooral te draaien om slimmer jongeren via Facebook te targeten (oa door koppeling met Motivactions Mentality model) om ze naar een duurzaamheidsfestival van ASN te laten komen.

Wel een actuele Fokke & Sukke (stond geloof ik gisteren in NRC);

Moreel kompas

Voor ronde 3 had ik een lastige keuze. Uiteindelijk ben ik niet naar de dataviz presentatie van @PerfecteGrafiek geweest [visualisatie volgtips 1, 2, 3, 4], maar naar de Staat van Marketing en kon ik de zeer interactieve discussies met de zaal volgen.

“Slaat online marketing ons kompas exponentieel vérder uit het lood?” vroeg Max Kohnstamm zich af. Dat is precies een van de vragen die ik samen met een collega ook stel in blogs over digitale marketing op de site van mijn werkgever, de Autoriteit Financiële Markten.

Van Welzen presenteerde consumentenonderzoek naar de (dominantie van bepaalde) hotel/restaurant/thuisbezorgings-platforms. Aardig vond ik de link met de financiële sector en de beloningstransparantie daar.

Een oplossing die Van Welzen en Kohnstamm suggereren, is meer transparantie. Ook al is dat ook niet de silver bullet die alles zal oplossen (“The platform always wins” citeerde Kohnstamm ergens). Een sessie met een mooi inhoudelijk debat en prikkelende stellingnames.


Experimenten bij Wehkamp
Voor ronde 4 nam ik plaats in een zeer goedgevulde grote zaal om te luisteren naar Data-driven experimentation bij Wehkamp door Kevin Swelsen. [hele presentatie]

Mooi uitgangspunt om alle problemen hypothese gedreven en met wetenschappelijke experimenten op te willen lossen, “innovation as a series of scientific experiments“. Kan ik me helemaal in vinden.

Een voorbeeld van een experiment was om aantal retouren te laten dalen. Dat is voor online retailer zo mogelijk een nog belangrijker (winstgevender) doel dan conversie verhogen.

De laatste ronde van mijn #MIE18 bracht me naar Analytics at ING’s mobile banking app door Timothy “ING has the best banking app in the Netherlands” Dieduksman.

Niet geheel verrassend zijn vooral jongeren goed voor de meer dan 1 miljard keer dat de App gebruikt is in 2017. Dieper inzoomen op feature-adoptie leverte interessante inzichten; het duurt ongeveer een jaar App-gebruik voordat iemand op het gemiddelde niveau zit en ongeveer 60% van de features kent/gebruikt.

Na (toch nog) een stukje over dataviz (het werk van Nadieh Bremer, maar dan niet over babynamen maar woorden in terugbetaalverzoeken), eindigde Dieduksman met Factbites, een slimme manier om resultaten en inzichten uit data-analyses te delen met collega’s. Met handige tips voor een goede factbite.

De afsluitende key-note heb ik overgeslagen en ook dag 2 laat ik aan me voorbij gaan. #MIE19 ben ik er weer bij.

[update 2/2/2018] De beoordelingen van alle presentaties zijn bekend. En van sommige presentaties de slides. Paar die me interessant lijken:

Schiedam – S’dam

Net terug van een weekendje weg met Taco naar Schiedam. Of eigenlijk: S’dam.


We zaten in Jeneverlogies en dat was prima. Uitstekend ontbijt ook.


En pal naast het Jenevermuseum.


Als niet-jeneverdrinker vond ik het Nationaal Jenevermuseum toch erg interessant. Zeker de uitleg door de Stoker.


Ook een aardige tentoonstelling met 1001 etiketten:

En een proeverij


Mout voor de miljoenen liters jenever voor de 400 branderijen werd in molens gemalen #nofilter (grauwe dag)


Het Jeneverthema komt overal terug, zoals in de Openbare bibliotheek, gevestigd in de voormalige Korenbeurs


Maar ook in logo van een distilleerderij


Straatbeeld; interessante combinatie…


Zondag naar het Stedelijk museum Schiedam.

How to prove, how to interpret and what to do? Uncertainty experiences of street-level tax officials

January 18th 2018, the VIDE publicatieprijs 2017 will be awarded. My own paper Werkt de wildwestwaarschuwing wel? is one of the nominees. The other nominee is How to prove, how to interpret and what to do? Uncertainty experiences of street-level tax officials by Nadine Raaphorst, published in Public Management Review in 2017 (2016 Impact Factor: 2.293).

Obviously, I can’t really objectively summarize this paper. And the fact that it is qualitative research based on a storytelling method, is also completely opposite to my quantitative bias. However, transcribing 37 stories “about situations they experienced as difficult or complicated” by 17 tax officials, is probably no easy feat and quite some work. And just like my paper, Raaphorst did not study actual behaviour.

This study examines the kind of uncertainties frontline tax officials working with a trust-based inspection approach experience in interacting with citizen-clients. The classical literature on bureaucracy and the street-level bureaucracy literature suggest frontline officials face two kinds of uncertainties: information and interpretation problems. Analysing stories of Dutch frontline tax officials collected through in-depth interviews, this article shows that these two kinds of uncertainty only explain a part of the uncertainties experienced. Respondents also face action problems requiring improvisational judgements. The study furthermore finds that different sources underlie these uncertainties, pointing to possible explanations.

Raaphorst studied Dutch tax officials (Belastingdienst) that have dealings with citizen-clients/entrepeneurs, and who have to implement a trust-based inspection approach (“horizontaal toezicht”, aimed at “collaboration and trust” and “rules and legislation that are vaguer“).

A trade-off is: “such policies may yield more responsive law enforcement and service provision, [but] they could also compromise consistent and fair decision-making, especially when certain types of citizen-clients have better negotiation and communication skills to take control in bureaucratic interactions.”

The paper seeks to solve “the lack of understanding of the kinds, conditions, and consequences of uncertainty at play in frontline work.” This is all the more important in a more uncertain bureaucratic process where “bureaucrats’ actions are increasingly made dependent on their perceptions of citizens in interactions, and to a lesser extent prescribed by formal rules, this leads to a more uncertain bureaucratic process.”

Three types of uncertainty

Apparently, there are two types of uncertainty inexisting literature (information and interpretation) and this study adds a new type; action uncertainty:

These findings underline the importance of social interactions to bureaucratic work and hence to understanding the role of uncertainty in bureaucracy. Whereas public administration literature has pointed to the existence of information uncertainties and interpretation uncertainties this study adds a third kind: action uncertainties.

Because “objective rationality (…) did not reflect organizational reality” as described below, there is a information problem with ‘unknowns’.

[In] the traditional model of bureaucracy (…) bureaucracies are seen as rational organizations that should limit individual bureaucrats’ discretionary powers by setting strict rules and procedures. Technocratic knowledge, embodied in rules, procedures, and policies, is put at the heart of bureaucratic organizations.

On uncertainty as an interpretation problem: “bureaucrats’ discretionary practices are not only informed by organizational classification systems and rules but also by personal judgements regarding clients’ worthiness or deservingness, based on cultural schemes, moral beliefs and values, or certain stereotypes.” So “‘instances'” need to be interpreted, to see “what ‘is really happening’“.

A paragraph on Uncertainty of social interactions rightly states: “Discretion at the frontlines ‘is necessary to respond to the unexpected and to ensure that services are responsive to individual need’” And in the public administration literature apparently “The uncertainty that is inherent to discretion is treated as given.” I don’t know the PA-literature, but this strikes me as strange (see this related discussion Toezichthouders moeten zelf initiatief nemen in discussie over buitenwettelijk toezicht).

Summarizing table

The paper has three tables (one in the appendix), that I tried to integrate into one table. I felt they overlapped a lot and differences were more in lay-out than content. That didn’t help me understand the structure of the paper. The different order in the text on action uncertainty from the tables also confused me a bit.

Table 2 Description of the kinds of uncertainty at play in frontline tax officials’ work, slightly adapted and enriched:

Problem of Proof Standards Control
Contexts in which they occur Lack of evidence to support one’s interpretation [4] Vague rules and legislation [8]

Conflicting norms, values, feelings [4]

Impact of citizen-clients’ private lives and emotions [10]

Negotiations with citizen-clients [3]

Deviations from normality [8]

Difficulties experienced Vague stories of citizen- clients |
Conflicting informational cues |Comprehensibility of account is not clear-cut affair |
Finding proof requires effort and time
Law insufficient as backing |Potential inconsistent decision-making | Far-reaching consequences for citizen-clients On-the-spot reaction | Consequentiality of official’s immediate reaction|Change of inspection approach | Dependence on citizen- client

Numbers in brackets: number of stories (total N=37).

As I understood it, rows with problem and Context are nearly identical to Tabel 1 and Table A1 from the Appendix.

For Interpretation uncertainty, what is called “Vague rules and legislation” in Table 2 is “Determining right decision” in Table 1 (and sometimes “grey area interpretation” or “absence of clear standards about what is right in these instances” in the text).

And “Conflicting norms, values, feelings” in Table 2 is called “Experiencing dilemmas” in Table 1 (or “tension between what one ought to do as a tax official and one’s personal values or ideas about what is appropriate, or one’s feelings of empathy.” in the text). Another nice description of this construct is “this leeway or ‘freedom to struggle’ involves dilemmas between following the law on the one hand and feelings of empathy on the other hand.

The “Impact of citizen-clients’ private lives and emotion” under Action uncertainty is described in the text as “emotional labour” and “when ‘private life’ leaks into the encounter“.

The story illustrating “Negotiations with citizen-clients” where one tax official felt “he has been too open and has given away too much already early in the negotiation” was the most salient and best at describing a construct for me.


Are We Smart Enough to Know How Smart Animals Are?

Humans are not so special as we ourselves think, argues Frans De Waal convincingly in this book. “Less anthropocentric orientation (…) animals should be given a chance to express their natural behavior” (p275) and “proponents of human uniqueness (…) can’t stand the notion of humans as modified apes”. (P268).

De Waal favors the cognition for animals camp above the behaviorists (eg Pavlovian training, skinner boxes).

Many experiments comparing humans to apes are flawed. Young children sit in their mothers lap, whilst apes are separated in cages. Especially hard to correctly test social interactions.

Also, initially, gibbons were considered least intelligent because they performed bad on certain tests. Turned out, they lack fully opposable thumb and had hard time picking up things from flat surface “Only when their hand morphology was taken into account did gibbons pass certian intelligence tests” (p14). And another experimental mistake: “premature denials of mirror self-recognition in elephants based on their reaction to an undersize mirror”. (P157). With elephant-sized mirrors elephants do recognize themselves.

Good blend of anecdotes and hard research; “Subjective feelings won’t get us there. Science goes by hard evidence.” (P234). The painstaking field experiments by biologists/ethologists, observing animal behavior in the wild, is a testament that it is possible but very hard. Humanities should take note, especially Gloria Wekker-like, subjective approaches to “science” (check @RealPeerReview on Twitter for some examples).

P208 Nice story on hiding something for chimps. Similar experiment five years (!) with another chimp, made Socko look at exactly the hiding place from 5 years ago.

P228 Work by Sarah Boysen; chimp Sheba gets to choose between two cups with different amounts of candy. The one Sheba points to is given to another chimp. “Yet unable to overcome her desire for the fuller cup, she never learned to do so [point at smaller cup]”. If cups were rep,sced by numbers, she did choose correctly, consistently pointing to the lower number.

P231: Both macaques (Robert Hampton 2004) and rats (Foote and Crystal 2007) volunteer for tests only when they feel confident, suggesting that they know their own knowledge.

Other cool references:

Sorge 2014 Olfactory exposure to males, including men, causes stress and related analgesia in rodents

Capucin monkeys reject unequal pay

Redonan Bshary on cooperation in fish