What does Behavioural Economics mean for Competition Policy?

What does Behavioural Economics mean for Competition Policy? Office of Fair Trading, March 2010. Matthew Bennet, John Fingleton, Amelia Fletcher, Liz Hurley & David Ruck


This paper looks at whether behavioural economics fundamentally changes our understanding of competition policy. We argue that behavioural economics is an important incremental advance in our understanding, just as informational economics was before it. But this does not mean that all previous economic models of competition and markets are now irrelevant. For the most part, they still provide valid and valuable insights. Importantly, behavioural economics does not question our belief in competition policy as a tool for making markets work well for consumers.

Nevertheless, the existence of behavioural biases does have a number of implications for the way in which markets work. Behavioural biases on the consumer side emphasize the importance of the demand side in making markets work well, and the important synergies between consumer policy and competition policy. Behavioural biases may also have implications for anti-competitive behaviour. In spite of this, behavioural economics does not necessarily imply more intervention. Markets can often solve their own problems and even where they can’t, there are dangers inherent in over-paternalism limiting consumer choice. Behavioural economics also emphasizes the difficulties that authorities can have in trying to correct for such biases.

Homo Sapiens exhibits systematic biases in the way he views both the world and markets. (…) are there ways in which behavioural biases might lead to systematic biases in the models of markets and competition on which we have been relying?

Behavioral economics no fundamental shift because/but:

  1. Behavioural economics does not mean that all previous economic models are negated
  2. Both competition policy and demand-side intervention are crucial tools for making markets work well for consumers
  3. The market may find its own solutions to any problems, but we can not blindly assume the market will solve everything
  4. competition (or consumer) authorities can face difficulities in trying to correct for such biases. Eg.: Behavioural economics tells us that simply providing more information may not be a good solution when consumers have problems assessing such information

Ad #4: It is well documented that consumers do not always read and understand the information provided to them; For example, see (November 2007), Warning: Too much information can harm, A final report by the Better Regulation Executive and National Consumer council on maximising the positive impact of regulated information for consumers and markets.

Market failures

  1. Market power
  2. Asymmetries in information between consumers and firms
  3. Externalities not captured within consumers’ preferences
  4. (?) Behavioural biases

Behavioural economics (& biases):

  • highlights that consumers may find it hard to assess information and compare across products
  • allows us to better understand the underlying causes of search costs (which affect access) and switching (which limits ability to act)
  • makes clear that existing problems within the consumer decision-making process are more entrenched and prevalent than we had believed

Virtuous circle

Markets work well when there are efficient interactions on both the demand(consumer) side and the supply (firm) side. On the demand side, confident consumers activate competition by making well-informed and well-reasoned decisions which reward those firms which best satisfy their needs. On the supply side, vigorous competition provides firms with incentives to deliver what consumers want as efficiently and innovatively as possible. When both sides function well, a virtuous circle is created between consumers and competition.


Failure of either side of the circle can harm the effectiveness of markets

Dynamic competition may also be affected by consumer biases within the market. Over time this evolutionary role of competition implies that the average efficiency of the market increases for all consumers. This role is diminished when consumers no longer reward those firms that provide them with what they really want but, instead, reward those that best play on their biases.

Consumers drive

In order for consumers to drive competition by their active, effective, and rational part in this virtuous circle, they ideally need to:

  • Access information about the various offers available in the market. Affected by biases e.g.: consumers tend to look at relative costs rather than absolute search costs.
  • Assess these offers in a well-reasoned way. Affected by biases e.g.:
    • incorrectly anticipating risk, underestimating or overestimating future use, or overweighting the present
    • use rules of thumb
    • distracted by the way in which information is framed and presented
  • Act on this information and analysis by purchasing the good or service that offers the best value to the customer. Affected by biases: e.g. overconfidence, can create inertia

Firms’ Reactions to Consumer Biases

  • Accessing information. Firms can make it more difficult for consumers to
    perform optimal search. E.g. add-on services, adding clauses, drip-pricing
  • Assessing offers. E.g. obfuscating prices or increasing choice or complexity
  • Acting on information and analysis. E.g. increase switching costs (play on inertia), use defaults and automatic enrolments, or use time limited offers to inhibit switching

But: there is a growing empirical literature that provides evidence to support the notion of non-rational behaviour by firms, see Armstrong & Huck (2010) Behavioral Economics as Applied to Firms: A Primer.

Problems in Markets can be Self-correcting

Market Solutions: the market may require a catalyst in order to change from an equilibrium in which all firms want to exploit consumer biases to an equilibrium in which all firms want to help consumers by revealing their prices. Potential catalysts: media, or advisors & intermediaries (e.g. consumer organizations).

Power of Learning: Even if firms have an incentive to mislead consumers this may not be possible (for long) if consumers learn from their mistakes. There are clearly limits to learning.

Self-regulation occurs where firms opt to join schemes that require them to
behave in particular ways.

Intervention Can Potentially Do More Harm than Good

All errors which [man] is likely to commit against advice and warning, are far outweighed by the evil of allowing others to constrain him to what they deem his good.

John Stuart Mill, (1859), On Liberty.

  1. We want solutions that solve the problem, but we do not want to remove consumer choice
  2. Tthere is no guarantee that authorities will necessarily improve the market or not create unforeseen consequences elsewhere. It may be that authorities simply do not have the level of expertise required to make delicate interventions
  3. Authorities might have behavioural biases as well

Caution us against being too paternalistic even when behavioural biases point to problems within the market.

Lessons for Design of Remedies

  • There will always be times – just as there has always been – when intervention is necessary.
  • Other tools include consumer enforcement, consumer education, and (in
    the UK at least) market studies and investigations. There is also potential for
    authorities to advocate legislation in a particular market
  • Example of a positive intervention may be obligations on firms to require them to help consumers make decisions

A further concern that can arise around interventions to solve problems associated with consumer biases is that such interventions can be inherently redistributive. In many markets, the gains that firms make from exploiting consumer biases will be to some extent passed back, through the competitive process, to customers who do not exhibit those biases. In this case, there is effectively a form of cross-subsidy between customers, and this may be unwound with intervention. This does not imply that such interventions should not be made, but it is important to be aware that there can be losers as well as winners in such situations.


Where behavioural biases appear to be creating problem, some may advocate abandoning competition for regulation. We discussed above the dangers of over-paternalism and limiting choice. Competition authorities have a key role in reminding government of the benefits that competition and choice bring. In doing so, however, they need to be cognisant of the available evidence on behavioural economics and its implications.


Hypotheek aflossen met rentevoordeel

Samen met Stefanie de Beer schreef ik een stukje voor de rubriek Statistiek van ESB: Hypotheek aflossen met rentevoordeel [paywall].

Het is een uitwerking van wat op pagina 36 van dit eerdere rapport van de AFM staat: Rapport Experimenteren: samen leren activeren [bijbehorende persbericht: AFM, ING en Florius innoveren met gedragswetenschappelijke experimenten].

Enkele praktijkmensen (hypotheekadviseurs en financiële planners) deelden via Twitter wat van hun ervaringen:


Omdat het idee gebaseerd is op Save More Tomorrow: Using Behavioral Economics to Increase Employee Saving, maakte ik ook een Engelse tweet met daarin de twee bedenkers van SMaRT:

pdf: ESB_Hypotheek aflossen met rentevoordeel


De eerste maanden van dit jaar ben ik in paar musea geweest en heb daar wat kiekjes geschoten. Verzamel ik hier in deze blogpost.

Bezoek aan Haags Fotomuseum (maart) heeft eigen post: Michael Wolf – Life in Cities, net als het Jenevermuseum in Schiedam (januari).

Februari “Photo-phylles” in Jardin Botanique Bordeaux:

Deze slideshow heeft JavaScript nodig.

Februari: Bernd, Hilla en de anderen / Fotografie uit Düsseldorf in Huis Marseille:

Deze slideshow heeft JavaScript nodig.

Maart: NEMO Amsterdam (kids hadden studiedag, samen met Mees):

DbFy7-ZW4AAg9EO.jpg large

Maart: André Volten – Utopia in Beelden aan zee:

Deze slideshow heeft JavaScript nodig.

April: Centraal Museum Utrecht:

Deze slideshow heeft JavaScript nodig.

April: Fashion Cities Africa in Tropenmuseum Amsterdam:

Deze slideshow heeft JavaScript nodig.

April: Hollandse Meesters uit de Hermitage in Hermitage Amsterdam:

Deze slideshow heeft JavaScript nodig.


Central Bank Communication and the General Public

Central Bank Communication and the General Public. Andy Haldane (of Dog and Frisbee fame) and Michael McMahon. 2018. Forthcoming, AEA Papers and Proceedings.

Blinder (2009) wrote that “It may be time for both central banks and researchers to pay more attention to communication with a very different audience: the general public.” communication can aid expectations, and hence economic, management; central bank communication is now itself a powerful lever of monetary policy.

Haldane (2017) stresses a deficit of public understanding as well as public trust in central banks –  a twin deficits problem. Facing these twin deficits, a number of central banks have recently acknowledged the need to adapt their communications strategies to improve their reach to the general public, including through more accessible language and more direct engagement (Haldane, 2017). Because such efforts are not costless, however, two important considerations arise: feasibility and desirabilty.


Four reasons why it may be desirable to speak directly to a wider audience.

  1. A better under standing of the factors driving the economy, and economic policy, could help to reduce the incidence of such self-reinforcing expectatational swings in sentiment and behaviour.
    To become convincing and credible, communications may need to be simple, relevant and story-based. Typical central bank communications tend to fail on all three fronts.
    Households who report greater knowledge and greater satisfaction with monetary policy are also likely to have one-year, two-year and five year inflation expectations that are closer to the inflation target.
  2. Building public understanding may be important as a means of establishing trust and credibility about central banks and their policies.
    It is also important for reasons of political accountability.
    Satisfaction in central banks’ actions is positively correlated with institutional understanding. It is also positively correlated with measures of central bank credibility.
  3. Traditional information intermediaries, such as the mainstream media and nancial markets, may benefit from new, simpler narrative communication.
  4. To engage in more listening to messages from the general public, given that aggregating information is one of a monetary policy committees’ key roles.


We examine a recent communication initiative by the Bank of England. In November 2017 the Bank of England launched a new, broader-interest version of its quarterly In ation Report (IR), augmented with new layers of content aimed explicitly at speaking to a less-specialist audience.

Overall, the analysis is a nuanced good news message.

  1. Website activity over the course of the 24 hours after the announcement increase markedly in November 2017, almost doubling compared with earlier IRs.
  2. Numbers of tweets and retweets associated with the IR were materially higher than in August 2017, but slightly lower than in August 2016. Monetary policy news itself,
    rather than the means by which it is communicated, is the largest single factor determining the reach of Twitter activity.
  3. More than 70% of respondents [in a survey of BoE business contacts] felt the new layered summary helped them to better understand the IR’s messages. And around 60% of respondents felt the new communication improved their perceptions of the Bank.


We now assess the impact of the new Bank of England communications more directly through a controlled experiment. N=285 UK general public, plus sample of first-year Oxford economics graduate students.

Participants were then randomly assigned to read either the traditional Monetary Policy Summary that accompanies the IR or the new, simplified layered content.

Three questions:

(1) understand the content and messages?

The results confirm that the new layered content is easier to read and understand, even for technically-advanced MPhil students.

(2) IR summary changed your views or expectations?

In the case of the general public survey, we find that more straightforward communication boosts the chances that the participant’s beliefs move more closely into alignment with the Bank’s forecasts. For MPhil students, the coefficient is also positive but not statistically significant.

(3) How has the IR summary affected your perceptions of the Bank of England?

Those that read the new layered content tended to develop an improved perception of the institution (BoE).

Policy implications

On a practical level, central banks aiming to reach a broader audience will need to continue to innovate and experiment with different methods and media for engaging the general public. This will, inevitably, require a degree of trial and error.

Success should be measured, not by the ability to reach everyone, but rather to influence beyond the small minority of technical specialists and information intermediaries who currently form the core of central banks’ audience.


Summarized in The Telegraph: There are good reasons why the Bank of England is trying to speak to ‘ordinary people’.

For central banks, communication is a powerful policy tool. The way central bankers talk about their thinking and decision influences even long-term interest rates as investors price credit according to their expectations of the central bank’s behaviour.


At the same time, the Bank is changing the way it communicates, and more specifically changing the people it communicates to. As well as the traders, economists and strategists in the markets, the Bank wants to talk to the wider public. This approach has led to a more accessible version of the quarterly Inflation Report and Governor Mark Carney going on ITV’s Peston on Sunday show. There are several reasons why the Bank might try to broaden the audience.

For one, clearer, simpler messaging may help media and markets to understand policy. Experience suggests the cryptic code of an earlier generation of central bankers can be misunderstood even by sophisticated market participants.

Perhaps more important, talking directly to “ordinary people” confirms that household actions matter to the economy and to Bank policy. Just as much as City investors, consumers need to form sensible expectations of the future economy when they make decisions on borrowing and spending.

A more accessible approach may build public confidence in the Bank at a time when trust in public institutions is weak. It could also open a dialogue that facilitates the flow of information from households to the central bank. The Bank surveys businesses extensively, but households less so. MPC members knowing more about what households think and feel about the economy can only be a good thing.

So there are good reasons for the Bank to try harder to talk to the wider public, and I think that approach will continue. Market participants should get used to the fact that they’re not the only people the Bank wants to talk to – and learn to read its utterances in that context.


Rethinking Traditional Methods of Survey Validation – Maul (2017)

Andrew Maul (2017) Rethinking Traditional Methods of Survey Validation, Measurement: Interdisciplinary Research and Perspectives, 15:2, 51-6 (Found via Tweet that is now protected)


It is commonly believed that self-report, survey-based instruments can be used to measure a wide range of psychological attributes, such as self-control, growth mindsets, and grit. Increasingly, such instruments are being used not only for basic research but also for supporting decisions regarding educational policy and accountability. The validity of such instruments is typically investigated using a classic set of methods, including the examination of reliability coefficients, factor or principal components analyses, and correlations between scores on the instrument and other variables. However, these techniques may fall short of providing the kinds of rigorous, potentially falsifying tests of relevant hypotheses commonly expected in scientific research. This point is illustrated via a series of studies in which respondents were presented with survey items deliberately constructed to be uninterpretable, but the application of the aforementioned validation procedures nonetheless returned favorable-appearing results. In part, this disconnect may be traceable to the way in which operationalist modes of thinking in the social sciences have reinforced the perception that attributes do not need to be defined independently of particular sets of testing operations. It is argued that affairs might be improved via greater attention to the manner in which definitions of psychological attributes are articulated and greater openness to treating beliefs about the existence and measurability of psychological attributes as hypotheses rather than assumptions—in other words, as beliefs potentially subject to revision.

Procedures of analysis and quality control of measurement instruments are often grouped under the heading of “validation” in the social sciences. In the case of self-report, survey-based instruments, such validation activities commonly consist of essentially three steps:

  1. Estimation of overall reliability or measurement precision, via estimation of Cronbach’s alpha
  2. Some form of latent variable modeling, via exploratory factor analysis (or sometimes principal components analysis), possibly followed by confirmatory factor analysis; and, more rarely,other latent variable models
  3. Estimation of associations between the measured variable and external variables, by inspection and interpretation of correlation matrices of scores on the new instrument and scores from existing instruments designed to measure similar or theoretically related attributes or outcomes of interest.


Why is this trinity succesful?

An optimistic explanation for the longevity and popularity of these techniques could be that they are, in fact, reliably successful in achieving their intended scientific and quality-control aims.

A less optimistic observer might note various extra-scientific factors that might contribute to the popularity of these techniques, such as the fact that they have a clear track record of success in facilitating the publication of manuscripts in academic journals and providing a socially accepted warrant for claims of validity; additionally, these techniques are relatively easy to understand and implement, especially by comparison to many other psychometric models (which are not as easily accessible via common software programs such as SPSS).

Three studies with items without theory

In the three studies described above, items were written in the complete absence of a theory concerning what they measured and how they worked.

  1. In the first study, the items closely resembled items from a widely used survey instrument intended to measure growth mindsets, with the notable exception that the key noun in the sentence (“intelligence”) had been replaced with a nonsense word (“gavagai”). To help ensure that any results were not driven by peculiarities of the word “gavagai,” two additional versions of the survey were also used,  where the word “gavagai” was replaced with “kanin” or“quintessence” [result: wording did not matter].
  2. In the second study, items consisted only of meaningless gibberish (Study 2 items were constructed so as to entirely lack even the semblance of semantics. Eight items were constructed, of approximately equal length, consisting of stock lorem ipsum text (e.g.,“sale mollis qualisque eum id, molestie constituto ei ius”)
  3. In the third, they were simply absent. The items (if they could even be called that) simply consisted of an item number (e.g., “1.”), followed only by the six response options as described in the previous studies,ranging from strongly disagree to strongly agree.

Prima facie, it would seem difficult to take seriously the claim that any of these sets of items constituted a valid measure of a psychological attribute, and if such a claim were made, one might reasonably expect any quality-control procedure worthy of the name to provide an unequivocal rejection.To state this in Popperian language: If ever there were a time when a theory deserved to be falsified, this would appear to be it.

Yet, this is not what occurred. In all three studies above, reliability estimates for the deliberately-poorly-designed item blocks were quite high by nearly any standard found in the social sciences.

These validation procedures returned results roughly in line with what is commonly provided as positive evidence of validity throughout the social sciences. This would appear to cast doubt on the adequacy of these methods for providing the kind of rigorous test of beliefs usually expected of scientific studies. Indeed, if response data from nonsensical and blank items can meet classically accepted criteria for validity, one might wonder under what conditions such procedures would not return encouraging results.

it was argued and shown that traditional validation approaches may commonly fail to provide rigorous, potentially falsifying tests of key hypotheses involved in the construction of measures; it was demonstrated that it is not only possible but also apparently fairly easy to obtain favorable-seeming values of common statistical criteria for validity even in the absence of a theory concerning what an instrument measures and how it operates and, in fact, even in the absence of actual items.

The validation activities themselves (in particular, the aforementioned trinity of reliability estimates, factor analyses, and inspection of correlations with other variables) are essentially unreactive to theory.

Favorable-looking results as a default expectation

The results of this study suggest that, at least in the context of responding to survey questions, respondents often choose to behave consistently unless there is a clear reason not to do so. As such, it may be that favorable-looking results of covariance-based statistical procedures  should be regarded more as a default expectation for survey response data than as positive evidence for the validity of an instrument as a measure of a psychological attribute.

Ad hoc explanations

[A] number of interesting correlations surfaced, including the correlation between scores on the “Theory of Gavagai” items and scores on the original Theory of Intelligence items, and the correlation between the nonsense items and Big Five Agreeableness. If one were inclined to do so, one might be able to provide ad hoc explanations regarding how these correlations constitute evidence of validity.

Misconceptions regarding the nature of scientific inquiry in general and measurement in particular

The process of “validating” a measure seems to be thought of by many as separate from the process of defining the attribute to be measured and articulating hypotheses concerning the nature of the connection between variation in the attribute and variation in the outcomes of the proposed testing procedures; that is, the classic trinity of analytic methods used in traditional survey validation applications seem to be fixed a priori and independently of the substantive area of application, background psychological theory, and motivating goals for the creation of the instrument.

In many applications of psychological measurement, the definition of the attribute of interest is vague at best and incoherent or entirely absent at worst.

operationalism and other strong forms of empiricism may have encouraged the perception that psychological attributes do not need to be rigorously defined independently of a particular set of testing operations.

There may be good reason to be suspicious of strong claims regarding the accuracy, precision, and coherence of many survey-based instruments at least, to the extent to which such claims are justified with reference to traditional validation strategies and

especially in the presence of unclear or poorly formulated definitions of target attributes and theories regarding their connection to the outcomes of proposed measurement procedures.

Michell (e.g.,1999; Measurement in psychology: A critical history of a methodological concept) refers to this belief [measurement is a universally necessary component of scientific inquiry] as the quantitative imperative—the conviction that measurement is necessary for scientific inquiry— and gives a thorough historical account of its origins and development and of the ways in which it has shaped methodological reasoning in the psychological sciences since their inception.

Factoid from footnote 5: Lorem ipsum text, which is commonly used as placeholder text in publishing and graphic design applications, is itself derived from sections 1.10.32 and 1.10.33 of “de Finibus Bonorum et Malorum” (The Extremes of Good and Evil) by Cicero, written in 45 BC.

Salience and Switching: Hungarians do switch car insurance due to advertising campaign

Working paper Salience and Switching by András Kiss (UvA) [mirror SalienceSwitchingKiss].


I estimate the effect of a concentrated advertising period on contract switching decisions in auto liability insurance and show that consumers’ inattention is a major obstacle to switching service providers. For identification, I exploit a recent change in Hungarian regulation,which creates exogenous variation in the salience of the switching opportunity for a subset of drivers. Using a micro-level dataset, I find that the media campaign increases switching rates from 20% to 36%. I also jointly estimate switching costs and consumer inattention in a two-stage demand model, showing that 30% of insurees only consider switching because of the campaign.


[I]nertia can be due to the time or the effort cost of switching, but psychological factors, such as inattention, procrastination, or fear of new situations, can also create or heighten barriers to switching.

[T]here are a number of important retail markets with low switching rates and high consumer inertia (e.g. gas, electricity, banking) in which people could benefit from an endogenously arising campaign effect if switching opportunities were restricted to specific times of the year. (also see my Dutch blog post Waarom klanten onbewuste blijvers zijn & hoe je ze kunt overtuigen).

Natural experiment

The main point of this paper, however, is the use of a natural experiment to measure how much an actually observed policy change can influence consumer decisions, and to show that the mechanism works primarily by decreasing the share of inattentive people.

[I]dentifying the effect of salience on consumer switching.

Campaign +16%-points switchers; financial incentive: +3.5%-points

In this paper, I exploit a change in auto liability insurance regulation in Hungary to identify the causal effect of a concentrated advertising period that provides no decision-relevant information to consumers, but increases the salience of the switching opportunity for a well-defined time period.

My main result is that the campaign almost doubles switching rates from a baseline of 20 percent to 36 percent.

In comparison, the estimated reduced-form relationship betweenfinancial incentives and switching decisions is much weaker: an additional saving of $35 per year (the median in the sample) is associated with only 3.5 percentage points higher switching rates.

Without the campaign, over two-thirds of consumers ignore the decision problem altogether, whereas during the campaign the implied share of inattentive people is 40 percent. The estimated mean switching cost is $57.

Hungarian auto-insurance market changed

Before January 1, 2010 (in the synchronized regime), contracts were required to coincide with the calendar year from the second year onwards. [P]eople [only] had the month of November to consider changing insurance contracts. Market players spent over 90 percent of their yearly marketing budgets in November.

Following January 1, 2010 (in the dispersed regime), all new insurance periods -including the first one – have become one year long.


People that can switch in November (campaign periods, grey areas) do so more often than in non-campaign periods.


Financial savings do play a role (fraction of switchers increases with fee savings from switching), but %switching is much higher in campaign-season (November; circles) than in months without major advertisement campaigns (squares)



Discussion & Conclusions

Using a natural experiment created by a change in auto insurance regulation in Hungary, I show that merely increasing the salience of a decision problem without transmitting relevant information has a large eect on people’s actions. Therefore, their “choice” to ignore the problem when it was not salient must have been suboptimal.

The coordination of all switching activity into a single month in the synchronized regime is also an effective idea, as the campaign eect estimates confirm.

Finally, a requirement to send a regulator designed information leaflet on contract switching along with the insurers’ announcement of next year’s continuation prices could make a difference as well. In a field experiment, Adams et al. (2016) vary the amount of information and the cost of switching to alternative savings accounts in the U.K. and also look at the effect of reminders. They find a 3-9 pp increase in switching from a baseline of 3%. [Adams et al, 2016, is FCA Occasional Paper No. 19: Attention, Search and Switching: Evidence on Mandated Disclosure from the Savings Market]. (I don’t think Kiss’s paper adds new evidence for leaflets, I think it is an additional policy recommendation).

My main result is that the media campaign has a large causal effect, increasing switching rates by around 16 percentage points from a baseline of 20 percent. 70 percent of people routinely ignore the decision problem, while the campaign decreases the share of inattentive people to 40 percent. The campaign’s effects are largely homogenous across drivers and robust to a variety of specifications.

Controlled experiments on the web: survey and practical guide

Controlled experiments on the web: survey and practical guide. Ron Kohavi · Roger Longbotham · Dan Sommerfield · Randal M. Henne (2009) Data Min Knowl Disc 18:140–181.


Controlled experiments neutralize confounding variables by distributing them equally over all values through random assignment, thus establishing a causal relationship between the changes made in the different variants and the measure(s) of interest, including the Overall Evaluation Criterion (OEC).

We agree and believe that companies can accelerate innovation through experimentation because it is the customers’ experience that ultimately matters, and we should listen to them all the time by running experiments.

Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). Many organizations have strong managers who have strong opinions, but lack data, so we started to use the term HiPPO, which stands for Highest Paid Person’s Opinion, as a way to remind everyone that success really depends on the users’ perceptions.

The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests

When a company builds a system for experimentation, the cost of testing and experimental failure becomes small, thus encouraging innovation through experimentation. Failing fast and knowing that an idea is not as great as was previously thought helps provide necessary course adjustments so that other more successful ideas can be proposed and implemented.

Overall Evaluation Criterion (OEC)

A good OEC should not be short-term focused (e.g., clicks); to the contrary, it should include factors that predict long-term goals, such as predicted lifetime value and repeat visits. Ulwick describes some ways to measure what customers want (although not specifically for the web). [Book by Ulwick – What customers want, pdf].

When running experiments, it is important to decide in advance on the OEC (a planned comparison); otherwise, there is an increased risk of finding what appear to be significant results by chance (familywise type I error).

If the experiment was designed and executed properly, the only thing consistently different between the two variants is the change between the Control and Treatment, so any differences in the OEC are inevitably the result of this assignment, establishing causality.

we need to aggressively filter out robots that do not delete cookies and have a large number of actions


An experiment can be initiated with a small percentage of users assigned to the Treatment(s), and then that percentage can be gradually increased. For example, if you plan to run an A/B test at 50%/50%, you might start with a 99.9%/0.1% split, then rampup the Treatment from 0.1% to 0.5% to 2.5% to 10% to 50%. At each step, which could run for, say, a couple of hours, you can analyze the data to make sure there are no egregious problems with the Treatment before exposing it to more users. The square factor in the power formula implies that such errors could be caught quickly on small populations and the experiment can be aborted before many users are exposed to the bad Treatment.


    1. Quantitative metrics, but no explanations (no “why”)
    2. Short term versus long term effects (Long-term goals should be part of the OEC).
    3. Primacy and newness effects
    4. Features must be implemented (paper prototyping can be used for qualitative feedback, can complement controlled experiments).
    5. Consistency
    6. Parallel experiments (strong interactions are rare in practice; see below)
    7. Launch Events and Media Announcements. If there is a big announcement made about a new feature, such that the feature is announced to the media, all users need to see it

Parallel or sequential?

There are two primary benefits of a single MVT (MultiVariable test) versus multiple sequential A/B tests to test the same factors
  1. You can test many factors in a short period of time
  2. You can estimate interactions between factors

Three common limitations are:

  1. Some combinations of factors may give a poor user experience
  2. Analysis and interpretation are more difficult
  3. It can take longer to begin the test


It is commonly thought that the power of the experiment decreases with the number of treatment combinations (cells). This may be true if the analysis is conducted by comparing each individual cell to the Control cell. However, if the analysis is the more traditional one of calculating main effects and interactions using all the data for each effect, little or no power is lost.
There are two things that will decrease your power, though. One is increasing the number of levels (variants) for a factor. This will effectively decrease the sample size for any comparison you want to make, whether the test is an MVT or an A/B test. The other is to assign less than 50% of the test population to the treatment (if there are two levels). It is especially important for treatments in an MVT to have the same percentage of the population as the Control.
If you want to test ideas as quickly as possible and aren’t concerned about interactions, use the overlapping experiments approach. (With overlapping experiments you test the factors more quickly and, if there is sufficient overlap in any two factors, you can estimate the interaction between those factors.)

If it is important to estimate interactions run the experiments concurrently with users being independently randomized into each test effectively giving you a full factorial experiment.

Lessons learned

6.1 Analysis

  • 6.1.1 Mine the data (a population of  users with a specific browser version was significantly worse for the Treatment)
  • 6.1.2 Speed matters
  • 6.1.3 Test one factor at a time (or not)
    • Conduct single-factor experiments for gaining insights and when you make incremental changes that could be decoupled
    • Try some bold bets and very different designs
    • Use full or fractional factorial designs suitable for estimating interactions when several factors are suspected to interact strongly. Limit the number of values per factor and assign the same percentage to the treatments as to the control. This gives your experiment maximum power to detect effects.

6.2 Trust and execution

  • 6.2.1 Run continuous A/A tests
  • 6.2.2 Automate ramp-up and abort
  • 6.2.3 Determine the minimum sample size [online power calculator]
  • 6.2.4 Assign 50% of users to treatment (For example, if an experiment is run at 99%/1%, then it will have to run about 25 times longer than if it ran at 50%/50%.)
  • 6.2.5 Beware of day of week effects

6.3 Culture and busines

  • 6.3.1 Agree on the OEC upfront; the interested parties have agreed on how an experiment is to be evaluated before the experiment is run.
  • 6.3.2 Beware of launching features that “do not hurt” users.(It is possible that the experiment is negative but underpowered)
  • 6.3.3 Weigh the feature maintenance costs (A small increase in the OEC may not outweigh the cost of maintaining the feature)
  • 6.3.4 Change to a data-driven culture


The paper has some nice quotes:

  • The fewer the facts, the stronger the opinion – Arnold Glasow
  • The difference between theory and practice is larger in practice than the difference between theory and practice in theory – Jan L.A. van de Snepscheut
  • The road to hell is paved with good intentions and littered with sloppy analysis– Anonymous
  • It is difficult to get a man to understand something when his salary depends upon his not understanding it.– Upton Sinclair
  • Almost any question can be answered cheaply, quickly and finally, by a test campaign. And that’s the way to answer them – not by arguments around a table. Go to the court of last resort – buyers of your products. – Claude Hopkins, Scientific Advertising, 1923
  • …the ability to experiment easily is a critical factor for Web-based applications. The online world is never static. There is a constant flow of new users, new products and new technologies. Being able to figure out quickly what works and what doesn’t can mean the difference between survival and extinction. – Hal Varian, 2007

Michael Wolf – Life in Cities

Tot 22 april is de tentoonstelling Life in Cities van Michael Wolf in het Fotomuseum Den Haag. Aardige introductie is dit filmpje van 6 minuten.

De serie Architecture of density is prachtig; wolkenkrabbers in Hong Kong.


Ook Transparent City (Chicago) bevat hele mooie foto’s


En Paris Roof Tops (waren heel mooi uitgelicht in een donkere kelder):


Andere series die getoond werden, waren Tokyo compression:


en werk als student:


Zeker een aanrader!

Op de terugweg kwamen we nog langs Ringen aan Zee


Psychologie van beleggen en Bitcoins

Bitcoins zijn hot en veel in het nieuws. De koers lijkt –met af en toe een hapering- alleen maar omhoog te gaan. Maar cryptomunten zijn zeker niet zonder risico. Bij deze hype spelen ook psychologische aspecten, waardoor consumenten mogelijk niet de juiste (risico-)afwegingen maken.

Psychologie van beleggers

Uit AFM onderzoek in 2015 onder zelfstandig beleggers “blijkt dat het gedrag van beleggers vaak afwijkt van de wijze waarop toezichthouders, financiële ondernemingen en wetgevers zouden willen dat beleggers zich gedragen.” Dat echte mensen zich soms anders gedragen dan perfect rationele modellen voorspellen, is een reden waarom de AFM een apart Consumentengedrag-team heeft. We gebruiken inzicht in daadwerkelijk beslisgedrag om risico’s in kaart te brengen en om effectiever toezicht te kunnen houden.

De psychologische valkuilen bij beleggen gelden misschien nog wel sterker bij het kopen en verkopen van Bitcoins en andere cryptomunten. Omdat koersveranderingen hierbij veel extremer omhoog en omlaag gaan dan meer reguliere beleggingen zoals aandelen of obligaties. Vergelijkingen met de tulpenmanie, de camping-hausse of de internetbubbel liggen op de loer (hoewel: Kent u het verhaal van de tulpenmanie? Klopt niet (en lessen voor de Bitcoin).


In Google Trends, op basis van zoekopdrachten, zien we een grote toename in de interesse in Bitcoins, die bijna net zo hard stijgt als de koers. Zoekvolume kan voorspellende waarde hebben, blijkt uit een studie van het Centraal Planbureau (CPB). “[Er] blijkt een sterke correlatie tussen het aantal zoekopdrachten [naar “hypotheek”] bij Google en het aantal feitelijke transacties op de huizenmarkt” schrijven de CPB-onderzoekers in Een voorlopende huizenmarktindicator.



Ook voor de “gewone” (aandelen)beurs geldt dat als de beurskoersen aantrekken, er meer mensen gaan beleggen. Volgens onderzoeksbureau Kantar TNS telt Nederland in 2017 bijna 1,4 miljoen beleggende huishoudens, 14% meer dan in 2016. De onderzoekers spreken ook van een bandwagon-effect. Dat is een bekend fenomeen uit de psychologie en onderstreept dat de mens een sociaal wezen is en sterk beïnvloed wordt door wat mensen om hem of haar heen doen. In 2015 schreven we hierover (pagina 16 van rapport Belangrijke inzichten over zelfstandige beleggers):

[sociale vergelijking en kuddegedrag] refereren aan het feit dat onze beslissingen worden beïnvloed door de beslissingen die anderen maken. Mensen vergelijken hun eigen situatie met die van anderen in hun sociale omgeving. Als veel mensen in onze omgeving al een bepaald product hebben, nemen we onbewust aan dat dit een goed product is (Cialdini, 1993). Dus, als veel mensen in onze omgeving starten met beleggen, nemen we aan dat het een goed moment is om dat ook te doen. Bewijs hiervoor volgt uit de sterke correlatie tussen de stand van de AEX en het aantal beleggers.

En natuurlijk speelt ook FOMO mee; Fear Of Missing Out. Als de buurman een nieuwe boot kan kopen van zijn Bitcoin-winsten (al dan niet alleen op papier), dan willen we daar niet bij achter blijven. Uit onderzoek van Kantar blijkt overigens ook dat “nauwelijks 2% van de huishoudens” cryptocurrencies zoals de bitcoin heeft. In een update van februari 2018 schrijft Reg van Steen van Kantar: Aantal Nederlandse beleggers cryptovaluta geëxplodeerd, maar nog geen kwart ervan staat op winst. Ze becijferen dat Nederland 580.000 cryptobeleggers telt.

Er is ook wetenschappelijke literatuur die stelt dat particuliere beleggers vaak slecht timen; ze stappen in als de koers hoog is en verkopen als de koers laag is. Om winst te maken, moet je dat juist andersom doen. Dit satirische bericht van De Speld bevat wel een kern van waarheid: Bitcoin stopt pas met stijgen als jij instapt.

Goede risicoafweging

De AFM is een toezichthouder, ze geeft geen beleggingsadvies. Maar de AFM vindt het wel belangrijk dat consumenten een goede risico-afweging maken. Daarom schreef de toezichthouder in 2013 Wees je bewust van de risico’s van Bitcoins, en onlangs nog Reële risico’s bij cryptocurrencies. Meer informatie over de risico’s van investeren in virtuele valuta lees je hier op de AFM website.

From Proof of Concept to Scalable Policies: Challenges and Solutions, with an Application

Crepon, Duflo, Gurgand, Rathelot, and Zamora From Proof of Concept to Scalable Policies: Challenges and Solutions, with an Application (2017) Abhijit Banerjee, Rukmini Banerji, James Berry, Esther Duflo, Harini Kannan,  Shobhini Mukerji, Marc Shotland, Michael Walton. Journal of Economic Perspectives vol. 31, no. 4, Fall 2017 (pp. 73-102). Suggested by my colleague Alexandra van Geen.


The promise of randomized controlled trials is that evidence gathered through the evaluation of a specific program helps us—possibly after several rounds of fine-tuning and multiple replications in different contexts—to inform policy. However, critics have pointed out that a potential constraint in this agenda is that results from small “proof-of-concept” studies run by nongovernment organizations may not apply to policies that can be implemented by governments on a large scale. After discussing the potential issues, this paper describes the journey from the original concept to the design and evaluation of scalable policy. (…) We use this example to draw general lessons about using randomized control trials to design scalable policies.

In terms of establishing causal claims, it is generally accepted within the discipline that randomized controlled trials are particularly credible from the point of view of internal validity. This credibility applies to the interventions studied—at that time, on that population, implemented by the organization that was studied—but does not necessarily extend beyond. Not at all clear that results from small “proof-of-concept” studies run by nongovernment organizations can or should be directly turned into recommendations for policies for implementation by governments on a large scale. While external validity of a randomized controlled trial cannot be taken for granted, is it far from unattainable.

6 obstacles

Six main challenges in drawing conclusions from a localized randomized controlled trial about a policy implemented at scale:

[1] Market equilibrium effects.  When an intervention is implemented at scale, it could change the nature of the market.

To assess the equilibrium impact of an intervention (…) The typical design is a two-stage randomization procedure in which the treatment is randomly assigned at the market level in addition to the random assignment within a market. For example, the experiment of Crepon, Duflo, Gurgand, Rathelot, and Zamora (2013) varied the treatment density of a job placement assistance program in France within labor markets, in addition to random assignment of individuals within each market.

One potential challenge with the experimental identification of equilibrium effects is that it is not always obvious what the “market” is

When a particular intervention is scaled up, more people will be needed to implement it. This may lead to an increase in their wages or in difficulties hiring them.

[2] Spillover Effects. Many treatments have spillovers on neighboring units, which implies that those units are not ideal control groups. Not all spillovers are easy to detect in pilot experiments: in some cases, they may be highly nonlinear.

[3] Political reactions. including either resistance to or support for a program, may vary as programs scale up.

Potential political backlash?

  • Worth exploring whether some changes in potentially inessential program details are available.
  • It is also important to try to anticipate the backlash and create a constituency for the reform from the start
  • Finally, the potential for political backlash may provide an argument for not doing too many pilots, since large-scale programs are less likely to be scotched.

[4] Context Dependence. Would results extend in a different setting (even within the same country)? Would the results depend on some observed or unobserved characteristics of the location where the intervention was carried out?

[5] Randomization or Site-selection bias. Organizations or individuals who agree to participate in an early experiment may be different from the rest of the population; randomization bias.

  • Organizations (and even individuals within governments) who agree to participate in randomized controlled trials are often exceptional
  • A well-understood problem arises when individuals select into treatment
  • Site-selection bias arises because an organization chooses a location or
    a subgroup where effects are particularly large

Blair, Iyengar, and Shapiro (2013): randomized controlled trials across are disproportionally conducted in countries with democratic governments.

[6] Piloting Bias/Implementation Challenges. A number of studies have found differences between implementation by nongovernment organizations and governments. Banerjee, Hanna, Kyle, Olken, and Sumarto (2016): the [Indonesian] government was less effective at running a pilot program and more effective with full implementation.

As the discussion in this section has emphasized, the issue of how to travel from evidence at proof-of-concept level to a scaled-up version cannot be settled in the abstract. The issue of [4] context-dependence needs to be addressed through replications, ideally guided by theory. [1] General equilibrium and [2] spillover effects can be addressed by incorporating estimation of these effects into study designs, or by conducting large-scale experiments where the equilibrium plays out. [5] Randomization bias and [6] piloting bias can be addressed by trying out the programs on a sufficient scale with the government that will eventually implement it, documenting success and failure, and moving from there.

[I skipped the Teaching at the Right Level example]

General Lessons

Perhaps the key point is to remember what small pilot experiments are good for and what they are not good for.

If the objective is to design or test a model [i.e. no policy implications], the researcher can ignore most of the concerns that we talked about in this paper. Something valuable will be learnt anyway.

For researchers, a strong temptation in a stage-two trial will be to do what it takes “to make it work,” but the risk of implementation challenges means that it is important to think about how far to go in that direction. On the one hand, trial and error will be needed to embed any new intervention within an existing bureaucracy. Anything new is challenging, and at the beginning of a stage-two trial, considerable time needs to be spent to give the program a fair shot. On the other hand, if the research team embeds too much of its own staff and effort and ends up substituting for the organization, not enough will need to be learnt about where implementation problems might emerge