Crepon, Duflo, Gurgand, Rathelot, and Zamora From Proof of Concept to Scalable Policies: Challenges and Solutions, with an Application (2017) Abhijit Banerjee, Rukmini Banerji, James Berry, Esther Duflo, Harini Kannan, Shobhini Mukerji, Marc Shotland, Michael Walton. Journal of Economic Perspectives vol. 31, no. 4, Fall 2017 (pp. 73-102). Suggested by my colleague Alexandra van Geen.
The promise of randomized controlled trials is that evidence gathered through the evaluation of a specific program helps us—possibly after several rounds of fine-tuning and multiple replications in different contexts—to inform policy. However, critics have pointed out that a potential constraint in this agenda is that results from small “proof-of-concept” studies run by nongovernment organizations may not apply to policies that can be implemented by governments on a large scale. After discussing the potential issues, this paper describes the journey from the original concept to the design and evaluation of scalable policy. (…) We use this example to draw general lessons about using randomized control trials to design scalable policies.
In terms of establishing causal claims, it is generally accepted within the discipline that randomized controlled trials are particularly credible from the point of view of internal validity. This credibility applies to the interventions studied—at that time, on that population, implemented by the organization that was studied—but does not necessarily extend beyond. Not at all clear that results from small “proof-of-concept” studies run by nongovernment organizations can or should be directly turned into recommendations for policies for implementation by governments on a large scale. While external validity of a randomized controlled trial cannot be taken for granted, is it far from unattainable.
Six main challenges in drawing conclusions from a localized randomized controlled trial about a policy implemented at scale:
 Market equilibrium effects. When an intervention is implemented at scale, it could change the nature of the market.
To assess the equilibrium impact of an intervention (…) The typical design is a two-stage randomization procedure in which the treatment is randomly assigned at the market level in addition to the random assignment within a market. For example, the experiment of Crepon, Duflo, Gurgand, Rathelot, and Zamora (2013) varied the treatment density of a job placement assistance program in France within labor markets, in addition to random assignment of individuals within each market.
One potential challenge with the experimental identification of equilibrium effects is that it is not always obvious what the “market” is
When a particular intervention is scaled up, more people will be needed to implement it. This may lead to an increase in their wages or in difficulties hiring them.
 Spillover Effects. Many treatments have spillovers on neighboring units, which implies that those units are not ideal control groups. Not all spillovers are easy to detect in pilot experiments: in some cases, they may be highly nonlinear.
 Political reactions. including either resistance to or support for a program, may vary as programs scale up.
Potential political backlash?
- Worth exploring whether some changes in potentially inessential program details are available.
- It is also important to try to anticipate the backlash and create a constituency for the reform from the start
- Finally, the potential for political backlash may provide an argument for not doing too many pilots, since large-scale programs are less likely to be scotched.
 Context Dependence. Would results extend in a different setting (even within the same country)? Would the results depend on some observed or unobserved characteristics of the location where the intervention was carried out?
 Randomization or Site-selection bias. Organizations or individuals who agree to participate in an early experiment may be different from the rest of the population; randomization bias.
- Organizations (and even individuals within governments) who agree to participate in randomized controlled trials are often exceptional
- A well-understood problem arises when individuals select into treatment
- Site-selection bias arises because an organization chooses a location or
a subgroup where effects are particularly large
Blair, Iyengar, and Shapiro (2013): randomized controlled trials across are disproportionally conducted in countries with democratic governments.
 Piloting Bias/Implementation Challenges. A number of studies have found differences between implementation by nongovernment organizations and governments. Banerjee, Hanna, Kyle, Olken, and Sumarto (2016): the [Indonesian] government was less effective at running a pilot program and more effective with full implementation.
As the discussion in this section has emphasized, the issue of how to travel from evidence at proof-of-concept level to a scaled-up version cannot be settled in the abstract. The issue of  context-dependence needs to be addressed through replications, ideally guided by theory.  General equilibrium and  spillover effects can be addressed by incorporating estimation of these effects into study designs, or by conducting large-scale experiments where the equilibrium plays out.  Randomization bias and  piloting bias can be addressed by trying out the programs on a sufficient scale with the government that will eventually implement it, documenting success and failure, and moving from there.
[I skipped the Teaching at the Right Level example]
Perhaps the key point is to remember what small pilot experiments are good for and what they are not good for.
If the objective is to design or test a model [i.e. no policy implications], the researcher can ignore most of the concerns that we talked about in this paper. Something valuable will be learnt anyway.
For researchers, a strong temptation in a stage-two trial will be to do what it takes “to make it work,” but the risk of implementation challenges means that it is important to think about how far to go in that direction. On the one hand, trial and error will be needed to embed any new intervention within an existing bureaucracy. Anything new is challenging, and at the beginning of a stage-two trial, considerable time needs to be spent to give the program a fair shot. On the other hand, if the research team embeds too much of its own staff and effort and ends up substituting for the organization, not enough will need to be learnt about where implementation problems might emerge