Who’s the Guinea Pig? Investigating Online A/B/n Tests in-the-Wild

This Fast Company article We’re all being manipulated by A/B testing all the time led me to read:

Jiang, S., Martin, J., & Wilson, C. (2019, January). Who’s the Guinea Pig? Investigating Online A/B/n Tests in-the-Wild. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 201-210). ACM.


Some quotes from the Fast Company piece:

The problem is that, when it’s not done in a transparent, responsible way, A/B testing can leverage the worst impulses of human psychology to convince you to click on something.

“I’m just concerned that you have more untrained people running experiments on large audiences”

“But we have to acknowledge that these are human experiments. You’re trying to change people’s behavior. If I was doing that as a scientist as a university, there’s protocol. You can’t just experiment on people.”

I like this response from Kaiser Fung Is A/B testing that scary?:

In short, human supervision of A/B testing from design to interpretation is definitely needed. A/B tests provide a wealth of data to support decision-making. The biases highlighted by the Fast Company article are merely revealed by the testing – they are not caused by it.

Abstract Paper

A/B/n testing has been adopted by many technology companies as a data-driven approach to product design and optimization. These tests are often run on their websites without explicit consent from users. In this paper, we investigate such online A/B/n tests by using Optimizely as a lens.

First, we provide measurement results of 575 websites that use Optimizely drawn from the Alexa Top-1M, and analyze the distributions of their audiences and experiments.

Then, we use three case studies to discuss potential ethical pitfalls of such experiments, including involvement of political content, price discrimination, and advertising campaigns. We conclude with a suggestion for greater awareness of ethical concerns inherent in human experimentation and a call for increased transparency among A/B/n test operators.

By 2011 Google engineers claimed to be running over 7,000 A/B tests per year.

We crawled, with PhantomJS, 10,584 sites. Of these sites, 575 were running experiments with Optimizely. In total, our analysis considers 1,143 audiences (i.e., treatment groups) and 2,001 experiments (i.e. treatments).

Segmenting users

We observe that 221 of 575 (38.4%) websites in our sample defined audiences for their experiments. Note that this does not necessarily mean the remaining 354 websites were not running experiments; recall that Optimizely allows simple A/B/n experiments to be run with random assignment of users, i.e., no manually defined audiences are required.

segmentation by device: primarily interested in separating desktop and mobile users. We hypothesize that iPhone and iPad users are frequently segmented because it is easier to identify them as classes (the hardware is more homogeneous). Alternatively, website operators may view Apple users as “high value” (i.e., as a proxy for affluence).

segmentation by browser: Internet Explorer is the most frequently targeted browser. We suspect that operators are using Optimizely to selectively apply compatibility patches. (…) Safari may be highly targeted for the same reason as iPhone and iPad users, i.e., as a means to segment the valuable Apple-user population.

segmentation by location: focus on localized audiences from technologically developed countries (US, UK, Australia), and suchs states within the USA (CA, TX, NY, IL); significant and positive correlation between technology index and audience distribution.


We observe that the majority of websites in our sample (206 of 297, 69.4%) are running 5 or less experiments.

As shown in Figure 14, 897 of 2358 (38.0%) of experiments in our sample have a single variation, which corresponds to ‘dummy’ experiments were all site visitors experience the same treatment. These ‘dummies’ may represent the final stage of successful experiments, e.g., the operator determined that a specific treatment in an A/B/n experiments was most effective, and they now apply it to all site visitors. 1214 of 2358 (51.5%) have two variations, which may correspond to A/B tests. The remaining experiments (247 of 2358, 10.4%) have more variations, which may correspond to multivariate tests.



Case study: Dynamic Pricing

Similarly, prior work has uncovered numerous instances of online price discrimination, including A/B testing on Expedia [slides, paper]; OCEs could also be used to manipulate consumer behavior by altering prices, changing the presentation of products to emphasize more or less expensive options, or tailor discounts to specific users.

And “Several studies have identified price discrimination across a broad swath of e-commerce websites”.

I did miss a reference to: Vissers, T., Nikiforakis, N., Bielova, N., & Joosen, W. (2014, July). Crying wolf? on the price discrimination of online airline tickets. In 7th Workshop on Hot Topics in Privacy Enhancing Technologies (HotPETs 2014), who write: “we do not find any evidence for systematic price discrimination.”

PolicyGenius is an online insurance policy company that we observe running 13 experiments with 13 defined audiences in our dataset. A group of its experiments, called ‘Marketing Life Landing Page Test,’ targets audiences based on Urchin
Tracking Module (UTM) parameters, which are parameters passed by Google’s analytics and advertising services.

Audiences with different UTMs are shown different insurance advertisements, including several of the form ‘Term Life Insurance As Low As X Per Month,’ where X can be $9.99, $10, or $29.8 Other treatments use text of the form ‘Compare term life insurance quotes and save up to 40%’ and ‘Compare and apply for term life insurance online,’ i.e., with and without discount offers.

For example, the higher-priced policies may be targeted to senior citizens. If the audience segments used by PolicyGenius are localized, or include other sensitive demographic characteristics, this price discrimination would be similarly troubling.

Case study: Advertising Campaigns

(on Optimizely’s own site) Each treatment places a different banner picture on the homepage that is specifically designed to attract users from each company, e.g., the banner picture for Petco is a dog, while for Centers for Medicare & Medicaid Services it is a picture of doctor/patient communication, etc.

exemplify just how highly targeted modern online advertising can be. It is easy to envision contexts where such micro-targeting could be creepy (e.g., on a health care website) or ethically problematic (e.g., on mortgage company website).


We do not observe any websites engaging in overtly unethical behavior in our dataset.

An explicit goal of our work is to raise awareness among the public about the existence of OCEs [Online Controlled Experiments], and push website operators towards greater transparency. We argue that transparency is one of the best defenses against harm to users . Additionally, companies should follow the same norms as academia and obtain informed consent before running substantive experiments.


Een gedachte over “Who’s the Guinea Pig? Investigating Online A/B/n Tests in-the-Wild

  1. Pingback: Objecting to experiments that compare two unobjectionable policies or treatments – “A/B effect” – wilte

Geef een reactie

Vul je gegevens in of klik op een icoon om in te loggen.

WordPress.com logo

Je reageert onder je WordPress.com account. Log uit /  Bijwerken )

Google photo

Je reageert onder je Google account. Log uit /  Bijwerken )


Je reageert onder je Twitter account. Log uit /  Bijwerken )

Facebook foto

Je reageert onder je Facebook account. Log uit /  Bijwerken )

Verbinden met %s