A/B testing

Last updated
Example of A/B testing on a website. By randomly serving visitors two versions of a website that differ only in the design of a single button element, the relative efficacy of the two designs can be measured. A-B testing example.png
Example of A/B testing on a website. By randomly serving visitors two versions of a website that differ only in the design of a single button element, the relative efficacy of the two designs can be measured.

A/B testing (also known as bucket testing, split-run testing, or split testing) is a user experience research methodology. [1] A/B tests consist of a randomized experiment that usually involves two variants (A and B), [2] [3] [4] although the concept can be also extended to multiple variants of the same variable. It includes application of statistical hypothesis testing or "two-sample hypothesis testing" as used in the field of statistics. A/B testing is a way to compare multiple versions of a single variable, for example by testing a subject's response to variant A against variant B, and determining which of the variants is more effective. [5]

Contents

Overview

"A/B testing" is a shorthand for a simple randomized controlled experiment, in which a number of samples (e.g. A and B) of a single vector-variable are compared. [1] These values are similar except for one variation which might affect a user's behavior. A/B tests are widely considered the simplest form of controlled experiment, especially when they only involve two variants. However, by adding more variants to the test, its complexity grows. [6]

A/B tests are useful for understanding user engagement and satisfaction of online features like a new feature or product. [7] Large social media sites like LinkedIn, Facebook, and Instagram use A/B testing to make user experiences more successful and as a way to streamline their services. [7]

Today, A/B tests are being used also for conducting complex experiments on subjects such as network effects when users are offline, how online services affect user actions, and how users influence one another. [7] A/B testing is used by data engineers, marketers, designers, software engineers, and entrepreneurs, among others. [8] Many positions rely on the data from A/B tests, as they allow companies to understand growth, increase revenue, and optimize customer satisfaction. [8]

Version A might be used at present (thus forming the control group), while version B is modified in some respect vs. A (thus forming the treatment group). For instance, on an e-commerce website the purchase funnel is typically a good candidate for A/B testing, since even marginal-decreases in drop-off rates can represent a significant gain in sales. Significant improvements can be sometimes seen through testing elements like copy text, layouts, images and colors, [9] but not always. In these tests, users only see one of two versions, since the goal is to discover which of the two versions is preferable. [10]

Multivariate testing or multinomial testing is similar to A/B testing, but may test more than two versions at the same time or use more controls. Simple A/B tests are not valid for observational, quasi-experimental or other non-experimental situations—commonplace with survey data, offline data, and other, more complex phenomena.

A/B testing is claimed by some to be a change in philosophy and business-strategy in certain niches, though the approach is identical to a between-subjects design, which is commonly used in a variety of research traditions. [11] [12] [13] A/B testing as a philosophy of web development brings the field into line with a broader movement toward evidence-based practice. The benefits of A/B testing are considered to be that it can be performed continuously on almost anything, especially since most marketing automation software now typically comes with the ability to run A/B tests on an ongoing basis.

Common test statistics

"Two-sample hypothesis tests" are appropriate for comparing the two samples where the samples are divided by the two control cases in the experiment. Z-tests are appropriate for comparing means under stringent conditions regarding normality and a known standard deviation. Student's t-tests are appropriate for comparing means under relaxed conditions when less is assumed. Welch's t test assumes the least and is therefore the most commonly used test in a two-sample hypothesis test where the mean of a metric is to be optimized. While the mean of the variable to be optimized is the most common choice of estimator, others are regularly used.

For a comparison of two binomial distributions such as a click-through rate one would use Fisher's exact test.

Assumed distributionExample caseStandard testAlternative test
Gaussian Average revenue per user Welch's t-test (Unpaired t-test) Student's t-test
Binomial Click-through rate Fisher's exact test Barnard's test
Poisson Transactions per paying userE-test [14] C-test
Multinomial Number of each product purchased Chi-squared test G-test
Unknown Mann–Whitney U test Gibbs sampling

Challenges

When conducting A/B testing, the user should evaluate the pros and cons of it to see if it aligns best with the results that they're hoping for.

Pros: Through A/B testing, it is easy to get a clear idea of what users prefer, since it is directly testing one thing over the other. It is based on real user behavior so the data can be very helpful especially when determining what works better between two options. In addition, it can also provide answers to very specific design questions. One example of this is Google's A/B testing with hyperlink colors. In order to optimize revenue, they tested dozens of different hyperlink hues to see which color the users tend to click more on.

Cons: There are, however, a couple of cons to A/B testing. Like mentioned above, A/B testing is good for specific design questions but it can also be a downside since it is mostly only good for specific design problems with very measurable outcomes. It could also be a very costly and timely process. Depending on the size of the company and/or team, there could be a lot of meetings and discussions about what exactly to test and what the impact of the A/B test is. If there's not a significant impact, it could end up as a waste of time and resources.

In December 2018, representatives with experience in large-scale A/B testing from thirteen different organizations (Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, and Stanford University) attended a summit and summarized the top challenges in a SIGKDD Explorations paper. [15] The challenges can be grouped into four areas: Analysis, Engineering and Culture, Deviations from Traditional A/B tests, and Data quality.

History

It is difficult to definitively establish when A/B testing was first used. The first randomized double-blind trial, to assess the effectiveness of a homeopathic drug, occurred in 1835. [16] Experimentation with advertising campaigns, which has been compared to modern A/B testing, began in the early twentieth century. [17] The advertising pioneer Claude Hopkins used promotional coupons to test the effectiveness of his campaigns. However, this process, which Hopkins described in his Scientific Advertising, did not incorporate concepts such as statistical significance and the null hypothesis, which are used in statistical hypothesis testing. [18] Modern statistical methods for assessing the significance of sample data were developed separately in the same period. This work was done in 1908 by William Sealy Gosset when he altered the Z-test to create Student's t-test. [19] [20]

With the growth of the internet, new ways to sample populations have become available. Google engineers ran their first A/B test in the year 2000 in an attempt to determine what the optimum number of results to display on its search engine results page would be. [5] The first test was unsuccessful due to glitches that resulted from slow loading times. Later A/B testing research would be more advanced, but the foundation and underlying principles generally remain the same, and in 2011, 11 years after Google's first test, Google ran over 7,000 different A/B tests. [5]

In 2012, a Microsoft employee working on the search engine Microsoft Bing created an experiment to test different ways of displaying advertising headlines. Within hours, the alternative format produced a revenue increase of 12% with no impact on user-experience metrics. [4] Today, companies like Microsoft and Google each conduct over 10,000 A/B tests annually. [4]

Many companies now use the "designed experiment" approach to making marketing decisions, with the expectation that relevant sample results can improve positive conversion results.[ citation needed ] It is an increasingly common practice as the tools and expertise grow in this area. [21]

Examples

Email marketing

A company with a customer database of 2,000 people decides to create an email campaign with a discount code in order to generate sales through its website. It creates two versions of the email with different call to action (the part of the copy which encourages customers to do something — in the case of a sales campaign, make a purchase) and identifying promotional code.

All other elements of the emails' copy and layout are identical. The company then monitors which campaign has the higher success rate by analyzing the use of the promotional codes. The email using the code A1 has a 5% response rate (50 of the 1,000 people emailed used the code to buy a product), and the email using the code B1 has a 3% response rate (30 of the recipients used the code to buy a product). The company therefore determines that in this instance, the first Call To Action is more effective and will use it in future sales. A more nuanced approach would involve applying statistical testing to determine if the differences in response rates between A1 and B1 were statistically significant (that is, highly likely that the differences are real, repeatable, and not due to random chance). [22]

In the example above, the purpose of the test is to determine which is the more effective way to encourage customers to make a purchase. If, however, the aim of the test had been to see which email would generate the higher click-rate  – that is, the number of people who actually click onto the website after receiving the email – then the results might have been different.

For example, even though more of the customers receiving the code B1 accessed the website, because the Call To Action didn't state the end-date of the promotion many of them may feel no urgency to make an immediate purchase. Consequently, if the purpose of the test had been simply to see which email would bring more traffic to the website, then the email containing code B1 might well have been more successful. An A/B test should have a defined outcome that is measurable such as number of sales made, click-rate conversion, or number of people signing up/registering. [23]

A/B testing for product pricing

A/B testing can be used to determine the right price for the product, as this is perhaps one of the most difficult tasks when a new product or service is launched. A/B testing (especially valid for digital goods) is an excellent way to find out which price-point and offering maximize the total revenue.

Political A/B testing

A/B tests have also been used by political campaigns. In 2007, Barack Obama's presidential campaign used A/B testing as a way to garner online attraction and understand what voters wanted to see from the presidential candidate. [24] For example, Obama's team tested four distinct buttons on their website that led users to sign up for newsletters. Additionally, the team used six different accompanying images to draw in users. Through A/B testing, staffers were able to determine how to effectively draw in voters and garner additional interest. [24]

HTTP Routing and API feature testing

HTTP Router with A/B testing HTTP AB Testing.png
HTTP Router with A/B testing

A/B testing is very common when deploying a newer version of an API. [25] For real-time user experience testing, an HTTP Layer-7 Reverse proxy is configured in such a way that, N% of the HTTP traffic goes into the newer version of the backend instance, while the remaining 100-N% of HTTP traffic hits the (stable) older version of the backend HTTP application service. [25] This is usually done for limiting the exposure of customers to a newer backend instance such that, if there is a bug on the newer version, only N% of the total user agents or clients get affected while others get routed to a stable backend, which is a common ingress control mechanism. [25]

Segmentation and targeting

A/B tests most commonly apply the same variant (e.g., user interface element) with equal probability to all users. However, in some circumstances, responses to variants may be heterogeneous. That is, while a variant A might have a higher response rate overall, variant B may have an even higher response rate within a specific segment of the customer base. [26]

For instance, in the above example, the breakdown of the response rates by gender could have been:

GenderOverallMenWomen
Total sends2,0001,0001,000
Total responses803545
Variant A50/ 1,000 (5%)10/ 500 (2%)40/ 500 (8%)
Variant B30/ 1,000 (3%)25/ 500 (5%)5/ 500 (1%)

In this case, we can see that while variant A had a higher response rate overall, variant B actually had a higher response rate with men.

As a result, the company might select a segmented strategy as a result of the A/B test, sending variant B to men and variant A to women in the future. In this example, a segmented strategy would yield an increase in expected response rates from to – constituting a 30% increase.

If segmented results are expected from the A/B test, the test should be properly designed at the outset to be evenly distributed across key customer attributes, such as gender. That is, the test should both (a) contain a representative sample of men vs. women, and (b) assign men and women randomly to each “variant” (variant A vs. variant B). Failure to do so could lead to experiment bias and inaccurate conclusions to be drawn from the test. [27]

This segmentation and targeting approach can be further generalized to include multiple customer attributes rather than a single customer attribute – for example, customers' age and gender – to identify more nuanced patterns that may exist in the test results.

See also

Related Research Articles

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.

Biostatistics is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.

<span class="mw-page-title-main">Statistics</span> Study of the collection, analysis, interpretation, and presentation of data

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

<span class="mw-page-title-main">Statistical hypothesis test</span> Method of statistical inference

A statistical hypothesis test is a method of statistical inference used to decide whether the data sufficiently support a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical value or equivalently by evaluating a p-value computed from the test statistic. Roughly 100 specialized statistical tests have been defined.

<span class="mw-page-title-main">Experiment</span> Scientific procedure performed to validate a hypothesis

An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs when a particular factor is manipulated. Experiments vary greatly in goal and scale but always rely on repeatable procedure and logical analysis of the results. There also exist natural experimental studies.

In scientific research, the null hypothesis is the claim that the effect being studied does not exist. Note that the term "effect" here is not meant to imply a causative relationship.

In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true. It is commonly denoted by , and represents the chances of a true positive detection conditional on the actual existence of an effect to detect. Statistical power ranges from 0 to 1, and as the power of a test increases, the probability of making a type II error by wrongly failing to reject the null hypothesis decreases.

Quantitative marketing research is the application of quantitative research techniques to the field of marketing research. It has roots in both the positivist view of the world, and the modern marketing viewpoint that marketing is an interactive process in which both the buyer and seller reach a satisfying agreement on the "four Ps" of marketing: Product, Price, Place (location) and Promotion.

<span class="mw-page-title-main">Cross-validation (statistics)</span> Statistical model validation technique

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. Cross-validation includes resampling and sample splitting methods that use different portions of the data to test and train a model on different iterations. It is often used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. It can also be used to assess the quality of a fitted model and the stability of its parameters.

Statistics, when used in a misleading fashion, can trick the casual observer into believing something other than what the data shows. That is, a misuse of statistics occurs when a statistical argument asserts a falsehood. In some cases, the misuse may be accidental. In others, it is purposeful and for the gain of the perpetrator. When the statistical reason involved is false or misapplied, this constitutes a statistical fallacy.

Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. In complex studies, different sample sizes may be allocated, such as in stratified surveys or experimental designs with multiple treatment groups. In a census, data is sought for an entire population, hence the intended sample size is equal to the population. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

In online marketing, a landing page, sometimes known as a "lead capture page", "single property page", "static page", "squeeze page" or a "destination page", is a single web page that appears in response to clicking on a search engine optimized search result, marketing promotion, marketing email or an online advertisement. The landing page will usually display directed sales copy that is a logical extension of the advertisement, search result or link. Landing pages are used for lead generation. The actions that a visitor takes on a landing page are what determine an advertiser's conversion rate. A landing page may be part of a microsite or a single page within an organization's main web site.

Marketing experimentation is a research method which can be defined as "the act of conducting such an investigation or test". It is testing a market that is segmented to discover new opportunities for organisations. By controlling conditions in an experiment, organisations will record and make decisions based on consumer behaviour. Marketing experimentation is commonly used to find the best method for maximizing revenues through the acquisition of new customers. For example; two groups of customers are exposed to different advertising (test). How did consumers react to advertising compared to the other group? (measurable). Did the advertising increase sales for each group? (result).

In statistics, resampling is the creation of new samples based on one observed sample. Resampling methods are:

  1. Permutation tests
  2. Bootstrapping
  3. Cross validation

In statistical hypothesis testing, a type I error, or a false positive, is the rejection of the null hypothesis when it is actually true. For example, an innocent person may be convicted. A type II error, or a false negative, is the failure to reject a null hypothesis that is actually false. For example: a guilty person may be not convicted.

<span class="mw-page-title-main">Randomized experiment</span> Experiment using randomness in some aspect, usually to aid in removal of bias

In science, randomized experiments are the experiments that allow the greatest reliability and validity of statistical estimates of treatment effects. Randomization-based inference is especially important in experimental design and in survey sampling.

Discrimination testing is a technique employed in sensory analysis to determine whether there is a detectable difference among two or more products. The test uses a group of assessors (panellists) with a degree of training appropriate to the complexity of the test to discriminate from one product to another through one of a variety of experimental designs. Though useful, these tests typically do not quantify or describe any differences, requiring a more specifically trained panel under different study design to describe differences and assess significance of the difference.

In marketing, multivariate testing or multi-variable testing techniques apply statistical hypothesis testing on multi-variable systems, typically consumers on websites. Techniques of multivariate statistics are used.

In the design of experiments, a sample ratio mismatch (SRM) is a statistically significant difference between the expected and actual ratios of the sizes of treatment and control groups in an experiment. Sample ratio mismatches also known as unbalanced sampling often occur in online controlled experiments due to failures in randomization and instrumentation.

References

  1. 1 2 Young, Scott W. H. (August 2014). "Improving Library User Experience with A/B Testing: Principles and Process". Weave: Journal of Library User Experience. 1 (1). doi: 10.3998/weave.12535642.0001.101 . hdl: 2027/spo.12535642.0001.101 .
  2. Kohavi, Ron; Xu, Ya; Tang, Diane (2000). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.
  3. Kohavi, Ron; Longbotham, Roger (2023). "Online Controlled Experiments and A/B Tests". In Phung, Dinh; Webb, Geoff; Sammut, Claude (eds.). Encyclopedia of Machine Learning and Data Science. Springer.
  4. 1 2 3 Kohavi, Ron; Thomke, Stefan (September 2017). "The Surprising Power of Online Experiments". Harvard Business Review: 74–82.
  5. 1 2 3 "The ABCs of A/B Testing - Pardot". Pardot. 12 July 2012. Retrieved 2016-02-21.
  6. Kohavi, Ron; Longbotham, Roger (2017). "Online Controlled Experiments and A/B Testing". Encyclopedia of Machine Learning and Data Mining. pp. 922–929. doi:10.1007/978-1-4899-7687-1_891. ISBN   978-1-4899-7685-7.
  7. 1 2 3 Xu, Ya; Chen, Nanyu; Fernandez, Addrian; Sinno, Omar; Bhasin, Anmol (10 August 2015). "From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks". Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 2227–2236. doi:10.1145/2783258.2788602. ISBN   9781450336642. S2CID   15847833.
  8. 1 2 Siroker, Dan; Koomen, Pete (2013-08-07). A / B Testing: The Most Powerful Way to Turn Clicks Into Customers. John Wiley & Sons. ISBN   978-1-118-65920-5.
  9. "Split Testing Guide for Online Stores". webics.com.au. August 27, 2012. Retrieved 2012-08-28.
  10. Kaufman, Emilie (2014). "On the Complexity of A/B Testing" (PDF). 35. arXiv: 1405.3224 . Bibcode:2014arXiv1405.3224K via JMLR: Workshop and Conference Proceedings.{{cite journal}}: Cite journal requires |journal= (help)
  11. Christian, Brian (2000-02-27). "The A/B Test: Inside the Technology That's Changing the Rules of Business | Wired Business". Wired.com. Retrieved 2014-03-18.
  12. Christian, Brian. "Test Everything: Notes on the A/B Revolution | Wired Enterprise". Wired. Retrieved 2014-03-18.
  13. Cory Doctorow (2012-04-26). "A/B testing: the secret engine of creation and refinement for the 21st century". Boing Boing. Retrieved 2014-03-18.
  14. Krishnamoorthy, K.; Thomson, Jessica (2004). "A more powerful test for comparing two Poisson means". Journal of Statistical Planning and Inference. 119: 23–35. doi:10.1016/S0378-3758(02)00408-1. S2CID   26753532.
  15. Gupta, Somit; Kohavi, Ronny; Tang, Diane; Xu, Ya; Andersen, Reid; Bakshy, Eytan; Cardin, Niall; Chandran, Sumitha; Chen, Nanyu; Coey, Dominic; Curtis, Mike; Deng, Alex; Duan, Weitao; Forbes, Peter; Frasca, Brian; Guy, Tommy; Imbens, Guido W.; Saint Jacques, Guillaume; Kantawala, Pranav; Katsev, Ilya; Katzwer, Moshe; Konutgan, Mikael; Kunakova, Elena; Lee, Minyong; Lee, MJ; Liu, Joseph; McQueen, James; Najmi, Amir; Smith, Brent; Trehan, Vivek; Vermeer, Lukas; Walker, Toby; Wong, Jeffrey; Yashkov, Igor (June 2019). "Top Challenges from the first Practical Online Controlled Experiments Summit". SIGKDD Explorations. 21 (1): 20–35. doi:10.1145/3331651.3331655. S2CID   153314606.
  16. Stolberg, M (December 2006). "Inventing the randomized double-blind trial: the Nuremberg salt test of 1835". Journal of the Royal Society of Medicine. 99 (12): 642–643. doi:10.1177/014107680609901216. PMC   1676327 . PMID   17139070.
  17. "What is A/B Testing." Convertize. Retrieved 2020-01-28.
  18. "Claude Hopkins Turned Advertising Into A Science." Retrieved 2019-11-01.
  19. "Brief history and background for the one sample t-test". 20 June 2007.
  20. Box, Joan Fisher (1987). "Guinness, Gosset, Fisher, and Small Samples". Statistical Science. 2 (1): 45–52. doi: 10.1214/ss/1177013437 .
  21. "A/B Testing: The ABCs of Paid Social Media". Anyword . 2020-01-17. Retrieved 2022-04-08.
  22. Amazon.com. "The Math Behind A/B Testing". Archived from the original on 2015-09-21. Retrieved 2015-04-12.
  23. Kohavi, Ron; Longbotham, Roger; Sommerfield, Dan; Henne, Randal M. (February 2009). "Controlled experiments on the web: survey and practical guide". Data Mining and Knowledge Discovery. 18 (1): 140–181. doi: 10.1007/s10618-008-0114-1 . S2CID   17165746.
  24. 1 2 Siroker, Dan; Koomen, Pete (2013-08-07). A / B Testing: The Most Powerful Way to Turn Clicks Into Customers. John Wiley & Sons. ISBN   978-1-118-65920-5.
  25. 1 2 3 Szucs, Sandor (2018). "Modern HTTP Routing" (PDF). Usenix.org.
  26. "Advanced A/B Testing Tactics That You Should Know | Testing & Usability". Online-behavior.com. Archived from the original on 2014-03-19. Retrieved 2014-03-18.
  27. "Eight Ways You've Misconfigured Your A/B Test". Dr. Jason Davis. 2013-09-12. Retrieved 2014-03-18.