A/B testing
   HOME

TheInfoList



OR:

A/B testing (also known as bucket testing, split-run testing, or split testing) is a user experience research
methodology In its most common sense, methodology is the study of research methods. However, the term can also refer to the methods themselves or to the philosophical discussion of associated background assumptions. A method is a structured procedure for br ...
. A/B tests consist of a
randomized experiment In science, randomized experiments are the experiments that allow the greatest reliability and validity of statistical estimates of treatment effects. Randomization-based inference is especially important in experimental design and in survey samp ...
that usually involves two variants (A and B), although the concept can be also extended to multiple variants of the same variable. It includes application of statistical hypothesis testing or " two-sample hypothesis testing" as used in the field of statistics. A/B testing is a way to compare multiple versions of a single
variable Variable may refer to: * Variable (computer science), a symbolic name associated with a value and whose associated value may be changed * Variable (mathematics), a symbol that represents a quantity in a mathematical expression, as used in many ...
, for example by testing a subject's response to variant A against variant B, and determining which of the variants is more effective.


Overview

"A/B testing" is a shorthand for a simple randomized controlled experiment, in which a number of samples (e.g. A and B) of a single vector-variable are compared. These values are similar except for one variation which might affect a user's behavior. A/B tests are widely considered the simplest form of controlled experiment, especially when they only involve two variants. However, by adding more variants to the test, its complexity grows. A/B tests are useful for understanding
user engagement Customer engagement is an interaction between an external consumer/customer (either B2C or B2B) and an organization (company or brand) through various online or offline channels. According to Hollebeek, Srivastava and Chen's (2019, p. 166) S-D l ...
and satisfaction of online features like a new feature or product. Large
social media Social media are interactive media technologies that facilitate the creation and sharing of information, ideas, interests, and other forms of expression through virtual communities and networks. While challenges to the definition of ''social medi ...
sites like
LinkedIn LinkedIn () is an American business and employment-oriented online service that operates via websites and mobile apps. Launched on May 5, 2003, the platform is primarily used for professional networking and career development, and allows job se ...
,
Facebook Facebook is an online social media and social networking service owned by American company Meta Platforms. Founded in 2004 by Mark Zuckerberg with fellow Harvard College students and roommates Eduardo Saverin, Andrew McCollum, Dustin Mosk ...
, and Instagram use A/B testing to make user experiences more successful and as a way to streamline their services. Today, A/B tests are being used also for conducting complex experiments on subjects such as
network effect In economics, a network effect (also called network externality or demand-side economies of scale) is the phenomenon by which the value or utility a user derives from a good or service depends on the number of users of compatible products. Net ...
s when users are offline, how online services affect user actions, and how users influence one another. A/B testing is used by data engineers, marketers, designers, software engineers, and entrepreneurs, among others. Many positions rely on the data from A/B tests, as they allow companies to understand growth, increase revenue, and optimize customer satisfaction. Version A might be a version used at present (thus forming the control group), while version B is modified in some respect vs. A (thus forming the treatment group). For instance, on an e-commerce website the
purchase funnel The purchase funnel, or purchasing funnel, is a consumer-focused marketing model that illustrates the theoretical customer journey toward the purchase of a good or service. In 1898, E. St. Elmo Lewis developed a model that mapped a theoretical ...
is typically a good candidate for A/B testing, since even marginal-decreases in drop-off rates can represent a significant gain in sales. Significant improvements can be sometimes seen through testing elements like copy text, layouts, images and colors, but not always. In these tests, users only see one of two versions, since the goal is to discover which of the two versions is preferable. Multivariate testing or multinomial testing is similar to A/B testing, but may test more than two versions at the same time or use more controls. Simple A/B tests are not valid for
observational Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data (information), data via the use of scienti ...
,
quasi-experimental A quasi-experiment is an empirical interventional study used to estimate the causal impact of an intervention on target population without random assignment. Quasi-experimental research shares similarities with the traditional experimental design ...
or other
non-experimental In fields such as epidemiology, social sciences, psychology and statistics, an observational study draws inferences from a sample to a population where the independent variable is not under the control of the researcher because of ethical con ...
situations - commonplace with survey data, offline data, and other, more complex phenomena. A/B testing is claimed by some to be a change in philosophy and business-strategy in certain niches, though the approach is identical to a between-subjects design, which is commonly used in a variety of research traditions. A/B testing as a philosophy of web development brings the field into line with a broader movement toward
evidence-based practice Evidence-based practice (EBP) is the idea that occupational practices ought to be based on scientific evidence. While seemingly obviously desirable, the proposal has been controversial, with some arguing that results may not specialize to indiv ...
. The benefits of A/B testing are considered to be that it can be performed continuously on almost anything, especially since most marketing automation software now typically comes with the ability to run A/B tests on an ongoing basis.


Common test statistics

"Two-sample hypothesis tests" are appropriate for comparing the two samples where the samples are divided by the two control cases in the experiment.
Z-test A ''Z''-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-tests test the mean of a distribution. For each significance level in the confide ...
s are appropriate for comparing means under stringent conditions regarding normality and a known standard deviation. Student's t-tests are appropriate for comparing means under relaxed conditions when less is assumed.
Welch's t test In statistics, Welch's ''t''-test, or unequal variances ''t''-test, is a two-sample location test which is used to test the hypothesis that two populations have equal means. It is named for its creator, Bernard Lewis Welch, is an adaptation of ...
assumes the least and is therefore the most commonly used test in a two-sample
hypothesis test A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
where the mean of a metric is to be optimized. While the
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the '' ari ...
of the variable to be optimized is the most common choice of
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
, others are regularly used. For a comparison of two binomial distributions such as a
click-through rate Click-through rate (CTR) is the ratio of users who click on a specific link to the number of total users who view a page, email, or advertisement. It is commonly used to measure the success of an online advertising campaign for a particular we ...
one would use
Fisher's exact test Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, a ...
.


Challenges

When conducting A/B testing, the user should evaluate the pros and cons of it to see if it aligns best with the results that they’re hoping for. Pros: Through A/B testing, it’s easy to get a clear idea of what users prefer, since it’s directly testing one thing over the other. It’s based on real user behavior so the data can be very helpful especially when determining what works better between two options. In addition, it can also provide answers to very specific design questions. One example of this is Google's A/B testing with hyperlink colors. In order to optimize revenue, they tested dozens of different hyperlink hues to see which color the users tend to click more on. Cons: However, there are a couple of cons to A/B testing. Like mentioned above, A/B testing is good for specific design questions but it can also be a downside since it’s mostly only good for specific design problems with very measurable outcomes. It could also be a very costly and timely process. Depending on the size of the company and/or team, there could be a lot of meetings and discussions about what exactly to test and what the impact of the A/B test is. If there’s not a significant impact, it could end up as a waste of time and resources. In December 2018, representatives with experience in large-scale A/B testing from thirteen different organizations (
Airbnb Airbnb, Inc. ( ), based in San Francisco, California, operates an online marketplace focused on short-term homestays and experiences. The company acts as a broker and charges a commission from each booking. The company was founded in 2008 b ...
,
Amazon Amazon most often refers to: * Amazons, a tribe of female warriors in Greek mythology * Amazon rainforest, a rainforest covering most of the Amazon basin * Amazon River, in South America * Amazon (company), an American multinational technolog ...
,
Booking.com Booking.com, headquartered in Amsterdam, is one of the largest online travel agencies. It is a subsidiary of Booking Holdings. History In 1996, Geert-Jan Bruinsma, a student at Universiteit Twente, founded Bookings.nl. In 2000, Booking.com w ...
,
Facebook Facebook is an online social media and social networking service owned by American company Meta Platforms. Founded in 2004 by Mark Zuckerberg with fellow Harvard College students and roommates Eduardo Saverin, Andrew McCollum, Dustin Mosk ...
,
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
,
LinkedIn LinkedIn () is an American business and employment-oriented online service that operates via websites and mobile apps. Launched on May 5, 2003, the platform is primarily used for professional networking and career development, and allows job se ...
,
Lyft Lyft, Inc. offers mobility as a service, ride-hailing, vehicles for hire, motorized scooters, a bicycle-sharing system, rental cars, and food delivery in the United States and select cities in Canada. Lyft sets fares, which vary using a dyn ...
,
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washin ...
,
Netflix Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a fi ...
,
Twitter Twitter is an online social media and social networking service owned and operated by American company Twitter, Inc., on which users post and interact with 280-character-long messages known as "tweets". Registered users can post, like, and ...
,
Uber Uber Technologies, Inc. (Uber), based in San Francisco, provides mobility as a service, ride-hailing (allowing users to book a car and driver to transport them in a way similar to a taxi), food delivery (Uber Eats and Postmates), packa ...
, and Stanford University) attended a summit and summarized the top challenges in a SIGKDD Explorations paper. The challenges can be grouped into four areas: Analysis, Engineering and Culture, Deviations from Traditional A/B tests, and Data quality.


History

It is difficult to definitively establish when A/B testing was first used. The first randomized double-blind trial, to assess the effectiveness of a homeopathic drug, occurred in 1835. Experimentation with advertising campaigns, which has been compared to modern A/B testing, began in the early twentieth century. The advertising pioneer
Claude Hopkins Claude Driskett Hopkins (August 24, 1903 – February 19, 1984) was an American jazz stride pianist and bandleader. Biography Claude Hopkins was born in Alexandria, Virginia, United States. Historians differ in respect of the actual date of his ...
used promotional coupons to test the effectiveness of his campaigns. However, this process, which Hopkins described in his Scientific Advertising, did not incorporate concepts such as statistical significance and the
null hypothesis In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...
, which are used in statistical hypothesis testing. Modern statistical methods for assessing the significance of sample data were developed separately in the same period. This work was done in 1908 by
William Sealy Gosset William Sealy Gosset (13 June 1876 – 16 October 1937) was an English statistician, chemist and brewer who served as Head Brewer of Guinness and Head Experimental Brewer of Guinness and was a pioneer of modern statistics. He pioneered small s ...
when he altered the
Z-test A ''Z''-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-tests test the mean of a distribution. For each significance level in the confide ...
to create Student's t-test. With the growth of the internet, new ways to sample populations have become available. Google engineers ran their first A/B test in the year 2000 in an attempt to determine what the optimum number of results to display on its search engine results page would be. The first test was unsuccessful due to glitches that resulted from slow loading times. Later A/B testing research would be more advanced, but the foundation and underlying principles generally remain the same, and in 2011, 11 years after Google's first test, Google ran over 7,000 different A/B tests. In 2012, a
Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washin ...
employee working on the search engine
Microsoft Bing Microsoft Bing (commonly known as Bing) is a web search engine owned and operated by Microsoft. The service has its origins in Microsoft's previous search engines: MSN Search, Windows Live Search and later Live Search. Bing provides a variety ...
created an experiment to test different ways of displaying advertising headlines. Within hours, the alternative format produced a revenue increase of 12% with no impact on user-experience metrics. Today, companies like Microsoft and Google each conduct over 10,000 A/B tests annually. Many companies now use the "designed experiment" approach to making marketing decisions, with the expectation that relevant sample results can improve positive conversion results. It is an increasingly common practice as the tools and expertise grow in this area.


Examples


Email marketing Email marketing is the act of sending a commercial message, typically to a group of people, using email. In its broadest sense, every email sent to a potential or current customer could be considered email marketing. It involves using email to ...

A company with a customer
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases s ...
of 2,000 people decides to create an email campaign with a discount code in order to generate sales through its website. It creates two versions of the email with different call to action (the part of the copy which encourages customers to do something — in the case of a sales campaign, make a purchase) and identifying promotional code. * To 1,000 people it sends the email with the call to action stating, "Offer ends this Saturday! Use code A1", * and to another 1,000 people it sends the email with the call to action stating, "Offer ends soon! Use code B1". All other elements of the emails' copy and layout are identical. The company then monitors which campaign has the higher success rate by analyzing the use of the promotional codes. The email using the code A1 has a 5% response rate (50 of the 1,000 people emailed used the code to buy a product), and the email using the code B1 has a 3% response rate (30 of the recipients used the code to buy a product). The company therefore determines that in this instance, the first Call To Action is more effective and will use it in future sales. A more nuanced approach would involve applying statistical testing to determine if the differences in response rates between A1 and B1 were statistically significant (that is, highly likely that the differences are real, repeatable, and not due to random chance). In the example above, the purpose of the test is to determine which is the more effective way to encourage customers to make a purchase. If, however, the aim of the test had been to see which email would generate the higher click-rate – that is, the number of people who actually click onto the website after receiving the email – then the results might have been different. For example, even though more of the customers receiving the code B1 accessed the website, because the Call To Action didn't state the end-date of the promotion many of them may feel no urgency to make an immediate purchase. Consequently, if the purpose of the test had been simply to see which email would bring more traffic to the website, then the email containing code B1 might well have been more successful. An A/B test should have a defined outcome that is measurable such as number of sales made, click-rate conversion, or number of people signing up/registering.


A/B testing for product pricing

A/B testing can be used to determine the right price for the product, as this is perhaps one of the most difficult tasks when a new product or service is launched. A/B testing (especially valid for digital goods) is an excellent way to find out which price-point and offering maximize the total revenue.


Political A/B testing

A/B tests have also been used by
political campaign A political campaign is an organized effort which seeks to influence the decision making progress within a specific group. In democracies, political campaigns often refer to electoral campaigns, by which representatives are chosen or referend ...
s. In 2007, Barack Obama's presidential campaign used A/B testing as a way to garner online attraction and understand what voters wanted to see from the presidential candidate. For example, Obama's team tested four distinct buttons on their website that led users to sign up for newsletters. Additionally, the team used six different accompanying images to draw in users. Through A/B testing, staffers were able to determine how to effectively draw in voters and garner additional interest.


HTTP Routing and API feature testing

A/B testing is very common when deploying a newer version of an API. For real-time user experience testing, an
HTTP The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide We ...
Layer-7
Reverse proxy In computer networks, a reverse proxy is the application that sits in front of back-end applications and forwards client (e.g. browser) requests to those applications. Reverse proxies help increase scalability, performance, resilience and securi ...
is configured in such a way that, ''N''% of the HTTP
traffic Traffic comprises pedestrians, vehicles, ridden or herded animals, trains, and other conveyances that use public ways (roads) for travel and transportation. Traffic laws govern and regulate traffic, while rules of the road include traffi ...
goes into the newer version of the backend instance, while the remaining ''100-N''% of HTTP traffic hits the (stable) older version of the backend HTTP application service. This is usually done for
limiting In electronics, a limiter is a circuit that allows signals below a specified input power or level to pass unaffected while attenuating (lowering) the peaks of stronger signals that exceed this threshold. Limiting is a type of dynamic range comp ...
the exposure of customers to a newer backend instance such that, if there is a bug on the newer version, only ''N''% of the total user agents or clients get affected while others get routed to a stable backend, which is a common ingress control mechanism.


Segmentation and targeting

A/B tests most commonly apply the same variant (e.g., user interface element) with equal probability to all users. However, in some circumstances, responses to variants may be heterogeneous. That is, while a variant A might have a higher response rate overall, variant B may have an even higher response rate within a specific segment of the customer base. For instance, in the above example, the breakdown of the response rates by gender could have been: In this case, we can see that while variant A had a higher response rate overall, variant B actually had a higher response rate with men. As a result, the company might select a segmented strategy as a result of the A/B test, sending variant B to men and variant A to women in the future. In this example, a segmented strategy would yield an increase in expected response rates from 5\% = \frac to 6.5\% = \frac – constituting a 30% increase. If segmented results are expected from the A/B test, the test should be properly designed at the outset to be evenly distributed across key customer attributes, such as gender. That is, the test should both (a) contain a representative sample of men vs. women, and (b) assign men and women randomly to each “variant” (variant A vs. variant B). Failure to do so could lead to experiment
bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group ...
and inaccurate conclusions to be drawn from the test. This segmentation and targeting approach can be further generalized to include multiple customer attributes rather than a single customer attribute – for example, customers' age ''and'' gender – to identify more nuanced patterns that may exist in the test results.


See also

*
Adaptive control Adaptive control is the control method used by a controller which must adapt to a controlled system with parameters which vary, or are initially uncertain. For example, as an aircraft flies, its mass will slowly decrease as a result of fuel consumpt ...
* Choice modelling *
Multi-armed bandit In probability theory and machine learning, the multi-armed bandit problem (sometimes called the ''K''- or ''N''-armed bandit problem) is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices ...
* Multivariate testing *
Randomized controlled trial A randomized controlled trial (or randomized control trial; RCT) is a form of scientific experiment used to control factors not under direct experimental control. Examples of RCTs are clinical trials that compare the effects of drugs, surgical te ...
* Scientific control *
Test statistic A test statistic is a statistic (a quantity derived from the sample) used in statistical hypothesis testing.Berger, R. L.; Casella, G. (2001). ''Statistical Inference'', Duxbury Press, Second Edition (p.374) A hypothesis test is typically specifi ...


References

{{DEFAULTSORT:A B testing Market research Experiments Software testing