7.1A/B Testing

Two side-by-side lemonade stands labeled A and B with a checkmark over the winner

At a Glance

1–2 weeks of run time

~$100 in ad spend

In Brief

A/B testing is a controlled experiment that randomly splits your audience into two groups, shows each group a different version of a page, feature, or design element, and measures which version performs better against a predefined success metric such as signups, purchases, or click-through rate. Both versions run simultaneously under identical conditions, and the output is a statistically validated answer to the question “does this specific change improve the outcome?”

Common Use Case

Your landing page converts visitors at 2% and you believe a different headline could do better. You want to show half your visitors the current headline and the other half a new one, then measure which version gets more signups over two weeks.

Helps Answer

Which version of this page or feature converts more visitors?
What design changes increase sales or signups?
Does this change actually improve the metric we care about?
How much of a difference does a specific change make?

Description

A/B testing is similar to the experiments you did in Science 101. Remember the one where you tested various substances to see which supports plant growth and which suppresses it? You measured the growth of plants at different intervals as they were subjected to different conditions, and in the end tallied the increase in height of the different plants.

A/B testing allows you to show potential customers and users two versions of the same element and let them determine the winner. As the name implies, two versions (A and B) are compared that are identical except for one variation that might affect a user’s behavior. Version A might be the currently used version (control), while version B is modified in some respect (variation).

In online settings, such as web design (especially user-experience design), the goal is to identify changes to web pages that increase or maximize an outcome of interest. Constantly testing and optimizing your web page can increase revenue, donations, leads, registrations, downloads, and user-generated content, while providing teams with valuable insight about their visitors.

For instance, on an ecommerce website the purchase funnel is typically a good candidate for A/B testing, as even marginal improvements in drop-off rates can represent a significant gain in sales. Significant improvements can sometimes be seen through testing elements such as copy, layouts, images, and colors.

By measuring the impact that changes have on your metrics (such as signups, downloads, purchases, or whatever else your goals might be), you can ensure that every change produces positive results.

This differs from multivariate testing, which tests multiple variations of a page at the same time.

The vastly larger group of statistics broadly referred to as “multivariate testing” or “multinomial testing” is similar to A/B testing, but may test more than two different versions at the same time and/or has more controls. Simple A/B tests are not valid for observational, quasi-experimental, or other nonexperimental situations, as is common with survey data, offline data, and other, more complex phenomena.

AI-Managed Multi-Variant Testing

Modern experimentation platforms (Statsig, Eppo, Optimizely, Unbounce Smart Traffic) can run multi-variant tests where the platform itself generates variants, allocates traffic via multi-armed bandit or contextual-bandit algorithms, and auto-converges toward the variant with the highest measured outcome — without you watching the dashboard. AI can also generate the variants themselves: copy, layout permutations, even imagery. This collapses what used to be a sequence of A/B → analyze → ship → next-A/B into a single rolling experiment that tests dozens of combinations at once.

Use AI-managed multi-variant testing when you already know what you are optimizing for and the target audience is locked in. It is genuinely useful for downstream optimization — pricing-page variants for a known segment, checkout-flow micro-changes, ad-creative rotation against a defined conversion event. It is the wrong tool for discovery: see the local-optima and wrong-target pitfalls in Biases & Tips below.

Imagine a company, Acme Cables, that operates a web store selling cables. The company’s ultimate goal is to sell more cables and increase their yearly revenue, thus the checkout funnel is the first place Acme’s head of marketing will focus optimization efforts.

The “Buy” button on each product page is the first element visitors interact with at the start of the checkout process. The team hypothesizes that making the button more prominent on the page would lead to more clicks and therefore more purchases. The team then makes the button red in the variation and leaves the button grey in the original. They are able to quickly set up an A/B test using an A/B testing tool that pits the two variations against each other.

As the test runs, all visitors to the Acme Cables site are bucketed into a variation. They are equally divided between the red button page and the original page.

Once enough visitors have run through the test, the Acme team ends the test and is able to declare a winner. The results show that 4.5 percent of visitors clicked on the red “Buy” button, and 1 percent clicked on the original version. The red “Buy” button led to a significant uplift in conversion rate, so Acme then redesigns their product pages accordingly. In subsequent A/B tests, Acme will apply the insight that red buttons convert better on their site than grey buttons.

You can also use it when you want to test your headline, but you have three possible variations. In this case, running a single test and splitting your visitors (or recipients in the case of an email) into three groups instead of two is reasonable and would likely still be considered an A/B test. This is more efficient than running three separate tests (A vs. B, B vs. C, and A vs. C). You may want to give your test an extra couple of days to run so you still have enough results on which to base any conclusions.

Testing more than one thing at a time, such as a headline and call to action, is a multivariate test, which is more complicated to run. There are plenty of resources out there for multivariate testing, but we won’t be covering that when talking about A/B testing.

Once you’ve concluded the test, you should update your product and/or site with the desired content variation(s) and remove all elements of the test as soon as possible.

How to

Prep

Prep is where most A/B tests are won or lost. Decide what you are testing, the metric that defines the win, and whether you have enough traffic before writing a single line of variant code. AI experimentation platforms (Statsig, Eppo, GrowthBook) can automate sample-size math and flag confounders during this phase, but the decisions are yours.

Confirm you have enough traffic. A/B testing needs statistical significance, which needs volume. Use an A/B test sample size calculator. Rough guide: under 1,000 unique visitors per week to the page you want to test, you likely cannot run a meaningful A/B test without waiting months. If that is you, switch to qualitative methods (usability testing, customer interviews) or test bolder changes that produce bigger effect sizes. B2B SaaS teams: your biggest opportunity is usually inside the product (onboarding flows, feature adoption, pricing page), not the marketing site — feature flag tools let you test against existing users where you control exposure.
Define one hypothesis. State what you are testing, what metric will change, and by how much. Example: “Changing the onboarding flow from 5 steps to 3 will increase activation rate by at least 5 percentage points.” One variable at a time. If you want to test multiple variables together, that is multivariate testing, not A/B.
Pick the primary metric and the minimum detectable effect. Choose the single metric that matters most for this experiment — signups, activation, purchases, retention. Define the minimum detectable effect (MDE) — the smallest improvement that would justify shipping — before you look at any data. The MDE drives the sample size and stops you from chasing 0.1% wins.
Calculate required sample size and duration. Plug your baseline conversion rate, MDE, and daily traffic into a sample size calculator. Output: how many visitors per arm and how many days. Do not end the test early because it “looks significant” — peeking inflates false-positive rates.
Design the control and variant. Version A is the current experience (control). Version B changes one thing only (variant). Identical conditions otherwise. Write down what counts as “shipping the change” so the engineering work is scoped before the test, not after.
Pilot the instrumentation. Run the test with a 1% holdout for a day to confirm tracking fires correctly, the split is even, and the metric records as expected. A broken pixel discovered on day 14 is a wasted test.

Execution

Set up the test in your experimentation tool. Use a feature flag or A/B testing platform to randomly split users into control and variant. Verify the split is even within the first hour of traffic. Confirm the control and variant run simultaneously under identical conditions — same time of day, same channels, same audience.
Run for the full pre-calculated duration. Do not peek and stop early when results “look significant.” Run until you hit the sample size from Prep step 4, or use a tool with sequential testing that mathematically accounts for multiple looks.
Watch for breakage, not winners. While the test runs, check for instrumentation drift, pipeline outages, or external events (a viral post, an outage, a concurrent campaign) that could contaminate results. Pause and restart if contamination is severe — do not try to “correct” the data after the fact.
End the test cleanly. When the duration completes, freeze the data, snapshot the assignment logs, and remove the test infrastructure from the live experience as soon as the decision is made. Lingering test code is a future bug.

Analysis

Check statistical significance. Most tools report a p-value or confidence interval. The standard threshold is p < 0.05 (95% confidence), but that is a convention, not a law. If your decision is reversible and cheap, a lower bar is fine. If you are betting the company on the result, demand more.
Check practical significance. Statistical significance only tells you the effect is probably not zero. Practical significance asks whether the effect is big enough to matter. A 0.1% lift that is statistically significant on millions of users may not be worth the engineering investment to ship and maintain.
Segment the results. Aggregate results can hide that one segment loved the variant and another hated it. Slice by acquisition channel, device, new versus returning, and any segment where you expect different behavior. Treat segment effects as hypotheses for the next test, not conclusions.
Decide: ship, kill, or iterate. If the variant wins on both bars (statistical and practical), ship it and document the learning. If it loses, kill it and document why you thought it would work. If results are inconclusive, decide whether the next move is a bolder change, more traffic, or a different method — not a longer run of the same test.

Peeking bias Looking at results before the planned sample size is reached and stopping when they “look significant” inflates false-positive rates dramatically. Pre-commit to the sample size from Prep, or use a tool with sequential testing that accounts for multiple looks.
Novelty effect A new variant often outperforms the control simply because it is new and users notice it. The lift fades after 2-4 weeks. For changes that will live in production long-term, run the test long enough to see the novelty wear off, or follow up with a holdout cohort.
Segment imbalance If the random split is uneven across important segments (gender, plan tier, device, geography), the comparison is contaminated. Verify the split is balanced on key attributes within the first hour, and rerandomize if it is not.
Multiple comparisons Running 20 tests at once and celebrating the one that “wins” at p < 0.05 is statistically equivalent to flipping coins. Use a Bonferroni or false-discovery-rate correction when running many tests, or focus on one test at a time.
Survivorship bias in results The users who completed the funnel are not a random sample of users who entered it. Drop-off itself can vary between control and variant. Measure the full funnel, not just the final conversion.
Confirmation bias in analysis If you want the variant to win, you will find a segment where it does. Pre-register the primary metric and the segments you will analyze before the test ends. Treat post-hoc segments as hypotheses, not findings.
Wrong-method bias A/B testing fails for new features users have never seen (the novelty effect dominates), changes too large to isolate one variable, or “missing thing” tests where the control lacks something the variant has. Use it for incremental optimization, not for green-field design questions.
Local-optima trap (AI-managed multi-variant) Auto-convergent platforms optimize hard against the metric you give them. They do not propose bolder reframings or cross-segment alternatives. After a few rounds of converging variants, you can plateau on a winner that beats the control but sits well below what a different framing or a different segment would reach. Periodically rerun against a deliberately distant variant to check whether you have flatlined.
Wrong-target optimization (AI during discovery) AI-managed multi-variant tools faithfully optimize for whichever signal converts now. They do not know whether the signups, clicks, or conversions are coming from the audience you actually want to serve, or whether the pain point your variants happen to address is the one you intended to validate. Run during discovery — when the value proposition or target customer is not yet locked in — and the platform will happily converge on a value proposition the founder is not actually interested in. Lock the segment and value proposition first (via Customer Discovery Interviews and Landing Page Test) before letting AI take the wheel.

Next Steps

If the variant wins, ship it and document the learning.
If results are inconclusive, check if your sample size was sufficient or if you need to test a bolder change.
Analyze results by segment — the overall result may mask that one segment loved it and another hated it.
Use the winning insight to generate the next experiment hypothesis.
Set up Dashboards to monitor the long-term impact of winning variants and catch any regression over time.
Use Scorecards to prioritize which elements to A/B test next when you have more hypotheses than traffic.

Learn more

Case Studies

Majestic Wine: Wedding-page redesign

A redesign of the UK wine retailer’s wedding-services category page — clearer headings, prominent CTAs, less clutter — lifted enquiry-form submissions by 201% at 99.99% statistical significance. A follow-up that removed a competing PDF download button added another 36.9%.

Six Pack Abs: Pricing test

Carl Juneau tested $19.95 against $29.95 on a workout product checkout page. Conversion rates were statistically equivalent (1.1% vs 1.0% across roughly 1,200–1,400 visitors per arm), and the higher price generated 61.67% more revenue.

Airbnb: Three experiment pitfalls

The engineering team warns against stopping early on a p-value that crosses 0.05 in week one, judging on a single aggregate metric (a search redesign looked neutral overall but was broken in Internet Explorer, a 2% loss), and trusting the assignment system without running A/A dummy tests.

Grab: Chat to reduce booking cancellations

Grab tested in-app chat between drivers and riders as a way to drive down booking cancellations, finding that prompting an early automated GrabChat message cut post-allocation cancellations by up to 2 percentage points.

Coinbase: Restoring trust in experimentation

Coinbase’s in-house platform (CIFER) lacked sequential confidence intervals and CUPED, and a recommendation system it had marked a success turned out to have no real impact when re-tested. After moving to Eppo, the team reported 40% faster experiment analysis.

Notion: Scaling experiments with Statsig

Notion moved from single-digit experiments per quarter on an in-house tool to 300+ per quarter on Statsig, released 600+ features behind flags, and reported a 6% lift in activation rate.

Statsig: Experimentation in the age of AI

Statsig argues that as AI cheapens the build phase, measurement matters more: offline evals feed production experiments, AI-generated code needs metric-linked logging, and every engineer becomes a growth engineer testing ~10x more concepts. Customers cited include OpenAI, Figma, Notion, Grammarly, Atlassian, and Brex.

Booking.com: Homepage-redesign experiment

In December 2017 Booking.com’s design director proposed testing an entirely new homepage layout that stripped the interface down to destination, dates, party size, and three options (accommodations, flights, rental cars) — discarding years of optimized content as a test of the company’s experimentation culture.

Booking.com: 25,000 tests per year

Booking.com runs more than 25,000 A/B tests annually (about 70 per day) and has grown roughly twice as fast as the S&P 500 over more than a decade, the canonical case for experimentation as a competitive moat.

The Real Startup Book

7.1A/B Testing

At a Glance

In Brief

Common Use Case

Helps Answer

Description

AI-Managed Multi-Variant Testing

How to

Prep

Execution

Analysis

Next Steps

Case Studies

Further reading

7.1A/B Testing

At a Glance

In Brief

Common Use Case

Helps Answer

Description

AI-Managed Multi-Variant Testing

How to

Prep AI Prompt

Execution AI Prompt

Analysis AI Prompt

Next Steps

Case Studies

Further reading

Prep

Execution

Analysis