7.1 A/B Testing

Two side-by-side lemonade stands labeled A and B with a checkmark over the winner

At a Glance

~1–4 weeks~1–4 weeks Drafting the two versions, wiring up the tracking, and running the math are all quick. The time you actually wait is the run itself: the test has to stay live until enough participants pass through both versions to trust the result. At moderate traffic that is one to two weeks, longer for low-traffic pages, and shorter if you pay to drive more participants through faster.
$0–$1.4K$0–$1.4K Testing tools are free or built into your platform, and you can draft the two versions and run the math yourself. The only out-of-pocket cost is traffic: if your unpaid visitor volume is thin, a modest ad spend pushes enough participants through the test to get a trustworthy result in a reasonable window. If you already have enough traffic, the test costs nothing.

Other names Split Test · Split Testing

In Brief

A/B testing randomly splits your participants into two groups, shows each group a different version of a product, feature, or design element, and measures which version performs better against a metric you choose in advance — signups, purchases, or the share of participants who use a feature. Both versions run at the same time under the same conditions, so the only difference between the two groups is the change you are testing.

Common Use Case

Your landing page converts visitors at 2% and you believe a different headline could do better. You want to show half your visitors the current headline and the other half a new one, then measure which version gets more signups over two weeks.

Helps Answer

  • Which version of this page or feature converts more participants?
  • What design changes increase sales or signups?
  • Does this change actually improve the metric we care about?
  • How much of a difference does a specific change make?

Description

A/B testing shows two versions of the same element to your participants and lets their behavior decide which one wins. As the name suggests, you compare two versions (A and B) that are identical except for one change that might affect what participants do. Version A is usually the current version — the control, the baseline you measure against. Version B changes one thing — the variant. You split incoming participants randomly between the two and compare a single number for each: the share who take the action you care about, such as the percentage who sign up or buy.

The method is most often used on web pages, where the goal is to find changes that increase an outcome you care about: revenue, signups, downloads, leads, donations. Testing copy, layout, images, or colors can move those numbers, and even a small improvement on a high-traffic page can add up to a meaningful gain.

A purchase flow — the sequence of pages a visitor moves through from product page to completed order — is a common place to test, because reducing the share of people who drop out at each step turns directly into more sales. Whatever the metric, A/B testing measures whether a specific change actually moved it rather than leaving you to guess.

A/B testing changes one thing at a time. Testing several changes together (for example a new headline and a new button at once) is multivariate testing, a related but more involved method that needs more traffic to untangle which change did what. A/B testing also assumes you control the conditions and the random split; it does not apply to data you merely observe after the fact, such as survey responses or historical records, where you cannot assign people to versions.

Imagine a company, Acme Cables, that operates a web store selling cables. The company’s ultimate goal is to sell more cables and increase their yearly revenue, thus the checkout funnel is the first place Acme’s head of marketing will focus optimization efforts.

The “Buy” button on each product page is the first element visitors interact with at the start of the checkout process. The team hypothesizes that making the button more prominent on the page would lead to more clicks and therefore more purchases. The team then makes the button red in the variation and leaves the button grey in the original. They are able to quickly set up an A/B test using an A/B testing tool that pits the two variations against each other.

As the test runs, all visitors to the Acme Cables site are bucketed into a variation. They are equally divided between the red button page and the original page.

Once enough visitors have run through the test, the Acme team ends it and declares a winner. The results show that 4.5 percent of visitors clicked the red “Buy” button, against 1 percent for the original — a large enough gap, on enough visitors, to trust. Acme redesigns its product pages accordingly, and carries the lesson that red buttons convert better than grey on its site into later tests.

You can also test more than two versions of the same one thing — say three possible headlines. Splitting your participants (or, for an email, your recipients) into three groups instead of two is still the same method, and it is more efficient than running three separate two-way tests. Give a three-way test an extra couple of days so each group still gathers enough results to draw a conclusion.

Some AI experimentation platforms take this further, generating the versions themselves — copy, layout, even imagery — and automatically steering traffic toward whichever is winning instead of waiting for you to read a dashboard and pick. That suits high-volume optimization of a fixed goal and audience (pricing-page versions for a known segment, checkout tweaks, rotating ad creative), but it is the wrong tool for discovery: it optimizes hard against whatever metric you give it without questioning whether the metric or audience is right (see the local-optima and wrong-target pitfalls in Biases & Tips).

How to

Prep

Decide what you are testing, the metric that defines the win, and whether you have enough traffic before writing any variant code. AI tools can do the sample-size math and flag confounders for you, but the decisions are yours.

  1. Confirm you have enough traffic. A/B testing only works once enough participants pass through each version that the result is unlikely to be chance — what statisticians call statistical significance. That takes volume. Use a sample-size calculator (see References) to check. As a rough guide, with under 1,000 unique visitors per week to the page, you likely cannot finish a meaningful test without waiting months. If that is you, switch to a method that learns from a handful of people — usability testing or customer discovery interviews — or test bolder changes that produce bigger, easier-to-detect differences. For B2B software teams, the best opportunity is usually inside the product (onboarding, feature adoption, the pricing page), not the marketing site: a feature flag (a switch that turns a change on for some users and off for others) lets you test against existing users where you control who sees what.
  2. Define one hypothesis. State what you are testing, what metric will change, and by how much. Example: “Changing the onboarding flow from 5 steps to 3 will increase the activation rate by at least 5 percentage points.” One change at a time — testing several at once is multivariate testing, not A/B.
  3. Pick the primary metric and the smallest difference worth detecting. Choose the single metric that matters most for this test — signups, activation, purchases, retention. Then decide the smallest improvement that would actually justify shipping the change, and write it down before you look at any data. This number (often called the minimum detectable effect) sets how big a sample you need and stops you from chasing a 0.1% win that isn’t worth the work.
  4. Calculate the required sample size and duration. Feed your current baseline rate, the smallest difference worth detecting, and your daily traffic into a sample-size calculator. It tells you how many participants each version needs and how many days that will take. Decide now to run that long — do not stop early because the numbers “look significant” partway through (see Peeking bias in Biases & Tips).
  5. Design the control and variant. Version A is the current experience (the control). Version B changes one thing only (the variant). Everything else stays identical. Write down what “shipping the change” means so the engineering work is scoped before the test, not after. Decide how many variants to run, too: two is the default, but a high-traffic site optimizing a fixed goal can use an AI-managed multi-variant platform to run many at once and auto-allocate traffic toward the winner.
  6. Pilot the setup. Run the test on 1% of traffic for a day to confirm the tracking fires correctly, the split is even, and the metric records as expected. A broken tracking tag discovered on day 14 is a wasted test.

Execution

  1. Set up the test in your tool. Use a feature flag or A/B testing platform to randomly split participants between the control and the variant. Verify the split is even within the first hour of traffic. Confirm both versions run at the same time under identical conditions — same time of day, same channels, same audience.
  2. Run for the full planned duration. Do not check partway through and stop the moment results “look significant.” Run until you reach the sample size from Prep step 4. Some tools offer always-valid testing, a built-in method that lets you look at results as they come in without inflating the odds of a false result — if yours does, follow its stopping rule rather than improvising.
  3. Watch for breakage, not winners. While the test runs, check for instrumentation drift, pipeline outages, or external events (a viral post, an outage, a concurrent campaign) that could contaminate results. Pause and restart if contamination is severe — do not try to “correct” the data after the fact.
  4. End the test cleanly. When the duration completes, freeze the data, snapshot the assignment logs, and remove the test infrastructure from the live experience as soon as the decision is made. Lingering test code is a future bug.

Analysis

  1. Decide how sure you need to be — and don’t over-demand it. Statistical significance just measures how likely your result is real rather than a fluke of chance. Tools usually report it as a p-value: 0.05 means roughly a 5% chance you’d see a gap this big even if the two versions were truly identical (often stated as “95% confidence”). To check whether your numbers clear that bar — or how many more participants you’d need — run them through an experiment calculator, or hand the counts to a statistician or an AI to do the math. But 0.05 is a convention, not a requirement: an early-stage startup rarely has the traffic to reach it, and rarely needs to. Deciding whether to keep iterating on a headline is not the bet a public company makes when it changes checkout for millions of users. If the change is cheap and reversible and the early numbers point clearly one way, act on the weaker evidence and keep moving — holding out for a p-value your sample size will never reach is its own mistake. Match the bar to the cost of being wrong.
  2. Check whether the result is worth acting on. A small p-value only tells you the difference is probably not zero — not that it is big enough to matter. A 0.1% improvement that is statistically solid on millions of participants may not be worth the work to build and maintain. Weigh the size of the change against what shipping it costs.
  3. Segment the results. Aggregate results can hide that one segment loved the variant and another hated it. Slice by acquisition channel, device, new versus returning, and any segment where you expect different behavior. Treat segment effects as hypotheses for the next test, not conclusions.
  4. Decide: ship, kill, or iterate. If the variant wins on both bars (statistical and practical), ship it and document the learning. If it loses, kill it and document why you thought it would work. If results are inconclusive, decide whether the next move is a bolder change, more traffic, or a different method — not a longer run of the same test.
Biases & Tips
  • Peeking bias Checking results before the planned sample size is reached and stopping the moment they “look significant” sharply raises the odds of a false win: the more often you look, the more likely you are to catch a random swing and mistake it for a real effect. Pre-commit to the sample size from Prep, or use a tool with always-valid testing built to handle repeated looks.
  • Novelty effect A new variant often outperforms the control simply because it is new and users notice it. The lift fades after 2-4 weeks. For changes that will live in production long-term, run the test long enough to see the novelty wear off, or follow up with a holdout cohort.
  • Randomization imbalance If the random split is uneven across important segments (gender, plan tier, device, geography), the comparison is contaminated and the measured lift cannot be trusted.
  • Multiple comparisons Run 20 tests at once and one will likely “win” at the usual cutoff by luck alone, the same way one of 20 coin flips comes up heads several times in a row. When you run many tests, ask your tool to tighten the cutoff to account for the number of tests (a standard correction it can apply), or focus on one test at a time.
  • Survivorship bias in results The participants who reach the end of a flow are not a random sample of everyone who entered it, and where people drop out can itself differ between the control and the variant. Measure every step of the flow, not just the final action, so you see where each version gains or loses people.
  • Confirmation bias in analysis If you want the variant to win, you will find a segment where it does. Pre-register the primary metric and the segments you will analyze before the test ends. Treat post-hoc segments as hypotheses, not findings.
  • Method-fit failure A/B testing tends to mislead on brand-new features no one has seen before (the novelty effect dominates), on changes too large to pin the result on a single variable, and on tests where the variant adds something the control simply lacks. For brand-new or sweeping changes, where novelty dominates or you cannot isolate one variable, the result is usually too tangled to interpret — reach for a qualitative method instead.
  • Local-optima trap (AI-managed multi-variant) Auto-convergent platforms optimize hard against the metric you give them. They do not propose bolder reframings or cross-segment alternatives. After a few rounds of converging variants, you can plateau on a winner that beats the control but sits well below what a different framing or a different segment would reach. Periodically rerun against a deliberately distant variant to check whether you have flatlined.
  • Wrong-target optimization (AI during discovery) AI-managed multi-variant tools faithfully optimize for whichever signal converts now. They do not know whether the signups, clicks, or conversions are coming from the audience you actually want to serve, or whether the pain point your variants happen to address is the one you intended to validate. Run during discovery — when the value proposition or target customer is not yet locked in — and the platform will happily converge on a value proposition the founder is not actually interested in. Lock the segment and value proposition first (via Customer Discovery Interviews and Landing Page Test) before letting AI take the wheel.

Next Steps

  • If you cannot reach significance with the traffic you have, test a bolder change instead of a subtle one. A dramatic difference — a whole new page, not a new button color — produces a larger effect that shows up clearly in a far smaller sample, so an early-stage startup learns faster from big swings than from fractional tweaks.
  • If a result was inconclusive because traffic was thin, validate the idea with a smaller-sample method first — a Landing Page Test or Usability Testing tells you whether a change is worth a full A/B test before you spend the run time.
  • If thin traffic is the recurring blocker, drive more participants to the next test with an Online Ad Test so it can reach a trustworthy result faster.
  • Set up Dashboards to monitor the long-term impact of winning variants and catch any regression over time.
Learn more

Case Studies

Majestic Wine: Wedding-page redesign

A redesign of the UK wine retailer’s wedding-services category page — clearer headings, prominent CTAs, less clutter — lifted enquiry-form submissions by 201% at 99.99% statistical significance. A follow-up that removed a competing PDF download button added another 36.9%.

Read more

Six Pack Abs: Pricing test

Carl Juneau tested $19.95 against $29.95 on a workout product checkout page. Conversion rates were statistically equivalent (1.1% vs 1.0% across roughly 1,200–1,400 participants per arm), and the higher price generated 61.67% more revenue.

Read more

Airbnb: Three experiment pitfalls

The engineering team warns against stopping early on a p-value that crosses 0.05 in week one, judging on a single aggregate metric (a search redesign looked neutral overall but was broken in Internet Explorer, a 2% loss), and trusting the assignment system without running A/A dummy tests.

Read more

Grab: Chat to reduce booking cancellations

Grab tested in-app chat between drivers and riders as a way to drive down booking cancellations, finding that prompting an early automated GrabChat message cut post-allocation cancellations by up to 2 percentage points.

Read more

Coinbase: Restoring trust in experimentation

Coinbase’s in-house platform (CIFER) lacked sequential confidence intervals and CUPED, and a recommendation system it had marked a success turned out to have no real impact when re-tested. After moving to Eppo (now Datadog Experiments), the team reported 40% faster experiment analysis.

Read more

Notion: Scaling experiments with Statsig

Notion moved from single-digit experiments per quarter on an in-house tool to 300+ per quarter on Statsig, released 600+ features behind flags, and reported a 6% lift in activation rate.

Read more

Statsig: Experimentation in the age of AI

Statsig argues that as AI cheapens the build phase, measurement matters more: offline evals feed production experiments, AI-generated code needs metric-linked logging, and every engineer becomes a growth engineer testing ~10x more concepts. Customers cited include OpenAI, Figma, Notion, Grammarly, Atlassian, and Brex.

Read more

Booking.com: Homepage-redesign experiment

In December 2017 Booking.com’s design director proposed testing an entirely new homepage layout that stripped the interface down to destination, dates, party size, and three options (accommodations, flights, rental cars) — discarding years of optimized content as a test of the company’s experimentation culture.

Read more

Booking.com: 25,000 tests per year

Booking.com runs more than 25,000 A/B tests annually (about 70 per day) and has grown roughly twice as fast as the S&P 500 over more than a decade, the canonical case for experimentation as a competitive moat.

Read more

Got something to add? Share with the community.