Split Test Pro
Intermediate 7 min read

Common Mistakes

The most frequent ways A/B tests produce misleading results — peeking, stopping early, multi-variable variants, mid-experiment edits, ignoring segments — and how to avoid each.

Most A/B testing failures aren’t about the math — they’re about the human. This guide covers the patterns that consistently produce false wins, false losses, and unactionable results, so you can spot them in your own workflow before they cost you a quarter.

Peeking and Stopping Early

The single most common A/B testing mistake. The pattern:

  1. You launch an experiment.
  2. After 2 days, Variant B looks promising — 91% probability.
  3. You convince yourself it’s “going to” cross 95% any minute now.
  4. You stop the experiment.
  5. You ship the variant.
  6. Six weeks later, the change isn’t producing the expected lift.

What went wrong: 91% probability with thin data is much weaker than 95% probability with mature data. Stopping at 91% means accepting a 1-in-11 chance you’re wrong. Across many experiments, that adds up to a lot of false wins shipped to production.

The fix: wait for 95% probability AND at least 300–500 sessions per variant AND at least 7 days. All three. See Declaring a Winner.

Running Too Short

Even at high traffic, one full week is the minimum. Reasons:

  • Day-of-week effects. Conversion behavior on Tuesday afternoon is different from Saturday morning. A test that runs only Mon–Wed shows results from a non-representative slice.
  • Marketing campaign confounds. A 2-day test that overlaps an email campaign measures the variant + the campaign, not the variant alone.
  • Novelty effects. Visitors react to changes because they’re new. After a few days, that wears off and the “true” effect emerges.

The fix: always plan on 7+ days minimum, even if the math hits significance faster.

Multiple Changes Per Variant

You change three things in Variant B: button color, headline, trust badge position. Variant B wins.

Which change caused the lift? You don’t know. You can’t:

  • Apply the result generally — maybe only one of the three actually helped, and the other two were neutral or even negative.
  • Build a follow-up experiment — your starting point is “this combination won,” not “we know X helps.”
  • Trust the result — the three changes might interact in non-obvious ways.

The fix: change only one thing per variant. If you have multiple ideas, run them sequentially. See Testing Methodology.

Editing Mid-Experiment

You launch an experiment. Two days in, you notice a typo in the variant CSS. You fix it.

Now your data is split: visitors before the fix saw the old variant; visitors after saw the new one. You can’t cleanly attribute the result to either version. The experiment is contaminated.

The fix: end the experiment, fix the variant, start a fresh one. Don’t edit a running experiment’s variant content — even small changes invalidate the data. See Experiment Lifecycle.

Insufficient Sample Size

Even at 95% probability, with very small samples the credible intervals are wide enough that the result isn’t trustworthy. A 95% probability based on 50 sessions per variant is a fragile claim.

Rough thresholds:

  • Below 300 sessions per variant — provisional only, treat with skepticism.
  • 300–500 — minimum for any decision.
  • 1,000+ — comfortable for high-stakes decisions.

The fix: wait for sufficient samples in addition to high probability. The two conditions are independent — you need both.

Wrong Primary Metric

You’re testing a button color. You set the primary metric to “newsletter signups” because it has the highest event volume.

The variant doesn’t touch the newsletter form. Whatever the result, it’s noise — the variant has no causal mechanism to influence newsletter signups.

The fix: the primary metric must be in the same causal chain as the change. Test a button → measure clicks (or downstream conversions). Test a hero image → measure scroll depth, time on page, or downstream actions. See Conversion Goals.

Ignoring Segments

The aggregate result is flat. You declare the experiment inconclusive.

But the device segments tell a different story:

  • Mobile: Variant B has +20% lift.
  • Desktop: Variant B has -15% lift.

The segments cancel out in aggregate. You missed a real opportunity (mobile win) and a real risk (desktop regression).

The fix: before declaring any experiment done, check the device segment breakdown. See Segmenting Results.

Running Conflicting Experiments

Two experiments target the same page area. One tests button color; one tests button size. Both run simultaneously.

Visitors are randomly assigned to both. Results for each show muddy patterns because the variants interact — a “small + orange” cell behaves differently from “large + orange” or “small + blue.”

The fix: run experiments sequentially when they touch the same DOM region. Run them in parallel only when targeting non-overlapping pages or non-interacting elements. See Running Multiple Experiments.

Testing on Already-Optimized Pages

Your home page has been refined for years. Every element has been carefully tuned. You start an experiment with a small variant change.

Result after a month: inconclusive. The page is at a local optimum where small changes don’t move the needle.

The fix: for already-optimized surfaces, test bolder changes (full hero swap, layout reflow) — not 5%-different button colors. Or focus testing energy on under-optimized pages where the leverage is higher.

Not Applying the Winner

You ran an experiment. Variant B won. You declared the winner.

Six months later, you realize the change was never applied to the theme. The variant only ran during the experiment window; new visitors are seeing the original.

The fix: declaring a winner doesn’t ship the change. Add “apply winning CSS to theme” as the explicit next step in your experimentation workflow. See Declaring a Winner for the post-winner cleanup checklist.

Reading the Conversion Rate Without the Credible Interval

Variant B shows 5.0% conversion vs Control’s 4.5%. That looks like a 11% relative lift.

But the credible intervals are 4.0%–6.0% for both variants. The two intervals overlap completely — the difference is well within noise.

The fix: the credible interval is what tells you whether a difference is real. Two non-overlapping intervals are signal; two overlapping intervals are noise, regardless of point estimates.

Treating Inconclusive as a Failure

You ran an experiment for three weeks. The probability never crossed 95%. You feel like the test “failed.”

It didn’t. An inconclusive result tells you the change has no detectable effect at this traffic level. That’s a finding:

  • The change isn’t worth shipping (no measurable lift to justify it).
  • Future tests in this area should test bolder changes (small lifts won’t show up).
  • The hypothesis behind the change may be wrong (worth questioning).

The fix: document inconclusive results the same way you document winners and losers. They’re as much a part of the program’s learning as the wins.

Using A/B Testing for Things It’s Bad At

A/B testing is great at:

  • Comparing two specific design choices.
  • Validating an incremental change.
  • Settling internal debates with data.

A/B testing is bad at:

  • Validating big strategic decisions. “Should we redesign the entire site?” doesn’t fit in an A/B test — and a redirect-based redesign comparison rarely produces clean results.
  • Catching long-term effects. A change that lifts conversion this week might damage retention three months from now. A/B tests measure short-term behavior; long-term outcomes need different methodology.
  • Testing in low-traffic environments. If you don’t have the volume, the test won’t converge no matter how rigorously you run it.

The fix: know when A/B testing is the right tool. For strategic questions, fall back on customer interviews, qualitative research, and judgment. For long-term effects, monitor cohort metrics on shipped changes, not just A/B results.

Confirmation Bias

You run an experiment. The result doesn’t match what you expected. You convince yourself the test was flawed and ignore the result.

Then you re-run it slightly differently until you get the answer you wanted.

This is the most insidious failure mode because you can talk yourself into it without noticing. Your test results become a report card on your hypotheses, not a reliable signal.

The fix: decide before launching what would change your mind. Write down the hypothesis and the expected outcome. If the data contradicts the expectation, the data is right and the hypothesis was wrong. Update your beliefs, don’t relitigate the test.

Next Steps

Ready to start testing?

Install Split Test Pro and run your first experiment today.

Install on Shopify