Split Test Pro
Advanced 10 min read

A/B Testing Methodology

The principles that separate teams who get compounding gains from A/B testing from those who spin their wheels — hypothesis-driven testing, single-variable rigor, sequential discipline, and documenting losses.

Running an A/B test is straightforward. Running one that produces a result you can actually trust — and learn from — requires discipline. This guide covers the principles that separate teams who get compounding gains from those who spin their wheels.

Start With a Written Hypothesis

Every experiment should begin with a hypothesis in this form:

“We believe [change] will [increase/decrease] [metric] because [reason].”

For example:

“We believe making the add-to-cart button orange will increase add-to-cart rate because it creates stronger contrast against our white theme and draws the eye faster.”

A hypothesis forces you to articulate why you expect a change to work. That’s what makes the test valuable regardless of outcome:

  • If the variant wins, the hypothesis is confirmed and you can build on the underlying belief.
  • If it loses, you’ve learned something specific about your customers — they don’t think the way you assumed.
  • If it’s inconclusive, the hypothesis was untestable at this traffic level — also useful information for prioritizing future tests.

Tests without hypotheses are random guesses. Tests with hypotheses are learning opportunities regardless of outcome.

Test One Thing at a Time

The single most important rule of A/B testing: change only one thing per variant.

If your variant has a new button color, a larger headline, and a moved trust badge — and it wins — you have no idea which of the three changes caused the lift. You can’t:

  • Build on the result with confidence.
  • Cleanly apply the winning change to other pages.
  • Know which change is worth keeping if you have to revert one.

Single-variable tests give you cleanly attributable results. Multi-variable tests give you a fuzzy “this combination beats that combination” answer that’s hard to translate into general lessons.

Prioritize Your Test Queue

You’ll quickly accumulate more ideas than you can test. A simple way to prioritize is the ICE framework:

FactorQuestionScore (1–3)
ImpactIf this wins, how much could it move the needle?3 = major
ConfidenceHow strongly do we believe this will work?3 = very sure
EaseHow simple is this to implement and test?3 = trivial

Add the three scores. Test the highest-scoring ideas first.

High-impact areas to test first

For most stores and SaaS sites, the highest-leverage areas:

  • Primary CTA — color, size, label, position.
  • Headline copy — wording, font weight, size.
  • Hero image — product vs. lifestyle, person vs. object.
  • Trust elements — adding, removing, or repositioning badges and reviews.
  • Pricing presentation — size, color, compare-at price visibility.
  • Form length — removing optional fields.

Avoid starting with low-traffic or low-leverage changes (a footer color, a 404 page redesign). Even a winning result there won’t move overall numbers meaningfully.

Respect Statistical Significance

The math is covered in Bayesian Stats Explained, but as a methodological principle: never end an experiment early because the result looks good.

Early data is noisy. The first 20% of your sessions are statistically meaningless — they just happen to arrive first. Stopping at 60% or even 85% probability is a common source of false wins that hurt rather than help.

Minimum runtime guidelines:

  • Always run for at least 7 calendar days to capture full weekly traffic patterns.
  • For low-traffic sites (fewer than 50 conversions/month), plan on 3–4 weeks minimum.
  • For high-traffic sites, wait for 95%+ probability to be best AND 300+ sessions per variant.

See Declaring a Winner for the full decision rule.

Run Experiments Sequentially Where They Interact

Two experiments running on the same page area at the same time create interaction effects — the results of each are influenced by the other. You can’t isolate what caused what.

The rule: one experiment per page area at a time.

If you have many pages and many ideas, run simultaneous experiments on different pages — a product-page experiment and a homepage experiment can run in parallel without interfering. See Running Multiple Experiments for the full breakdown of when parallel is safe and when it isn’t.

Document Every Result — Including Losses

Most A/B tests don’t produce winners. That’s normal and expected — even the best-run testing programs see only 10–30% of tests produce statistically significant improvements.

The value is in the learning, not just the wins. For every experiment, document:

  • What was the hypothesis?
  • What did you change?
  • What was the result (with numbers)?
  • What did you conclude?

A test that shows your “obvious improvement” didn’t work is extremely valuable. It tells you your customers don’t think the way you assumed — and that’s more useful than confirming a bias.

Segment Beyond the Aggregate

Aggregate results sometimes hide important patterns. Before declaring a winner, check:

  • Did mobile and desktop users respond differently? See Segmenting Results.
  • Did the lift come from one funnel stage or distribute across all of them? (Shopify only — see Shopify Funnel Tracking.)
  • Was there a device-specific regression that the aggregate masked?

A flat aggregate result with a clear mobile win is still actionable — apply the change to mobile only.

Build a Testing Roadmap

Teams who get the most value treat A/B testing as an ongoing process, not a one-time project.

Build a backlog

Keep a running list of hypotheses. Add to it whenever you:

  • Notice something on a competitor’s site.
  • Get a customer support question about confusing UX.
  • Read a conversion optimization case study.
  • Have a debate that data could settle.

A populated backlog means you’re never stuck for what to test next.

Run one experiment at a time per page area

Keep experiments focused and sequential where they overlap. Complexity compounds your errors.

Apply winners immediately

When a variant wins, apply the change to your site permanently before starting the next test on that page area. Your baseline improves with every winner — meaning the next test starts from a better starting point, and the lift you measure is genuinely incremental.

Review quarterly

Every quarter, look at all experiments run. Ask:

  • What kinds of changes tend to win in your store?
  • What customer behavior keeps surprising you?
  • Are there patterns across categories (e.g., “removing trust badges always loses”)?

Use these insights to write better hypotheses going forward.

What Methodology Doesn’t Fix

A few things even rigorous testing can’t compensate for:

  • Insufficient traffic. No methodology turns a low-traffic site into a fast-feedback experimentation environment. Pick fewer, bigger bets.
  • Wrong primary metric. Testing a button color and judging by total revenue is a metric mismatch — see Conversion Goals.
  • Lying with averages. A strong mobile loss + flat desktop = “neutral” aggregate. Always check segments.
  • Confirmation bias. If you run the same kind of test ten times until you get a winner, you’ve gamed yourself. Set the test plan up front; don’t keep retrying until you like the answer.

Common Mistakes

  • Skipping the hypothesis. “Let’s try a red button” is not a hypothesis; it’s a guess. Reframe as “We believe red will outperform blue because…”
  • Stopping at 80%. It feels significant. It isn’t.
  • Testing trivial changes on low-traffic pages. A button color test on a page that gets 100 sessions/week will never converge.
  • Not applying winners. Many teams ship the test, win it, and never propagate the change to the theme. The win sits in the experiment, not in production.
  • Running too many experiments at once. More than 2–3 simultaneous tests on a small site means you’re starving each one of statistical power.

Next Steps

Ready to start testing?

Install Split Test Pro and run your first experiment today.

Install on Shopify