Advanced 10 min read

A/B Testing Methodology

The principles that separate teams who get compounding gains from A/B testing from those who spin their wheels — hypothesis-driven testing, single-variable rigor, sequential discipline, and documenting losses.

Running an A/B test is straightforward. Running one that produces a result you can actually trust — and learn from — requires discipline. This guide covers the principles that separate teams who get compounding gains from those who spin their wheels.

Start With a Written Hypothesis

Every experiment should begin with a hypothesis in this form:

“We believe [change] will [increase/decrease] [metric] because [reason].”

For example:

“We believe making the add-to-cart button orange will increase add-to-cart rate because it creates stronger contrast against our white theme and draws the eye faster.”

A hypothesis forces you to articulate why you expect a change to work. That’s what makes the test valuable regardless of outcome:

If the variant wins, the hypothesis is confirmed and you can build on the underlying belief.
If it loses, you’ve learned something specific about your customers — they don’t think the way you assumed.
If it’s inconclusive, the hypothesis was untestable at this traffic level — also useful information for prioritizing future tests.

Tests without hypotheses are random guesses. Tests with hypotheses are learning opportunities regardless of outcome.

Test One Thing at a Time

The single most important rule of A/B testing: change only one thing per variant.

If your variant has a new button color, a larger headline, and a moved trust badge — and it wins — you have no idea which of the three changes caused the lift. You can’t:

Build on the result with confidence.
Cleanly apply the winning change to other pages.
Know which change is worth keeping if you have to revert one.

Single-variable tests give you cleanly attributable results. Multi-variable tests give you a fuzzy “this combination beats that combination” answer that’s hard to translate into general lessons.

Prioritize Your Test Queue

You’ll quickly accumulate more ideas than you can test. A simple way to prioritize is the ICE framework:

Factor	Question	Score (1–3)
Impact	If this wins, how much could it move the needle?	3 = major
Confidence	How strongly do we believe this will work?	3 = very sure
Ease	How simple is this to implement and test?	3 = trivial

Add the three scores. Test the highest-scoring ideas first.

High-impact areas to test first

For most stores and SaaS sites, the highest-leverage areas:

Primary CTA — color, size, label, position.
Headline copy — wording, font weight, size.
Hero image — product vs. lifestyle, person vs. object.
Trust elements — adding, removing, or repositioning badges and reviews.
Pricing presentation — size, color, compare-at price visibility.
Form length — removing optional fields.

Avoid starting with low-traffic or low-leverage changes (a footer color, a 404 page redesign). Even a winning result there won’t move overall numbers meaningfully.

Respect Statistical Significance

The math is covered in Bayesian Stats Explained, but as a methodological principle: never end an experiment early because the result looks good.

Early data is noisy. The first 20% of your sessions are statistically meaningless — they just happen to arrive first. Stopping at 60% or even 85% probability is a common source of false wins that hurt rather than help.

Minimum runtime guidelines:

Always run for at least 7 calendar days to capture full weekly traffic patterns.
For low-traffic sites (fewer than 50 conversions/month), plan on 3–4 weeks minimum.
For high-traffic sites, wait for 95%+ probability to be best AND 300+ sessions per variant.

See Declaring a Winner for the full decision rule.

Run Experiments Sequentially Where They Interact

Two experiments running on the same page area at the same time create interaction effects — the results of each are influenced by the other. You can’t isolate what caused what.

The rule: one experiment per page area at a time.

If you have many pages and many ideas, run simultaneous experiments on different pages — a product-page experiment and a homepage experiment can run in parallel without interfering. See Running Multiple Experiments for the full breakdown of when parallel is safe and when it isn’t.

Document Every Result — Including Losses

Most A/B tests don’t produce winners. That’s normal and expected — even the best-run testing programs see only 10–30% of tests produce statistically significant improvements.

The value is in the learning, not just the wins. For every experiment, document:

What was the hypothesis?
What did you change?
What was the result (with numbers)?
What did you conclude?

A test that shows your “obvious improvement” didn’t work is extremely valuable. It tells you your customers don’t think the way you assumed — and that’s more useful than confirming a bias.

Segment Beyond the Aggregate

Aggregate results sometimes hide important patterns. Before declaring a winner, check:

Did mobile and desktop users respond differently? See Segmenting Results.
Did the lift come from one funnel stage or distribute across all of them? (Shopify only — see Shopify Funnel Tracking.)
Was there a device-specific regression that the aggregate masked?

A flat aggregate result with a clear mobile win is still actionable — apply the change to mobile only.

Build a Testing Roadmap

Teams who get the most value treat A/B testing as an ongoing process, not a one-time project.

Build a backlog

Keep a running list of hypotheses. Add to it whenever you:

Notice something on a competitor’s site.
Get a customer support question about confusing UX.
Read a conversion optimization case study.
Have a debate that data could settle.

A populated backlog means you’re never stuck for what to test next.

Run one experiment at a time per page area

Keep experiments focused and sequential where they overlap. Complexity compounds your errors.

Apply winners immediately

When a variant wins, apply the change to your site permanently before starting the next test on that page area. Your baseline improves with every winner — meaning the next test starts from a better starting point, and the lift you measure is genuinely incremental.

Review quarterly

Every quarter, look at all experiments run. Ask:

What kinds of changes tend to win in your store?
What customer behavior keeps surprising you?
Are there patterns across categories (e.g., “removing trust badges always loses”)?

Use these insights to write better hypotheses going forward.

What Methodology Doesn’t Fix

A few things even rigorous testing can’t compensate for:

Insufficient traffic. No methodology turns a low-traffic site into a fast-feedback experimentation environment. Pick fewer, bigger bets.
Wrong primary metric. Testing a button color and judging by total revenue is a metric mismatch — see Conversion Goals.
Lying with averages. A strong mobile loss + flat desktop = “neutral” aggregate. Always check segments.
Confirmation bias. If you run the same kind of test ten times until you get a winner, you’ve gamed yourself. Set the test plan up front; don’t keep retrying until you like the answer.

Common Mistakes

Skipping the hypothesis. “Let’s try a red button” is not a hypothesis; it’s a guess. Reframe as “We believe red will outperform blue because…”
Stopping at 80%. It feels significant. It isn’t.
Testing trivial changes on low-traffic pages. A button color test on a page that gets 100 sessions/week will never converge.
Not applying winners. Many teams ship the test, win it, and never propagate the change to the theme. The win sits in the experiment, not in production.
Running too many experiments at once. More than 2–3 simultaneous tests on a small site means you’re starving each one of statistical power.

Next Steps

Avoid the most common foot-guns: Common Mistakes.
Build the experimentation muscle into a long-term program: Building a Testing Program.
Decide when 95% is enough to call it: Declaring a Winner.

Ready to start testing?

Install Split Test Pro and run your first experiment today.

Install on Shopify