A/B Testing Methodology
The principles that separate teams who get compounding gains from A/B testing from those who spin their wheels — hypothesis-driven testing, single-variable rigor, sequential discipline, and documenting losses.
Running an A/B test is straightforward. Running one that produces a result you can actually trust — and learn from — requires discipline. This guide covers the principles that separate teams who get compounding gains from those who spin their wheels.
Start With a Written Hypothesis
Every experiment should begin with a hypothesis in this form:
“We believe [change] will [increase/decrease] [metric] because [reason].”
For example:
“We believe making the add-to-cart button orange will increase add-to-cart rate because it creates stronger contrast against our white theme and draws the eye faster.”
A hypothesis forces you to articulate why you expect a change to work. That’s what makes the test valuable regardless of outcome:
- If the variant wins, the hypothesis is confirmed and you can build on the underlying belief.
- If it loses, you’ve learned something specific about your customers — they don’t think the way you assumed.
- If it’s inconclusive, the hypothesis was untestable at this traffic level — also useful information for prioritizing future tests.
Tests without hypotheses are random guesses. Tests with hypotheses are learning opportunities regardless of outcome.
Test One Thing at a Time
The single most important rule of A/B testing: change only one thing per variant.
If your variant has a new button color, a larger headline, and a moved trust badge — and it wins — you have no idea which of the three changes caused the lift. You can’t:
- Build on the result with confidence.
- Cleanly apply the winning change to other pages.
- Know which change is worth keeping if you have to revert one.
Single-variable tests give you cleanly attributable results. Multi-variable tests give you a fuzzy “this combination beats that combination” answer that’s hard to translate into general lessons.
Prioritize Your Test Queue
You’ll quickly accumulate more ideas than you can test. A simple way to prioritize is the ICE framework:
| Factor | Question | Score (1–3) |
|---|---|---|
| Impact | If this wins, how much could it move the needle? | 3 = major |
| Confidence | How strongly do we believe this will work? | 3 = very sure |
| Ease | How simple is this to implement and test? | 3 = trivial |
Add the three scores. Test the highest-scoring ideas first.
High-impact areas to test first
For most stores and SaaS sites, the highest-leverage areas:
- Primary CTA — color, size, label, position.
- Headline copy — wording, font weight, size.
- Hero image — product vs. lifestyle, person vs. object.
- Trust elements — adding, removing, or repositioning badges and reviews.
- Pricing presentation — size, color, compare-at price visibility.
- Form length — removing optional fields.
Avoid starting with low-traffic or low-leverage changes (a footer color, a 404 page redesign). Even a winning result there won’t move overall numbers meaningfully.
Respect Statistical Significance
The math is covered in Bayesian Stats Explained, but as a methodological principle: never end an experiment early because the result looks good.
Early data is noisy. The first 20% of your sessions are statistically meaningless — they just happen to arrive first. Stopping at 60% or even 85% probability is a common source of false wins that hurt rather than help.
Minimum runtime guidelines:
- Always run for at least 7 calendar days to capture full weekly traffic patterns.
- For low-traffic sites (fewer than 50 conversions/month), plan on 3–4 weeks minimum.
- For high-traffic sites, wait for 95%+ probability to be best AND 300+ sessions per variant.
See Declaring a Winner for the full decision rule.
Run Experiments Sequentially Where They Interact
Two experiments running on the same page area at the same time create interaction effects — the results of each are influenced by the other. You can’t isolate what caused what.
The rule: one experiment per page area at a time.
If you have many pages and many ideas, run simultaneous experiments on different pages — a product-page experiment and a homepage experiment can run in parallel without interfering. See Running Multiple Experiments for the full breakdown of when parallel is safe and when it isn’t.
Document Every Result — Including Losses
Most A/B tests don’t produce winners. That’s normal and expected — even the best-run testing programs see only 10–30% of tests produce statistically significant improvements.
The value is in the learning, not just the wins. For every experiment, document:
- What was the hypothesis?
- What did you change?
- What was the result (with numbers)?
- What did you conclude?
A test that shows your “obvious improvement” didn’t work is extremely valuable. It tells you your customers don’t think the way you assumed — and that’s more useful than confirming a bias.
Segment Beyond the Aggregate
Aggregate results sometimes hide important patterns. Before declaring a winner, check:
- Did mobile and desktop users respond differently? See Segmenting Results.
- Did the lift come from one funnel stage or distribute across all of them? (Shopify only — see Shopify Funnel Tracking.)
- Was there a device-specific regression that the aggregate masked?
A flat aggregate result with a clear mobile win is still actionable — apply the change to mobile only.
Build a Testing Roadmap
Teams who get the most value treat A/B testing as an ongoing process, not a one-time project.
Build a backlog
Keep a running list of hypotheses. Add to it whenever you:
- Notice something on a competitor’s site.
- Get a customer support question about confusing UX.
- Read a conversion optimization case study.
- Have a debate that data could settle.
A populated backlog means you’re never stuck for what to test next.
Run one experiment at a time per page area
Keep experiments focused and sequential where they overlap. Complexity compounds your errors.
Apply winners immediately
When a variant wins, apply the change to your site permanently before starting the next test on that page area. Your baseline improves with every winner — meaning the next test starts from a better starting point, and the lift you measure is genuinely incremental.
Review quarterly
Every quarter, look at all experiments run. Ask:
- What kinds of changes tend to win in your store?
- What customer behavior keeps surprising you?
- Are there patterns across categories (e.g., “removing trust badges always loses”)?
Use these insights to write better hypotheses going forward.
What Methodology Doesn’t Fix
A few things even rigorous testing can’t compensate for:
- Insufficient traffic. No methodology turns a low-traffic site into a fast-feedback experimentation environment. Pick fewer, bigger bets.
- Wrong primary metric. Testing a button color and judging by total revenue is a metric mismatch — see Conversion Goals.
- Lying with averages. A strong mobile loss + flat desktop = “neutral” aggregate. Always check segments.
- Confirmation bias. If you run the same kind of test ten times until you get a winner, you’ve gamed yourself. Set the test plan up front; don’t keep retrying until you like the answer.
Common Mistakes
- Skipping the hypothesis. “Let’s try a red button” is not a hypothesis; it’s a guess. Reframe as “We believe red will outperform blue because…”
- Stopping at 80%. It feels significant. It isn’t.
- Testing trivial changes on low-traffic pages. A button color test on a page that gets 100 sessions/week will never converge.
- Not applying winners. Many teams ship the test, win it, and never propagate the change to the theme. The win sits in the experiment, not in production.
- Running too many experiments at once. More than 2–3 simultaneous tests on a small site means you’re starving each one of statistical power.
Next Steps
- Avoid the most common foot-guns: Common Mistakes.
- Build the experimentation muscle into a long-term program: Building a Testing Program.
- Decide when 95% is enough to call it: Declaring a Winner.
Ready to start testing?
Install Split Test Pro and run your first experiment today.