Split Test Pro
Advanced 8 min read

Building a Testing Program

Move from ad-hoc tests to a sustainable experimentation practice — backlog discipline, ICE prioritization, weekly cadences, quarterly reviews, and the compounding effect of a few well-applied winners.

A single A/B test is an event. A testing program is a practice. The teams that get the most value from experimentation treat it as an ongoing discipline — not a thing they do when something feels wrong, but a steady cadence that compounds over time.

This guide is for teams ready to move past one-off experiments into a sustainable testing practice.

The Compounding Argument

Why bother building a program at all? Because of compounding:

  • A team running 20 experiments a year, with 30% winners producing average 5% lifts, applies 6 winners per year.
  • 6 × 5% lifts compound to roughly 34% improvement on the primary metric in twelve months.
  • That’s a typical case. Aggressive programs see substantially more.

Compare to the alternative: ad-hoc tests when something feels wrong, no documented learnings, no compounding. Same number of experiments, but each one starts from scratch and ends in scratch. The wins don’t accumulate because they don’t get systematized.

The first 5–10 tests of a program teach you about your customers more than two years of intuition would. The next 50 build a body of pattern recognition that makes every subsequent decision faster.

The Five Pieces

A working program needs five things, all of them mundane:

  1. A populated backlog of hypotheses.
  2. A prioritization framework for the backlog.
  3. A consistent cadence for launching tests.
  4. A discipline for documenting results.
  5. A periodic review to extract patterns.

None of these are technical — they’re operational. But teams skip them and wonder why their A/B testing program never coalesces into a practice.

1. Build a Backlog

Keep a running list of test hypotheses somewhere everyone can see. Notion, a spreadsheet, a project management board — the tool doesn’t matter; the visibility does.

Add to it whenever you:

  • Notice something on a competitor’s site that’s different from yours.
  • Get a customer support question that hints at confusing UX.
  • Read a conversion optimization case study with a transferable pattern.
  • Have an internal debate that data could settle.
  • Spot a funnel drop-off in your analytics tool.

The backlog should always be longer than your testing capacity. If it’s empty, your team isn’t generating ideas. If it’s the same length all year, you’re not testing fast enough.

A backlog entry should capture:

  • Hypothesis — “We believe X will Y because Z.”
  • Where — which page or area.
  • Why now — what triggered the idea.
  • ICE score — see next section.

2. Prioritize With ICE

The ICE framework — Impact, Confidence, Ease — is a quick way to rank a backlog. Score each idea 1–3 on each axis:

FactorQuestionScore (1–3)
ImpactIf this wins, how much could it move the primary metric?3 = major
ConfidenceHow strongly do we believe this will work?3 = very sure
EaseHow simple is this to implement and test?3 = trivial

Add the three. Test the highest scores first.

ICE isn’t a precise tool — two reasonable people can score the same idea differently. But it forces the discussion that surfaces the real disagreements. A team where everyone agrees a 7-point idea is high priority is more aligned than a team where one person wants to test it and another thinks it’s pointless.

3. Set a Cadence

A program needs predictable rhythm. Pick something realistic:

  • Weekly launches — for high-traffic sites, you can ship a new test every Monday.
  • Bi-weekly launches — for medium-traffic, every other Monday.
  • Monthly launches — for low-traffic sites where each test takes weeks to converge.

The cadence isn’t about velocity for its own sake. It’s about forcing decisions. If you have a calendar slot to launch a test on Monday, you have to decide what to test by Friday — which forces backlog grooming, ICE scoring, and pre-launch review.

Without a cadence, tests slip. The team gets busy with other things, the backlog stagnates, and the program quietly dies.

The “always running” rule

At any given time, you should have at least one experiment running. If your cadence is weekly and a test ends, the next one should start within days — not weeks.

Idle calendar slots are a sign the program is starving for ideas or the team is starving for time. Both are fixable, but only if you notice.

4. Document Every Result

Every experiment, when it ends, gets a writeup. Three sentences minimum:

  1. What did you test, and what was the hypothesis?
  2. What was the result (with numbers)?
  3. What did you learn — and what does this imply for the next test?

For winners: include the change you made, the numbers, and the date you applied it to the theme. For losers: include why you think it lost — your post-hoc theory matters because it informs the next test. For inconclusive: include the duration, the sample size you reached, and your decision (drop the idea, retry with a bigger change, retry on a higher-traffic page).

Most teams keep this in a single shared doc — a spreadsheet, a Notion database, or an internal wiki. The format matters less than the consistency.

5. Review Quarterly

Every quarter, set aside time to look at the entire experiment log:

  • Pattern in winners. What kinds of changes consistently win in your store? (e.g., “every test that simplified the form has won.” That’s a generalizable insight.)
  • Pattern in losers. What kinds of changes consistently lose? (e.g., “every test that added social proof has lost.” That’s a counter-intuitive insight worth investigating.)
  • Pattern in inconclusives. Are there pages where tests never converge? Maybe traffic is too low, or maybe the page is at a local optimum.
  • Hypothesis quality. Are you writing better hypotheses over time? The first ones tend to be vague; later ones tend to be more precise.
  • Decision quality. Are you stopping experiments at the right time? Reviewing past calls helps calibrate future ones.

The quarterly review isn’t a status update — it’s a learning meeting. The output is a small list of insights that change how you run the next quarter’s tests.

Anti-Patterns

A few patterns that consistently kill programs:

“We’ll start when we’re ready”

Some teams wait for “the right moment” to start a real testing program. That moment never arrives. Start with one test this week, then another next week. The infrastructure (a backlog, a cadence) builds itself once you’re shipping tests regularly.

”Test everything”

Trying to test every UI decision exhausts the team and dilutes the signal across too many small tests. Pick the high-leverage changes (ICE-scored) and let the rest get decided by judgment.

”Wait for a winner before applying anything”

Teams that wait for an enormous lift before changing the theme apply nothing. Most winners are 3–10% lifts; ship them. Ten 5% wins compound to 60%+.

”Run the same test on every page”

A button-color test on the home page that wins doesn’t mean it’ll win on the product page. Different audiences, different intents. Re-test in each context if you want to ship it everywhere.

”One person owns testing”

When the testing program belongs to one person, it dies when they leave or get pulled into other work. Distribute ownership: have backups, rotate the runbook, make the process visible.

Tools and Process

Recommended setup for most teams:

  • Backlog — Notion database or spreadsheet with columns for Hypothesis, ICE, Status, Owner.
  • Documentation — same Notion / sheet, with one row per experiment plus a writeup column.
  • Cadence — recurring calendar event for launch day, plus a Slack reminder the day before.
  • Quarterly review — 90-minute meeting, agenda is “scroll through every test from the past quarter.”

You don’t need a heavy process tool. The discipline is what matters; the tool just needs to make the discipline visible.

Maturity Curve

Programs mature in roughly this order:

  1. Ad-hoc — random tests when something feels wrong. Most teams’ starting point.
  2. Cadence-driven — predictable launch rhythm. Backlog populated.
  3. Pattern-recognizing — quarterly reviews surface generalizable insights. Hypotheses get sharper.
  4. Strategically integrated — testing program informs product roadmap, not just optimization. Big bets get tested before launching at scale.

Most teams plateau at level 2. Reaching level 3 takes a year of consistent execution. Level 4 is rare and impressive.

Common Failure Modes

  • The program dies after the founder stops running it. Distribute ownership early.
  • The program becomes the conversion-rate-optimization team’s domain. Make tests cross-functional — design, product, engineering, support all surface ideas.
  • Wins don’t get applied. Build “ship the winner” into the launch checklist.
  • Inconclusive results pile up unrequested. Schedule cadence reviews.

Next Steps

Ready to start testing?

Install Split Test Pro and run your first experiment today.

Install on Shopify