Split Test Pro
Beginner 5 min read

AI Review

Get an AI-generated read on an experiment — what it checks before launch, what it summarizes after, when it's worth running, and how the cache works.

Split Test Pro has two AI-powered review features, both Claude-backed and both opt-in: a pre-launch review that checks your experiment setup before you ship it, and a results review that interprets the running data. Neither is automatic; you click a button when you want one.

This guide covers what each does, when to use them, and the caveats that matter.

Pre-Launch AI Review

Available from the Pre-Launch Checklist (when you click Start Experiment before launching). The button is labeled Get AI Review.

What it checks

The pre-launch review reads your experiment’s configuration and returns an assessment covering:

  1. Is this a reasonable test? — does the hypothesis match the change, is the metric appropriate, are you testing one thing.
  2. Estimated time to significance — based on a generic baseline (no per-workspace traffic data is currently used in the prompt — the estimate uses 500 sessions/day and 2% baseline as the default).
  3. Obvious setup issues — empty variants, missing primary metric, unusual targeting.
  4. Suggested minimum detectable effect — what kind of lift you’d realistically need to detect.

When to use it

  • First experiment. Cheap sanity check that you’ve set things up reasonably.
  • High-stakes test. Anything touching checkout, pricing, or a major flow.
  • Unusual hypothesis. When you’re not sure the change is going to do what you think.

For routine tests where you’ve done the manual review (see Pre-Launch QA), the AI review is optional polish — not required.

What it returns

A short markdown response (up to ~1,000 tokens) inside the launch checklist banner. You can read it and decide whether to proceed via Start Test Anyway or click Cancel to fix something first. The review never blocks the launch.

Results AI Review

Available on any experiment’s Results tab. The button is labeled Get AI Review in the dashboard area.

What it returns

A structured markdown response with four sections:

  1. Summary — 2–3 sentences on what the data shows.
  2. Key Findings — 3–5 bullet points highlighting notable patterns (winning variants, funnel drop-offs, device discrepancies).
  3. Recommendation — one of three calls: Implement, Keep running, or Revert.
  4. Estimated Impact — a monthly revenue range if you applied the winning variant, based on current traffic and conversion rates.

For Shopify experiments, the prompt includes the full funnel breakdown so the AI can flag where in the funnel a variant changed behavior. For HTML, it’s the conversion goals you’ve defined.

When to use it

  • Mid-experiment sanity check — your gut says the result is meaningful, but is it?
  • End of experiment, before declaring a winner — independent read on whether you’re calling the right thing.
  • Communicating results to stakeholders — the structured Summary / Recommendation format is well-suited to dropping into a Slack message or PR description.

It’s most useful for non-statisticians. If you’re already comfortable reading credible intervals and probability distributions, you’ll often draw the same conclusions yourself.

How the Cache Works

The AI review is cached server-side per experiment. The cache is also invalidated whenever new results are computed, so the analysis always reflects the most recent data snapshot. As a safety backstop, cached entries also expire after 6 hours — so even if results aren’t explicitly refreshed (for example, on a concluded experiment), you’ll never read an analysis older than six hours.

If results have updated since the last analysis, you’ll see a prompt to re-analyze above the result. Clicking Re-analyze generates a fresh review against the current data immediately.

A practical pattern: skip the AI review until you have meaningful data (a few hundred sessions per variant), then check it once a day during the experiment’s main run rather than on every page refresh.

What the AI Review Won’t Do

  • Won’t make the decision for you. It’s a recommendation, not an authority. The “Implement” call is based on the data the prompt was given, including any caveats you’ve already considered yourself.
  • Won’t catch external context. It doesn’t know about your business goals, seasonal effects, or active marketing campaigns. The recommendation is “based on this data alone.”
  • Won’t flag long-tail risks. The Implement / Keep running / Revert framing is intentional and three-bucket. It doesn’t say “implement, but ramp slowly to 25% first” — that’s still your call.
  • Won’t replace your stats intuition. It’s a complement to Bayesian Stats Explained, not a substitute. If the underlying numbers don’t support the recommendation, the recommendation is wrong.

Cost and Usage

Each AI review consumes API tokens against the Split Test Pro account’s AI budget. There’s no per-merchant per-day limit today, but heavy usage may hit shared rate limits — practical limit is on the order of dozens of reviews per day per workspace. If you find you’re running AI reviews every few minutes, that’s a sign you’re peeking too aggressively at the results — see Common Mistakes.

Privacy

The prompt sent to Claude includes:

  • The experiment name, hypothesis, and primary metric
  • Variant counts, conversion totals, and conversion rates per variant
  • Funnel events (Shopify) or custom event totals (HTML)
  • The target URL pattern

It does not include:

  • Individual visitor IDs, sessions, or PII
  • Customer data, order details, or payment information
  • The actual variant CSS or JS code

If you have policies around what data can leave your domain, this is the surface to evaluate.

Common Mistakes

  • Treating the AI review as final. It’s a summary. Read the actual numbers in the Results Dashboard.
  • Running it too early. A review at 100 sessions is going to be uncertain about everything. Wait until you have meaningful data.
  • Ignoring the staleness notice. If you see the “results have been updated” prompt, re-analyze before acting on the recommendation.
  • Skipping the human review. The AI is good at summary; it’s bad at “do I trust this enough to ship to all my customers.” That part is yours.

Next Steps

Ready to start testing?

Install Split Test Pro and run your first experiment today.

Install on Shopify