Intermediate 8 min read

Bayesian Stats Explained

How Split Test Pro's Bayesian engine computes "probability to be best" and credible intervals — the priors, the Monte Carlo sampling, and why this beats traditional p-values for product decisions.

Split Test Pro uses a Bayesian engine for its results — not the frequentist p-values you might have seen in older A/B testing tools. This guide explains why, what the engine actually does under the hood, and how to read the numbers it produces.

You don’t need a stats background to use Split Test Pro. But if you want to know why the dashboard says “73% probability to be best” and not “p < 0.05,” this is the doc.

Frequentist vs Bayesian, Briefly

Frequentist (the traditional approach):

You compute a p-value — the probability of seeing this data if the null hypothesis (no difference between variants) were true.
You stop the test when p < 0.05.
You can’t peek at p-values during the test — looking inflates your false-positive rate.
Predetermined sample size required.

Bayesian (Split Test Pro’s approach):

You compute a posterior probability — the probability that Variant B is better than Control given the data.
You stop the test when the probability crosses a threshold (typically 95%).
You can check the result anytime — the math doesn’t care about peeking.
No predetermined sample size. The data updates the model continuously.

The Bayesian framing matches how product teams actually think: “How confident are we that the new variant is better?” — not “How surprising would this data be if there were no difference?”

The Statistical Model

Split Test Pro uses two models depending on the metric type. Both are computed independently per variant.

Binary metrics — Beta-Binomial

For binary metrics (conversion happened / didn’t happen), each variant’s conversion rate is modeled as a Beta distribution:

posterior = Beta(α, β)
where α = conversions + 1
      β = (sessions − conversions) + 1

The +1 on each parameter is the Beta(1, 1) prior — a uniform prior that says “before seeing any data, every conversion rate from 0% to 100% is equally plausible.” This is the standard “uninformative” prior for binary A/B testing. As data accumulates, the posterior tightens around the true rate.

Continuous metrics — Normal

For continuous metrics (revenue per session, AOV, time spent), each variant’s mean value is modeled as a Normal distribution parameterized by:

µ = sample mean of recorded values
σ = standard error of the mean

There’s no informative prior on the continuous side — the data drives the parameters directly. As more events arrive, the standard error shrinks and the distribution tightens around the true mean.

See Continuous Metrics for when continuous mode kicks in.

Probability to Be Best

This is the headline number on every Results dashboard. It’s computed via Monte Carlo sampling:

For each variant, draw a random sample from its posterior distribution.
Record which variant’s sample is the largest.
Repeat 5,000 times.
The probability that variant X is best = (count of times X had the largest draw) / 5,000.

For a two-variant test, “probability to be best” equals “probability to beat control” — they’re the same number. For three or more variants, “probability to be best” sums to 100% across all variants, while “probability to beat control” is computed per non-control variant independently.

Credible Intervals

A credible interval is the range of plausible values for a parameter, with a stated probability. Split Test Pro reports 95% credible intervals by default:

“We’re 95% confident the true conversion rate for Variant B is between 4.2% and 5.8%.”

This is not the same as a frequentist confidence interval. Confidence intervals make a procedural claim (“if you ran this experiment 100 times, 95 of the intervals would contain the true value”). Credible intervals make a direct claim (“there’s a 95% chance the true value is in this range, given the data”).

Credible intervals work how most people intuitively think confidence intervals work. That’s the upshot.

Reading the interval

Wide interval (e.g., 3.0% to 7.0%) — not enough data yet. The true rate could be anywhere in there.
Narrow interval (e.g., 4.7% to 5.1%) — data has converged. You have a high-confidence estimate.
Two intervals don’t overlap — strong signal of a real difference.
Two intervals overlap heavily — likely no real difference, even if the point estimates differ.

Modeled Improvement

The Statistical Analysis accordion shows a modeled improvement distribution — the probability distribution of how much Variant B beats (or loses to) Control. It’s computed by sampling pairs from each variant’s posterior and computing the lift:

lift = (variant_sample - control_sample) / control_sample × 100

The result is a distribution of plausible lifts. The median is the most likely lift; the 25th and 75th percentiles show the typical range; the 95% bounds show how extreme the lift could plausibly be.

A modeled improvement plot mostly above zero with a tight spread is a strong winner. A plot straddling zero with a wide spread is uncertain — needs more data.

The 95% Threshold

Split Test Pro uses 95% as the default threshold for declaring a winner. This is a convention, not a mathematical truth — you can apply different thresholds based on the stakes of the decision:

90% — fine for low-stakes tests. Higher false-positive rate, but for cosmetic changes that’s acceptable.
95% — the default. Balances false positives against waiting forever.
99% — for high-stakes tests (changes to checkout, pricing, or any revenue-critical flow). Lower false-positive rate, longer experiments.

The “Declare Winner” CTA in the UI uses 95% as its trigger. You can choose to wait longer or stop earlier based on context — the threshold is your decision, not the platform’s.

Why You Can Peek (Mostly)

A signature claim of Bayesian methods is that peeking doesn’t inflate false positives the way it does with frequentist methods. This is approximately true, but with caveats:

The math doesn’t penalize you for looking. Probability-to-be-best is a posterior; it updates monotonically with data and doesn’t “spike” at random checkpoints.
But you, the human, can fool yourself. Watching the number bounce around in real time invites confirmation bias — you’ll convince yourself it’s “going to” cross 95% any minute now and stop early.
And the prior matters. Beta(1, 1) is uninformative, but that means very small samples can produce surprisingly extreme posteriors. Always sanity-check sample sizes.

The pragmatic rule: don’t stop until 95% AND at least 300–500 sessions per variant AND at least 7 days elapsed. That belt-and-braces combination keeps you honest.

What the Engine Doesn’t Do

Worth being explicit about the limitations:

No multiple-testing correction. Running 10 secondary metrics on the same experiment increases your odds that at least one looks significant by chance. Treat secondaries as directional, not decisional.
No outlier trimming. A few huge revenue events can pull a continuous mean for a long time. The math handles it via increased variance, but in practice you may want to wait longer or manually inspect outliers.
No auto-stop. The engine computes; you decide. Probability hitting 95% surfaces the “Declare Winner” button — clicking it is your call.
No adaptive traffic allocation. Some platforms shift traffic toward the leading variant during the test. Split Test Pro doesn’t — it uses fixed splits per the configuration.

A Note on Sample Sizes

A common question: “How many sessions do I need?” There’s no single answer — it depends on:

The baseline conversion rate of your primary metric (lower base = need more samples).
The lift size you care about detecting (smaller lifts = need more samples).
The threshold (99% needs more than 95%).

A rough starting point: at least 300–500 sessions per variant before the credible intervals tighten enough to matter, and at least 1,000+ sessions per variant for high-stakes decisions. Below 300, anything the dashboard says is provisional.

Next Steps

Use the dashboard built on this engine: Results Dashboard.
Decide when 95% is enough to act on: Declaring a Winner.
Avoid the most common ways to fool yourself with Bayesian results: Common Mistakes.

Ready to start testing?

Install Split Test Pro and run your first experiment today.

Install on Shopify