Wingify Campaign Reports: Statistical Models and Inference Levels – Wingify

close this to read article

This article covers the following:

Overview
The Five Levels of Statistical Inference
How the Statistical Model Impacts Your Report
FAQs

Overview

Wingify campaign reports surface statistical evidence to support decisions about which variation to roll out or disable. The statistical outputs your report displays depend on the statistical model configured for the campaign: Bayesian or Frequentist. Both models work through the same logical hierarchy from raw data to a final decision, but they express uncertainty and significance differently.

When conducting a simple A/B test, you might find that the conversion rate for your control group is 10%, while the variation group shows a conversion rate of 15%. At first glance, it might seem obvious that the variation wins. Why, then, are statistics needed in experimentation?

However, it would not be so obvious if the variation that is winning had conversion rates of 10% and 10.2%. It would have seemed much more doubtful if the difference was due to chance or the variation's true impact. This means determining whether the difference is reliable or not. As it turns out, most experiments lead to nominal uplift, raising the question of whether the uplift is significant enough to be considered true.

Here's where the concept of statistical significance comes into play. Statistical significance helps us determine whether the observed improvement in a metric is genuinely due to an actual difference or just a result of random chance. Understanding this principle is crucial to making informed decisions while A/B testing.

The Five Levels of Statistical Inference

The statistical engine underlying Wingify reports breaks the path from raw data to a decision into five levels. Understanding these levels helps you read any report, regardless of which statistical model is active.

Level 0: Empirical Data

The base level is the raw data from your campaign: how many visitors were assigned to each variation, how many conversions occurred, and the total metric value for continuous metrics like revenue. This data contains no uncertainty; it is a direct count of observed events.

For binary metrics, for example, conversion rate, add-to-cart rate, the data is labeled Unique Conversions/Visitors. For continuous metrics, for example, revenue per visitor, it is labeled Total value (Unique Conversions/Visitors).

Level 1: Expected Average

Wingify uses the empirical data to project the likely range of true averages for each variation. These projections are expressed as distributions, not fixed numbers, because any finite sample carries measurement uncertainty. Wingify models these summarized metrics as normal distributions.

In Bayesian reports, this column is labeled Expected Conversion Rate (binary metrics) or Expected value per visitor (continuous metrics). In Frequentist reports, the equivalent is Conversion Rate (v) with a confidence interval shown below the point estimate.

Level 2: Expected Improvement

This level calculates the difference between the variation's and the baseline's projected average distributions. The result is an Improvement Distribution that captures both the direction and magnitude of the likely effect. It helps us understand how much better or worse the variation is performing compared to the baseline.

In Bayesian campaigns, the improvement is shown as a box plot of the improvement posterior in the detail view of the report table. In Frequentist campaigns, it appears as Improvement % (v) with confidence intervals.

Note: Box plot graphs are available for Bayesian campaigns using the Classic Statistical Engine only. They are not displayed for campaigns using the Enhanced Statistical Engine. For Frequentist campaigns, improvement is shown as a percentage with a range.

Level 3: Probability of Improvement

At this level, Wingify infers from the Improvement Distribution the probability that the true improvement exceeds the Region of Practical Equivalence (ROPE). ROPE defines the minimum improvement that would be practically meaningful.

In Bayesian reports, this column is labeled Decision Probabilities or Probability to be Better. In Frequentist reports, the equivalent is Significance Level, which represents the complement of the p-value interpreted as evidence against the null hypothesis.

Level 4: Decisions

Wingify applies a threshold to the Level 3 metric to determine a winner. The threshold is controlled by the False Positive Rate (FPR) in both Bayesian and Frequentist campaigns. The difference lies only in what the metric is called:

Bayesian: the metric is called Probability to be Better

Frequentist: the metric is called Significance Level

When a variation crosses the winner threshold (typically 95%), Wingify declares it better than the baseline and displays a recommendation banner. When a variation falls below the lower threshold (typically 5%), Wingify recommends disabling it.

The winner threshold is shown as a dotted line across the probability bars in the Bayesian Statistics view. In Frequentist campaigns, the same threshold appears as the configured FPR, typically set at 10%.

How the Statistical Model Impacts Your Report

The report interface adapts to the configured statistical model. The table below summarizes the key differences in what you see.

Report Element	Bayesian Model	Frequentist Model
Significance column	Decision Probabilities (probability bar showing % chance variation is better)	Significance Level (% significance, compared against the FPR threshold)
Improvement column	Expected Improvement (distribution with ROPE reference)	Improvement % (v) with confidence interval range
Statistics view key metric	Probability to be Better	Significance Level
Winner declaration	Probability exceeds Winner Threshold (for example, 95%)	Significance Level exceeds 1 - FPR threshold
Recommendation banner language	"[Variation] is better than baseline ... can be expected with 95% probability of being better"	"Stick to [Variation] Baseline as no variation shows the potential to outperform the baseline"

Note: To customize which columns appear in each view, or to create your own custom views, click the pencil icon ✏ to the right of the view tabs. For more information and detailed steps, see Navigate and Customize Your Campaign Report.

The five-level statistical hierarchy, from raw data through expected averages, improvement distributions, probability estimates, and final decisions, applies to both statistical models. What changes between Bayesian and Frequentist campaigns is the language and visual form of the outputs at Levels 3 and 4. The Bayesian model gives you a continuously updated probability estimate; the Frequentist model gives you a significance level relative to a null hypothesis. Both are designed to help you make reliable, data-backed decisions while minimizing the risk of acting on random fluctuations.

To ensure your report reflects reliable data, see Achieve Accurate Campaign Results with SmartStats Configuration.

FAQs

What is the difference between "Probability to be Better" and "Significance Level"?
Probability to be Better is a Bayesian metric expressing the direct probability that the variation outperforms the baseline. Significance Level is a Frequentist metric expressing the statistical evidence against the null hypothesis that there is no difference. Both serve as the primary decision indicator in their respective frameworks.
Why do some variations show "Collecting Data" in the probability column?
Wingify requires at least 500 visitors and at least one conversion on the baseline before it can compute reliable statistics. Until that threshold is met, the probability columns display Collecting Data to indicate the result is not yet available. You can configure this limit using Observatory Mode. For more information, see Configure Observatory mode in Your Testing Campaigns.
What does "No data yet" mean in a variation row?
No data yet means the variation has not received any visitors or conversions during the selected date range. This typically affects variations added late to a campaign or those that received zero traffic allocation.
Can I switch between Bayesian and Frequentist on a running campaign?
No, you cannot switch it for a running campaign.
What is ROPE and how does it affect the probability of improvement?
ROPE (Region of Practical Equivalence) defines the range of improvement values considered too small to be practically meaningful. Only improvement that exceeds the ROPE boundary counts toward the probability of improvement calculation. This prevents declaring winners based on trivially small uplifts.

Need more help?

For more information or further assistance, contact Wingify Support.