How A/B test scores are calculated
On this page
The main complexity of A/B Testing is ensuring that tests are reliable and the results significant to your business.
A/B Testing is derived from statistical analysis and relies on a sound methodology that is trustworthy, and verifiable. We use proven mathematics, ensuring that the underlying statistical calculations represent (or reliably predict) real differences between index A and B.
The Method and math at a glance
Randomness: the assignment of any one user to scenario A or B is purely random.
Statistical significance (Confidence): The statistical significance threshold is set at (p-value) < 0.05 (The p-value can be roughly described as a confidence score). A test is conclusive (statistically significant) when the p-value is less than 0.05.
Mathematical formula: The mathematical formula used to calculate the scores is Two-Tailed. This means that we are not assuming anything, especially not that B is better than A. We calculate percentages in both directions: whether A is better than B or B is better than A. By using the two-tailed approach, we avoid making any assumptions about A or B - either one is possible.
Relevance Improvement: That said, the concern with A/B Testing is usually improvement: every change to B is usually intended to be better than the current main index (A). In other words, A/B Testing is trying to help us find a better index configuration.
Statistical Significance or Chance
When you run your tests, you may get results that show a 4% increase in one of the measured metrics. Statistical significance is concerned with whether the 4% increase is chance or real. The statistical concern is whether your sample group truly represents the larger population: Does the 4% make sense only for that sample group or does it reasonably predict the behavior of the larger population?
If the sample doesn’t represent the larger population, then your results are due to chance. Statistical significance (the confidence indicator) distinguishes chance from a real change. When you reach confidence, the difference between the A and B variants is likely not due to chance, but something you can likely expect (or predict) for the larger population as well.
Large, distributed samples
Large data samples are necessary to reach confidence. When flipping a coin 1,000 times, you can expect a close to 50:50 ratio of heads and tails. If you flip it just a few times, the ratio can be heavily skewed (it is completely possible to flip heads three times in a row, but very unlikely to do so 1,000 times).
Increasing sample size stabilizes results, increasing the confidence in the results. Each new search event clarifies the underlying pattern and generally leads towards a reliable outcome.
Be careful when you test. Testing during a sales campaign, a major holiday, or some other exceptional event, can undermine the reliability of your results.