Guides / A/B testing / What is A/B testing

Create and run an A/B test

Some of the following instructions only apply to A/B testing with two different indices: ignore those parts if you’re A/B testing a single index.

You need to provide four pieces of information for an A/B test:

  • The test name. When creating your test, always use descriptive names. For example, if you’re comparing the “most-viewed” and “most-sold” sort orders, a good test name would be “Which is better: most-viewed or most-sold?”.
  • Your A and B indices. A is the control index, your main index. B is your comparison index. The A and B indices are referred to as scenarios. Each index should return different results. You can set a description for each index: make it readable so that you can understand, at a glance, the nature of the scenario.
  • The traffic split between A and B. A 50/50 split is a common starting point for A/B testing, but adjust the split if you’re uncertain about any changes you’ve made. For example, you might direct a smaller percentage of traffic, such as 30%, to B. This would result in a 70/30 split.
  • Test duration. Reliable results hinge on collecting enough data within a suitable period. You can choose to run your test for between 1 and 90 days. Generally, set a duration that allows enough time to reach statistical confidence. Low-usage sites need longer test durations than high-usage sites.

Don’t use a replica index for variant A. Instead, use your primary (live) index as variant A and the replica for variant B. If you set a replica as variant A, traffic won’t be routed through your A/B test, rendering the test invalid.

Estimate sample size and test duration

Accurate sample size determination is essential for effective A/B testing and reliable results. The sample size estimation tool simplifies this process by estimating the number of searches needed to confidently detect meaningful differences between experiment variations.

Estimate sample sizes during test creation

Sample size estimator

The sample size estimator uses your historical data to establish a baseline for performance metrics, such as conversion or click-through rates. Algolia uses these metrics to calculate the optimal sample size needed to determine statistically significant changes. With fixed statistical parameters—80% power and a 5% significance level—the tool ensures robust and reliable results.

Key features:

  • Historical baseline rate. The tool uses your existing data to determine the current performance metric, such as conversion or click-through rate, against the measured changes.
  • Fixed statistical parameters. A statistical power of 80% and a significance level of 5% provide a balance between detecting true effects and minimizing false positives.
  • Customizable metrics and effect size. Select the metric you want to measure, such as conversions or clicks, and define the minimum detectable effect size that aligns with your objectives.
Choose an appropriate effect size

The effect size is the smallest relative change in a metric that you consider significant enough to act on. Choosing an appropriate effect size is essential for accurate and efficient A/B tests.

For example, if you would adopt a new feature only if it increases your conversion rate by 5%, set your effect size to 5%. With a baseline conversion rate of 10% for your main index (A), the new feature would need to raise the rate to 10.5% (a 5% relative increase) for the test to be successful. The sample size estimation tool then calculates the searches needed to detect this change with 80% power and a 5% significance level.

When choosing an effect size, consider:

  • Impact. Consider the smallest change that would make a meaningful improvement in your goals. For example, a 2% relative increase in conversion rate might be significant for you, while another organization might aim for a 5% relative change.
  • Historical data. Review past experiments to understand typical variations and set an effect size that’s realistic and achievable based on historical performance. For example, if other changes you have tested result in a 3% relative increase in conversion rate, you might set your effect size to 3%.
  • Balance between sensitivity and practicality. Smaller effect sizes, such as 1% to 2%, require larger sample sizes but let you detect subtle changes. Larger effect sizes (for example, 5% to 10%) require smaller samples and are easier to detect but may overlook smaller yet important changes.

Start the A/B test

Create new tests from the Algolia dashboard:

Create an AB test from the Algolia dashboard

Press Create to start your A/B test. You can stop the test before its planned duration or wait until it finishes. However, you can’t stop and then restart a test because doing so would undermine the accuracy of the data: complete and uninterrupted testing is essential for an accurate A/B test.

View the results

View running or completed tests from the Algolia dashboard It shows the scores for each variant, how many users and searches are involved in the test, and how statistically significant the observed results are (confidence). Each test has a row for each test variant (A and B). For example:

AB testing in the Algolia dashboard

Click the menu button (three vertical dots) in the test’s title block to stop a test or delete the results. This title block also displays test status: In progress - Early for a new test, Stopped, Completed, Failed, or In progress (along with how many days are left).

The rows for the two variants provide details of:

  • Tracked searches and users: indicates the sufficiency of the data and the fairness of the allocation.
  • Click-through (CTR), conversion (CVR), purchase, and add-to-cart (ACTR) rates: shows whether index A or B provides better results. Purchase and add-to-cart rates are calculated globally for the variant rather than per currency and only display if you track the relevant revenue events.
  • Revenue: the income generated by each variant in the selected currency. It’s only displayed if you track relevant revenue events.

The final row of the test displays a confidence status. This helps you determine if the test results are statistically significant. Hovering over a column’s confidence shows the test’s p-value.

If possible, wait until the A/B test has finished before interpreting the results.

A/B tests show results in the dashboard within the first hour after you created them, but metric comparisons won’t show until at least 1 week or 20% of the test duration has elapsed. This mitigates the risk of drawing inaccurate conclusions due to insufficient data. Test results are updated every day.

Test statuses

The potential test statuses are:

  • In progress - Early: the test has started, and there is insufficient data to draw reliable conclusions. Wait for at least one week, or 20% of the test duration, before metrics begin to show. Hover over the badges to see data, but avoid drawing conclusions during this stage.
  • In progress: the test has been running for a while, and metrics are being collected and compared.
  • Failed: the test couldn’t be created. This is usually due to an issue with your index, or provisioning the test to the Search API. Try to create the test again, or contact the Algolia support team.
  • Stopped: the test was stopped manually. Your app is back to normal: index A performs as usual, receiving 100% of search requests. When you stop a test, all associated metadata and metrics are stored, and the test remains visible in the Algolia dashboard. However, the results may be inconclusive.
  • Completed: the test has finished. Your app is back to normal: index A performs as usual, receiving 100% of search requests.

If you delete a test, all its associated metadata and metrics are also deleted and the test removed from the Algolia dashboard.

How to interpret results

What you consider good or bad is entirely dependent on your needs. Compare the effort and cost of an improvement with its benefits. A 4% improvement in click-through or conversion rate might not be convincing or profitable enough to warrant a change to the structure of your records.

Given the typically low effort required to adjust settings or modify data, it’s generally advisable to implement any potential improvement.

For more information, see How A/B test scores are calculated.

Confidence

The confidence for a test is based on the test status and the p-value. The lower the p-value, the higher the likelihood that the observed difference between the variants isn’t due to chance.

The different confidence levels are:

  • Too early to interpret: the test has started, and there is insufficient data to draw reliable conclusions on the performance of the variants. Wait for at least a week, or 20% of the test duration, before metrics begin to show. Hover over the badges to see data, but avoid drawing conclusions during this stage.
  • No data: the test has been running for a while, but there is no data to compare. This typically occurs when no events are tracked.
  • Unconfident: the test has been running for a while, but it’s impossible to tell whether the observed change is representative of the true impact. This could change as more data is collected. Be careful when interpreting these results.
  • Trending confident: the test has been running for a while, and it currently looks like the observed change reflects the true impact. This could change as more data is collected. Be careful when you interpret these results.
  • Inconclusive: the test has finished, but the confidence is too low to determine whether the observed change is due to chance. This might be because of insufficient data or high similarity between the variants.
  • Confident: the test is complete, and the observed change probably reflects the true impact.

Confident or Trending confident doesn’t mean that the change is good. It just means that the observed change isn’t due to chance. For example:

  • A Confident result with a large observed decrease in conversion rate suggests that the change will harm conversions. If your goal is to increase conversion rates, avoid this change.
  • A Confident result with a large observed increase in conversion rate suggests that the change will improve conversions. If your goal is to increase conversion rates, do implement the change.
  • A Trending confident result with a large observed increase in conversion rate suggests that, based on current data, the change will improve conversions. You could implement the change, but the confidence might change later.
  • An Inconclusive test means that the impact is uncertain. Ignore the results or interpret them with discretion. Try re-launching the test for a longer duration to collect more data. This increases, but doesn’t guarantee, the likelihood that the results will reach a confident state.

Minimum number of searches

You can stop your test at any point and analyze the results with as little or as much data as you want. Drawing conclusions based on insufficient data and low-confidence test results might lower your overall search performance and lead to unexpected results.

The confidence indicator helps ensure the confidence and reliability of test results. Use the confidence indicator to protect you from jumping to conclusions too early based on skewed or insufficient data.

Recommendations

  • Test before going live. Be wary of breaking anything. For example, ensure both your test indices work with your UI. Small changes can break your interface or strongly affect the user experience. For example:

    • Making a change that affects facets can cause the facet’s UI logic to fail.
    • Changing a simple ranking on index B can make the search results so bad that users of this index have terrible results. This isn’t the purpose of A/B testing. Index B should theoretically be better or at least as good as index A.
  • Don’t change your A or B indices during a test. Don’t adjust settings during testing. This pollutes your test results, making them unreliable. If you must update your data, do so synchronously for both indices and, preferably, restart your test. Changing data or settings during a test can break your search experience and undermine your test conclusions.

  • Don’t use the same index for several A/B tests. You can’t use the same index in more than one test at the same time. You’ll get an error.

  • Make only small changes. The more features you test simultaneously, the harder it is to determine causality.

API clients

In most cases, use the Algolia dashboard to manage A/B tests. However, sometimes, testing with an Algolia API client is beneficial. For example:

  • You want to run the same test across many indices, for example, if you have several websites using the same kind of indices but with different data. The API simplifies the creation of multiple tests.
  • You want your backend to trigger tests based on data changes, or create feedback loops based on your analytics, as used in machine learning. This is an advanced method of managing product-line changes or industry trends and should be done carefully.

To use the API, your API key ACL must include:

  • A/B test creation and deletion - setSettings on all indices
  • A/B test analytics - analytics on all indices
Did you find this page helpful?