Other Types
  1. All Blogs
  2. Product | Algolia
  3. Ai | Algolia
  4. E-commerce | Algolia
  5. User Experience | Algolia
  6. Algolia
  7. Engineering | Algolia

How Algolia trains language models

Published:

Listen to this blog as a podcast:

Estimated time to read: 2 minutes

Algolia’s language models now continuously improve through a training process that learns from anonymized patterns across our network of customers. By doing this, NeuralSearch can deliver more relevant results out of the box, without requiring you to fine-tune models or maintain your own machine learning pipelines. 

What data is used

NeuralSearch’s model training only uses information that helps to improve search relevance:

  • Historic search queries – the words and phrases users type into the search experience.

  • Record attributes – searchable fields such as titles, descriptions, and categories.

  • Relevance signals – statistical indicators of whether a retrieved record was useful to the user.

  • Application IDs – used only to manage training consent preferences.

Sensitive or identifiable fields, like company name, emails, user tokens, or IP addresses, are not used in the training process. 

How the data is processed

Before training, all data is anonymized and aggregated. Identifiers are removed, and the data is blended across customers so the model learns from broad statistical patterns, not individual catalogs. For example, the model might learn that a “sofa” may also be referred to as a “couch.” But it does not know which user submitted the query or which customer provided the record.

The data aggregation process in the model training pipeline uses a 90-day data retention period. Before each new training run, the data used previously is automatically purged, ensuring that models are up-to-date with evolving search behavior, while upholding strict privacy and compliance regulations. All data processing happens within two defined regions, the EU or the US, based on where your Algolia infrastructure was originally provisioned. Your data never leaves the regional boundaries. 

At this stage, model training includes anonymized data from customers on the Elevate pricing plan. Data from other plans isn’t included, and customers using Algolia’s HIPAA-enabled features are automatically excluded.

How encoder models work

NeuralSearch utilizes encoder models, which convert queries and records into vector embeddings, numerical representations that capture meaning and context. These embeddings allow the system to measure similarity between queries and records. For example, “sofa” and “couch” end up close together in vector space because they are often used interchangeably.

This is fundamentally different from encoder–decoder (generative) models, which are designed to create new text or content. Generative models are valuable for tasks like answering questions or writing paragraphs, but they are not the most effective way to retrieve relevant results. NeuralSearch focuses on encoders only, as its sole purpose is to understand intent and match queries with the right records.

Testing and transparency

Every NeuralSearch model is rigorously tested before release to validate that the new version outperforms the last. We measure relevance metrics, like nDCG and recall, which evaluate how effectively the results are ranked and how many relevant results the model retrieves. This ensures that the most useful products appear first and that fewer relevant options are missed.

We also create model cards to provide transparency into how each model performs and where it’s best suited. Each model card summarizes details such as the model version, intended use, evaluation metrics, and supported languages. This helps ensure that every model deployed has been independently validated for accuracy and quality before it’s rolled out to customers.

What this means for you

By training on anonymized, aggregated patterns, NeuralSearch continuously improves without requiring extra effort from you. This means your users benefit from increasingly relevant search results that are adaptive to real-world behavior. 

You can choose to opt out at any time through your dashboard settings. Doing so means your search experience will continue to function normally, but it will not benefit from the broader model improvements that come from shared learnings across the Algolia network. If you do not see an “opt-out” option in your Dashboard settings, it means Algolia is not using your data to train AI models for the benefit of all customers.

Customers operating under separate written agreements are automatically excluded from model training by default but can choose to opt in. This gives you full control over how your data contributes to Algolia’s AI systems.

For instructions on managing model training preferences, visit our NeuralSearch documentation.

Recommended

Get the AI search that shows users what they need