Other Types
  1. All Blogs
  2. Product
  3. AI
  4. E-commerce
  5. User Experience
  6. Algolia
  7. Engineering

Solving the cold start problem with synthetic data

Updated:
Originaly published on:

These days, AI systems are at the center of nearly every industry, solving some of the world’s most complex problems. This didn’t happen by accident. Decades of research and development (or centuries, some may argue) brought us a world where large language models can analyze, understand, and reason with near human-level intelligence.

But there’s a catch. These models are only as good as the data they learn from, made possible by the sheer volume of information available across the internet. Without that massive training corpus, the generated outputs we now take for granted simply wouldn’t exist. To make accurate predictions, AI systems need rich signals. And to get those signals, they first need users engaging with them. That circular dependency is the cold start problem, and it remains one of the biggest challenges teams face when building or adopting AI features.

At Algolia, we faced it head-on while building NeuralSearch, our hybrid semantic + keyword engine. In this post, I’ll walk through the problem we encountered, how we approached it, and the tactics we used to solve it using synthetic data so customers could realize value from day one. If you prefer a deeper dive, you can also watch the presentation I gave at DevCon 2025 on this topic.

Why events matter so much

In the search space, events play a critical role in shaping how AI models learn and improve. Every interaction is a digital breadcrumb that reveals part of the user’s story. Views, clicks, and conversions help the system understand which results resonated, which ones fell short, and how user interests evolve over time.

Without these signals, the model has no ground truth to determine relevance. We may know what a user searched for and what results were shown, but we cannot confidently explain why they chose (or didn’t choose) a specific item. This gap is especially painful early in a customer’s journey, when event volume is low and the full benefits of Algolia’s AI capabilities can’t be unlocked.

How NeuralSearch learns (and why it needs a kickstart)

NeuralSearch blends two complementary search systems: semantic relevance through vector embeddings and keyword relevance through traditional lexical matching. During activation, it evaluates a customer’s index by replaying queries, inspecting records, analyzing events, and computing relevance scores across both signals. From there, it identifies the attributes that carry the most semantic meaning, applies the optimal weighting, and learns how to balance vector and keyword ranking through reinforcement learning.

This workflow depends on three inputs: records, queries, and events. Records describe the product or content catalog, queries capture user intent, and events reveal which results were actually relevant. When any of these pieces are missing, especially behavioral events, NeuralSearch cannot confidently determine how to blend or weight its ranking components, which slows down its ability to deliver strong relevance out of the gate.

Turning a constraint into innovation

When faced with the cold start problem, we asked a simple question:

“If customers can’t provide us with enough data, can we generate it responsibly?”

The turning point came when research showed that large language models could generate relevance judgments with human-level accuracy. Our own experiments confirmed it: LLMs consistently matched expert annotations with roughly 97% accuracy, proving they could serve as a suitable alternative for event-driven relevance scoring during onboarding. This opened a high-confidence path to fill in the gaps left by missing events.

A study from the University of Waterloo

Around the same time, we tested whether an LLM could generate realistic search queries directly from customer records. The results were encouraging. The model generated natural, long-tail queries that closely resembled real user searches and provided enough variation to support meaningful evaluation. Combined with the relevance judgments, these synthetic queries created the complete reference set needed to activate NeuralSearch without waiting for live traffic.

From research to product: a dashboard-driven experience

cold-start-problem-examples.jpg

Synthetic query generation

From research to product: a dashboard-driven experience

Once the approach proved reliable, we integrated a fully automated synthetic data workflow directly into the NeuralSearch dashboard. The process is deterministic: if a customer has no queries, the system generates them from their records. If they have no events, it produces relevance labels for each query-record pair. Everything runs behind the scenes, requiring no additional setup from the customer.

This removes the need for fully instrumented event pipelines before experimenting with NeuralSearch, which has historically been an onboarding requirement. Customers start with a high-quality reference set and see meaningful relevance improvements right away. The same infrastructure now supports several other AI systems across the platform, laying the groundwork for future capabilities.

What we learned along the way

Building this system forced us to rethink our assumptions about how AI features should be activated. What started as a limitation quickly became a chance to approach the problem from a different angle, and those lessons have shaped how we think about future AI development at Algolia.

  • Constraints spark innovation. Missing events were initially a blocker, but that constraint pushed us to get creative with synthetic data. Treating the challenge as an opportunity made it easier to work through the uncertainty and uncover a better path forward.

  • Embrace experiments. This started as a research prototype and eventually evolved into a customer-facing feature. Not every experiment pays off, but the ones that do can meaningfully move the needle.

  • Time to value is the north star. Customers adopt products faster when they see impact on day one, even before they have all the systems wired up. If you accelerate the time to value for them, they generally stick around. 

  • LLMs are your friends. They are not a replacement for real production data, but they can be highly effective at filling early gaps. We are still only scratching the surface of what generative AI can enable.

  • Design for modularity from the start. We originally built these components for NeuralSearch, but the same building blocks can support other AI features across the platform. Reusability multiplies impact.

These principles didn’t just help us ship this feature; they now guide how we approach new AI capabilities across Algolia.

See it in action

This synthetic data workflow is now generally available directly in NeuralSearch for Algolia Elevate customers. Early adopters are already using it to activate AI features without waiting for event data or building complex pipelines upfront. 

If you’d like to try it for yourself, reach out to your Customer Success Manager or Algolia account team to get started.

Recommended Articles

Powered by Algolia AI Recommendations

Get the AI search that shows users what they need