Agent Studio Benchmark

Agentic SearchLeaderboard

We tested every major LLM on real shopping queries through Agent Studio, Algolia's platform for building search and discovery agents. Three dimensions of quality. Open methodology. The results speak for themselves.

-
Models
-
Test cases
3
Metrics
-
Providers
Scroll
Who leads agentic search?
Anthropic
OpenAI
Google
Open Source
xAI
DeepSeek
2
Model B
•••
Strong on Relevance, but higher Hallucination rate
3.2s $ 0.018
1
Model A
•••
Overall leader: Relevance, Hallucinations, Language
4.1s $ 0.024
3
Model C
•••
Best Speed-to-quality ratio in the benchmark
1.8s $ 0.006
Scroll to discover the real ranking

Same question, different answers

Each metric tests a specific quality dimension. Pick one to see how models performed on a real query.

Loading metrics...

Speed and cost analysis available in the full leaderboard below.

From your catalog to ranked results

Three steps, fully automated. Click any step to see how it works.

Generated

From your catalog

Real products from your index become realistic shopping queries across multiple difficulty levels.

Graded

By LLM judges

A calibrated LLM evaluates each response against clear pass/fail criteria. Agreement rate above 95%.

Ranked

With statistical confidence

We resample each score 10,000 times and test pairs head-to-head. Models share a tier when their difference on the same cases could be zero.

Query
Loading...
Search calls
Agent response
Reveal verdict

Bands show each model's bootstrapped 95% confidence interval (10,000 resamples). Tiers come from a stronger test: we resample the head-to-head difference between two models on the same cases and check whether zero is plausible. Sharper than comparing single bands.

Model A
87.2% Tier 1
Model B
82.1% Tier 1
Model C
74.4% Tier 2
Models A and B share Tier 1. Their head-to-head difference on the same cases could plausibly be zero — we can't call a winner. Model C's gap vs A is well above zero, so it separates into Tier 2.

Every model, every dimension

Each cell shows the model relative to the column leader — % below best (the leader is "Best", the 100% bar). Hover for the absolute score and 95% CI. Sort by any column; hover a metric for its definition.

Build better agents. Measure what matters. Same benchmarks, your data.

Frequently asked questions

This leaderboard evaluates LLMs in real agent workflows, focusing on practical factors like cost, latency, tool use, and groundedness so you can make informed decisions about which model to use for search and discovery agents.
Use it as a guide to narrow down model choices based on your specific constraints. It is a starting point, not a final answer. You should validate results in your own environment and select the model that best fits your needs.
We are model agnostic. Agent Studio supports a bring-your-own-LLM approach, so you can use any model you prefer. This leaderboard exists to provide data and context, not to prescribe a single choice.
No. The right model depends on your use case, budget, latency targets, and quality requirements. This leaderboard highlights tradeoffs so you can decide what matters most.
These results are based on 1,500+ queries evaluated on internal ecommerce product catalogs, and are meant to be directional. Performance will vary depending on your data, setup, and implementation, and you are responsible for evaluating what works best in your environment.
We update the leaderboard as we run new evaluations across models and scenarios. As the ecosystem evolves, the data will continue to reflect the latest findings.
Try Agent Studio