Agentic Search Leaderboard

The agent has to earn every answer

Standard benchmarks score a model answering from memory. An agent has to run this loop, and any step along it can break the answer.

Repeats until the agent can answer

Same loop, same harness, every model. See how we generate and grade cases →

Full leaderboard

Every model, every dimension

Each cell shows the model relative to the column leader — % below best (the leader is "Best", the 100% bar). Hover for the absolute score and 95% CI. Sort by any column; hover a metric for its definition.

Frequently asked questions

This leaderboard evaluates LLMs in real agent workflows, focusing on practical factors like cost, latency, tool use, and groundedness so you can make informed decisions about which model to use for search and discovery agents.

Use it as a guide to narrow down model choices based on your specific constraints. It is a starting point, not a final answer. You should validate results in your own environment and select the model that best fits your needs.

We are model agnostic. Agent Studio supports a bring-your-own-LLM approach, so you can use any model you prefer. This leaderboard exists to provide data and context, not to prescribe a single choice.

No. The right model depends on your use case, budget, latency targets, and quality requirements. This leaderboard highlights tradeoffs so you can decide what matters most.

These results are based on 1,500+ queries evaluated on internal ecommerce product catalogs, and are meant to be directional. Performance will vary depending on your data, setup, and implementation, and you are responsible for evaluating what works best in your environment.

We update the leaderboard as we run new evaluations across models and scenarios. As the ecosystem evolves, the data will continue to reflect the latest findings.

There is no best model. Only tradeoffs.

The agent has to earn every answer

Every model, every dimension

Frequently asked questions

Build your agent with confidence