We tested every major LLM on real shopping queries through Agent Studio, Algolia's platform for building search and discovery agents. Three dimensions of quality. Open methodology. The results speak for themselves.
Explore by metric
Each metric tests a specific quality dimension. Pick one to see how models performed on a real query.
Speed and cost analysis available in the full leaderboard below.
Under the hood
Three steps, fully automated. Click any step to see how it works.
Real products from your index become realistic shopping queries across multiple difficulty levels.
▶A calibrated LLM evaluates each response against clear pass/fail criteria. Agreement rate above 95%.
▶We resample each score 10,000 times and test pairs head-to-head. Models share a tier when their difference on the same cases could be zero.
▶Bands show each model's bootstrapped 95% confidence interval (10,000 resamples). Tiers come from a stronger test: we resample the head-to-head difference between two models on the same cases and check whether zero is plausible. Sharper than comparing single bands.
Full leaderboard
Each cell shows the model relative to the column leader — % below best (the leader is "Best", the 100% bar). Hover for the absolute score and 95% CI. Sort by any column; hover a metric for its definition.
Build better agents. Measure what matters. Same benchmarks, your data.