Analysis and results

Re-evaluating LLM encoders for semantic search

Summary

In this study, we explore the transferability of MTEB and publicly available ecommerce dataset benchmark performance to real-world retail search applications.

Analysis and results

Ecommerce benchmark (Algolia private + public datasets)

All three Algolia LLMs ranked in the top 20 for multilingual ecommerce datasets including English (Figure 2). Algolia v2412 is at the top of the leaderboard with its NDCG@10 performance. Algolia Multilingual v2410, which is available on Hugging Face with permissive license as of December 2014, ranks at 6th place above Cohere multilingual LLM. The performance difference between our v2410 and v2412 multilingual models is approximately 3% in NDCG@10.

Figure 2: Algolia LLM performance on ecommerce benchmark that includes Algolia private and public ecommerce datasets

Ecommerce benchmark (Algolia private datasets)

Algolia ecommerce benchmark using just Algolia private datasets is curated from Algolia customers that collaborated to evaluate the model performance for their specific data distribution. Algolia private ecommerce datasets include query and document sentences as pair and an associated relevancy score, and dominant languages in the benchmark are English, French, German, and Spanish. The relevancy scores are derived from the analytics based on the past user interactions with the products for the associated queries. Removing public ecommerce datasets doesn’t impact the leaderboard significantly, yet Algolia v2412 becomes the second in the leaderboard leading the best Google embedding model on the leaderboard. All three Algolia LLMs are still ranked in the top 20 (Figure 3).

Figure 3: Algolia LLM performance on ecommerce benchmark that includes Algolia private ecommerce datasets

Ecommerce benchmark (Algolia private multilingual datasets)

This benchmark keeps all datasets that are non-English, so dominant languages in this benchmark are French, German, and Spanish. Note that there are some other non-European languages, yet they are minority in numbers in this benchmark. Removing private English ecommerce datasets doesn’t impact the leaderboard significantly, and the Algolia English v2410 model drops from 12th place to 13th place in the leaderboard as expected (Figure 4). Also, there are some new models that climb in the top 20, namely Alibaba gte-large-en-1.5.

Figure 4: Algolia LLM performance on ecommerce benchmark that includes Algolia multilingual private ecommerce datasets

Ecommerce benchmark (Algolia private English datasets)

Algolia ecommerce benchmark includes ecommerce datasets in English only. Since the majority of Algolia customer datasets are in English, it is essential to benchmark Algolia LMMs against other open-source and commercially available counterparts. Dominant verticals in this benchmark are Fashion and Technology. Algolia v2412 model retains its place in the leaderboard (Figure 5), whereas v2410 English model climbs to the 9th place and v2410 multilingual falls from 9th place to 13th place. The Snowflake arctic-embed-m-v1.5 model jumps to the 5th place from 24th place. Lajavaness bilingual models end up in 32nd place for large and 35th place for large-8k models. Such drastic changes in the leaderboard shows how some LLM embedding models are sensitive to English. It is essential to evaluate sensitivity of models on any specific language to make sure they match the expectations of the deployment data distributions.

Figure 5: Algolia LLM performance on ecommerce benchmark that includes Algolia English private ecommerce datasets

Ecommerce benchmark (public English datasets)

Public ecommerce benchmark includes datasets such as ESCI, WANDS, Home Depot, Crowdflower, and Marqo. All these datasets with relevance scores are in English and publicly available. Due to their availability, it is no surprise many open-source models are trained on them. In the public English ecommerce benchmark, Algolia v2412 model is at 3rd place, whereas Algolia v2410 English model is in 6th and Algolia v2410 Multilingual model in 10th place (Figure 6). Algolia models perform consistently well across all ecommerce benchmarks. It is noticeable that Snowflakes models are gathered at the top of the leaderboard, indicating that they are trained strongly on public ecommerce datasets. OpenAI and Google models are no longer within the top 20, and they are clustered at around 30th place, which shows that these commercial models are not trained heavily on public ecommerce datasets. Cohere multilingual and English models are within top 20 in the leaderboard.

Figure 6: Algolia LLM performance on ecommerce benchmark that includes Public English ecommerce datasets

MTEB benchmark (English datasets)

MTEB includes English retrieval datasets only. Algolia keeps MTEB to make sure Algolia models aren’t diverging much from their foundational language understanding given that Algolia LLMs are fine-tuned on top of state-of-the-art open-source LLM embedding models. Compared to the Private ecommerce benchmark, there are seven models displaced in MTEB, the most notable of which are Algolia v2410 Multilingual, Google gecko@003, and JinaAI embedding-v3 (Figure 7). Although these three models are highly ranked in private ecommerce datasets, they aren’t within the top 20 on the MTEB leaderboard. This is an indication that MTEB and any other open-source benchmarks aren’t enough to tell which LLM embedding models can be used for a specific context. It is essential to build an internal benchmark that includes datasets with deployment data distributions to ensure the model preferred is indeed the right one.

Figure 7: Algolia LLM performance on the MTEB benchmark that includes English retrieval datasets

Next Chapter >

Re-evaluating LLM encoders for semantic search

Summary

Analysis and results

Ecommerce benchmark (Algolia private + public datasets)

Ecommerce benchmark (Algolia private datasets)

Ecommerce benchmark (Algolia private multilingual datasets)

Ecommerce benchmark (Algolia private English datasets)

Ecommerce benchmark (public English datasets)

MTEB benchmark (English datasets)

Enable anyone to build great Search & Discovery