Re-evaluating LLM encoders for semantic search
In this study, we explore the transferability of MTEB and publicly available ecommerce dataset benchmark performance to real-world retail search applications.
Algolia builds embedding models that are approximately 500M parameters to ensure desired latency. Each LLM is architecturally optimised and quantised to ensure lower latency (Table 2). The Algolia AI team continuously assesses state-of-the-art LLMs, selecting those with top performance and permissive licenses. Fine-tuning these models for ecommerce contexts ensures superior performance tailored to industry-specific needs. The fine-tuning methodology combines the best practices from cutting-edge research with the AI team's expertise. Leveraging automated AI training and evaluation pipelines, the process optimizes model performance by simultaneously exploring numerous hyperparameters on the same training dataset, resulting in the best possible models. Some of the techniques inspired by the latest research, without delving into exact details, are outlined in the table below:
| Stage | Technique | Comments |
|---|---|---|
| Fine-tune (infoNCE loss) | Stratified public ecommerce datasets in batches | To ensure the best possible outcome is achieved from infoNCE loss, stratified datasets are curated in the same batch. |
| Hard-negative fine-tune | Synthetic hard negatives for further fine-tuning | A combination of GenAI labelling and tuned hard negative mining is leveraged to ensure the resultant model can separate the decision boundary between vague samples. |
Algolia embedding LLMs (as of December 2024) with their specifications are provided in Table 3. All Algolia LLMs are trained on publicly available ecommerce datasets, and no Algolia customer datasets are used for training purposes. GenAI labeling and hard negative mining are combined to create synthetic datasets for further fine-tuning. Algolia v2410 models are open-sourced under MIT license, and they can be accessed at Hugging Face. Note that latency is computed on a local machine with an i9 CPU.
| Model | License | Base | Datasets | Dimension | Latency |
|---|---|---|---|---|---|
| Algolia-Large-EN-Generic-v2410 | MIT | gte-large | Public ecommerce (+Syn.) | 1024 | 90 ms / 40 ms (opt.) |
| Algolia-Large-Multilang-Generic-v2410 | MIT | solon-embeddings-large-0.1 | Public ecommerce (+Syn.) | 1024 | 90 ms / 40 ms (opt.) |
| Algolia-Large-All-Generic-v2412 | MIT | snowflake-arctic-embed-l-v2.0 | Public ecommerce (+Syn.) | 1024 | 90 ms / 35 ms (opt.) |