ECIR 2026: The Evaluation Renaissance

Listen to this blog as a podcast:

What 8 Algolians learned in Delft about the state of retrieval.

We are just back from ECIR 2026 in Delft, with attendees from our Ranking, Retrieval, Understanding, and Agent Studio teams. Over five days of tutorials and presentations, we could see forty-plus talks and tens of posters. We got one shared realization: IR in 2026 is bustling.

LLMs are diffusing across the stack. Synthetic data and LLM-as-a-judge are maturing fast.

We might be looking at an evaluation renaissance.

Last year in Lucca, we wrote about “search meeting conversation”. Well this year, the conversation grew up.

What's continuing

Three threads from 2025 deepened.

Hybrid retrieval is baseline. Nobody at ECIR debates keyword vs. semantic anymore. Every production system presented used both. The question shifted to how efficiently you blend them, and at what cost. Perplexity presented pplx-embed, a diffusion-pretrained embedding family achieving comparable quality at 4x memory reduction through native INT8 quantization. The message: embeddings are a solved-enough problem that efficiency is the new frontier.

Conversational search is no longer speculative. The CDR tutorial covered conversation-aware dense retrieval end to end: data bootstrapping from search logs, context denoising, the CORAL benchmark for conversational RAG. Our colleague Pascal went deep on this material and came back connecting it directly to our own Agent Studio conversations and memory architecture. Meanwhile, the BrowseComp benchmark showed that deep research tasks are brutally hard. Even humans fail 70% of the time despite spending more than 2 hours on each task. Meanwhile, LLMs failed catastrophically here, scoring 2% success rate without reasoning steps.

RAG moved from "does it work" to "can we trust it." This was the biggest shift. Multiple papers tackled RAG trustworthiness from different angles: contradiction detection in healthcare contexts (Javadi et al.), hallucination detection via token-level entropy, citation verification through mechanistic interpretability. Claire, from our Ranking team, went deep on the FACTUM paper and came back with a key insight: correct citations require simultaneous grounding and stable synthesis, not just parametric memory. Most of our attendees independently flagged RAG faithfulness as a top theme. The field is no longer asking "does RAG work?" but "how do we know it works? Is it trustable?"

What surprised us

Encoders strike back

LightOn's Antoine Chaffin presented the ModernBERT project and its late-interaction descendants: GTE-ModernColBERT, Reason-ModernColBERT, ColBERT-Zero. The headline: small encoder models outperform decoders up to 45x their size on retrieval tasks. The PyLate library makes ColBERT-style multi-vector retrieval accessible.

Our colleague Alphonse, a non-ML engineer on the Understanding team, called it "technical but with the right mix of vulgarisation to speak to less ML-focused people." Ange, from our Ranking team, dug deepest into the technical details and came back noting that LateOn-Code-edge, a 17M-parameter model, outperforms models with 150M+ parameters. The trend was impossible to miss across the conference: encoders are no longer the forgotten middle child of the transformer family.

At the LIR workshop, Omar Khattab (MIT) delivered a call to action: late interaction needs production infrastructure. Filtering, faceting, pagination, multi-tenancy. During the Q&A, someone asked whether late interaction could work model-agnostically, with different encoders and architectures. The answer was candid: it might work if trained for it, but you cannot piggyback on existing models. The research is ahead of the tooling.

The eBay effect

Samarth Agrawal from eBay presented what might be the week's most practical talk. The approach: encode product codes as 120-bit binary vectors using 6-bit character encoding, build a KNN index, and retrieve variations by Hamming distance. No model training. No LLM. CVR up 45% in A/B testing.

Michal, from our Retrieval team, captured the appeal: "I liked the simplicity and effectiveness; no LLM magic." Alphonse, on Understanding, was already planning: "I can't wait to try and adapt this to our own Query Suggestions." Claire, from Ranking, left with four follow-up questions for our QS pipeline. Sometimes the most impactful research skips the neural network entirely.

Synthetic data is now production-ready

Two industry talks landed this message.

IKEA (Glisovic et al.) used Gemini Pro 1.5 to generate synthetic purchase interactions for cold-start recommendations. Key finding: random augmentation provides no benefit and can harm. LLM reasoning over context and style is what drives the gain. Selective augmentation of just 20% of users achieved production-viable cost.

Amazon Alexa built Comprehensive Synthetic Personas through a 1,800-node interest ontology. Models trained on synthetic data outperformed those on real de-identified data. The taxonomy grounding prevents overfitting to synthetic artifacts and improves long-tail coverage.

Fairness has a deadline

The EU AI Act enters force August 2026. ECIR took this seriously.

Baltazar et al. presented bribery-resistant ranking systems designed for compliance. Maria, from our Agent Studio team, attended the fairness sessions that most of us missed. The finding that stuck: reasoning rerankers do not improve equity (Samuel et al.). Reasoning improves relevance but preserves existing bias, sometimes amplifying it through "confidently skewed rationales."

As Rabelais wrote five centuries ago, "science sans conscience n'est que ruine de l'âme."

The IR community is taking this to heart, and so do we, as search and recommendations systems have a bigger impact than ever on our society’s broader fabric.

What we're building

While we were in Delft, we launched our LLM Leaderboard and shared our talk "Don't Trust Your Search Agent" at DevBit.

The leaderboard evaluates 24 top models across relevance, hallucinations, and language adherence. All scores with 95% confidence intervals: no point estimates, no "trust us" rankings. With difficulty-tiered synthetic data test cases grounded in real product catalogs data.
Shipping this during ECIR week was not planned, but the alignment was unmistakable: the field is converging on synthetic data evals powered with statistical rigor, and so are we.

ECIR's themes map directly to what we shipped this year. Agent Studio launched with persistent memory (semantic and episodic), a four-type tool system, and built-in analytics. When Guillaume, from our Agent Studio team, had a deep conversation with Elias Lumer about MemTool's approach to tool memory management, the parallels were immediate: agents need to add tools and remove them, managing context growth explicitly.

We also assessed 30 techniques for trustworthy AI evaluation as part of an internal research PoC, then ran 5 on Agent Studio. The learnings are promising: we can verify agent claims, and LLM introspection techniques show real potential. Stay tuned.

Talks worth your time

A curated list, with paper links, for practitioners:

Contradictions in Context (Javadi et al.): High similarity between retrieved docs does not mean low contradiction. Essential reading for any RAG builder.
Less LLM, More Documents (Ning et al.): Mid-size models benefit most from larger corpora. Scaling the corpus can offset scaling the LLM.
OrLog (Hoveyda et al.): Neuro-symbolic framework for complex logical queries. 90% fewer tokens than standard LLM reasoning.
Evalugator (Koopman et al.): Rapid RAG evaluation without labels. Reduces the friction of adopting good evaluation habits.
Beyond the Click (Zerhoudi & Granitzer): Framework for inferring cognitive traces from search behavior. Alphonse, from our Understanding team: "exactly one of the problems we are trying to solve."
FACTUM (Dassen et al.): Mechanistic detection of citation hallucination. Logits are a bad proxy for truth; residual stream analysis works better. Claire's top pick from the trustworthiness sessions.
Iterative Reranking (Czinczoll et al.): Query difficulty taxonomy for targeted LLM reranking. Useful beyond ranking.
LightOn: Encoders and Late Interaction (Chaffin): ModernBERT, PyLate, ColBERT-Zero. The encoder renaissance, with open models and code.

The hallway track

Serendipitous encounters in the coffee and lunch queues are always insightful. We chatted with researchers from Huawei, a student from La Sorbonne, industry practitioners comparing notes on agentic architectures. The LightOn team shared ColGREP, a semantic code search tool built on their multi-vector models. Ange had a separate conversation with Amélie Châtelain, LightOn's Head of Knowledge & Search, about synthetic data and evaluation at every step of the pipeline. The conference dinner at Het Arsenaal brought conversations that no paper session could replicate.

What struck us most was the ecosystem's maturity. Not any single talk, but the overall direction: retrieval is becoming infrastructure, evaluation is becoming rigorous, and the gap between academic innovation and production deployment is narrowing. Across our teams, each of us brought back different angles on the same zeitgeist. That is the real value of conferences like ECIR.

Looking ahead

The state of IR in 2026 is an evaluation renaissance. Synthetic data works. LLM-as-a-judge works: you can Just Do Things. Not in all cases, and the devil is in the details. But if you build proper frameworks, based on industry-standardized metrics and techniques, you can evaluate much of the agentic frontier's user experiences.

We are building those frameworks. Our LLM Leaderboard brings the receipts. Our Agent Studio brings the platform. And we are hiring engineers who want to work at this intersection of retrieval, generation, and evaluation.

See you at SIGIR or at ECIR 2027!

team-at-ecir (1).jpg

This post reflects the collective notes of our team: Guillaume Belain, Alphonse Bouy, Ange Daumal, Claire Helme-Guizon, Maria Lungu, Paul-Louis Nech, Michal Szmaj, and Pascal Zaragoza.

ECIR 2026: The Evaluation Renaissance

What 8 Algolians learned in Delft about the state of retrieval.

What's continuing

What surprised us

Encoders strike back

The eBay effect

Synthetic data is now production-ready

Fairness has a deadline

What we're building

Talks worth your time

The hallway track

Looking ahead

Recommended

Get the AI search that shows users what they need

Agentic intelligence layer powering commerce discovery

A leader for the third consecutive year

Increased Operating Profit and Improved Efficiency

Named a leader in knowledge discovery

Top scores across every B2B category