ECIR23 Conference Take-aways from Dublin

Back to all blogs

We’re just back from ECIR23, the leading European conference around Information Retrieval systems, which ran its 45th edition in Dublin.

Across 4 days, this edition offered a packed agenda around Content Search & Recommendation, alongside world experts in the field.
You could learn about new methods for data preparation & pre-processing, and a variety of algorithms for Information Retrieval & Recommender systems. Other talks discussed evaluating metrics (both classic and new ones) to aim at content quality, diversity & balance, or other goals.

Overall, the 2023 edition of this venerable conference had both solid roots grounded into decades of research in the field & wide-reaching developments that ranged from applications in specific domains to experience sharing in deploying Search-as-a-Service or optimizing database queries for efficiency and cost.

Our team of Algolia engineers attending the conference took away many learnings: insights into how big players like Spotify or ShareChat power & evaluate their recommendations, more fundamental research like new metrics for improving content diversity, and bigger think pieces on the place of Large Language Models in our Question-Answering systems or the challenges and opportunities of semantic search.

Read-on to learn more from the four days of conferences – you’ll find an overview of the talks we attended, with links to the many useful techniques, metrics, or frameworks presented at ECIR!

Day 1

Personalization at Spotify

In the opening keynote, Mounia Lalmas (Head of Tech Research in Personalization) presented Spotify’s approach to Personalized Recommendations.

She described Shared Models used across features: embeddings of tracks/albums/artists using word2vec with tracks/artists as words, affinities of users for these entities, and similarities (e.g. when one ‘single’ album is equivalent to its one song).
They consider playlisting as explicit intent, used for algorithms from content-based to collaborative filtering.
She touched upon Experience considerations: they approach personalization as a UI-based modeling problem (the home feed is a set of “Layers”, fully-personalized carousels, to optimize for engagement), hence the signals are contextualized (skipping within “Discover Weekly” is fine, skipping in “Calm music” playlist is negative signal – contrast with e.g. Jazz for Sleep, where no interaction means success). This enables Journey attribution: “high quality linking of outcomes to initial content”.
The personalized experience at Spotify includes Algotorial playlists: semi-personalized with a “mixture of Human experts & Machine Learning”, showing the value for content teams of Merchandising capabilities on top of raw powerful AI models.
She shared some insights on Evaluation of recommendations, e.g. the tradeoff between Wants and Needs, more of the songs we know you like vs discovering the content you’ll love tomorrow:

image1 (9).png — “We optimize for long-term or delayed rewards”

Ultimately, the goal is to align optimization criteria with long-term KPIs & Business needs

Finally, we got some insights into Research at Spotify:
- Answering “How do you know where the user is in their journey?” by detecting Sequential listening patterns from morning to evening
- Studying Evolution of User Preferences: see the “Where to Next?” paper
- Exploring & Understanding Personal Preferences with their TastePaths research and interface
  
  TastePaths helps a user explore a genre from three artists they most listen to. More insights and takeaways from this user research on the Spotify Research blog!

- Optimizing for Diversity of Recommendations, promoting Tail Content while keeping high user engagement: “It’s possible to take actions that increase diversity, with little/no fall in relevance.”
  - see Shifting Consumption towards Diverse Content for insights into how
- How do we know we’re looking at the Right Metrics?
  - e.g. Sleep playlists have different rewards modeling where no interaction == success
  - Deriving User- and Content- specific Rewards explains how they approach this modeling.
    - Some insights on user-specific variance: “Now focusing on user types, users who like Jazz music tend to listen to recommended playlists for longer than other users, on average 27.16 minutes (global average 20.43m)” so you need to account for that to avoid biasing more strongly towards some use-cases.
    - Ultimately Two kinds of metrics:
      - Personalized surfaces: immediate outcomes of their actions
      - KPIs and business goals: measure behaviors that align with long-term success
- What’s our Long-term User Satisfaction:
  - They model users with a Generalist-Specialist score: what kind of user are you?
  - See Algorithmic Effects on the Diversity of Consumption for more on Diversity and its impact on long-term retention.

Viewpoint diversity in Search Results

How can you optimize not only for engagement, but also diversity in results?

Optimize for Stance balance in results, best seems to be hierarchical diversification: first aim at diversity of stance, then of topics.
Key reference to dig deeper: On the Democratic Role of News Recommenders

Auditing Fairness in Graph models

The authors evaluate CF models on Consumer and Producer fairness metrics, showing in the paper that their superior accuracy capabilities may come at the expense of user fairness and item exposure. They use the Elliot framework, a nice tool for studying your recommender systems.

image5 (11).png — The Elliot Framework: A toolbox for running & evaluating recommender models.

Graph-Based Recommendation for Sparse and Heterogeneous User Interactions

Using random walks and genetic algorithms, this paper proposes a graph recommender which learns from sparse interaction data.
Although less optimal at higher scale, their approach can bring a new state-of-the-art accuracy to your small-scale datasets. 💡

Day 2

Large Language Models for Question Answering

Day two started with a keynote by William Yang Wang on the Challenges & Opportunities of using Large Language Models (LLMs) for QA applications.
Wang presented paradigm shifts in language processing, going through the several steps from classical train/test-based Machine Learning, to fine-tuning on top of pretrained LLMs, to the current era of in-context learning on top of pretrained models.

FLOPS — Emerging abilities of LLMs: as the model scales, so does its cross-domain capabilities (towards general intelligence?)

He presented a Latent Variable perspective: we can see LLMs as working with a latent, high-dimensionality vector ‘theta’ which represents the task at hand: ability to solve complex tasks is more and more a matter of phrasing the task in language that can be represented efficiently.

When touching on the topic of Numerical Reasoning and other Specialized Domains, he pointed at new datasets to test & improve current systems: WikiWhy, a dataset of “Why?” questions with Wikipedia extracts to support the answer’s rationale ; HybridQA for large-scale, multi-hop question answering dataset relying on both structured tabular and unstructured textual forms, or ConvFinQA, a dataset asking tough financial questions that require chain reasoning (a very expensive dataset according to him, considering the many financial experts hired to provide an answering baseline 💸).

The keynote ended by mentioning several other issues: model hallucinations, which can result in misinformation or factuality issues; bias & fairness considerations, or even privacy & copyright problems. On that last topic, an interesting direction is Machine UnLearning: can we honor Right to be Forgotten requests without retraining a huge language model?

Some research labs focus on these key issues, such as UC Santa Barbara’s Center for Responsible Machine Learning. Lots of interesting research, such as better understanding of Gender-Seniority bias, to follow so that we ML product builders ensure our systems are built with fairness to everyone!

Evaluation

On the theme of evaluation & metrics, there was a lot to learn. From a method to rank results based on various explainer models, to new metrics that capture Innovation and Diversity in search results; this ties back to the opening keynote’s insistence on not only measuring accuracy, but also other things your users value in the long term like the ability to discover new content.

Interleaving: Theoretical foundations

In this talk, Kojiro Iizuka presents an alternative to A/B testing which is up to 100x faster to gather conclusions, called Interleaving. What is the secret behind this huge performance improvement?

When comparing two rankers A & B, instead of showing randomly a full result page from either ranker, Interleaving consists in blending results from both and simultaneously measuring feedback (explicit or implicit) across tested rankers.

This allows for a very dynamic testing which collects results way faster than traditional A/B testing. Yet this was until now a practical trick with no theoretical basis, and the paper’s team did a wonderful job at demonstrating a solid theoretical explanation of its known high value.

User aspects

Some talks that day focused on UX considerations when applying recommender models to real-life use-cases.In A Matter of Time: Detecting Depression with Time-Enriched MultiM Transformers, Ana-Maria Bucur presented her team’s work in using a Multimodal Transformer architecture to predict likelihood of depression from small signals like the use of language which suggest social isolation, or the tone and content of pictures posted on social networks.

Great insights both on the potential of multimodal systems, and on the challenges of building tools to assist experts like psychologists: we need to work on making our systems explainable and highly reliable, before we can expect professionals to trust us with critical content like their patients’ data.

Later, we saw a potential Recommender Meeting Assistant: this system listens to the same meeting as you, and suggests complementary content like infographics to help you understand the speaker.

Interestingly, the challenges to its adoption are less technical (although finding the right multimodal content in realtime is quite a feat!), but rather come from user experience considerations: how do you solve the interactivity, social implications, and personalization issues which stem from such a setup? Part of the answer is following good interaction design practices: they mentioned UX guidelines from Microsoft’s foundational paper Guidelines for Human AI interactions [2019], which seem even more relevant today.

Day 3

The third day brought several talks about Multimedia applications.

There is a solid trend in MultiModal systems, able to make sense of both textual and visual data, such as HADA: A Graph-based Amalgamation Framework in Image-Text Retrieval. In this paper (with code <3), they demonstrate a system for Image-Text Retrieval which beats the best known options so far: after showing a first Fusion Encoder method, which is accurate but slow, they highlight their key idea: using two Transformers side-by-side, encoding images and text separately. The loss in accuracy is minor, and can be solved with retraining; using pre-trained models like ALBEF and LightningDOT allows them to train quickly as the overall graph-based system only has 10M trainable parameters to optimize.
Finally, they shared some tooling advice: using Salesforce’s LAVIS, a nice open-source framework to leverage various visual models in a standardized way and to test them on some multimodal tasks!

Other papers were applying similar ideas to more specific domains: in Multimodal Geolocation Estimation of News Photos, a team created a dataset called MMG-NewsPhoto: half a million news texts with the corresponding image, with a set of 3000 photos manually labeled with geolocation information. They benchmark state-of-the-art methods, and demonstrate that most multimodal models are superior to unimodal approaches by more than 10%!

Beyond images, others were looking at multimodal encoding to solve different kinds of specialized problems: in Predicting the Listening Contexts of Music Playlists Using Knowledge Graphs, a team of two proposed four novel classifiers which yield approximately 10% higher performance than the state-of-the-art.
Beyond their nice results on the benchmarks, a counter-intuitive learning from their paper is that representing playlists as temporal sequences of songs (so a Prepare for sleep playlist going from Metal to Jazz would be seen as different from a Wake-up! playlist going from Jazz to Metal) prevents some learning, so in the end an approach using a bag-of-songs representation performs better at classifying each song’s listening context!
Here, as often, our knowledge of a use-case might give wrong intuitions – a good reminder to always start ML projects by thinking from our domain knowledge, and then to make evidence-based decisions by looking at quality metrics!

Day 4: Industry Day

On the final day, we attended what I considered the most insightful track: Industrial presentations applying all this algorithmic knowledge to concrete business use-cases. There were introductions to various applications from Domain-specific IR to Recommendations, presenting real-world applications and ideas for Next-generation Search.

The first session on Domain-Specific Retrieval started with an invited talk from Aylien, a news intelligence platform leveraging NLP techniques and knowledge bases to represent articles across many world news sources. They explained the value of Dense Vectors for various text applications, from recommendations to clustering and more.
This was followed by a presentation of Ortho Search, a domain-specific gateway to retrieve medical articles, where they could leverage the specificities of the content to provide domain-specific navigation features.

The session concluded with Jakub Zavrel, previously founder/CEO of TextKernel, presenting his new project Zeta Alpha: their platform combines a Neural-based search engine with Language Model Prompting, to rephrase the query before search and generate a natural language answer from retrieved documents. Its goal is to multiply the productivity of knowledge workers by helping them answer questions from large-scale enterprise data more efficiently. It’s interesting to see such recent efforts towards natural language enterprise search – looking forward to seeing how they solve the difficulties of language models’ generating confabulations, their potential biases around topics like gender, and the other challenges that come when integrating large language models after a search engine’s retrieval phase!

The second part of the Industry day focused on real-world applications. Here an invited talk brought the team from Siren, a platform for Investigative Intelligence which uses a data model to drive discovery of related nodes. They did a live demo showing how one can load e.g. investor data in their platform, and answer questions like “what investments connect these two accredited investors” by displaying a graph of relationships & common investments between two nodes of interest.
One key insight was how they keep the graph-based discovery experience efficient at scale: by leveraging semi-joins instead of inner-joins, they get both a faster user experience and some welcome cost optimization on frequent navigation actions.

image9 (2).png — Demo of the Siren platform, here displaying entities in analyzed emails

Between other talks of interest, in this session we also saw the Scalac team present Teamy.ai, a system which matches projects to employees based on their skills and interests using some evolutionary algorithms; or RipJar, which showed how their risk intelligence use-case needs solid language processing tools: they presented an example with a criminal called David Cameron (not the one you may think of!), and the challenges in handling ambiguous entities in text content.

In the afternoon, we saw a whole session focused on Recommendations. The invited talk here was Algolia’s: we presented Building a Business-Aware Image-Recommendation API – showing the power of vector search via image vectors, the challenges in scaling vectors via hashing techniques to reach the industrial scale, and how this raw power needs to be wrapped in an expressive API to unlock many of its potential use-cases.

image11 (4).png — Click here to find the slide deck (with full presenter notes!) of our invited talk at ECIR23

The session continued with Olivier Jeunen from ShareChat. Leading social media platform in India with ~400 million users, they are in a unique position to study a known problem in recommenders: Position bias, or how we should treat explicit feedback (e.g. users liking a video) when we know that the first item in the list will get more attention than the 5th or 10th.

Their talk Probabilistic Position Bias Modeling brought a refreshing perspective on the old wisdom that we should use nDCG, the gold standard for discounting position bias: contrasting this model with their user analytics, they noticed that after the 100th video in a feed nDCG would give a way too high probability of being viewed; when modeling different functions against their known user behavior, they found out that a power law was a better fit: a good cautionary tale to not trust too much the theory, but rather use it as a starting point before comparing with real user behavior, as ultimately offline metrics are as good as they are correlated with the online metrics and business KPIs which in the end are the one thing we all want to optimize for!

The last talk on Recommenders was from Lamya Moutiq at Expedia. They built their own recommender model to address the specific challenges of the travel industry: fast-changing environment, with unstable item properties (hotels changing their offering regularly) and user preferences (criteria for selecting a winter ski resort and a summer beach vacation can be very different, even for the same profile).
A notable insight from her presentation is the choice of different metrics at different stages of the recommendation pipeline: first they generate candidates with a wide recall, optimizing for Recall@100 to generate a large list of good candidates; then in a second step they refine this for better precision, optimizing for nDCG@5 as the final UI will display a handful of recommendations at a time.
This two-pass approach, combined with feature-selection, lets them cast a wide net across all places the user could rent, before refining this to optimize for the likelihood that a user ultimately books it through!

Finally, the industry day’s ultimate session was on Next Generation Search.

It opened with an invited talk from Stéphane Clinchant, presenting the Naver Labs team’s work on SPLADE: a Sparse, Lexical And Expansion Model for search, from its roots to future work.

This is a neural retrieval model, which learns how to expand queries and documents into sparse representations, which can then be retrieved via mixed methods: efficient inverted indices, explicit lexical match, or neural (semantic) retrieval.
They exposed a few core differences of SPLADEv2 compared to the initial release: benefiting from recent advances in training neural retrievals, the v2 leverages better Pre-trained language models, distillation and other techniques to increase their effectiveness both in-domain (tested on the MS MARCO benchmark) and out-of-domain (evaluating on Benchmarking-IR [BIER], a framework for evaluating zero-shot information retrieval models).

Illustration of the SPLADE system: A query is processed and expanded, before various lookup methods are used and ranked, ultimately surfacing the most relevant document.

Next were two presentations on different aspects of search systems: first by Huawei, presenting an approach to Clustering efficiently for various tasks around information retrieval, which focused exclusively on search implementation; then a talk on Dedicated Search & SaaS at Bloomberg, which in contrast was focused on search infrastructure: it was really insightful to hear their experience managing a dedicated search system for hundreds of internal teams, and moving to a search-as-a-service approach which brought many benefits but also some challenges: on one hand it was extremely valuable to their internal ‘consumers’, as they benefited from new features being shipped regularly, less maintenance efforts on a regular basis, less specific knowledge needed in each internal team so they could focus on their value-add.

On the other hand, this came at an added cost for their team, who now was responsible for any troubles arising – they shared in particular the hassle in patching the log4j vulnerabilities across the few different versions of the stack which were in use by the different downstream teams.

Even then, the speaker reminded us that in the dedicated search alternative scenario, it would have been even more difficult to coordinate patches across ~200 different teams, so there is a clear economy of scale in the SaaS approach!

The closing talk was an extended presentation by Niels Reimers from Cohere: his keynote-level presentation outlined the Opportunities and Challenges of Multilingual Semantic Search.

In his talk, Niels builds a great case for semantic search: its ability to close the lexical gap of keyword-only search. For example, allowing on a query “United States” to retrieve documents which only mentions “U.S.” is something where keyword search systems would rely on synonyms which can be tedious to maintain manually, or require to develop an advanced synonym generation pipeline. Likewise, being able to retrieve nearby terms when answering a query is key to good understanding of varied users: Niels gave the example of searching for “Capital of the United States” on Wikipedia, where their Elasticsearch-based keyword search engine returns a whole page of results without mentioning Washington D.C.; indeed, the top keyword match is “Capital Punishment in the United States”, which has lots of common keywords and yet is about a totally different topic.

However, semantic search is not just a land of opportunities. There are unique challenges to overcome before you can replace a successful traditional search interface with a fully semantic engine.

One big problem is generalizing to new language, such as words not seen in the training data: while keyword engines use search features like lemmatization and stemming to provide results to unknown words, today the purely semantic search engines fail to make sense of this kind of queries.

Another challenge with semantic search is bias: Niels shared how multimodal engines trained on image & text would be able to handle a query like शादी (wedding in Hindi), they would be likely to return images like this one, which is a very western-centric representation of a wedding. Here Niels concluded that at least for the foreseeable future, hybrid search engines leveraging the strengths of both sides look like the most solid approach to have your [textually-relevant] cake and eat it [with semantic icing] too!

Overall, the ECIR23 conference was full of insight, offering us a perspective into fundamental advances in the field as well as welcome opportunities to reflect on our practices around user experience, evaluation metrics, or content diversity.
We hope this recap can inspire you and give useful ideas for your own work – as for us, you can be sure they will inform our next improvements to our search & discovery platform!

Conference Recap: ECIR23 Take-aways