Haystack US 2024: From RAGs to Relevance

The week of Monday, April 22nd saw the Haystack US conference descend on Charlottesville, Virginia. OpenSource Connections hosts this annual conference for search engineers and adjacent enthusiasts within the industry, and we at Algolia were excited to both attend and present.

This year, RAG was all the rage. Retrieval-augmented generation is a clever way to avoid the hallucinations and inaccuracies of general-use LLMs by restricting content generation to your specific corpus of data. But even then, there are still challenges around testing those responses. We heard several novel approaches to testing RAGs and semantic search implementations.

That said, RAG wasn’t the only topic on the agenda. Relevance is always top of mind at Haystack, and we heard about clever ways people are improving ranking across data sources.

Algolians at Haystack US 2024


Here’s a quick recap and some takeaways from this year’s Haystack US presentations.

Keynote: People of Search and AI

In lieu of a keynote, OpenSource Connections’ Charlie Hull gave a truly heartfelt welcome and reminder that your search system is more than algorithms, it’s mostly made by people.

From the people who made search history to your own search team, from the engineers building new methods to the PMs applying them to real user problems, Charlie reminded us to keep the human at the heart of what we do, so that we can delight the humans we build for: the customers & end-users looking to achieve more and live happier lives thanks to search systems.

“Take care of your people and they’ll take care of your search!”

This message resonated throughout the week as the open source ethos of collaboration touched every conversation, even among those of us outside the open source space. Our resident developer advocate appreciated Charlie’s jibe at developer relations “elves” and their abundant energy – a good reminder to listen and collaborate, and that not everyone has the same social battery.

📺 On YouTube

All vector search is Hybrid search

Bold claim! With this provocative title, John Solitario actually means: “All vector search applications that have good UX leverage hybrid approaches”. From hybrid vector+lexical search to “metadata filters on top of vectors”, John presented several hybrid approaches that enhance your user experience over the raw powers of vectors.

📺 On YouTube

Chat with Your Data

Jeff Capobianco of Moody’s Search Engineering team gave some insights on building RAG solutions and making sure your AI “sticks to the facts.” Jeff pointed out that LLMs may be the most expensive tools you run at your company – at least for now. He recommends balancing accuracy against cost. Depending on your use case, you may not need the newest, most expensive models to get what you need. Use guardrails to ensure your LLMs are returning answers based on the RAG data and not generic model responses.

Jeff also emphasized testing, and had some novel ideas around using LLMs as proxies for judges when evaluating results. You compare the judgements (and explanations of those judgements) from the AI to those from your experts. Pro tip – always have your AI explain its judgment before it makes it, so it isn’t just a post-facto justification!

LTR at Reddit

Lovely 3-headed talk on putting LTR (learning to rank) in production; or how even at big corporations like Reddit turning State-of-the-art models into live prod services takes more than you’d think (and even may take down several systems as you learn what changes they imply). Nice to see the nitty gritty parts of AI engineering: handling cache issues, testing hypotheses around data selection with good old A/B testing (”nDCG metric goes up? Ship to A/B test -> Offline metrics go up? SHIPIT!”)

📺 On YouTube

Generative AI Search and Summarization Testing

Testing for generative AI tools was a huge topic this year. A/B testing and judgements by subject matter experts are still considered the gold standard (shout out to Quepid!), but several new and innovative approaches were also presented. In this talk, Douglas Rosenoff from Lexis Nexis walked us through how they test AI generated document summaries across their 3.8 petabytes of global content. Their goal was to build trustworthy tools to simplify the work process.

Since summaries can vary drastically with small changes to their RAG pipeline, their focus is comparing the ratings of the summaries, rather than the summaries themselves. Their legal experts rank each summary across seven criteria, and they look for consistency across those rankings. I thought this was a great way to bring objectivity into a subjective process.

📺 On YouTube

Apache Lucene: From Text Indexing to Artificial Intelligence

Buckle-up for a wild history ride! Lucian Precup from shared 22 years of Lucene: how this venerable lexical search engine (which today powers giants like Elastic/OpenSearch/Solr/MongoDB Atlas) introduced many concepts from inverted indices to the [“Search syntax” we all know AND are “used to” OR “at least familiar with”].

Lucene being historical, however, doesn’t mean it’s ancient: although it came a long way (support of actual numbers rather than pseudo-number is quite recent!), since 2022 Lucene offers modern features like Embeddings/Dense Semantic vectors, ANN search, sparse vectors, and soon RAG will be coming – Lucian even showed us a glimpse of the future with his demo of a Chat assistant using Lucene-powered RAG to ground its answers in our documents’ reality!

📺 On YouTube

Retro Relevance: Lessons Learned Balancing Keyword and Semantic Search

Principal Software Engineer Kathleen DeRusso of Elastic also took us on a trip down memory lane, reminding us that TF/IDF is as old as the movie The Godfather and even BM25 is 30 years old. New tools get balanced against old as we carefully incorporate semantic search into our existing keyword stacks.

Kathleen pointed out the benefits and challenges, (cough, cough, pagination), of going hybrid and some traps to avoid as you balance and merge results. Re-ranking is your friend here, but avoid tuning for “pet” queries. My favorite take away: It’s ok to keep manually crafting head queries and pinning results – focus your tuning on the tail.

📺 On YouTube

Vector Search A/B Test results for Ecommerce

Istvan Simon from PrefixBox asks: sure, Vectors are much better in theory, but by how much are they better in practice?
Nice presentation walking us towards finding concrete answers to a reality-check, looking at all angles from data collection to how vectors are generated, even covering interesting edge-cases.

Moodys: How Round Robin improved Search Relevancy

Sometimes the old ways are the best ways. Senior software engineer Joelle Robinson gave a great talk on how Moody’s improved their federated search experience for their users. They have seven unique data sources all using different retrieval algorithms. How do you aggregate results when a user searches across everything without getting tangled in conflicting boosts? Their solution was to keep it simple by using weighted round robining. They grab records from the most popular sources first before layering in records from the other data sources. I found this solution interesting, although I tend to prefer multi-column approaches to federation.

CLAP with me

Great talk by AJ Wallace on multimodal, but not images: how to search for audio clips based on text queries. He works at Splice, a marketplace for samples – but how do you handle users who’d like “something like this drum breaks sample, but not quite the same?”
The answer is multimodal search: using Audio embeddings with CLAP to retrieve relevant sounds.

He showed a use-case of Similar Sounds Recommendations in Splice’s website (showing the value of ItemToItem Recommendations, such as what LookingSimilar provides, beyond textual data!), and a live demo of a search & indexing system where the audience made duck quacks (what a fun way to re-energize a post-lunch room!) and we could retrieve it by searching for “ducks quacking”!

📺 On YouTube

⚡️ Haystack Lightning talks

Such high-quality lightning talks, lots to unpack here! A few highlights:

  • Rajani Maski’s talk on assessing the cost of vectors at scale (we’re talking million items scale, she’s from Shutterstock) in a LTR setup which included many good tips in such a small time
  • Denzell Ford’s quick PSA that Text-Embeddings-Inference is the fastest embeddings out there
  • Maya Ziv’s very engaging talk on building semantic search at a startup (could have been titled “or how I learned to stop worrying about state-of-the-art and love the hustle” 😛)
  • Doug Turnbull’s cool project adding search to Pandas Dataframes (🤯)
  • Eric Pugh’s Year in Review for the new Quepid tool for relevance testing (which now supports Algolia, too 💙).
Eric Pugh gives a lightning talk at Haystack US 2024 about Quepid (which now supports Algolia!)

We also found quite clever Lucian Precup’s collaborative search engine, now offering GenAI to chat with your technical docs!

Evolution of Search (Moodys)

This was actually extra fun for Chuck Meyer because he did some consulting at Moody’s about a decade ago. It was cool to see how far they’ve come embracing cloud and managed services. Even old school companies need state-of-the-art search! One interesting statistic was that 75% of their customers’ first action is a search – and 50% of their customers’ second action is another search. They found a balance between the reduced cost of using a managed service that still allowed lots of flexibility for configuration.

It’s interesting to see how a corp like Moody, which cares so much about being right and correct, approaches search relevance. They emphasized the same compliance concerns as in other talks: when the stakes are so high, you want to know when you don’t know and not just make stuff up.

Personalizing search using multimodal latent behavioral embeddings

Trey Grainger, one of the authors of AI-Powered Search, presented on how to leverage embeddings (which he aptly calls Thought Vectors) for personalized search. How do you represent user affinities? Should you personalize everything, returning Hello Kitty microwave ovens to someone who liked a Hello Kitty sticker two weeks ago? Trey gave good guidance on how to implement much needed personalization guardrails!

📺 On YouTube

Zucchini or Cucumber? Benchmarking embeddings

Our own Paul-Louis Nech demonstrated the importance of establishing benchmarks for your models by describing his own journey trying to benchmark an image similarity model for food. Was your benchmarking data set already used to train your model? Is it a popular data set that the model has been exposed to for other demos? Is it the right shape?

He gave some great advice around synthesizing your metrics based on what we want to know. He also stressed the importance of consistency when benchmarking (lock your model version, use consistent seeding for random functions, etc.).

Paul-Louis Nech presents about benchmarking image embeddings at Haystack US 2024

📺 On YouTube

Women of Search present Comparing AI-Augmented Information Retrieval Strategies

Informative talk by Audrey Lorberfeld from Pinecone. She presented a state-of-the-art overview of AI-Augmented Information Retrieval, that is, mostly RAG. She presented many approaches to Retrieval-Augmented Generation, e.g. Corrective RAG or Agentic RAG – yet she didn’t stop there, talking as much about under-the-hood techniques like Reranking or Fine-Tuning of Language models – without forgetting the queen of development practices: Evaluation!

Women of Search group photo at Haystack US 2024
Women of Search group photo at Haystack US 2024 (photo by Chuck Meyer)

📺 On YouTube

Your Search Engine Needs a Memory!

We now know many ways to optimize our search. But what are we optimizing for? Stavros and Eric presented a new OpenSearch feature, User Behavior Insights, paving the way to enable systematic collection of view/click/conversion events in OpenSource search projects – something similar to what Algolia customers can find in our Events API, which powers many of our current and future AI features!

📺 On YouTube

Why RAG Projects Fail, and How to Make Yours Succeed

Nesh CTO Collin Harman shared learnings from several recent RAG-implementation projects. Collin emphasized the importance of properly framing RAG (or really all AI) projects. There is still a perception of magic around these projects that must be dispelled through very frank discussion of the system you’re building and its limitations. Make sure you are solving an actual user problem. If your users are comfortable with the current UX, replacing it with a chatbot might be like removing the keys from a calculator.

📺 On YouTube

Measuring and Improving the R in RAG

Scott Stults from OpenSource Connections focused on the “retrieval” aspect of RAG with a final talk on the theme of testing. To him, the retrieval portion of RAG is the biggest limiting factor. Are the retrieved results relevant? And have you retrieved every relevant record from your corpus?

Scott also described the use of LLMs as a relevance judge, covered below in Key Topic Summary: LLMs and Search.

Search Query Understanding with LLMs: From Ideation to Production

Covered below in Key Topic Summary: LLMs and Search.

📺 On YouTube

Key Topic Summary: LLMs and Search

To wrap up, we’ll attempt to connect some LLM-related common themes that were covered by multiple talks. Two techniques which can improve the cost and performance of LLMs will be covered: Multi-Agent Workflows and Fine-tuning for Scale. Finally, we’ll theorize on how two LLM-powered features could improve the quality of the overall search experience: LLMs for Query Understanding and LLMs for Reranking

Multi-Agent Workflows

Multiple talks hinted at a powerful concept known as the multi-agent workflow. In the multi-agent workflow, each LLM ‘agent’ is given unique instructions that allow it to specialize in solving a narrow task. The concept is simple: breaking a single complicated problem into multiple narrow subtasks, each with a specialized prompt, results in a better solution than trying to solve the whole problem with a single prompt. This is similar to the single responsibility principle.

Credit: Andrew Ng – What’s next for AI Agentic Workflows

Multi-agent workflows allow agents to autonomously converse among themselves and perform iterative refinements before returning a final output. A common pattern is to combine an ‘actor’ agent that generates content with a ‘critic’ agent that evaluates the quality of the actor’s generation.

Credit: Andrew Ng – What’s next for AI Agentic Workflows

As seen in the image above, a multi-agent GPT-3.5 workflow can outperform a single-agent (i.e. zero-shot) GPT-4 on a complicated task like coding.

More Agents Is All You Need
What’s Next for AI Agentic Workflows

Fine-tuning for Scale

In Search Query Understanding with LLMs: From Ideation to Production, Yelp’s Ali Rokni describes how his team uses fine-tuned cheap models like GPT-3.5 when deploying LLM-powered features to maintain quality while minimizing cost and latency.

Credit: Ali Rokni – Search Query Understanding with LLMs: From Ideation to Production

Ali reaffirms the notion supported by OpenAI that a model like GPT-3.5-turbo can match or outperform the capabilities of a state-of-the-art model like GPT-4 when fine-tuned on narrow tasks. This provides a good callback to multi-agent workflows, which allows broad multi-step problems to be broken down into narrow tasks.

To minimize the cost of LLM-powered features while maintaining quality, it’s possible to combine fine-tuning with a multi-agent workflow:

  1. Break your complex task into multiple sub-tasks that are as narrow as possible.
  2. Implement a multi-agent workflow by creating a specialized LLM agent for each sub-task and a method to aggregate the sub-tasks output into the final output.
  3. Build a fine-tuning dataset for each agent.
    • Start with 2K-5K examples that are either labeled by a state-of-the-art model like GPT-4 or a more reliable source.
    • Examples should aim to be maximally informative and represent the full range of expected inputs and outputs.
    • Humans can be used to selectively evaluate the accuracy of the training examples and can relabel if necessary.
  4. Use the datasets to fine-tune cheaper models like GPT-3.5-turbo.
  5. Evaluate the performance of the workflow when agents use their fine-tuned model.
  6. If performance is unsatisfactory, consider further narrowing each agent’s task or improving the fine-tuning datasets.

Feature: LLMs for Query Understanding

Query understanding involves some techniques which require significant amount of time and machine learning expertise to properly implement. However, as described in both Women of Search present Comparing AI-Augmented IR Strategies and Search Query Understanding with LLMs: From Ideation to Production, LLMs are making these techniques far easier to implement (but not low-latency or cheap).

Credit: Audrey Lorberfeld – Women of Search present Comparing AI-Augmented IR Strategies

Yelp’s use of LLMs for query tagging and scoping is particularly interesting due to its potential to increase precision.

Credit: Ali Rokni – Search Query Understanding with LLMs: From Ideation to Production

There are imaginative ways fine-tuning and multi-agent workflows could be applied to query understanding. Imagine that an agent responsible for query understanding receives a query that contains an unrecognized word. The word is the name of a brand that was released after the base model was trained. The query understanding agent could be linked to an agent capable of searching the internet to determine the likely meaning of the word. If identifying unknown words can be done reliably, it can be linked to fine-tuning to allow models to stay up-to-date.

Feature: LLMs for Reranking

Consider the ranking problem: given a query and a list of documents, what is the optimal ordering of the documents?

The ranking problem is typically approached in two steps:
1. First-pass ranking: Compute an initial ranking on a potentially large number of retrieved documents. Optimize for speed.
2. Reranking: Rerank a small number of top documents from the first-pass ranking, this time willing to trade speed to optimize for relevance.

Credit: Audrey Lorberfeld – Women of Search present Comparing AI-Augmented IR Strategies

Historical reranking algorithms must compress query-document pair information into ranking signals (e.g. number of matching keywords). If relevance-determining information is not captured in a signal or if the relative importance of each signal is incorrectly defined, the ranking will likely not be optimal.

Using an LLM as a reranker has a key advantage: the LLM can use its strong natural language understanding capabilities to determine relevance on the query-document pair by using all available text instead of signals. This ensures that relevance-determining information is not lost and avoids the need to define the relative importance of signals. This provides a higher opportunity for achieving the optimal ranking.

Multiple talks at Haystack covered techniques that can be applied to improve the performance of LLMs used for reranking.

In Improving the R in RAG, Scott Stults recommended balancing relevance judgements from human experts (which he called the “gold standard”) against generative judgements from LLMs (which he called the “aluminum standard”). The key is to effectively instruct the LLM to follow the same method of reasoning used by the human experts. His process focused on starting with human judgements from both experts and non-experts, then teaching your LLM to reason about why they might disagree. Cohen’s Kappa can be used to evaluate the agreement between LLM judgements and subject matter expert judgements. The LLM instructions can be refined until agreement is maximized.

Individual human subject matter experts use unique reasoning methods and sometimes disagree on a document’s relevance.  A multi-agent approach could be used in which each agent is instructed to judge relevance using a unique expert reasoning method, with the final relevance being assigned via a vote.

The Improving the R in RAG talk also noted that when LLMs are used for reranking, better context yields better generation. The raw query text is frequently not enough context to determine the optimal ranking. You must also have a full understanding of the intent behind the query. The advanced query understanding features described in the LLMs for Query Understanding section could be passed as context to a multi-agent LLM reranker which has been trained to act like a team of ranking subject matter experts.

There’s an open PR to add a trained LLM to the Quepid judgment tool.
Large Language Models can Accurately Predict Searcher Preferences

On to the next Haystack…

Two Algolians celebrating at the Haystack Reception

And that’s it from this year’s Haystack conference here in the US. We’ll see you this fall at Haystack EU!

About the authorsChuck Meyer

Chuck Meyer

Sr. Developer Advocate
Paul-Louis Nech

Paul-Louis Nech

Senior ML Engineer
Tyler Butler

Tyler Butler

Software Engineer, Search Ranking
Steven Evans

Steven Evans

Software Engineer, Search Ranking

Recommended Articles

Powered by Algolia AI Recommendations

Haystack EU 2023: Learnings and reflections from our team

Haystack EU 2023: Learnings and reflections from our team

Paul-Louis Nech

Paul-Louis Nech

Senior ML Engineer
Conference Recap: ECIR23 Take-aways

Conference Recap: ECIR23 Take-aways

Paul-Louis Nech

Paul-Louis Nech

Senior ML Engineer
Algolia named a leader in IDC MarketScape

Algolia named a leader in IDC MarketScape

John Stewart

John Stewart

VP Corporate Marketing