The week of Monday, April 22nd saw the Haystack US conference descend on Charlottesville, Virginia. OpenSource Connections hosts this annual conference for search engineers and adjacent enthusiasts within the industry, and we at Algolia were excited to both attend and present.
This year, RAG was all the rage. Retrieval-augmented generation is a clever way to avoid the hallucinations and inaccuracies of general-use LLMs by restricting content generation to your specific corpus of data. But even then, there are still challenges around testing those responses. We heard several novel approaches to testing RAGs and semantic search implementations.
That said, RAG wasn’t the only topic on the agenda. Relevance is always top of mind at Haystack, and we heard about clever ways people are improving ranking across data sources.
Here’s a quick recap and some takeaways from this year’s Haystack US presentations.
In lieu of a keynote, OpenSource Connections’ Charlie Hull gave a truly heartfelt welcome and reminder that your search system is more than algorithms, it’s mostly made by people.
From the people who made search history to your own search team, from the engineers building new methods to the PMs applying them to real user problems, Charlie reminded us to keep the human at the heart of what we do, so that we can delight the humans we build for: the customers & end-users looking to achieve more and live happier lives thanks to search systems.
“Take care of your people and they’ll take care of your search!”
This message resonated throughout the week as the open source ethos of collaboration touched every conversation, even among those of us outside the open source space. Our resident developer advocate appreciated Charlie’s jibe at developer relations “elves” and their abundant energy – a good reminder to listen and collaborate, and that not everyone has the same social battery.
Bold claim! With this provocative title, John Solitario actually means: “All vector search applications that have good UX leverage hybrid approaches”. From hybrid vector+lexical search to “metadata filters on top of vectors”, John presented several hybrid approaches that enhance your user experience over the raw powers of vectors.
Jeff Capobianco of Moody’s Search Engineering team gave some insights on building RAG solutions and making sure your AI “sticks to the facts.” Jeff pointed out that LLMs may be the most expensive tools you run at your company – at least for now. He recommends balancing accuracy against cost. Depending on your use case, you may not need the newest, most expensive models to get what you need. Use guardrails to ensure your LLMs are returning answers based on the RAG data and not generic model responses.
Jeff also emphasized testing, and had some novel ideas around using LLMs as proxies for judges when evaluating results. You compare the judgements (and explanations of those judgements) from the AI to those from your experts. Pro tip – always have your AI explain its judgment before it makes it, so it isn’t just a post-facto justification!
Lovely 3-headed talk on putting LTR (learning to rank) in production; or how even at big corporations like Reddit turning State-of-the-art models into live prod services takes more than you’d think (and even may take down several systems as you learn what changes they imply). Nice to see the nitty gritty parts of AI engineering: handling cache issues, testing hypotheses around data selection with good old A/B testing (”nDCG metric goes up? Ship to A/B test -> Offline metrics go up? SHIPIT!”)
Testing for generative AI tools was a huge topic this year. A/B testing and judgements by subject matter experts are still considered the gold standard (shout out to Quepid!), but several new and innovative approaches were also presented. In this talk, Douglas Rosenoff from Lexis Nexis walked us through how they test AI generated document summaries across their 3.8 petabytes of global content. Their goal was to build trustworthy tools to simplify the work process.
Since summaries can vary drastically with small changes to their RAG pipeline, their focus is comparing the ratings of the summaries, rather than the summaries themselves. Their legal experts rank each summary across seven criteria, and they look for consistency across those rankings. I thought this was a great way to bring objectivity into a subjective process.
Buckle-up for a wild history ride! Lucian Precup from Adelean.com shared 22 years of Lucene: how this venerable lexical search engine (which today powers giants like Elastic/OpenSearch/Solr/MongoDB Atlas) introduced many concepts from inverted indices to the [“Search syntax” we all know AND are “used to” OR “at least familiar with”].
Lucene being historical, however, doesn’t mean it’s ancient: although it came a long way (support of actual numbers rather than pseudo-number is quite recent!), since 2022 Lucene offers modern features like Embeddings/Dense Semantic vectors, ANN search, sparse vectors, and soon RAG will be coming – Lucian even showed us a glimpse of the future with his demo of a Chat assistant using Lucene-powered RAG to ground its answers in our documents’ reality!
Principal Software Engineer Kathleen DeRusso of Elastic also took us on a trip down memory lane, reminding us that TF/IDF is as old as the movie The Godfather and even BM25 is 30 years old. New tools get balanced against old as we carefully incorporate semantic search into our existing keyword stacks.
Kathleen pointed out the benefits and challenges, (cough, cough, pagination), of going hybrid and some traps to avoid as you balance and merge results. Re-ranking is your friend here, but avoid tuning for “pet” queries. My favorite take away: It’s ok to keep manually crafting head queries and pinning results – focus your tuning on the tail.
Istvan Simon from PrefixBox asks: sure, Vectors are much better in theory, but by how much are they better in practice?
Nice presentation walking us towards finding concrete answers to a reality-check, looking at all angles from data collection to how vectors are generated, even covering interesting edge-cases.
Sometimes the old ways are the best ways. Senior software engineer Joelle Robinson gave a great talk on how Moody’s improved their federated search experience for their users. They have seven unique data sources all using different retrieval algorithms. How do you aggregate results when a user searches across everything without getting tangled in conflicting boosts? Their solution was to keep it simple by using weighted round robining. They grab records from the most popular sources first before layering in records from the other data sources. I found this solution interesting, although I tend to prefer multi-column approaches to federation.
Great talk by AJ Wallace on multimodal, but not images: how to search for audio clips based on text queries. He works at Splice, a marketplace for samples – but how do you handle users who’d like “something like this drum breaks sample, but not quite the same?”
The answer is multimodal search: using Audio embeddings with CLAP to retrieve relevant sounds.
He showed a use-case of Similar Sounds Recommendations in Splice’s website (showing the value of ItemToItem Recommendations, such as what LookingSimilar provides, beyond textual data!), and a live demo of a search & indexing system where the audience made duck quacks (what a fun way to re-energize a post-lunch room!) and we could retrieve it by searching for “ducks quacking”!
Such high-quality lightning talks, lots to unpack here! A few highlights:
We also found quite clever Lucian Precup’s collaborative search engine all.site, now offering GenAI to chat with your technical docs!
This was actually extra fun for Chuck Meyer because he did some consulting at Moody’s about a decade ago. It was cool to see how far they’ve come embracing cloud and managed services. Even old school companies need state-of-the-art search! One interesting statistic was that 75% of their customers’ first action is a search – and 50% of their customers’ second action is another search. They found a balance between the reduced cost of using a managed service that still allowed lots of flexibility for configuration.
It’s interesting to see how a corp like Moody, which cares so much about being right and correct, approaches search relevance. They emphasized the same compliance concerns as in other talks: when the stakes are so high, you want to know when you don’t know and not just make stuff up.
Trey Grainger, one of the authors of AI-Powered Search, presented on how to leverage embeddings (which he aptly calls Thought Vectors) for personalized search. How do you represent user affinities? Should you personalize everything, returning Hello Kitty microwave ovens to someone who liked a Hello Kitty sticker two weeks ago? Trey gave good guidance on how to implement much needed personalization guardrails!
Our own Paul-Louis Nech demonstrated the importance of establishing benchmarks for your models by describing his own journey trying to benchmark an image similarity model for food. Was your benchmarking data set already used to train your model? Is it a popular data set that the model has been exposed to for other demos? Is it the right shape?
He gave some great advice around synthesizing your metrics based on what we want to know. He also stressed the importance of consistency when benchmarking (lock your model version, use consistent seeding for random functions, etc.).
Informative talk by Audrey Lorberfeld from Pinecone. She presented a state-of-the-art overview of AI-Augmented Information Retrieval, that is, mostly RAG. She presented many approaches to Retrieval-Augmented Generation, e.g. Corrective RAG or Agentic RAG – yet she didn’t stop there, talking as much about under-the-hood techniques like Reranking or Fine-Tuning of Language models – without forgetting the queen of development practices: Evaluation!
We now know many ways to optimize our search. But what are we optimizing for? Stavros and Eric presented a new OpenSearch feature, User Behavior Insights, paving the way to enable systematic collection of view/click/conversion events in OpenSource search projects – something similar to what Algolia customers can find in our Events API, which powers many of our current and future AI features!
Nesh CTO Collin Harman shared learnings from several recent RAG-implementation projects. Collin emphasized the importance of properly framing RAG (or really all AI) projects. There is still a perception of magic around these projects that must be dispelled through very frank discussion of the system you’re building and its limitations. Make sure you are solving an actual user problem. If your users are comfortable with the current UX, replacing it with a chatbot might be like removing the keys from a calculator.
Scott Stults from OpenSource Connections focused on the “retrieval” aspect of RAG with a final talk on the theme of testing. To him, the retrieval portion of RAG is the biggest limiting factor. Are the retrieved results relevant? And have you retrieved every relevant record from your corpus?
Scott also described the use of LLMs as a relevance judge, covered below in Key Topic Summary: LLMs and Search.
Covered below in Key Topic Summary: LLMs and Search.
To wrap up, we’ll attempt to connect some LLM-related common themes that were covered by multiple talks. Two techniques which can improve the cost and performance of LLMs will be covered: Multi-Agent Workflows and Fine-tuning for Scale. Finally, we’ll theorize on how two LLM-powered features could improve the quality of the overall search experience: LLMs for Query Understanding and LLMs for Reranking.
Multiple talks hinted at a powerful concept known as the multi-agent workflow. In the multi-agent workflow, each LLM ‘agent’ is given unique instructions that allow it to specialize in solving a narrow task. The concept is simple: breaking a single complicated problem into multiple narrow subtasks, each with a specialized prompt, results in a better solution than trying to solve the whole problem with a single prompt. This is similar to the single responsibility principle.
Multi-agent workflows allow agents to autonomously converse among themselves and perform iterative refinements before returning a final output. A common pattern is to combine an ‘actor’ agent that generates content with a ‘critic’ agent that evaluates the quality of the actor’s generation.
As seen in the image above, a multi-agent GPT-3.5 workflow can outperform a single-agent (i.e. zero-shot) GPT-4 on a complicated task like coding.
References:
More Agents Is All You Need
What’s Next for AI Agentic Workflows
In Search Query Understanding with LLMs: From Ideation to Production, Yelp’s Ali Rokni describes how his team uses fine-tuned cheap models like GPT-3.5 when deploying LLM-powered features to maintain quality while minimizing cost and latency.
Ali reaffirms the notion supported by OpenAI that a model like GPT-3.5-turbo can match or outperform the capabilities of a state-of-the-art model like GPT-4 when fine-tuned on narrow tasks. This provides a good callback to multi-agent workflows, which allows broad multi-step problems to be broken down into narrow tasks.
To minimize the cost of LLM-powered features while maintaining quality, it’s possible to combine fine-tuning with a multi-agent workflow:
Query understanding involves some techniques which require significant amount of time and machine learning expertise to properly implement. However, as described in both Women of Search present Comparing AI-Augmented IR Strategies and Search Query Understanding with LLMs: From Ideation to Production, LLMs are making these techniques far easier to implement (but not low-latency or cheap).
Yelp’s use of LLMs for query tagging and scoping is particularly interesting due to its potential to increase precision.
There are imaginative ways fine-tuning and multi-agent workflows could be applied to query understanding. Imagine that an agent responsible for query understanding receives a query that contains an unrecognized word. The word is the name of a brand that was released after the base model was trained. The query understanding agent could be linked to an agent capable of searching the internet to determine the likely meaning of the word. If identifying unknown words can be done reliably, it can be linked to fine-tuning to allow models to stay up-to-date.
Consider the ranking problem: given a query and a list of documents, what is the optimal ordering of the documents?
The ranking problem is typically approached in two steps:
1. First-pass ranking: Compute an initial ranking on a potentially large number of retrieved documents. Optimize for speed.
2. Reranking: Rerank a small number of top documents from the first-pass ranking, this time willing to trade speed to optimize for relevance.
Historical reranking algorithms must compress query-document pair information into ranking signals (e.g. number of matching keywords). If relevance-determining information is not captured in a signal or if the relative importance of each signal is incorrectly defined, the ranking will likely not be optimal.
Using an LLM as a reranker has a key advantage: the LLM can use its strong natural language understanding capabilities to determine relevance on the query-document pair by using all available text instead of signals. This ensures that relevance-determining information is not lost and avoids the need to define the relative importance of signals. This provides a higher opportunity for achieving the optimal ranking.
Multiple talks at Haystack covered techniques that can be applied to improve the performance of LLMs used for reranking.
In Improving the R in RAG, Scott Stults recommended balancing relevance judgements from human experts (which he called the “gold standard”) against generative judgements from LLMs (which he called the “aluminum standard”). The key is to effectively instruct the LLM to follow the same method of reasoning used by the human experts. His process focused on starting with human judgements from both experts and non-experts, then teaching your LLM to reason about why they might disagree. Cohen’s Kappa can be used to evaluate the agreement between LLM judgements and subject matter expert judgements. The LLM instructions can be refined until agreement is maximized.
Individual human subject matter experts use unique reasoning methods and sometimes disagree on a document’s relevance. A multi-agent approach could be used in which each agent is instructed to judge relevance using a unique expert reasoning method, with the final relevance being assigned via a vote.
The Improving the R in RAG talk also noted that when LLMs are used for reranking, better context yields better generation. The raw query text is frequently not enough context to determine the optimal ranking. You must also have a full understanding of the intent behind the query. The advanced query understanding features described in the LLMs for Query Understanding section could be passed as context to a multi-agent LLM reranker which has been trained to act like a team of ranking subject matter experts.
References:
There’s an open PR to add a trained LLM to the Quepid judgment tool.
Large Language Models can Accurately Predict Searcher Preferences
And that’s it from this year’s Haystack conference here in the US. We’ll see you this fall at Haystack EU!
Chuck Meyer
Sr. Developer AdvocatePaul-Louis Nech
Senior ML EngineerTyler Butler
Software Engineer, Search RankingSteven Evans
Software Engineer, Search RankingPowered by Algolia AI Recommendations
Paul-Louis Nech
Senior ML EngineerRaed Chammam
Senior Software EngineerPaul-Louis Nech
Senior ML EngineerPaul-Louis Nech
Senior ML Engineer