We’re back in sunny Paris and Nice after a week in the German capital focused on the Berlin Buzzwords and MICES conference. From think pieces discussing the politics of open technologies or arguing that LLMs are not a paradigm shift; to technical talks around semantic search, RAG, and image understanding; there was a lot of content to unpack from these conferences!
We had two Algolians attending: Paul-Louis Nech and Raed Chammam, who presented together at MICES a talk on the Use-cases for Visual Recommendations.
Here is what we learned from the talks we attended and our thoughts on where the Ecommerce & Search ecosystems are heading towards in 2024.
Our trip began with Berlin Buzzwords, “Germany’s most exciting conference on storing, processing, streaming and searching large amounts of digital data”.
With roots in a Berliner hackerspace and a focus on open source software projects, Buzzwords is a space where innovation and digital freedom cohabit to bring you perspectives on how new technologies can serve real human needs.
The conference started with a keynote from Zuzanna Warso on The Paradox of Open, asking: “Can Digital Commons Offer a Way Forward?”
Zuzanna walked us through the history of the internet: how it emerged as digital commons, with collectively created & managed resources open to the public. In the last decade, she argues the digital landscape has changed in nature: from a world where technology serves democracy and public investment democratizes new technologies, we’re entering an era of extractive business practices where many organizations take more than they give to the commons.
Like many digital activists, Zuzanna Warso believes that the freedom to interact with tech & projects without seeking anyone’s permission is the cornerstone of our ecosystem – to her that freedom was the foundation of the technologies we can all freely enjoy today, from the open web to emails and much more.
This is the Paradox of Open according to Zuzanna: the openness of technology is both a challenge to and an enabler of the concentration of power in the digital space.
Within this new reign of deep learning & Generative AI, she argues we must look at the power structures around our technologies, and develop tech while thinking about its societal impact.
The example of Exposing.ai, a project showing how your personal pictures may have been used to develop face recognition technology, is a good cautionary tale on reckless use of personal data.
On that, Zuzanna argues that we need new licenses and new laws to protect and foster privacy & innovation, to match the stakes of our unique time where more and more people can build powerful AI applications. She highlighted how Creative Commons licenses are not enough to protect individual privacy and address the research risks of AI development, and mentioned the AI Open Letter which states that if we collectively want human creativity to carry on, artists need new protections to ensure that they can still live decently from their creative work.
Her keynote ended on a positive note: although see describes today’s internet as an “online shopping mall”, with surveillance capitalism companies using any personal data they can grab to surveil your every step online, there is a growing movement working to “rewild the internet”: we see new regulation (GDPR, Digital Market Act, AI Act…) defending civil liberties and right to competition, and there’s a rising awareness in the public sphere that “business as usual” would lead us astray.
Historically the public sector played a huge role in building our digital world, which then got fragmented and left to market forces; Zuzanna believes the recent awareness of the dangers of this power concentration (corrosive effects on the economy, mental health crisis, etc) allows a new paradigm to start emerging, and investments in our public digital infrastructure to rise again!
After such a big-picture talk on the challenges and opportunities of AI in our exciting times,
we were ready to dive into more technical content.
Some talks brought an historical perspective on the new era of LLM-based applications.
We saw a new approach to Query Expansion using LLMs, presented by Ilaria Petreti & Anna Ruggero – they developed a system to turn natural language queries into SolR filters matching a JSON record representation.
This was a very cool talk both due to the goal they set out to reach, and to the methods used: leveraging a LLM for turning messy natural language inputs into structured data is a very promising application of those models. Their approach allows them to overcome limitations of lexical matching, severely improving the recall of the resulting filters. The icing on the cake? The LLM provides explainability for the selected filters, with a reasoning on why it’s used!
There are of course limitations to this approach: the higher recall comes at a cost in precision, identifying the relevant field to filter on can be challenging when several facets have shared values, and you can get inconsistencies due to hallucinations. Some of these can be solved by refining the input search metadata (e.g. using human readable field names), or with more advanced prompt engineering: they showed how using the DSPy framework lets you structure your prompts and test them with actual metrics and checks, opening the way to few-shot prompting, fine-tuning for disambiguation, and many more improvements to their prompts..
They concluded the talk on User Experience considerations: it’s cool to build a LLM-application, but if you want users to make the most of it, designing the UX of your system is paramount!
As a counterpoint, William Benton gave a provocative presentation titled “LLMs are not a paradigm shift”. He presented a few historical examples of where science paradigms truly changed, like the Copernician revolution or how the germs theory replaced the theory of miasma.
That last example shows how we can be right and wrong at the same time: when experts recommended fresh air cures to eliminate disease they thought were caused by miasmas, we were confusing symptoms for cause: “when things are correlated you might do the right thing for the wrong reason”.
Here, the paradigm shift to a new germs theory meant revisiting all prior knowledge: things we knew may well be no longer useful.
Are there examples of historical paradigm shifts in computer science? GUIs were one, where most programming interfaces moved from console to graphical user interfaces.
Another one is Statistical NLP: today most approaches to natural language programming have some statistical aspects, and “purely symbolic NLP” is dead.
Contrast this with our new era of LLM applications. William argues that text generators are nothing new (see Markov chains) and that most of the old knowledge still does apply when building systems with these new tools. He mentions three core aspects of LLMs: “having potential for misuse”, “needing additional context beyond the query to be useful”, and “not needing feature engineering anymore”, he argues that this is nothing new: all ML systems always needed guardrails (recommender systems can also “hallucinate” and suggest e.g. items not in stock), most ML systems are more useful when connected to extra context (fraud detection algorithms too need some context behind the financial data points!), and most ML models also are impacted by how you represent features (e.g. if you use one-hot-encoding, different features are optimal than if you use other approaches).
William concludes his talk by arguing that we should see LLMs as Search/Recommender systems: an end-to-end engineering solution which will require good old engineering methods.
He recommends focusing on data preparation and feature engineering over using fancy models; when working with Language Models, choosing prompting over fine-tuning and fine-tuning over pre-training from scratch; in general to keep things simple and iterate pragmatically!
We can see Retrieval Augmented Generation, or RAG, becoming more and more popular: while a year ago you could see most people experimenting with RAG as a solution to avoid hallucinations with LLMs; today many are putting it in production and sharing tips – such as Zain Hasan, who presented a talk on the different techniques you can use in an advanced RAG setup. I won’t try to summarize his talk because with the many techniques and in-context tips Zain mentioned, it’s worth watching the full video if you want to learn from his experience!
To conclude on Buzzwords, let’s remember that developing AI-based products is not only a technical task: building successful AI projects requires solid collaboration between engineers, devops, designers, and product managers.
It was thus lovely to see the human & collaborative aspects mentioned in several talks.
Shout-out to The C in CI is not for “Closed”, where Josh Reed argues that for all the talk about being open our CI/CD practices are still very secretive and would benefit more knowledge-sharing and standardized tools.
Last but not least, it was refreshing to listen to Saahil Ognawala present his Open-Source GenAI Product Manager blueprint: sharing his high-level perspective on the ecosystem and its evolution over recent years, Saahil gave several tips on navigating the competitive space of GenAI products, how to be aware of the impact of different monetization strategies, and general advice for the success of your products in the competitive world of generative AI.
What a nice conclusion to this tech conference, reminding us of the key role PMs have working with their engineering teams 🤝
On Wednesday we headed to MICES: Mixed-Camp Ecommerce Search, a conference bringing together participants from a variety of backgrounds, all sharing a common interest in e-commerce search.
Organized by René Kriegler from OpenSource Connections, this conference is more intimate than Buzzwords: with around a hundred participants in-person and more attending online, MICES allowed for deeper connections with the people of search, and offered space for in-depth conversations on the tools and methods that experts use to build great ecommerce experiences.
René opened the floor with a provocative intro: Not everyone will be taken into the future.
This is the name of an artwork by Ilya and Emilia Kabakov presented at the Tate Modern Gallery, showing a train departing to the future and various artworks left behind.
The piece asks: “What will happen to artists and their works in the very near, and not so near, future? How will they be accepted and understood?”
Likewise Charlie Hull offers us to pause and ask: GenAI and shiny new technologies are a train everybody’s rushing to jump on. “This is the future – but who’s not getting in?”
We know that every technological revolution can result either in a more level playing field or in new concentration of power, sometimes resulting in new empires or even monopolies.
How do we make the future more evenly reparted? How do we make sure many are not left behind our technological progress?
Charlie Hull offers a potential answer, the open-source tech community’s approach: share knowledge and learn from each other as openly and as much as we can!
You could say the MICES crowd took this message to heart: for a one-day conference, the agenda was packed with insightful talks, returns on experience, and solid think pieces.
First we saw Doug Turnbull, a.k.a Software Doug, give his take on how you should be Planning your E-commerce development work.
Every quarter, we ask the same question: So many options, PM ideas / new research paper / hot topic to implement; but which one should we build?
Doug suggests a few mental models to answer this question. At Reddit, Gap analysis allows them to understand the weak performing queries and find potential solutions to solve several at once. Nice method as long as one remembers that correlation is not causation, so keep in mind the limits of what you can learn with this approach.
A more traditional tool is offline judgment lists: define a metric to evaluate quality (via clicks, human labelers, LLMs-as-a-judge, you name it) and ultimately compute a relevance grade which can then be used for statistical checks (computing nDGC, etc).
Measure as soon as you can! As Doug says: “Smarter teams test earlier in the process. Often we do offline tests: check an idea before it launches”.
His conclusion is a call to action for developers, so that your code is easy to test, measure, and iterate: “Make your code prototypable!”
Next was a talk by two speakers: Ruchi Juneja (MediaMarktSaturn) & Johannes Peter (Principal Search Consultant), on Vectorizing Consumer Electronics Goods. They set out to answer a big question: What can you do to analyze and understand your many zero-results queries?
They start from a manual problem analysis of a sample of those: roughly 32% of those queries are a Series (model number misspelled, specific number not available, …), 24% are due to Semantics (generit terms not in product e.g. “small”/“comfy”, different namings, synonyms, …), 11% Spelling mistakes, and the rest are different languages or assortment issues (not in stock).
In theory, Vector Search could solve 4 out of 5 of these problems!
The talk describes their several iterations: from a first vector search implementation and its offline evaluation, to a second iteration with Entity Recognition and Dynamic number of features per category (e.g. drones have more specific features than groceries), to a third iteration building their final MVP with a model trained from scratch (Masked Language Modeling + Dynamic masking during training + query-product tuples). Nice dynamic presentation of an iterative development process, which is crucial when you build custom AI models from scratch and want to create value with it!
Continuing on the theme of vector implementations which may or may not work, Roman Grebennikov shared a thought-provoking talk: “How semantic search projects fail.”
You see many talks from vendors these days saying Vector search “Just Works ™️” and simply is the future. But Roman says: “I blame Survivor Bias!” You simply don’t hear about all the failed projects.
Indeed, relevance is subjective: it depends on the user’s intent. Embedding models have no idea about your audience! This means optimizing for theoretical IR metrics might not bring the customer value you are striving to build.
Roman argues that relevance tuning, even with vectors, is the same old loop: define relevance labels, tinker with the retrieval setup, iterate until the first page looks good to you. This is a new metric he introduced: LGTM@10 🙈
There’s also the case of non-English search: The MTEB leaderboards are in English. Also, if you fine-tune on implicit data, with confidence-based labels, you bias further your system towards english. A hacky solution is to up-sample non-English training data (but you also get more noise); proper hope might come with machine-translation assisted fine tuning.
As for asking if your final system/feature/relevance tuning works for your users in practice… Even with fancy vector search, Roman recommends good old A/B tests as your ultimate guideline!
The next talk presented a technique we were keen to see used in practice: LLM-as-a-judge, here in the talk on Shopify’s Offline evaluation with Model-Based Judgements, presented by Albero Castelo Becerra. How can a platform at a global scale like Shopify assess relevance across hundreds of millions of products? A/B testings are great, but slow and costly. So they needed a way to run offline evaluations to get confidence in releases and faster iteration cycles.
Implicit judgements, like clicks, are not enough: they have many limitations like being scarce and noisy, or not aligned with the desired UX.
Hence, the team at Shopify tried Model-Based judgements: asking a LLM to evaluate product/query relevance at scale.
Building a training loop (Train/Model/Analyze) repeated until they reached satisfaction, Albero and his colleagues used a golden eval dataset + synthetic data to generate artificial judgements for their search system.
Distilling knowledge from GPT4 labels and using a mix of real data and synthetic, they used a cross-encoder architecture with CLIP for image embeddings, with a final classification head on top.
The results are really promising! Alberto concluded that Model-Based judgements work great, using a specialized language model works best (for now…), and that ultimately good data is key to good judgements!
The following talk was from Algolia, with Raed and Paul-Louis presenting The Good, Bad, and Ugly use cases of Image Search for Recommendations. We shared various examples of Image-based Recommendations: where it works great out of the box, where your use-case requires some careful implementation to be successful; and a few fun examples which are honestly better served by other recommender models… including a surprise appearance of Nicolas Cage!
We left the stage to welcome another duo of presenters, which you might recognize: Stavros Macrakis from OpenSearch @AWS and Charlie Hull, discussing User-Behavior Insights and how to collect them.
Presenting the common needs of all search implementations, they argue that this is a key part of relevance engineering that needs to be standardized further: how we collect and aggregate insights about how our users interact with our search engines.
They closed with a call to action for our e-commerce search community: we need to make it simpler to track the steps of a user’s search journey, in an ethical and safe manner. Streamlining this process will be crucial to build the experiences of the future. Stavros is frequently astounded at how even large, sophisticated organizations fail to collect the data needed to evaluate and tune their search systems.
The last two talks closed the conference by showing promising applications of search & language technologies, put at the service of real user needs:
First we saw the Empathy.co and SpaceHeater.ai team (Angel Maldonado, Alex Barrett and Ben Cooper) present Search & Privacy as One, their vision of an assistant ecosystem where your data does not live in centralized AI Search systems, but rather in private enclaves, putting privacy, control, and agency back in the hands of the users – a project to follow with interest for sure!
Finally, we saw Lucian Precup from Adelean demonstrate An AI Assistant in the life of a Search Engine Administrator: how integrating AI assistants into dashboards like Kibana can improve significantly the life of a Search Engine Administrator.
He demonstrated an experience where you can interact with your data in natural language, and then go a lot deeper: for example Lucian’s assistant will proactively recommend ideas to improve the user experience (e.g. automatically suggesting redirections based on user interactions, or offering synonyms to address zero results queries). Great talk showing that there’s a lot of real-world utility to be found applying LLMs to build internal tools for technical teams!
The conference ended on a very grassroot note: all MICES attendees could propose topics for self-organized sessions. We saw discussions from very technical topics like “how to leverage DSPy for prompt programming” or “how to standardize analytics”, to more soft-skills concerns like “how can we break silos in our companies”, and various other bottom-up topics: many thanks to the organizers for fostering such a friendly space for sharing knowledge among ecommerce experts, from Search builders to ecom product teams – this is how we can work together so that everybody gets taken into the future!
We hope this recap gave you some ideas on where Search & Ecommerce experiences are going in 2024: more semantic understanding of your users, Generative features to bring value with RAG guardrails and evaluation frameworks to ensure the generated content is relevant and trustworthy; and investments in better data collection, which is the cornerstone of so many AI features that today’s ecommerce search users expect !
Have fun diving into the above videos of Buzzwords and MICES – see you there next year!
This blog post, Berlin Buzzwords & MICES24: Paradoxes and Paradigm shifts, © 2024 by Paul-Louis Nech & Raed Chammam at Algolia, is licensed under CC BY-SA 4.0.
Paul-Louis Nech
Senior ML EngineerRaed Chammam
Senior Software EngineerPowered by Algolia AI Recommendations
Paul-Louis Nech
Senior ML EngineerPaul-Louis Nech
Senior ML EngineerChuck Meyer
Sr. Developer AdvocatePaul-Louis Nech
Senior ML EngineerTyler Butler
Software Engineer, Search RankingSteven Evans
Software Engineer, Search Ranking