Search used to mean "text search" and search terms used to be called "keywords." Those limitations feel passé when we use our computers and phones for so many other modalities. We Shazam songs to discover what that catchy tune is. We leave ourselves voice notes. We take pics to remember places and things – and find them again in other contexts.
Our latest challenge has been to move beyond text and keyword search to make your customer experience multimodal. We've spent the last twelve months building a high-performing image retrieval API.
This eBook describes that development process.
Our priority from the start was to understand user needs and design for what users want.
We used a trio of technologies at the core of the image recommendation API service – image vectorization, vector hashing, and vector retrieval – to perform image retrieval at scale.
The result is more than an image API. We've created a powerful image retrieval and recommendation service that is applicable to an endless number of business use cases.
We are surrounded by images. The business world survives and thrives by visualizing products. To map information. To report. To convey color, texture, style, place, and mood. To excite the imagination, to dream, and to tantalize.
All that with pictures. And yet, our digital devices have only started to manage images really, really well relatively recently. The better our apps and interfaces get at seeing and sorting images, the faster new and potential use cases for image recommendations rush to mind, and the faster developers work to enable technology to serve these possibilities.
|
Use Case |
Solution |
|
Do you have a marketplace with a wide range of different, short-lived items? |
Provide image recommendations that are speedy, relevant, and personalized to specific user journeys. |
|
Want to offer customers an image-based "Find-Me-This" shopping experience? |
Empower shoppers to compare images (taken in-store or in daily life) with product catalogs to easily locate desired items online and enhance the online shopping experience. |
|
Want a system that learns even better as the user searches? |
Include recent search history in search results to make better visual recommendations. Bias results towards the last category searched, or away from products that were returned. |
What applications would you build if you could match any user-provided visual cue to your company's content?
When we sat down to build an image retrieval API, we knew we'd be dealing with new questions and different parameters.
One thing would remain constant, however: users and their expectations.
In 2023, users expect real-time, instant responses and seamless interactions from their apps and services. The internet giants paved the way on speed and have the data to show what users want and expect.
Google showed that a half-second delay was enough to drive 20% of traffic away from sites. Netflix research indicated that viewers pick movies in under 2 seconds. Amazon discovered that a lag of only 100 milliseconds lost them 1% of their overall global sales.
No matter how fast you are, returning wrong results is equally detrimental. Users want, need, and expect relevant results. This expectation is borne out by Google statistics for desktop search. The top result gets one third of total clicks. The tenth result, near the bottom of the first page, gets ten times fewer clicks. Two thirds of users grow frustrated if irrelevant results force them to dig and will never advance to the second page.
On mobile, where screen space is even scarcer, the first result is often all that matters.
The same goes for newer search modalities, such as speech queries through conversational interfaces. When a voice assistant returns a single result, irrelevancy amounts to failure. These days, the world is in the thrall of ChatGPT.
Users have exceedingly high expectations for the natural language capabilities of their apps and interfaces. They expect to be clearly understood the first time around.
A 2016 study by Ewa Luger and Abigail Sellen of Microsoft Research showed that users had higher expectations for systems than what was realistically deliverable. Rather than chalking errors up to the state of the art in machine intelligence, users thought badly of the systems when they provided wrong or unintelligible responses.
"We find user expectations dramatically out of step with the operation of the systems, particularly in terms of known machine intelligence, system capability and goals."
– EWA LUGER AND ABIGAIL SELLEN (2016),
"Like Having a Really Bad PA”: The Gulf between User Expectation and Experience of Conversational Agents
Today's users expect the systems they use to understand text and images combined.
Pinterest's PinSage recommender system, for instance, helps users find "Pins" (image bookmarks) related to topics they're interested in. It uses a neural network that handles both visual and textual annotations to increase the relevance of results. By "borrowing" embeddings from nearby Pins, PinSage can disambiguate Pins that share visual characteristics (e.g. fence post and bed post) but are semantically and contextually different (garden, but not bedroom).
Users are starting to notice these text-image interactions popping up on their favorite services. Netflix personalizes thumbnail artwork to optimize those fleeting two seconds that viewers spend selecting a show.
People in one city or country are seeing entirely different "box" artwork for the same show than their friends and relatives living elsewhere.
Users don't expect multimodality for its own sake.
The objective is to deliver the most relevant response for the user's specific context. That means taking both text and image into account, and personalizing results with other factors, such as search history and location.
The core technique that makes these recommendations possible is image vectorization. This process converts an image's pixel values into numerical values, called vectors.
To vectorize an image, it's run through an advanced deep learning algorithm, a convolutional neural network (CNN), which extracts the relevant features. This way, the model learns how to discern images for classification purposes, sorting, retrieval, and other image-related tasks.
The "deep" part of deep learning is crucial. The deeper the network, the more layers there are, the more features are extracted, and the better the performance of the image task should be.
As work on CNNs progressed they got extremely deep. However, it turned out that adding layers didn't just slow down the training process. At a certain point, it degraded the model and produced more errors.
Thankfully, in 2015, a groundbreaking paper from Microsoft Research helped image researchers get over the hump. The authors of "Deep Residual Learning for Image Recognition" designed a special CNN, a ResNet (Residual Neural Network) that could run more deep layers (up to 152) without adding complexity. The innovation involved adding shortcuts between layers to add compression and optimize the process. After exploring various depths, the authors concluded that the "sweet spot" was either 18 or 34 layers, which both achieved similar accuracy.
ResNet 18 converged faster and performed adequately for Algolia's purposes without adding training time. Our image API uses ResNet 18 to convert images into 512 fixed length vectors. While this detail might not be salient for end users, it’s valuable for marketplace owners. It means they can compare images of widely varying sizes without worrying about size differences – a common challenge with user-generated images.
Those vectors – essentially coordinates (a floating number of fixed size) – are then plotted into a multi-dimensional graphical space. As float vectors are added, another problem arises. How can you scale this process? If you have millions of float vectors in the same graphical space, how can you ensure efficient classification, matching, comparison, and retrieval?
We relied on another important advance: a clever technique called hashing. The concept was first introduced as "semantic hashing" in a 1999 paper by Ruslan Salakhutdinov and Geoffrey Hinton.
The method groups similar items closer together so retrieving them is efficient at scale.
"We show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Analysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the graphical model performs ‘semantic hashing’: Documents are mapped to memory addresses in such a way that semantically similar documents are located at nearby addresses."
- Ruslan Salakhutdinov and Geoffrey Hinton, Semantic Hashing
The real insight that Salakhutdinov and Hinton get at with this research is that visual deep learning models can be taught the meaning of what they look at and will learn to bring related items closer together. They’ll understand, for instance, the relationship between a cap and a hat.
Our image API incorporates this learning and uses a specific kind of hashing proposed in 1999 by Aristides Gionis, Piotry Indyk, and Rajeev Motwani.
The hashing, called "Locality Sensitive Hashing" (LSH), converts float vectors into binary hashes.
Nearly every fast information retrieval system relies at some level on a binary partition of the dataset. Even an activity like faceting operates like a binary filter. Is an item in stock or not in stock? Keyword matching involves a binary filter. Yes, this matches or no, it doesn't.
Advanced catalog intelligence, like predicting what items might be out of stock before a sale ends, acts like a true/false binary filter as well. Is this item above or below expected sales for the current period?
With LSH, the binary function boils down to proximity. Items in the dataset are, based on their vector values, either similar to each other or dissimilar.
“The basic idea is to hash the points from the database so as to ensure that the probability of collision is much higher for objects that are close to each other than for those that are far apart.”
Aristides Gionis, Piotr Indyk, Rajeev Motwani,
"Similarity Search in High Dimensions via Hashing"
With LSH, instead of assigning binary features that separate similar documents from different ones,
we're training a deep learning model to discover them. In the process, we're creating an efficient semantic binary hashing function.
Once the model knows which vectors provide a good representation of those different partitions of our data, we can retrieve them efficiently using nearest neighbor search (NNS), or approximate nearest neighbor search (ANN) to ensure accelerated results.
For this, our image retrieval API uses a state of the art ANN method called "Hierarchical, Navigable, Small World" (HNSW) graphs. First we build a proximity graph. Then we place our vectors in the space. Finally, we connect them to their neighbors. The graph can be traversed quickly to retrieve the top recommendations with incredible efficiency. While this would be expected for most systems over small catalogs, our new system is a game changer for massive catalogs, achieving the same efficiency over thousands of items.
As a final step, we link the relevant vectors to object metadata – image, description, price, and so on – and deliver instant results to the end user.
Image vectorization, binary hashing, and vector retrieval using HNSW enable fast, relevant recommendations.
They combine to create a powerful, versatile image recommendation API that can be adapted to endless use cases:
After a customer searches well-known brand “A” in your catalog, leverage the strength of the vector search to deliver similar looking alternatives from brand “B” (perhaps an in-store brand with better margins for you, or a lower cost option for your customer).
The API has enormous value from a usability standpoint as well. Because it’s designed with efficient querying as a priority, retrieval is fast. Put to the test, indexing and latency performance delivery strong results, too.
Indexing ~50k images in <1 hour
Handling up to half-a-million images in a few hours
With distributed indexing, those results can scale to hundreds of images per second to meet user needs over vast image catalogs.
To keep things simple, the API includes a limited number of variable parameters, enabling businesses to:
The challenge with image recommendations is that the image is not all that matters. For businesses to derive value, we need to blend the image signals with textual and business signals, such as textual relevance of the description, price range, or stock availability. The "right" combination can be very different from one use case to another, from one customer to another, from one query to another. Those are critical business decisions.
Our priority is to enable businesses to leverage contextual-image search in the broadest way possible. To support this, our API runs the raw image model. At that point, the business takes over. You can use object-level properties to filter, re-rank, and post-process these results as necessary. You know your use case and your customers better than anyone.
That’s what makes the API expressive. The combination of raw image model, filtering capabilities, and fallback capabilities allows you to serve unlimited needs with the same tool.
For instance, an online shoe store could display “top sellers and similar looking items from top brands” on its landing page. At the same time, the product page for brand “A” could show “similar looking items from lesser known brands,” using filters in opposing ways to finely craft the user experience.
Another store might prefer its landing page to display “similar items from brand “X,” but only if they’re in stock,” while an “upcoming collection” inspiration page displays “similar items from brand X,” matching both categories “not-in-stock” and “Winter 2024 Collection.”
Gone are the days where “searching” means “keywords” and “search results” means “pages of text” links.
Customers want fast, relevant results from interfaces that understand them in modalities that make sense for the request and in the moment.
Get in touch to discover more about Algolia's Image Recommendation API. In a nutshell, it's a powerful and efficient image vector hashing retrieval system, wrapped in a flexible API, deployed within a managed service.
Our developers can help you build out a use case for your image catalog that makes sense for your organization and delivers relevant results fast. Request a demo today and tap into a world of possibility for connecting your customers to products and services.