You looked at this scarf twice; need matching mittens? How about an expensive down vest?
You watched this goofy flick four times? Try something else. I know what will hook you next.
That search term you’ve entered appears in a zillion spots in the company’s data silos. Based on my bag-of-words model, check out these similar documents I found that contain common words.
If a recommendation-engine algorithm could “think” as a human does, you might catch it making these kinds of private observations. Of course, these aren’t the ways observations about document similarity and related content or product suggestions are phrased on websites. All you get is a “You might also like” or a list of items with no indication as to why they were selected. Regardless, you’re served great recommendations for similar items or content that could very easily pique your interest, as though the algorithm has been taking notes.
Similarity is a key differentiator in ranking search engine results and recommending content. If we as humans like something, it’s probably not enough; we want more, more, more.
Ecommerce retailers are especially happy to indulge us with more-specialized information retrieval; they’re understandably passionate about meeting our similarity-seeking needs. User-based recommendations for similar items are everywhere on the Web, from Amazon to Netflix to large retailers’ sites. Most people are intrigued by their personalized “You loved that, you could love this” ideas, their “Buy it with” add-on suggestions, their ability to check out “Customers also considered” products.
Thanks to data scientists’ creation of reliable similar-content functionality, this experience is an everyday occurrence. But how do companies use data science to so uncannily figure out what else we might like? What’s involved in a website identifying the right similar items in a sea of options? How does a recommender system use artificial intelligence and tap a dataset to figure out what similar movie title, product, or blog post a user would want to see next?
The secret boils down to a time-tested measure of similarity between two number sequences: cosine similarity (Wikipedia definition).
In terms of language, cosine similarity determines the closeness in meaning between two or more words. It’s the way a search or recommendation engine knows, for example, that the word math is similar to statistics, is similar to machine learning models, is similar to cosines, all of which are not similar to scarf and mittens.
This distance-evaluation metric, also known as item-to-item similarity, calculates the similarity scores between two items in a vector residing in a multidimensional inner product space. This is made possible by vectorization, which converts words into vectors (numbers), allowing their meaning to be encoded and processed mathematically. Then the cosine of the angle between the two vector items as projected in the multidimensional space can be determined.
Here’s an example:
This diagram shows that woman and man are somewhat similar (as Mars and Venus would be), yet king and queen aren’t related, but king is related to man.
How does this measurement go about revealing the similarity between items? It works based on the principles of cosines: when cosine distance increases, the similarity of the data points decreases.
To measure the similarity of two items based on their attributes, cosine similarity is computed on a matrix like this. The output value ranges from 0–1.
The cosine computation across all of these values will produce the following possible outputs:
But the most telling values are the decimals in between the extremes, which indicate varying degrees of similarity. For example, if item 1 and item 2 have a .8 degree difference, that would make them far more similar to item 3, if item 3 has a .2 distance from both items 1 and 2.
Here’s a mini tutorial with more details on how to compute cosine similarity.
The upshot: if two item vectors have many common attributes, the items are very similar.
In data analysis for recommendation systems, various similarity metrics, including Euclidean distance, Jaccard similarity, and Manhattan distance, are used for evaluating data points. But among the options, cosine similarity is considered the best and most common method.
Cosine similarity is a trusted form of measurement for a variety of reasons. For instance, even if two similar data objects are far apart in terms of Euclidean distance because of their size, they could still have a relatively small angle between them. And the smaller the angle, the stronger the similarity.
In addition, the cosine similarity formula is a winner because it can handle variable-length data, such as sentences, not just words.
Attesting to its popularity, cosine similarity is utilized in many online libraries and tools, such as TensorFlow, plus sklearn and scikit-learn for Python.
Machine-learning algorithms are commonly applied to datasets in order to offer website users and shoppers the most on-point customized recommendations. This practice has taken off: deep-learning-generated recommendations for shoppers and media-site subscribers have become an integral part of the website search and discovery experience.
With similarity assessment, getting the semantics right is key, so natural language processing (NLP) plays a substantial role.
Consider the types of terms in the diagram — king, queen, ruler, monarchy, royalty. With vectors, computers can make sense of them by clustering them together in n-dimensional space. They can each be located with coordinates (x, y, z), and similarity can be calculated using distance and angles.
Machine learning models can then surmise that words that are near each other in vector space — such as king and queen — are related, and words that are even closer, such as queen and ruler, could be synonyms.
Vectors can also be added, subtracted, and multiplied to establish meaning and relationships, and thereby provide more-accurate recommendations. One often-cited example of such addition and subtraction: king – man + woman = queen. Machines can use this type of formula to determine gender.
At Algolia, our recommendations rely in part on supervised machine-learning models. Data is collected for a similarity matrix in which columns are userTokens and rows are objectIDs. Each cell represents the number of interactions (click and/or conversion) between a userToken and an objectID.
Then we apply a collaborative filtering algorithm that, for each item, finds other items that share similar buying patterns across customers. Items are similar if the same user set has interacted with them.
One challenge: the similarity matrix is computationally heavy (dense), and the similarity values are small, introducing noise to the data that can negatively impact the quality of the recommendations provided.
To get around this roadblock, the k-nearest neighbors algorithm (KNN) comes in handy. Cosine similarity determines the nearest neighbors. You get the optimal number of neighbors for which data points with higher similarity are considered nearest and those with lower similarity aren’t considered. You retain only the k most-similar couples of items. The result: high-quality suggestions.
With movie recommendation systems, among other types of content-based recommendation systems, it’s all about the algorithms.
What do similar users watch (or read or listen to)? Cosine similarity measures the similarity between two viewers — that is, one user profile vs. all the others.
What else do people who view or buy this item buy? In the recommendation-generating process, item descriptions and attributes are leveraged in order to calculate item similarity. Using cosine similarity, the degree of sameness between what the person has selected or viewed compared with other items in the catalog is assessed. The other items with the highest similarity values are presented as the most promising recommendations.
Cosine similarity is instrumental in recommending the right text documents, too. For instance, it can help answer questions like:
For text similarity, frequently occurring terms are key. The terms are vectorized, and for recommendations, those with the higher frequencies are considered the strongest.
Want to offer your search-engine users or customers the best algorithmically calculated personalized suggestions for similar items? Check out Algolia Recommend.
Regardless of your use case, your developers can take advantage of our API to build the recommendations experiences best suited to your needs. Our recommendation algorithm applies content-based filtering to enhance your user engagement and inspire visitors to come back. That’s good news for conversion and your bottom line.
Get a customized demo, try us free, or chat with us soon about high-quality similar-content suggestions that are bound to resonate with your customer base. We’re looking forward to hearing from you!
Vincent Caruana
Senior Digital Marketing Manager, SEOPowered by Algolia AI Recommendations
Vincent Caruana
Sr. SEO Web Digital Marketing ManagerCiprian Borodescu
AI Product Manager | On a mission to help people succeed through the use of AICiprian Borodescu
AI Product Manager | On a mission to help people succeed through the use of AI