Search by Algolia
Feature Spotlight: Query Rules
product

Feature Spotlight: Query Rules

You’re running an ecommerce site for an electronics retailer, and you’re seeing in your analytics that users keep ...

Jaden Baptista

Technical Writer

An introduction to transformer models in neural networks and machine learning
ai

An introduction to transformer models in neural networks and machine learning

What do OpenAI and DeepMind have in common? Give up? These innovative organizations both utilize technology known as transformer models ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What’s the secret of online merchandise management? Giving store merchandisers the right tools
e-commerce

What’s the secret of online merchandise management? Giving store merchandisers the right tools

As a successful in-store boutique manager in 1994, you might have had your merchandisers adorn your street-facing storefront ...

Catherine Dee

Search and Discovery writer

New features and capabilities in Algolia InstantSearch
engineering

New features and capabilities in Algolia InstantSearch

At Algolia, our business is more than search and discovery, it’s the continuous improvement of site search. If you ...

Haroen Viaene

JavaScript Library Developer

Feature Spotlight: Analytics
product

Feature Spotlight: Analytics

Analytics brings math and data into the otherwise very subjective world of ecommerce. It helps companies quantify how well their ...

Jaden Baptista

Technical Writer

What is clustering?
ai

What is clustering?

Amid all the momentous developments in the generative AI data space, are you a data scientist struggling to make sense ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is a vector database?
product

What is a vector database?

Fashion ideas for guest aunt informal summer wedding Funny movie to get my bored high-schoolers off their addictive gaming ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Unlock the power of image-based recommendation with Algolia’s LookingSimilar
engineering

Unlock the power of image-based recommendation with Algolia’s LookingSimilar

Imagine you're visiting an online art gallery and a specific painting catches your eye. You'd like to find ...

Raed Chammam

Senior Software Engineer

Empowering Change: Algolia's Global Giving Days Impact Report
algolia

Empowering Change: Algolia's Global Giving Days Impact Report

At Algolia, our commitment to making a positive impact extends far beyond the digital landscape. We believe in the power ...

Amy Ciba

Senior Manager, People Success

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve
e-commerce

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve

In today’s post-pandemic-yet-still-super-competitive retail landscape, gaining, keeping, and converting ecommerce customers is no easy ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Algolia x eTail | A busy few days in Boston
algolia

Algolia x eTail | A busy few days in Boston

There are few atmospheres as unique as that of a conference exhibit hall: the air always filled with an indescribable ...

Marissa Wharton

Marketing Content Manager

What are vectors and how do they apply to machine learning?
ai

What are vectors and how do they apply to machine learning?

To consider the question of what vectors are, it helps to be a mathematician, or at least someone who’s ...

Catherine Dee

Search and Discovery writer

Why imports are important in JS
engineering

Why imports are important in JS

My first foray into programming was writing Python on a Raspberry Pi to flicker some LED lights — it wasn’t ...

Jaden Baptista

Technical Writer

What is ecommerce? The complete guide
e-commerce

What is ecommerce? The complete guide

How well do you know the world of modern ecommerce?  With retail ecommerce sales having exceeded $5.7 trillion worldwide ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Data is king: The role of data capture and integrity in embracing AI
ai

Data is king: The role of data capture and integrity in embracing AI

In a world of artificial intelligence (AI), data serves as the foundation for machine learning (ML) models to identify trends ...

Alexandra Anghel

Director of AI Engineering

What are data privacy and data security? Why are they  critical for an organization?
product

What are data privacy and data security? Why are they critical for an organization?

Imagine you’re a leading healthcare provider that performs extensive data collection as part of your patient management. You’re ...

Catherine Dee

Search and Discovery writer

Achieving digital excellence: Algolia's insights from the GDS Retail Digital Summit
e-commerce

Achieving digital excellence: Algolia's insights from the GDS Retail Digital Summit

In an era where customer experience reigns supreme, achieving digital excellence is a worthy goal for retail leaders. But what ...

Marissa Wharton

Marketing Content Manager

AI at scale: Managing ML models over time & across use cases
ai

AI at scale: Managing ML models over time & across use cases

Just a few years ago it would have required considerable resources to build a new AI service from scratch. Of ...

Benoit Perrot

VP, Engineering

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

You looked at this scarf twice; need matching mittens? How about an expensive down vest?

You watched this goofy flick four times? Try something else. I know what will hook you next.

That search term you’ve entered appears in a zillion spots in the company’s data silos. Based on my bag-of-words model, check out these similar documents I found that contain common words.

If a recommendation-engine algorithm could “think” as a human does, you might catch it making these kinds of private observations. Of course, these aren’t the ways observations about document similarity and related content or product suggestions are phrased on websites. All you get is a “You might also like” or a list of items with no indication as to why they were selected. Regardless, you’re served great recommendations for similar items or content that could very easily pique your interest, as though the algorithm has been taking notes. 

Big Data is listening

Similarity is a key differentiator in ranking search engine results and recommending content. If we as humans like something, it’s probably not enough; we want more, more, more. 

Ecommerce retailers are especially happy to indulge us with more-specialized information retrieval; they’re understandably passionate about meeting our similarity-seeking needs. User-based recommendations for similar items are everywhere on the Web, from Amazon to Netflix to large retailers’ sites. Most people are intrigued by their personalized “You loved that, you could love this” ideas, their “Buy it with” add-on suggestions, their ability to check out “Customers also considered” products.

Thanks to data scientists’ creation of reliable similar-content functionality, this experience is an everyday occurrence. But how do companies use data science to so uncannily figure out what else we might like? What’s involved in a website identifying the right similar items in a sea of options? How does a recommender system use artificial intelligence and tap a dataset to figure out what similar movie title, product, or blog post a user would want to see next?

The secret boils down to a time-tested measure of similarity between two number sequences: cosine similarity (Wikipedia definition).

What is cosine similarity?

In terms of language, cosine similarity determines the closeness in meaning between two or more words. It’s the way a search or recommendation engine knows, for example, that the word math is similar to statistics, is similar to machine learning models, is similar to cosines, all of which are not similar to scarf and mittens.

This distance-evaluation metric, also known as item-to-item similarity, calculates the similarity scores between two items in a vector residing in a multidimensional inner product space. This is made possible by vectorization, which converts words into vectors (numbers), allowing their meaning to be encoded and processed mathematically. Then the cosine of the angle between the two vector items as projected in the multidimensional space can be determined.

Here’s an example:

 

This diagram shows that woman and man are somewhat similar (as Mars and Venus would be), yet king and queen aren’t related, but king is related to man.

How does this measurement go about revealing the similarity between items? It works based on the principles of cosines: when cosine distance increases, the similarity of the data points decreases.

To measure the similarity of two items based on their attributes, cosine similarity is computed on a matrix like this. The output value ranges from 0–1.

The cosine computation across all of these values will produce the following possible outputs:

  • -1 (an opposite)
  •  0 (no relation)
  • 1 (100% related)

But the most telling values are the decimals in between the extremes, which indicate varying degrees of similarity. For example, if item 1 and item 2 have a .8 degree difference, that would make them far more similar to item 3, if item 3 has a .2 distance from both items 1 and 2. 

Here’s a mini tutorial with more details on how to compute cosine similarity.

The upshot: if two item vectors have many common attributes, the items are very similar.

Why cosine similarity?

In data analysis for recommendation systems, various similarity metrics, including Euclidean distance, Jaccard similarity, and Manhattan distance, are used for evaluating data points. But among the options, cosine similarity is considered the best and most common method.

Cosine similarity is a trusted form of measurement for a variety of reasons. For instance, even if two similar data objects are far apart in terms of Euclidean distance because of their size, they could still have a relatively small angle between them. And the smaller the angle, the stronger the similarity.

In addition, the cosine similarity formula is a winner because it can handle variable-length data, such as sentences, not just words.

Attesting to its popularity, cosine similarity is utilized in many online libraries and tools, such as TensorFlow, plus sklearn and scikit-learn for Python.

Cosine similarity and machine learning

Machine-learning algorithms are commonly applied to datasets in order to offer website users and shoppers the most on-point customized recommendations. This practice has taken off: deep-learning-generated recommendations for shoppers and media-site subscribers have become an integral part of the website search and discovery experience.

With similarity assessment, getting the semantics right is key, so natural language processing (NLP) plays a substantial role.

Consider the types of terms in the diagram — king, queen, ruler, monarchy, royalty. With vectors, computers can make sense of them by clustering them together in n-dimensional space. They can each be located with coordinates (x, y, z), and similarity can be calculated using distance and angles.

Machine learning models can then surmise that words that are near each other in vector space — such as king and queen — are related, and words that are even closer, such as queen and ruler, could be synonyms. 

Vectors can also be added, subtracted, and multiplied to establish meaning and relationships, and thereby provide more-accurate recommendations. One often-cited example of such addition and subtraction: king – man + woman = queen. Machines can use this type of formula to determine gender.

Applying the algorithm

At Algolia, our recommendations rely in part on supervised machine-learning models. Data is collected for a similarity matrix in which columns are userTokens and rows are objectIDs. Each cell represents the number of interactions (click and/or conversion) between a userToken and an objectID.

Then we apply a collaborative filtering algorithm that, for each item, finds other items that share similar buying patterns across customers. Items are similar if the same user set has interacted with them.

One challenge: the similarity matrix is computationally heavy (dense), and the similarity values are small, introducing noise to the data that can negatively impact the quality of the recommendations provided.

To get around this roadblock, the k-nearest neighbors algorithm (KNN) comes in handy. Cosine similarity determines the nearest neighbors. You get the optimal number of neighbors for which data points with higher similarity are considered nearest and those with lower similarity aren’t considered. You retain only the k most-similar couples of items. The result: high-quality suggestions.

Cosine similarity in a recommendation system

With movie recommendation systems, among other types of content-based recommendation systems, it’s all about the algorithms. 

What do similar users watch (or read or listen to)? Cosine similarity measures the similarity between two viewers — that is, one user profile vs. all the others.

What else do people who view or buy this item buy? In the recommendation-generating process, item descriptions and attributes are leveraged in order to calculate item similarity. Using cosine similarity, the degree of sameness between what the person has selected or viewed compared with other items in the catalog is assessed. The other items with the highest similarity values are presented as the most promising recommendations.

Cosine similarity is instrumental in recommending the right text documents, too. For instance, it can help answer questions like:

For text similarity, frequently occurring terms are key. The terms are vectorized, and for recommendations, those with the higher frequencies are considered the strongest.

If you like this post, you may like Algolia

Want to offer your search-engine users or customers the best algorithmically calculated personalized suggestions for similar items? Check out Algolia Recommend.

Regardless of your use case, your developers can take advantage of our API to build the recommendations experiences best suited to your needs. Our recommendation algorithm applies content-based filtering to enhance your user engagement and inspire visitors to come back. That’s good news for conversion and your bottom line

Get a customized demo, try us free, or chat with us soon about high-quality similar-content suggestions that are bound to resonate with your customer base. We’re looking forward to hearing from you!

About the author
Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Recommended Articles

Powered byAlgolia Algolia Recommend

Semantic textual similarity: a game changer for search results and recommendations
product

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

The anatomy of high-performance recommender systems – Part IV
ai

Ciprian Borodescu

AI Product Manager | On a mission to help people succeed through the use of AI

The anatomy of high-performance recommender systems - Part 1
ai

Ciprian Borodescu

AI Product Manager | On a mission to help people succeed through the use of AI