Cosine similarity: what is it and how does it enable effective (and profitable) recommendations?

Listen to this blog as a podcast:

Estimated time to read: 9 minutes

Cosine similarity is the Internet’s silent sentinel — the most powerful concept you’ve never heard of. It’s the math behind how Spotify suggests your next favorite artist, how ecommerce sites recommend similar products, and how AI models understand the meaning of your words. But how does it work? It sure seems like magic that this rock we tricked into thinking can do anything at all, let alone understand the meaning of natural language.

In reality, it’s not magic — it’s just elegant math. In this article, we’ll explain some of the concepts behind how cosine similarity works, such as how to:

turn raw data into vectors using models like Word2Vec,
use cosine similarity to compare them semantically and meaningfully, and
scale that comparison efficiently with algorithms like HNSW or our own proprietary hashing models that turn dense vectors into hashed binary embeddings. Our hashing algorithm, inspired by this paper, is called fly hashing and it’s foundational to our approach to hashing.

Just a heads up: you don’t actually need to know any of this to implement app search or any similar high-level features. Algolia hires the best engineers in the industry to build out these algorithms — we use smart caching, hashing, model optimization techniques and CDN tricks to make it blazing fast so even KW search customers do not feel the latency — for you and package them up in tidy APIs and SDKs, so you can add these complex features to your application without thinking at all about the underlying mechanics. But knowing how it works under the hood is how you gain confidence in your tools, and that knowledge empowers you to build custom, unique solutions for other problems yourself. We’ll touch on how Algolia uses these cutting-edge algorithms to bring you lightning-fast search through our API.

Generating vectors from data

To make a computer do anything, we have to represent data quantitatively so we can do math to it. After all, that’s how computers work — they compute things. For simple storage on your device, a text file might be encoded into a numerical format that represents the original text (like ASCII or Unicode). But in this case, we’ll encode that text with a process called vectorization: the process of representing words, sentences, or whole documents as arrays of floating point numbers called vectors. This isn’t as character-level precise as converting back and forth to Unicode, but it captures more semantic information, as we’ll see.

Visualizing vectors

Consider this arrow in a 3D space:

How could we mathematically describe this arrow? If you remember back to your high school algebra and geometry classes — this will be a running theme in this article — you can describe a line with two points, its start and end. However, if we assume that the line starts at the origin (where the x, y, and z axes meet, or [0, 0, 0]), then we only need the coordinates of the end point to describe the line. Those end point coordinates ([3, 3, 3] in the example above) are the array of numbers we’re calling vectors. In reverse, then, we can visualize the vector with arrows like this in a space with as many dimensions as there are numbers in the array (so a length-3 vector can be visualized in 3D space, a length-4 vector can be visualized in a 4D space, and so on).

Vector generation models

So now that we can visualize vectors, how do we generate them in the first place?

Many folks in this space have heard of Word2Vec, a model that maps words to very dense numerical vectors. At Algolia, we actually use more modern sentence-transformers, but models like Word2Vec were some of the first to validate these concepts. We’ll go over Word2Vec here, but know that the logic we follow can be extended and optimized further and further, all the way to what industry leaders like Algolia use today.

Here’s how it works: Instead of just a list of 3 numbers, these vectors represent so many possible words and concepts that these arrays must be hundreds, even thousands of numbers long. If we use the same vectorization model every time, we can be confident that words appearing in similar contexts (like “king” and ”queen”) will end up near each other in this very high-dimensional space. Every permutation of these numbers encodes a specific concept of a word, and each dimension encodes some general concept of a category or spectrum of words. The model uses cosine similarity (which we’ll explain in a moment) to notice words that are used in related ways and represent those linguistic patterns as directions in this high-dimensional space.

If that sounds confusing, you’re not alone. But to help you visualize it, I’ll show you an oversimplified 3D example. As you read the following paragraph, try watching the GIF below to follow along.

Imagine we have two vectors: one that represents the word “dog” and another that represents the word “puppy”. Something is similar about these words, and the model was trained on a corpus that often saw these words used together or used in interchangeable contexts, so it understands the connection. It has created vectors for these words that generally point in the same direction, shown in blue. Let’s draw a new orange vector to get from “dog” to “puppy”. Likely in this direction, the model has encoded the idea of youth. Theoretically then, we should be able to get the younger counterpart of any word in its corpus by transforming that vector in the “youth” direction. Sure enough, if we take the word “cat” and look some length away in the “youth” direction, you get the vector for the word “kitten”.

The previous paragraph explained visually

In theory, this should work for any word. The word “woman” slid in that direction should result in the vector for the word “girl”. Depending on how the model learned these words, you might be able to slide it in the opposite direction and get something like “grandmother” or “elderly woman”. But of course, neither of those fit exactly. A grandmother is a familial relationship and doesn’t totally mean an old person, and “elderly woman” is two words that might not be given a vector together in the first place. But with a model like Word2Vec encoding concepts like this, there must be some vector representing the idea of an elderly woman, even though we don’t have a perfect English word for it. Another natural language that developed differently might have a single word for that concept (like the Portuguese “idosa”), but that’s not always going to happen. There are just more available vector permutations than words in the English dictionary, so by the pigeonhole principle, there must be some vectors that don’t have corresponding natural language words. The point here is that the Word2Vec encodings are able to represent any idea or concept as a vector, regardless of whether that concept has an exact equivalent word in a natural language or not.

Angles and cosines

Let’s go back to the cat and dog example. Those vectors are illustrated as coming from the same point (the origin, [0,0,0]) and going to two different points. Those three points are enough to define a plane, like this:

angles and cosines.png

The plane of the pet-riarchy

That plane makes it easier to see the angle measurement we’re taking between the two vectors, shown in red. That angle could be anywhere between 0 degrees and 360 degrees though, so how do we measure how close they are to pointing the same direction?

This is where cosines come in. Take a look at this interactive unit circle demo from Maths Is Fun and ignore the green and grey functions for today. Focus on the blue cosine data. When your mouse is fully lined up with the line going right from the center point, the cosine of the degree value will be 1. When your mouse is lined up with the vertical axis, the cosine value will be 0. And when your mouse is left of the center point, the cosine value will be -1. In other words, if you trace the circle with your mouse, the cosine function takes in the degree around the circle where your mouse is and spits out just the horizontal component of its location.

This is exactly what we need, since it’s completely continuous. The cosine value for being 3 degrees around the circle is exactly the same as being 357 degrees around the circle (which is just 3 degrees the other way). It’s telling us how similarly angled two lines are — the baseline going right from the origin and the line from the origin to your mouse.

And here’s the jump to the Word2Vec vectors: The cosine of the angle pictured above in red just tells us how similar the “dog” and “cat” vectors are, on a scale from 1 (identical) to 0 (irrelevant) to -1 (opposite). It’s normalized all these weird maths into an easy, digestible computation we can do really quickly to tell us how semantically similar two words are.

This is where the meaning is in our vectors, not in their magnitude or other attributes. And as mentioned earlier, Word2Vec and other similar models were actually trained to optimize for this. If it noticed the words “dog” and “cat” being used a lot together or being used in similar ways in the training corpus, it specifically tried to make their cosine value go down so as to imply that they’re similar words, intending for us to use that cosine similarity to reconstruct these patterns later.

The scalability problem

Once you have vectors for everything (say, every product in a catalog), cosine similarity can be used to recommend related items. Let’s say a user clicks on product A, and you want to find the top 10 most similar items based on cosine similarity. The obvious solution to start would be to compare vector A to every other vector in your catalog, and for each comparison, calculate the cosine similarity.

This brute-force method is fine for 1,000 items. But at 1 million items, it’s painfully slow, since for each product record, we would have to do some math with each number in the vector (which could number into the thousands). Billions of calculations for every query is unacceptable and completely impractical for real-time search.

One potential solution would be to divide the entire space into chunks (you could visualize this like a Minecraft world) and assign each chunk the vectors that fall into it. Then when we try to search for similar products, we know to start in our target product vector’s chunk. If we don’t find any right there, we just expand the search to the neighboring chunks. This approach, called IVF (Inverted File Index) is much faster and easier to use. However, it can be tough to find a goldilocks chunk size that’s not too big or too small to be useful. That ideal chunk size will change with the amount of indexed data and where all of our data points are clustered inside the space. Some chunks will be painfully slow to process if they contain clusters of similar vectors, which will happen often in settings like ecommerce where a great many of our product records are semantically similar to each other. Think of a store that only sells skincare products — every product is going to land very close to the others in the vector space. This approach also takes up a ton of storage space, as you might have guessed.

Over the years, data scientists have come up with better solutions that usually involve making a graph connecting vectors to each other. When we add a new vector A to our space — which is the sciency term for what’s happening behind the scenes when you add a new product to your Algolia index — we link it to its neighbors (some near, some far) and keep their IDs and the length of the connections logged on vector A’s record in our database. This duplicates data slightly, but not nearly as much as IVF. Then, when we’re trying to search through this vector database, we generate a vector for our search query and follow these steps:

Pick a random entry point into the network. We’ll call this our home.
Identify all of the long connections from home to other nodes in the network and see if traveling to those nodes gets us any closer to our query vector. Whichever link gets us closest, we consider that our new home.
From our new home, we find all the medium-length connections to other nodes and see if traveling to those nodes gets us any closer to our query vector. Whichever link gets us closest, we consider that our new home.
Repeat this process with however many levels of connection length you want. Too many or too few and you’ll be adding unnecessary steps, however this ideal step count shouldn’t change with the composition of your indexed data.
Once you’ve gotten all the way to the last level with the shortest connections between nodes, keep traveling until no move from your current home can bring you closer to the query vector. That means your current home is the closest vector in the database to the query.

This is one approach to this process, and it's called the Hierarchical Navigable Small Worlds (HNSW), but there are others that are even more mind-blowing.

If you’ve been into algorithms for a while, you might recognize this as basically playing out the A* Algorithm on an invented network of high-dimensional vectors. You could think of the top level of long connections like the big highway system, while each level down gets to progressively smaller and less navigable streets. By the time you get to the lower levels, you should be so close to your destination that you’ll barely spend any time at all on the “back roads”.

If you like this post, you may want to try Algolia

Want to offer your search-engine users or customers the best algorithmically calculated personalized suggestions for similar items? Check out Algolia Recommend.

Regardless of your use case, your developers can take advantage of our API to build the recommendations experiences best suited to your needs. Our recommendation algorithm applies content-based filtering to enhance your user engagement and inspire visitors to come back. That’s good news for conversion and your bottom line.

Get a customized demo, try us free, or chat with us soon about high-quality similar-content suggestions that are bound to resonate with your customer base. We’re looking forward to hearing from you!

Cosine similarity: what is it and how does it enable effective (and profitable) recommendations?

Generating vectors from data

Visualizing vectors

Vector generation models

Angles and cosines

The scalability problem

If you like this post, you may want to try Algolia

Recommended Content

Get the AI search that shows users what they need

Agentic intelligence layer powering commerce discovery

A leader for the third consecutive year

Increased Operating Profit and Improved Efficiency

Named a leader in knowledge discovery

Top scores across every B2B category