In the first part of this series on recommendations, we talked about the key components of a high-performance recommender system: (1) Data Sources, (2) Feature Engineering and Feature Store, (3) Machine Learning Models, (4 & 5) Predictions & Actions, (6) Results, (7) Evaluation, and (8) AI Ethics. The focus here is on 3 — Machine Learning as applied to Recommendations Models.
In our previous article on features recommendations, we learned that the most popular techniques used for recommender systems rely on three data source types: (1) user metadata — gender, age, occupation, location, etc., (2) attribute information about items — a merchant selling books may have descriptions of the books and keywords describing the content, title, and author, and (3) user-item interactions — click, views, transactions, reviews, etc.
In this article, we’ll be building the recommendation models themselves.
We categorize common recommendation approaches based on the way we handle these raw data-sources:
Let’s dive deeper into each one of these techniques.
In content-based filtering we recommend items based on their attributes. Thus, the input to a content-based recommender system consists of all the attributes described in the items content matrix.
The main assumption in content-based recommenders is that a user who expressed a preference for an item is likely to prefer similar items. For example, if I happen to enjoy Marvel movies and especially Ryan Reynolds in Green Lantern back in 2011, chances are I would enjoy Deadpool (2016 and 2018) … not to mention Deadpool 3 (rumour has it they’re starting production in 2022, yey!) because they share the same actor and they’re the same movie genre.
The Item Content Metrics (ICM) best represents the attributes of a pool of items, as seen below:
Now, the key question is: how do we measure the similarity of two items based on their attributes? For that, we compute cosine similarity on a matrix such as the one below:
Attribute 1 | Attribute 2 | … | Attribute N | |
Item 1 | 1 | 0 | … | 1 |
Item 2 | 0 | 1 | 0 | |
… | 1 | 0 | 1 | |
Item N | 0 | 1 | 0 |
The intuition is that if two item vectors have many attributes in common, we can assume that the two items are very similar. Mathematically, that can be expressed using the dot product operation:
And geometrically, it can be expressed as the cosine between the two vectors:
As can be seen, a smaller angle (larger cosine) means that the items have a higher degree of similarity. There are two other important concepts that will help us build the similarity matrix:
With this, the similarity matrix is computed using the formula:
Below, you can see how we compute recommendations based on the well known MovieLens dataset that consists of movie titles and user-provided tags:
First, we compute the cosine similarity between all the samples in the matrix and we save it in a new data frame that will be later used for returning top k similar movies:
# creating the tf-idf Vectorizer to analyze, at word level, unigrams and bigrams tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2)) # applying the vectorizer on the 'tags' column tfidf_matrix = tf.fit_transform(movies_df['tags']) # compute the cosine similarity between all the samples in the matrix cosine_sim = cosine_similarity(tfidf_matrix) # saving the values in a DataFrame for better visualisation cosine_sim_df = pd.DataFrame(cosine_sim, index=movies_df['title'], columns=movies_df['title'])
def get_recommendations(movie_id, similarity_df, movies_df, k=10): # partitions the matrix such that the indices are in the position they would be in a sorted array ix = similarity_df.loc[:,movie_id].to_numpy().argpartition(range(-1,-k,-1)) # gets the corresponding movie titles for k+1 sorted indices closest = similarity_df.columns[ix[-1:-(k+2):-1]] # removes the queried title from the results closest = closest.drop(movie_id, errors='ignore') # returns top k most similar movies return pd.DataFrame(closest).merge(movies_df).head(k)
And finally, getting recommendations is as simple as:
x = 'Superman (1978)' # get recommendations for the given x title y = get_recommendations(x, cosine_sim_df, movies_df[['title', 'tags']])
This basic content-based recommendation algorithm will produce the following results:
The issue with the similarity matrix is that it’s very dense (computational heavy) and the similarity values are very small, introducing a lot of noise in the data, hence lowering the quality of recommendations. To solve this problem, we can use the K-Nearest Neighbors algorithm (KNN) — a method that consists in keeping only the K-most similar couples of items in the similarity matrix.
from sklearn.neighbors import NearestNeighbors # Create the unsupervised learner with k=10 and train it on our data knn = NearestNeighbors(n_neighbors=10).fit(cosine_sim_df) def get_recommendations(movie_title, similarity_df, movies_df): # Get the indices of the neighbors of the queried movie ix = knn.kneighbors(similarity_df[movie_title].to_numpy().reshape(1, -1), return_distance=False) # Get the titles of the corresponding indices closest = similarity_df.columns[ix].flatten() return pd.DataFrame(closest,columns=['title']).merge(movies_df)
The same query, using the same code as above, will produce the following results:
In the basic example above, we’ve taken into consideration only 1 attribute, namely tags. The content-based algorithm can be greatly improved by including more non-binary attributes such as title, actors, genre, etc.
The main idea of the Collaborative Filtering method is to infer the ratings a user has not explicitly given to the items by observing ratings from other highly correlated users and items where ratings have been given. They do not require any item attribute to work, but rely on the opinions of a community of users.
The input to a Collaborative Filtering recommender system is the User-Rating Matrix (URM), which contains all the user ratings on the items:
Interactions in the user rating matrix can be, for instance, explicit ratings. Explicit rating means that the user explicitly stated his or her opinion with an item. An example of an explicit rating can be the number of stars a user gave to an item. If there is no rating from a user for an item, the corresponding value in the URM is 0.
Item 1 | Item 2 | … | Item N | |
User 1 | 1 | 3 | … | 1 |
User 2 | 2 | 4 | 4 | |
… | ||||
User N | 3 | 5 | 2 |
Another type of interaction described in the URM can be implicit ratings. Implicit ratings simply state whether the user interacted with an item and therefore has only two values: 0 or 1. One means the user interacted with the item, while zero means no information.
In principle, it’s possible to design an algorithm that can use both explicit and implicit ratings. The simplest algorithms are designed to work only with one of the two types: implicit or explicit ratings.
Although there are different types of collaborative filtering techniques, below we’ll cover user-based and item-based techniques.
When we want to make recommendations to a user, user-based collaborative filtering can be used to search for other users with similar tastes and recommend the items those users like the most.
Thus, the first problem is to find a way to measure the similarity between users. How can we measure this similarity?
We start by looking at the ratings of the users and compare user tastes based on those ratings. If two users have given similar ratings to several items, we can assume that those two users have the same opinion on those items; and if they have the same opinions on a lot of items, we can assume that the two users are similar.
We compute the similarity between users based on their ratings, and we create a similarity matrix, where the value of row i and column j is the similarity score between user i and user j.
To compute the similarity between two users, we use the cosine similarity. For computational purposes we will compute the similarities between a user and all the other users at the same time:
# create URM URM = np.zeros((num_users, num_items), dtype=int) for _, row in data.iterrows(): URM[row['user_id'] - 1, row['item_id'] - 1 ] = row['rating']
# find the most N similar users to the current one (we picked N to be 3) def similar_users(user_id, URM, N = 3): number_of_users = URM.shape[0] # create the list of other users (take all users ids and then remove the id for the current one) other_users_ids = np.array(range(number_of_users), dtype=int) other_users_ids = np.delete(other_users_ids, user_id).tolist() # take the ratings made by the current user user = URM[user_id].tolist() # iterate through all the other users ids and compute the similarity to the current user other_users_similarity = [] for other_user_id in other_users_ids: other_user = URM[other_user_id].tolist() other_users_similarity.append(cosine_similarity_pair_of_users(user, other_user)) # create a dictionary with the user indexes and their similarity index_similarities = dict(zip(other_users_ids, other_users_similarity)) # sort by the similarity index_similarity_sorted = sorted(index_similarities.items(), key=lambda x: x[1], reverse=True) # get the first N users and their similarity top_user_similarity = index_similarity_sorted[:N] # get the ids for the top N users top_user_ids = [usr[0] for usr in top_user_similarity] return top_user_ids
At this point, identifying the similar user for a given user_id is straightforward:
# for example we take the user with the id 48 example_user_id = 48 top_similar_users = similar_users(example_user_id, URM)
The next step is to actually define the top K items to be recommended to our user. Here’s the code to do exactly that, once we have the top similar users identified:
def recommend_topK_items_for_user(user_id, similar_users_ids, URM, topK = 5): # get the similar users ratings similar_users = URM[similar_users_ids] # compute the average rating for every item rated by the similar users average_ratings = similar_users.mean(axis=0) # get the current user user = URM[user_id] # transpose it user_transposed = user.transpose() # get the items ids where the rating is 0 (the user didn't rate the item) items = np.where(user_transposed == 0)[0] # get the similar users ratings for the unrated items by the current user similar_users_ratings = similar_users[:, items] average_ratings = average_ratings[items] # order the items by the average of their ratings item_indexes = np.argsort(average_ratings)[::-1] # get the top k items indexes top_items_index = item_indexes[:topK] return top_items_index.tolist()
Final step is generating the user-based recommendations and comparing them with the rated movies:
recommended_items = recommend_topK_items_for_user(example_user_id, top_similar_users, URM) # items rated by the user user_ratings = URM[example_user_id] items_rated = np.where(user_ratings != 0)[0].tolist() items_info.loc[items_info['movie_id'].isin(items_rated)] # recommended items for user items_info.loc[items_info['movie_id'].isin(recommended_items)]
In an item-based collaborative filtering approach, in order to predict a user’s ratings for a target item, we have to determine the set of items that are most similar to the target item. The idea is to calculate the similarity between each pair of items according to how many users have rated them both. Then we use ratings specified by the user in that item to predict if he or she will like the target items.
Thus, the first step is to calculate the similarity between items as shown below:
def similar_items(item_id, URM, topK = 5): # get the current item item = URM[:, item_id].tolist() # get all the other items number_of_items = URM.shape[1] other_items_ids = np.array(range(number_of_items), dtype=int) other_items_ids = np.delete(other_items_ids, item_id).tolist() # compute the similarity between the current item and the other items other_items_similarity = [] for other_item_id in other_items_ids: other_item = URM[:, other_item_id].tolist() other_items_similarity.append(cosine_similarity_pair_of_items(item, other_item)) # create a dictionary with the item indexes and their similarity index_similarities = dict(zip(other_items_ids, other_items_similarity)) # sort by the similarity index_similarity_sorted = sorted(index_similarities.items(), key=lambda x: x[1], reverse=True) # get the first K items and their similarity top_item_similarity = index_similarity_sorted[:topK] # get the ids for the top K items top_item_ids = [itm[0] for itm in top_item_similarity] return top_item_ids
Just like with user-based collaborative filtering, at this stage, similarity between items can be calculated as follows:
example_item_id = 29 recommended_items = similar_items(example_item_id, URM)
And finally, getting recommendations:
example_item_id = 29 # recommended items items_info.loc[items_info['movie_id'].isin(recommended_items)]
After seeing how collaborative filtering works, you might still have a few pending questions, such as:
To answer these questions, we need to look at how similarity is calculated — the cosine similarity formula can be adapted to calculate the similarity between users and the similarity between items starting from the user rating matrix or URM.
Considering the similarity between items, the approach to the problem is the same as what we’ve seen with content-based filtering techniques. In content-based filtering, we can recommend items to the user based on what he or she has previously liked. The similarity is measured by taking into account attributes. If two items have a lot of attributes in common, we can say that these two items are similar.
Can we measure the similarity between the items without knowing the attributes?
Yes, we don’t need to know anything about the attributes of the items as we can compute our similarity matrix knowing the ratings coming from the user rating matrix or URM.
Up to now, we’ve used memory-based techniques for our collaborative filtering algorithms, in which we predict ratings on the basis of user similarity. They use the ratings of the user rating matrix or URM directly in production and they are easier to implement.
However, there is a second technique commonly used in collaborative filtering: model-based methods.
Model-based techniques do not rely on the whole data set when recommendations are computed, but they extract information from the data set to build a model. These techniques need two steps for the prediction: (1) the first step is to build the model; (2) the second step is to estimate ratings using a function that takes as inputs the model built in the first step, and the user profile.
What happens if we have a new user in the system?
In the case of memory-based techniques, we need to add the new user to the URM and recompute the similarity between the new user and all the other users. This operation is computationally heavy and we can make recommendations only to users present in the model.
In the case of model-based techniques, we don’t need to add the user to the URM, if the URM is big enough. We don’t have to recompute the similarity matrix; it can be updated once a day, or every week or month. Therefore, we can make recommendations even to users who are not in the model.
At a minimum, a hybrid recommender system combines one collaborative filtering technique with one content-based technique. However, we can build a hybrid recommender system by putting together as many techniques as we want.
There are dozens of different approaches we can use to build a hybrid recommender system and all these approaches can be roughly classified into five families:
The Factorization Machines algorithm is a general-purpose supervised learning algorithm that can be used for both classification and regression tasks, making it attractive for a collaborative filtering approach. The input for our algorithm is structured as a table, in which all the columns, except the last one, are divided into groups. For example, we have two groups: users and items. Each group corresponds to one dimension of the URM. These columns represent the features of the data set. The last column contains the ratings of the users for items, which makes it the target to be predicted in a recommender system scenario.
In order to represent users and items, we use One Hot Encoding. Using this method, we assign a column to every user and item. By setting the proper column to 1 and all the others to 0, we can select a specific user and a specific item. For example, the first column could be the first user in the URM, the second column could be the second user, and so on. By setting the first column to one and the others to zero, we select the first user. Similarly, in the item group, if we set the first column to one, and others to zero, we specify the first item.
Compared to the URM, this format uses a larger table with more zeroes and this makes the table more sparse than the URM. According to this representation, for each interaction in the URM we insert a row in the table. For instance, if the URM has 1000 interactions, we should insert 1000 rows in our table. Once we have defined the data structure of factorization machines, we can start studying the model used to estimate unknown ratings.
As much as we’d like to have one general recommendation system that fits all real life scenarios, it’s more realistic to get into the mindset of having a family of recommendation models, each one built for a specific purpose.
For example, if you’re an eCommerce business, you might want to use collaborative filtering techniques to offer item-based recommendations, just like we offer with Algolia Recommend. However, if you’re a media company with a portfolio of digital publications, offering content-based recommendations might make more sense.
Or, in case you’re looking for personalized recommendations, then user-based collaborative filtering is the way to go. Also, deciding on model-based vs. memory-based techniques is more of a technical consideration which comes into play when scalability becomes an issue.
Most of the time a hybrid approach is the best strategy and clearly the way to go for companies that have a machine learning team capable of iterating and fine tuning recommender models.
In the next blog post in this series we’ll focus on the actions various recommender systems can facilitate via the predictions they’re generating. Stay tuned! And if you have any questions: https://twitter.com/cborodescu
Ciprian Borodescu
AI Product Manager | On a mission to help people succeed through the use of AIPowered by Algolia AI Recommendations
Ciprian Borodescu
AI Product Manager | On a mission to help people succeed through the use of AICatherine Dee
Search and Discovery writerVincent Caruana
Senior Digital Marketing Manager, SEO