Learn a CTO’s perspective on Algolia vs. Elasticsearch.

Read More
  • Partners
Search for features, resources
  • Log in
  • Start free


Inspiration Library

Get inspired by 200+ customer examples and take your search and discovery experience to the next level.

Log inStart free
Algolia logo blueprint

Looking for our logo?

We got you covered!

Download logo packMore algolia assets
Share on facebookShare on linkedinShare on twitterShare by email

Engineering features that go into a recommender system

In the first article in this blog post series, we talked about the key components of a high-performance recommender system: data sources, feature engineering and a feature store, machine-learning models, predictions, actions, results, evaluation, and AI ethics. In the second article, we delved into common data sources for a recommender system

In this third post, we’ll do a deep dive into the vast topic of feature engineering for recommender systems. While the path from raw data to recommendations goes through various tools and systems, the process involves two mathematical entities that are the bread and butter of any recommendation system: features and models.

image from data to recommendations

A feature is a numeric representation of raw data. Feature engineering is the process of composing the most appropriate features given the data, model, and task. In a basic collaborative filtering scenario we do not actually have features because ratings are, in fact, labels.

Content-based systems work with a wide variety of item descriptions and knowledge about users. Feature engineering involves converting these different types of unstructured data into standardized descriptions. Although it’s possible to use any kind of representation, such as a multidimensional data representation, the most common approach is to extract keywords from the underlying data. 

Items have multiple fields in which various attributes are listed. For example, books have a description, title, and author. In some cases, these descriptions and attributes can be converted into keywords. 

ItemID Title Authors Description Genre Price
0000031852 2034: A Novel of the Next World War Elliot Ackerman, Admiral James Stavridis USN From two former military officers and award-winning authors, a chillingly authentic geopolitical thriller that imagines a naval clash between the US and China in the South China Sea in 2034–and the path from there to a nightmarish global conflagration. Thrillers & Suspense $17.84

On the other hand, you can work directly with a multidimensional (structured) representation when attributes contain numerical quantities (e.g., price) or fields that are drawn from a small universe of possibilities (e.g., genre).

In addition to describing items, you’ll want to collect user profile data. For example, MovieLens, a classical recommendation data set of user attributes, contains three user attributes: gender, age, and occupation. Because these are single-label attributes, they can be encoded during a preprocess using one-hot encoding.

UserID Gender Age Occupation
M/F *  1:  “Under 18”
* 18:  “18-24”
* 25:  “25-34”
* 35:  “35-44”
* 45:  “45-49”
* 50:  “50-55”
* 56:  “56+”
*  0:  “other” or not specified
*  1:  “academic/educator”
*  2:  “artist”
*  3:  “clerical/admin”
*  4:  “college/grad student”
*  5:  “customer service”
*  6:  “doctor/health care”
*  7:  “executive/managerial”
*  8:  “farmer”
*  9:  “homemaker”
* 10:  “K-12 student”
* 11:  “lawyer”
* 12:  “programmer”
* 13:  “retired”
* 14:  “sales/marketing”
* 15:  “scientist”
* 16:  “self-employed”
* 17:  “technician/engineer”
* 18:  “tradesman/craftsman”
* 19:  “unemployed”
* 20:  “writer”

Finally, not all features are created equal. You can apply feature weighting, which gives differential weights depending on a feature’s importance, or feature selection, which includes or excludes attributes based on relevance.

Now let’s explore feature engineering methodologies for the most common item and user attributes in a recommendation engine.

Numerical features


The price attribute contained in an item dataset is a continuous variable because it can take on an uncountable set of values, and it may contain any value within a given range. To transform this raw feature into a format that can be ingested by a machine-learning model, you’ll use quantization: essentially, mapping a continuous value to a discrete value. Conceptually, this can be thought of as an ordered sequence of statistical bins.

image of the discretization process

There are three typical approaches to binning: 

  • Uniform: all bins have identical widths
  • Quantile based: all bins have the same number of values
  • K-means based: each bin belongs to the nearest one-dimensional k-means cluster

Uniform binning is the simplest approach. It divides a range of possible values into N bins of the same width using the formula:

    \[width = \frac{maxvalue-minvalue}{N}\]

where N is the number of bins or intervals. 

N is normally determined through experimentation — there’s no rule of thumb here.

For example, if the variable interval is [10, 110] and you want to create 5 bins, that means 110-10 / 5 = 20, so each bin’s width is 20 and the intervals will be [10, 30], [30, 50], [50, 70], [70 – 90], and [90, 110].

The code and histogram(s) for uniform, quantile, or k-means binning looks something like this:

Uniform (bins = 10) Quantile (bins = 10) K-means (bins = 10)
image of bins image of bins image of bins

Normalization and standardization

The two most-discussed scaling methods are normalization (rescaling values into a range of [0,1]) and standardization (rescaling data to have a mean of 0 and a standard deviation of 1). Here are visual representations of the data after it has been normalized and standardized: 

image of the difference-between-normalization-standardization

You can use normalization for features such as pageviews, clicks, and transaction amounts because the values are not normally (Gaussian) distributed — most of the time they’re long tail.  

Here’s the formula for normalization:

    \[{X}' = \frac{X-X_{min} }{X_{max}-X_{min}}\]

where Xmax and Xmin are the maximum and minimum values of the feature, respectively.

Standardization is useful for customer reviews because the data follows a Gaussian (normal) distribution:

    \[{X}' = \frac{X-\mu}{\sigma}\]

where μ is the mean of the feature values and 𝝈 is the standard deviation of the feature values. Note that in this case, the values are not restricted to a particular range.

Categorical features

Often, features are represented as categorical instead of continuous values. As in the above example, users could have features such as gender ([male, female]), age ([Under 18, 18-24, 25-34]), and occupation ([other, academic/educator, artist, clerical/admin], …). Such features can be efficiently coded as integers, for instance, [male, 18-24, clerical/admin] would be expressed as [0, 1, 3] while [female, 25-34, academic/educator] would be [1, 2, 1].

We have a few options for converting categorical features to integers:

  • Use the OrdinalEncoder. This estimator converts each categorical feature to one new feature of integers (0 to n_categories – 1)

  • Use a scikit-learn estimator. This estimator uses a one-of-K scheme, also known as one-hot or dummy encoding, to convert categories to integers.

Text embedding

These methods — normalization, standardization, and categorical features — are used to compose the features. They all rely on a semantic understanding of language. Let’s take a look at how we can read text-based content. 

Natural-language processing (NLP) is a subfield of AI that enables computers to understand and process human language. There are two techniques for achieving this task: applying a bag-of-words model to unprocessed text, and preprocessing text in order to use a neural network model later.

Bag of words

The bag-of-words model is the most commonly used process, as it’s easy to implement and understand. The idea is to create an occurrence matrix for sentences and documents without taking into account grammar or word order.

The resulting frequencies are used to train a classifier. Note that because preprocessing the sentences is not required, this approach brings a series of drawbacks, such as sparse representation of the resulting vectors, a poor job in making sense of text data, and suboptimal performance when dealing with a large number of documents.

Preprocessing text

The standard order for preprocessing sentences is tokenization, removing unnecessary punctuation and stop words, stemming, and lemmatization. 

  • Tokenization

Tokenization consists of turning sentences into words. The word_tokenizer from the nltk package tokenizes a string to split off punctuation other than periods.

  • Removing stop words

Different packages have predefined stop words. The alternative is to define a custom list of stop words relevant to the corpus. 

  • Stemming

The words in the corpus are reduced to their roots by removing suffixes and prefixes. The stemmer looks for a list of common suffixes and prefixes and removes them.

  • . Lemmatization

Lemmatization has the same expected output as stemming: reducing the word to either a common base or its root. However, the lemmatizer takes into account the morphological analysis of the word, using the same base for all its inflections. 

Image encoding

Before jumping into this topic, you must understand the way computers “see” images. Each image is represented as either 1 or 3 matrices of pixels. Each matrix represents a channel. For black-and-white images, there is only one channel, while for colored images, there are three: Red, Green, and Blue. Each pixel is in turn represented by a number between 0 and 255, which denotes the intensity of the color.

image of black and white

Pixel values as features

The simplest way to retrieve the features from an image is to rearrange all the pixels and generate a feature vector. For a grayscale image, this is easily achieved using NumPy:

The same technique can be used for RGB images. However, a more suitable approach would be to create the feature vector by using the mean value of pixels from all the channels.

Edge detection

An edge is a set of points in the image where the brightness and color change sharply. There are different techniques for detecting edges, the most common being the Canny edge detector algorithm. Here’s an overview of how it works:

  1. Noise reduction using a Gaussian filter
  2. Gradient calculation
  3. Non maximum suppression (the algorithm goes through all the points on the gradient intensity matrix and finds the pixels with the maximum values in the edge directions)
  4. Double threshold (splits the pixels into two categories: strong and weak)
  5. Edge tracking by hysteresis (converts the weak pixels into strong ones if and only if there’s another strong pixel as their neighbor) 

Final word and feature stores

The ways features are maintained and served can differ significantly across projects and teams. This introduces infrastructure complexity and often results in duplication of work. Some of the challenges faced by distributed organizations include: 

  • Features are not reused
  • Feature definitions vary
  • Features take a long time to be computed
  • There is inconsistency between training and serving
  • Feature decay is unknown

To address these issues, a feature store acts as a central vault for storing documented, curated, and access-controlled features within an organization. 

Feature Store
Name Description Metadata Definition
average_user_order_value The average order value for a user. Why the feature was added to the model, how it contributes to generalization, the engineer’s name in the organization responsible for maintaining the feature’s data source, the input type, the output type. Versioned code executed in a runtime environment and applied to the input to compute the feature value.


Essentially, a feature store allows data engineers to insert features. In turn, data analysts and machine-learning engineers use an API to get feature values they deem relevant. 

Additionally, feature values in a feature store should be versioned to ensure that the data analyst is able to rebuild the model with the same feature values as those used to train the previous model version. After the feature value for a given input is updated, the previous value is not erased; instead, it’s saved with a timestamp indicating when it was generated. 

Throughout the past few years, analysts and engineers have invented, experimented with, and validated various best practices that apply to feature engineering. In this article, we’ve looked at normalization, standardization, and categorical features. Other practices include generating simple features, reusing legacy systems, using IDs as features when needed, reducing cardinality when possible, using counts with caution, making feature selection when necessary, carefully testing the code, keeping the code, model, and data in sync, isolating feature-extraction code, serializing the model and feature extractor together, and logging the values of features. 

Feature engineering is a creative process, and as a machine-learning engineer, you’re in the best position to determine which features are good for your recommendation model. 

In the next post in this series, we’ll focus on building collaborative filtering recommendation models, which will be a walk in the park now that we’ve got feature engineering out of the way. Stay tuned! And if you have any questions, ask me on Twitter.

Share on facebookShare on linkedinShare on twitterShare by email
About the author

Loading amazing content…

Subscribe to the blog updates
Thank you for subscribing, stay tuned!