Estimated time to read: 8 minutes

Engineering features that go into a recommender system

In the first part of this series, we talked about the key components of a high-performance recommender system: (1) Data Sources, (2) Feature Store, (3) Machine Learning Models, (4 & 5) Predictions & Actions, (6) Results, (7) Evaluation and (8) AI Ethics.

In this third post, we’ll do a deep dive into the vast topic of feature engineering for recommender systems. While the path from raw data to recommendations goes through various tools and systems, the process involves two mathematical entities that are the bread and butter of any recommendation system: features and models.

A feature is a numeric representation of raw data. Feature engineering is the process of composing the most appropriate features given the data, model, and task. In a basic collaborative filtering scenario we do not actually have features because ratings are, in fact, labels.

Content-based systems work with a wide variety of item descriptions and knowledge about users. Feature engineering involves converting these different types of unstructured data into standardized descriptions. Although it’s possible to use any kind of representation, such as a multidimensional data representation, the most common approach is to extract keywords from the underlying data.

Items have multiple fields in which various attributes are listed. For example, books have a description, title, and author. In some cases, these descriptions and attributes can be converted into keywords.

ItemID	Title	Authors	Description	Genre	Price
0000031852	2034: A Novel of the Next World War	Elliot Ackerman, Admiral James Stavridis USN	From two former military officers and award-winning authors, a chillingly authentic geopolitical thriller that imagines a naval clash between the US and China in the South China Sea in 2034–and the path from there to a nightmarish global conflagration.	Thrillers & Suspense	$17.84

On the other hand, you can work directly with a multidimensional (structured) representation when attributes contain numerical quantities (e.g., price) or fields that are drawn from a small universe of possibilities (e.g., genre).

In addition to describing items, you’ll want to collect user profile data. For example, MovieLens, a classical recommendation data set of user attributes, contains three user attributes: gender, age, and occupation. Because these are single-label attributes, they can be encoded during a preprocess using one-hot encoding.

UserID	Gender	Age	Occupation
	M/F	* 1: “Under 18” * 18: “18-24” * 25: “25-34” * 35: “35-44” * 45: “45-49” * 50: “50-55” * 56: “56+”	* 0: “other” or not specified * 1: “academic/educator” * 2: “artist” * 3: “clerical/admin” * 4: “college/grad student” * 5: “customer service” * 6: “doctor/health care” * 7: “executive/managerial” * 8: “farmer” * 9: “homemaker” * 10: “K-12 student” * 11: “lawyer” * 12: “programmer” * 13: “retired” * 14: “sales/marketing” * 15: “scientist” * 16: “self-employed” * 17: “technician/engineer” * 18: “tradesman/craftsman” * 19: “unemployed” * 20: “writer”

Finally, not all features are created equal. You can apply feature weighting, which gives differential weights depending on a feature’s importance, or feature selection, which includes or excludes attributes based on relevance.

Now let’s explore feature engineering methodologies for the most common item and user attributes in a recommendation engine.

Numerical features

Discretization

The price attribute contained in an item dataset is a continuous variable because it can take on an uncountable set of values, and it may contain any value within a given range. To transform this raw feature into a format that can be ingested by a machine-learning model, you’ll use quantization: essentially, mapping a continuous value to a discrete value. Conceptually, this can be thought of as an ordered sequence of statistical bins.

There are three typical approaches to binning:

Uniform: all bins have identical widths
Quantile based: all bins have the same number of values
K-means based: each bin belongs to the nearest one-dimensional k-means cluster

Uniform binning is the simplest approach. It divides a range of possible values into N bins of the same width using the formula:

formula

where N is the number of bins or intervals.

N is normally determined through experimentation — there’s no rule of thumb here.

For example, if the variable interval is [10, 110] and you want to create 5 bins, that means 110-10 / 5 = 20, so each bin’s width is 20 and the intervals will be [10, 30], [30, 50], [50, 70], [70 – 90], and [90, 110].

The code and histogram(s) for uniform, quantile, or k-means binning looks something like this:

from sklearn.preprocessing import KBinsDiscretizer
# create the discretizer object with strategy uniform
discretizer = KBinsDiscretizer(bins, encode='ordinal', strategy='uniform') # replace "uniform" with "quantile" or "kmeans" to change discretization strategies
data_disc= discretizer.fit_transform(data)

Uniform (bins = 10)	Quantile (bins = 10)	K-means (bins = 10)

Normalization and standardization

The two most-discussed scaling methods are normalization (rescaling values into a range of [0,1]) and standardization (rescaling data to have a mean of 0 and a standard deviation of 1). Here are visual representations of the data after it has been normalized and standardized:

You can use normalization for features such as pageviews, clicks, and transaction amounts because the values are not normally (Gaussian) distributed — most of the time they’re long tail.

Here’s the formula for normalization:

normalization formula

where Xmax and Xmin are the maximum and minimum values of the feature, respectively.

# data normalization with sklearn
from sklearn.preprocessing import MinMaxScaler

# fit scaler on training data
norm = MinMaxScaler().fit(X_train)

# transform training data
X_train_norm = norm.transform(X_train)

# transform testing data
X_test_norm = norm.transform(X_test)

Standardization is useful for customer reviews because the data follows a Gaussian (normal) distribution:

formula

where μ is the mean of the feature values and 𝝈 is the standard deviation of the feature values. Note that in this case, the values are not restricted to a particular range.

from sklearn.preprocessing import StandardScaler

# fit on training data
scale = StandardScaler().fit(X_train)
    
# transform the training data 
X_train_stand = scale.transform(X_train)
    
# transform the testing data 
X_test_stand = scale.transform(X_test)

Categorical features

Often, features are represented as categorical instead of continuous values. As in the above example, users could have features such as gender ([male, female]), age ([Under 18, 18-24, 25-34]), and occupation ([other, academic/educator, artist, clerical/admin], …). Such features can be efficiently coded as integers, for instance, [male, 18-24, clerical/admin] would be expressed as [0, 1, 3] while [female, 25-34, academic/educator] would be [1, 2, 1].

We have a few options for converting categorical features to integers:

Use the OrdinalEncoder. This estimator converts each categorical feature to one new feature of integers (0 to n_categories – 1)

from sklearn.preprocessing import OrdinalEncoder
 
user_data = [['male', '18-24', 'clerical/admin'], ['female', '25-34', 'academic/educator']]
encoder = OrdinalEncoder().fit(user_data)
encoder.transform([['female', '25-34', 'clerical/admin']])
 
# array([[0., 1., 1.]]

Use a scikit-learn estimator. This estimator uses a one-of-K scheme, also known as one-hot or dummy encoding, to convert categories to integers.

from sklearn.preprocessing import OneHotEncoder
 
user_data = [['male', '18-24', 'clerical/admin'], ['female', '25-34', 'academic/educator']]
encoder = OneHotEncoder().fit(user_data)
encoder.transform([['female', '25-34', 'clerical/admin'],['male', '25-34', 'academic/educator']]).toarray()
 
# array([[1., 0., 0., 1., 0., 1.], [0., 1., 0., 1., 1., 0.]])

Text embedding

These methods — normalization, standardization, and categorical features — are used to compose the features. They all rely on a semantic understanding of language. Let’s take a look at how we can read text-based content.

Natural-language processing (NLP) is a subfield of AI that enables computers to understand and process human language. There are two techniques for achieving this task: applying a bag-of-words model to unprocessed text, and preprocessing text in order to use a neural network model later.

Bag of words

The bag-of-words model is the most commonly used process, as it’s easy to implement and understand. The idea is to create an occurrence matrix for sentences and documents without taking into account grammar or word order.

from sklearn.feature_extraction.text import CountVectorizer
 
corpus = ["From two former military officers and award-winning authors, a chillingly authentic geopolitical thriller that imagines a naval clash between the US and China in the South China Sea in 2034–and the path from there to a nightmarish global conflagration."]
 
vectorizer = CountVectorizer(stop_words=None, ngram_range=(1, 1), min_df=1, max_df=1)
# stop_words - Please see the following guidelines before choosing a value for this param: https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words
# ngram_range - Provides the range of n-values for the word n-grams to be extracted. (1, 1) means the vectorizer will only take into account unigrams, thus implementing the Bag-of-Words model
# min_df && max_df - The minimum and maximum document frequency required for a term to be included in the vocabulary
 
X = vectorizer.fit_transform(corpus)
 
print(vectorizer.get_feature_names())
'''
['2034', 'and', 'authentic', 'authors', 'award', 'between', 'chillingly', 'china', 'clash', 'conflagration', 'former', 'from', 'geopolitical', 'global', 'imagines', 'in', 'military', 'naval', 'nightmarish', 'officers', 'path', 'sea', 'south', 'that', 'the', 'there', 'thriller', 'to', 'two', 'us', 'winning']
'''
 
print(X.toarray())
#[[1 3 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1]]

The resulting frequencies are used to train a classifier. Note that because preprocessing the sentences is not required, this approach brings a series of drawbacks, such as sparse representation of the resulting vectors, a poor job in making sense of text data, and suboptimal performance when dealing with a large number of documents.

Preprocessing text

The standard order for preprocessing sentences is tokenization, removing unnecessary punctuation and stop words, stemming, and lemmatization.

Tokenization

Tokenization consists of turning sentences into words. The word_tokenizer from the nltk package tokenizes a string to split off punctuation other than periods.

import nltk
nltk.download('punkt') # a pre-trained Punkt tokenizer for English
from nltk.tokenize import word_tokenize
 
text = "From two former military officers and award-winning authors, a chillingly authentic geopolitical thriller that imagines a naval clash between the US and China in the South China Sea in 2034–and the path from there to a nightmarish global conflagration."
 
tokens = word_tokenize(text)
print(tokens)
'''
['From', 'two', 'former', 'military', 'officers', 'and', 'award-winning', 'authors', ',', 'a', 'chillingly', 'authentic', 'geopolitical', 'thriller', 'that', 'imagines', 'a', 'naval', 'clash', 'between', 'the', 'US', 'and', 'China', 'in', 'the', 'South', 'China', 'Sea', 'in', '2034–and', 'the', 'path', 'from', 'there', 'to', 'a', 'nightmarish', 'global', 'conflagration', '.']
'''

Removing stop words

Different packages have predefined stop words. The alternative is to define a custom list of stop words relevant to the corpus.

nltk.download('stopwords') # 2,400 stopwords for 11 languages
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]

print(tokens)
'''
['From', 'two', 'former', 'military', 'officers', 'award-winning', 'authors', ',', 'chillingly', 'authentic', 'geopolitical', 'thriller', 'imagines', 'naval', 'clash', 'US', 'China', 'South', 'China', 'Sea', '2034–and', 'path', 'nightmarish', 'global', 'conflagration', '.']
'''

Stemming

The words in the corpus are reduced to their roots by removing suffixes and prefixes. The stemmer looks for a list of common suffixes and prefixes and removes them.

from nltk.stem.porter import PorterStemmer #there are more available

porter = PorterStemmer()
stems = []
for t in tokens:    
    stems.append(porter.stem(t))

print(stems)
'''
['from', 'two', 'former', 'militari', 'offic', 'award-win', 'author', ',', 'chillingli', 'authent', 'geopolit', 'thriller', 'imagin', 'naval', 'clash', 'us', 'china', 'south', 'china', 'sea', '2034–and', 'path', 'nightmarish', 'global', 'conflagr', '.']
'''

. Lemmatization

Lemmatization has the same expected output as stemming: reducing the word to either a common base or its root. However, the lemmatizer takes into account the morphological analysis of the word, using the same base for all its inflections.

nltk.download('wordnet') # a dictionary is needed for a Lemmatizer
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()
lemmas = []
for t in tokens:    
    lemmas.append(lemmatizer.lemmatize(t))
print(lemmas)

'''
['From', 'two', 'former', 'military', 'officer', 'award-winning', 'author', ',', 'chillingly', 'authentic', 'geopolitical', 'thriller', 'imago', 'naval', 'clash', 'US', 'China', 'South', 'China', 'Sea', '2034–and', 'path', 'nightmarish', 'global', 'conflagration', '.']
'''

Image encoding

Before jumping into this topic, you must understand the way computers “see” images. Each image is represented as either 1 or 3 matrices of pixels. Each matrix represents a channel. For black-and-white images, there is only one channel, while for colored images, there are three: Red, Green, and Blue. Each pixel is in turn represented by a number between 0 and 255, which denotes the intensity of the color.

Pixel values as features

The simplest way to retrieve the features from an image is to rearrange all the pixels and generate a feature vector. For a grayscale image, this is easily achieved using NumPy:

import skimage
from skimage import data, io # data has standard test images
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

camera = data.camera() 
io.imshow(camera)
plt.show()



print(camera.shape) # (height, width)
# (512, 512)
features = np.reshape(camera, (512*512))
print(features.shape, features)
# ((262144,), array([156, 157, 160, ..., 121, 113, 111], dtype=uint8))

The same technique can be used for RGB images. However, a more suitable approach would be to create the feature vector by using the mean value of pixels from all the channels.

import skimage
from skimage import data, io # data has standard test images
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

astronaut = data.astronaut() 
io.imshow(astronaut)
plt.show()



print(astronaut.shape) # (height, width, no. of channels)
# (512, 512, 3)
feature_matrix = np.zeros((512, 512)) # matrix initialized with 0
for i in range(0, astronaut.shape[0]):
    for j in range(0, astronaut.shape[1]):
        feature_matrix[i][j] = ((astronaut[i, j, 0]) +(astronaut[i, j, 1]) + (astronaut[i, j, 2])) / 3

print(feature_matrix)
'''
array([[65.33333333, 26.66666667, 74.33333333, ..., 35.33333333,
        29.        , 32.66666667],
       [ 2.33333333, 57.33333333, 31.66666667, ..., 33.66666667,
        30.33333333, 28.66666667],
       [25.33333333,  7.66666667, 80.33333333, ..., 36.33333333,
        32.66666667, 30.33333333],
       ...,
       [ 6.66666667,  7.        ,  3.        , ...,  0.        ,
         0.33333333,  0.        ],
       [ 3.33333333,  2.66666667,  4.33333333, ...,  0.33333333,
         1.        ,  0.        ],
       [ 3.66666667,  1.66666667,  0.33333333, ...,  0.        ,
         1.        ,  0.        ]])
'''

Edge detection

An edge is a set of points in the image where the brightness and color change sharply. There are different techniques for detecting edges, the most common being the Canny edge detector algorithm. Here’s an overview of how it works:

Noise reduction using a Gaussian filter
Gradient calculation
Non maximum suppression (the algorithm goes through all the points on the gradient intensity matrix and finds the pixels with the maximum values in the edge directions)
Double threshold (splits the pixels into two categories: strong and weak)
Edge tracking by hysteresis (converts the weak pixels into strong ones if and only if there’s another strong pixel as their neighbor)

import skimage
from skimage import data # data has standard test images
import cv2
from matplotlib import pyplot as plt
%matplotlib inline

camera = data.camera() 
# from skimage import color # the image should be grayscale for Canny
# camera = color.rgb2gray(camera) # however this one already is
edges = cv2.Canny(camera,100,200) # thresholds for the hysteresis procedure

plt.subplot(121), plt.imshow(camera, cmap = 'gray')
plt.title('Original Image')
plt.subplot(122), plt.imshow(edges, cmap = 'gray')
plt.title('Edge Image')
plt.show()

Final word and feature stores

The ways features are maintained and served can differ significantly across projects and teams. This introduces infrastructure complexity and often results in duplication of work. Some of the challenges faced by distributed organizations include:

Features are not reused
Feature definitions vary
Features take a long time to be computed
There is inconsistency between training and serving
Feature decay is unknown

To address these issues, a feature store acts as a central vault for storing documented, curated, and access-controlled features within an organization.

Feature Store
Name	Description	Metadata	Definition
average_user_order_value	The average order value for a user.	Why the feature was added to the model, how it contributes to generalization, the engineer’s name in the organization responsible for maintaining the feature’s data source, the input type, the output type.	Versioned code executed in a runtime environment and applied to the input to compute the feature value.

Essentially, a feature store allows data engineers to insert features. In turn, data analysts and machine-learning engineers use an API to get feature values they deem relevant.

Additionally, feature values in a feature store should be versioned to ensure that the data analyst is able to rebuild the model with the same feature values as those used to train the previous model version. After the feature value for a given input is updated, the previous value is not erased; instead, it’s saved with a timestamp indicating when it was generated.

Throughout the past few years, analysts and engineers have invented, experimented with, and validated various best practices that apply to feature engineering. In this article, we’ve looked at normalization, standardization, and categorical features. Other practices include generating simple features, reusing legacy systems, using IDs as features when needed, reducing cardinality when possible, using counts with caution, making feature selection when necessary, carefully testing the code, keeping the code, model, and data in sync, isolating feature-extraction code, serializing the model and feature extractor together, and logging the values of features.

Feature engineering is a creative process, and as a machine-learning engineer, you’re in the best position to determine which features are good for your recommendation model.

In the next post in this series, we’ll focus on building collaborative filtering recommendation models, which will be a walk in the park now that we’ve got feature engineering out of the way. Stay tuned! And if you have any questions, ask me on Twitter.

The anatomy of high-performance recommender systems – Part III

Estimated time to read: 8 minutes

Engineering features that go into a recommender system

Numerical features

Discretization

Normalization and standardization

Categorical features

Text embedding

Bag of words

Preprocessing text

Tokenization

Removing stop words

Different packages have predefined stop words. The alternative is to define a custom list of stop words relevant to the corpus.

Image encoding

Pixel values as features

Edge detection

Final word and feature stores

Recommended Content

Get the AI search that shows users what they need

Agentic intelligence layer powering commerce discovery

A leader for the third consecutive year

Increased Operating Profit and Improved Efficiency

Named a leader in knowledge discovery

Top scores across every B2B category