In the first part of this series, we talked about the key components of a high-performance recommender system: (1) Data Sources, (2) Feature Store, (3) Machine Learning Models, (4 & 5) Predictions & Actions, (6) Results, (7) Evaluation and (8) AI Ethics.
In this third post, we’ll do a deep dive into the vast topic of feature engineering for recommender systems. While the path from raw data to recommendations goes through various tools and systems, the process involves two mathematical entities that are the bread and butter of any recommendation system: features and models.
A feature is a numeric representation of raw data. Feature engineering is the process of composing the most appropriate features given the data, model, and task. In a basic collaborative filtering scenario we do not actually have features because ratings are, in fact, labels.
Content-based systems work with a wide variety of item descriptions and knowledge about users. Feature engineering involves converting these different types of unstructured data into standardized descriptions. Although it’s possible to use any kind of representation, such as a multidimensional data representation, the most common approach is to extract keywords from the underlying data.
Items have multiple fields in which various attributes are listed. For example, books have a description, title, and author. In some cases, these descriptions and attributes can be converted into keywords.
ItemID | Title | Authors | Description | Genre | Price |
0000031852 | 2034: A Novel of the Next World War | Elliot Ackerman, Admiral James Stavridis USN | From two former military officers and award-winning authors, a chillingly authentic geopolitical thriller that imagines a naval clash between the US and China in the South China Sea in 2034–and the path from there to a nightmarish global conflagration. | Thrillers & Suspense | $17.84 |
On the other hand, you can work directly with a multidimensional (structured) representation when attributes contain numerical quantities (e.g., price) or fields that are drawn from a small universe of possibilities (e.g., genre).
In addition to describing items, you’ll want to collect user profile data. For example, MovieLens, a classical recommendation data set of user attributes, contains three user attributes: gender, age, and occupation. Because these are single-label attributes, they can be encoded during a preprocess using one-hot encoding.
UserID | Gender | Age | Occupation |
M/F | * 1: “Under 18” * 18: “18-24” * 25: “25-34” * 35: “35-44” * 45: “45-49” * 50: “50-55” * 56: “56+” |
* 0: “other” or not specified * 1: “academic/educator” * 2: “artist” * 3: “clerical/admin” * 4: “college/grad student” * 5: “customer service” * 6: “doctor/health care” * 7: “executive/managerial” * 8: “farmer” * 9: “homemaker” * 10: “K-12 student” * 11: “lawyer” * 12: “programmer” * 13: “retired” * 14: “sales/marketing” * 15: “scientist” * 16: “self-employed” * 17: “technician/engineer” * 18: “tradesman/craftsman” * 19: “unemployed” * 20: “writer” |
Finally, not all features are created equal. You can apply feature weighting, which gives differential weights depending on a feature’s importance, or feature selection, which includes or excludes attributes based on relevance.
Now let’s explore feature engineering methodologies for the most common item and user attributes in a recommendation engine.
The price attribute contained in an item dataset is a continuous variable because it can take on an uncountable set of values, and it may contain any value within a given range. To transform this raw feature into a format that can be ingested by a machine-learning model, you’ll use quantization: essentially, mapping a continuous value to a discrete value. Conceptually, this can be thought of as an ordered sequence of statistical bins.
There are three typical approaches to binning:
Uniform binning is the simplest approach. It divides a range of possible values into N bins of the same width using the formula:
where N is the number of bins or intervals.
N is normally determined through experimentation — there’s no rule of thumb here.
For example, if the variable interval is [10, 110] and you want to create 5 bins, that means 110-10 / 5 = 20
, so each bin’s width is 20 and the intervals will be [10, 30], [30, 50], [50, 70], [70 – 90], and [90, 110].
The code and histogram(s) for uniform, quantile, or k-means binning looks something like this:
from sklearn.preprocessing import KBinsDiscretizer # create the discretizer object with strategy uniform discretizer = KBinsDiscretizer(bins, encode='ordinal', strategy='uniform') # replace "uniform" with "quantile" or "kmeans" to change discretization strategies data_disc= discretizer.fit_transform(data)
Uniform (bins = 10) | Quantile (bins = 10) | K-means (bins = 10) |
The two most-discussed scaling methods are normalization (rescaling values into a range of [0,1]) and standardization (rescaling data to have a mean of 0 and a standard deviation of 1). Here are visual representations of the data after it has been normalized and standardized:
You can use normalization for features such as pageviews, clicks, and transaction amounts because the values are not normally (Gaussian) distributed — most of the time they’re long tail.
Here’s the formula for normalization:
where Xmax and Xmin are the maximum and minimum values of the feature, respectively.
# data normalization with sklearn from sklearn.preprocessing import MinMaxScaler # fit scaler on training data norm = MinMaxScaler().fit(X_train) # transform training data X_train_norm = norm.transform(X_train) # transform testing data X_test_norm = norm.transform(X_test)
Standardization is useful for customer reviews because the data follows a Gaussian (normal) distribution:
where μ is the mean of the feature values and 𝝈 is the standard deviation of the feature values. Note that in this case, the values are not restricted to a particular range.
from sklearn.preprocessing import StandardScaler # fit on training data scale = StandardScaler().fit(X_train) # transform the training data X_train_stand = scale.transform(X_train) # transform the testing data X_test_stand = scale.transform(X_test)
Often, features are represented as categorical instead of continuous values. As in the above example, users could have features such as gender ([male, female]), age ([Under 18, 18-24, 25-34]), and occupation ([other, academic/educator, artist, clerical/admin], …). Such features can be efficiently coded as integers, for instance, [male, 18-24, clerical/admin] would be expressed as [0, 1, 3] while [female, 25-34, academic/educator] would be [1, 2, 1].
We have a few options for converting categorical features to integers:
from sklearn.preprocessing import OrdinalEncoder user_data = [['male', '18-24', 'clerical/admin'], ['female', '25-34', 'academic/educator']] encoder = OrdinalEncoder().fit(user_data) encoder.transform([['female', '25-34', 'clerical/admin']]) # array([[0., 1., 1.]]
from sklearn.preprocessing import OneHotEncoder user_data = [['male', '18-24', 'clerical/admin'], ['female', '25-34', 'academic/educator']] encoder = OneHotEncoder().fit(user_data) encoder.transform([['female', '25-34', 'clerical/admin'],['male', '25-34', 'academic/educator']]).toarray() # array([[1., 0., 0., 1., 0., 1.], [0., 1., 0., 1., 1., 0.]])
These methods — normalization, standardization, and categorical features — are used to compose the features. They all rely on a semantic understanding of language. Let’s take a look at how we can read text-based content.
Natural-language processing (NLP) is a subfield of AI that enables computers to understand and process human language. There are two techniques for achieving this task: applying a bag-of-words model to unprocessed text, and preprocessing text in order to use a neural network model later.
The bag-of-words model is the most commonly used process, as it’s easy to implement and understand. The idea is to create an occurrence matrix for sentences and documents without taking into account grammar or word order.
from sklearn.feature_extraction.text import CountVectorizer corpus = ["From two former military officers and award-winning authors, a chillingly authentic geopolitical thriller that imagines a naval clash between the US and China in the South China Sea in 2034–and the path from there to a nightmarish global conflagration."] vectorizer = CountVectorizer(stop_words=None, ngram_range=(1, 1), min_df=1, max_df=1) # stop_words - Please see the following guidelines before choosing a value for this param: https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words # ngram_range - Provides the range of n-values for the word n-grams to be extracted. (1, 1) means the vectorizer will only take into account unigrams, thus implementing the Bag-of-Words model # min_df && max_df - The minimum and maximum document frequency required for a term to be included in the vocabulary X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) ''' ['2034', 'and', 'authentic', 'authors', 'award', 'between', 'chillingly', 'china', 'clash', 'conflagration', 'former', 'from', 'geopolitical', 'global', 'imagines', 'in', 'military', 'naval', 'nightmarish', 'officers', 'path', 'sea', 'south', 'that', 'the', 'there', 'thriller', 'to', 'two', 'us', 'winning'] ''' print(X.toarray()) #[[1 3 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1]]
The resulting frequencies are used to train a classifier. Note that because preprocessing the sentences is not required, this approach brings a series of drawbacks, such as sparse representation of the resulting vectors, a poor job in making sense of text data, and suboptimal performance when dealing with a large number of documents.
The standard order for preprocessing sentences is tokenization, removing unnecessary punctuation and stop words, stemming, and lemmatization.
Tokenization consists of turning sentences into words. The word_tokenizer
from the nltk package
tokenizes a string to split off punctuation other than periods.
import nltk nltk.download('punkt') # a pre-trained Punkt tokenizer for English from nltk.tokenize import word_tokenize text = "From two former military officers and award-winning authors, a chillingly authentic geopolitical thriller that imagines a naval clash between the US and China in the South China Sea in 2034–and the path from there to a nightmarish global conflagration." tokens = word_tokenize(text) print(tokens) ''' ['From', 'two', 'former', 'military', 'officers', 'and', 'award-winning', 'authors', ',', 'a', 'chillingly', 'authentic', 'geopolitical', 'thriller', 'that', 'imagines', 'a', 'naval', 'clash', 'between', 'the', 'US', 'and', 'China', 'in', 'the', 'South', 'China', 'Sea', 'in', '2034–and', 'the', 'path', 'from', 'there', 'to', 'a', 'nightmarish', 'global', 'conflagration', '.'] '''
nltk.download('stopwords') # 2,400 stopwords for 11 languages from nltk.corpus import stopwords stop_words = set(stopwords.words('english')) tokens = [w for w in tokens if not w in stop_words] print(tokens) ''' ['From', 'two', 'former', 'military', 'officers', 'award-winning', 'authors', ',', 'chillingly', 'authentic', 'geopolitical', 'thriller', 'imagines', 'naval', 'clash', 'US', 'China', 'South', 'China', 'Sea', '2034–and', 'path', 'nightmarish', 'global', 'conflagration', '.'] '''
The words in the corpus are reduced to their roots by removing suffixes and prefixes. The stemmer looks for a list of common suffixes and prefixes and removes them.
from nltk.stem.porter import PorterStemmer #there are more available porter = PorterStemmer() stems = [] for t in tokens: stems.append(porter.stem(t)) print(stems) ''' ['from', 'two', 'former', 'militari', 'offic', 'award-win', 'author', ',', 'chillingli', 'authent', 'geopolit', 'thriller', 'imagin', 'naval', 'clash', 'us', 'china', 'south', 'china', 'sea', '2034–and', 'path', 'nightmarish', 'global', 'conflagr', '.'] '''
Lemmatization has the same expected output as stemming: reducing the word to either a common base or its root. However, the lemmatizer takes into account the morphological analysis of the word, using the same base for all its inflections.
nltk.download('wordnet') # a dictionary is needed for a Lemmatizer from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() lemmas = [] for t in tokens: lemmas.append(lemmatizer.lemmatize(t)) print(lemmas) ''' ['From', 'two', 'former', 'military', 'officer', 'award-winning', 'author', ',', 'chillingly', 'authentic', 'geopolitical', 'thriller', 'imago', 'naval', 'clash', 'US', 'China', 'South', 'China', 'Sea', '2034–and', 'path', 'nightmarish', 'global', 'conflagration', '.'] '''
Before jumping into this topic, you must understand the way computers “see” images. Each image is represented as either 1 or 3 matrices of pixels. Each matrix represents a channel. For black-and-white images, there is only one channel, while for colored images, there are three: Red, Green, and Blue. Each pixel is in turn represented by a number between 0 and 255, which denotes the intensity of the color.
The simplest way to retrieve the features from an image is to rearrange all the pixels and generate a feature vector. For a grayscale image, this is easily achieved using NumPy:
import skimage from skimage import data, io # data has standard test images import numpy as np from matplotlib import pyplot as plt %matplotlib inline camera = data.camera() io.imshow(camera) plt.show() print(camera.shape) # (height, width) # (512, 512) features = np.reshape(camera, (512*512)) print(features.shape, features) # ((262144,), array([156, 157, 160, ..., 121, 113, 111], dtype=uint8))
The same technique can be used for RGB images. However, a more suitable approach would be to create the feature vector by using the mean value of pixels from all the channels.
import skimage from skimage import data, io # data has standard test images import numpy as np from matplotlib import pyplot as plt %matplotlib inline astronaut = data.astronaut() io.imshow(astronaut) plt.show() print(astronaut.shape) # (height, width, no. of channels) # (512, 512, 3) feature_matrix = np.zeros((512, 512)) # matrix initialized with 0 for i in range(0, astronaut.shape[0]): for j in range(0, astronaut.shape[1]): feature_matrix[i][j] = ((astronaut[i, j, 0]) +(astronaut[i, j, 1]) + (astronaut[i, j, 2])) / 3 print(feature_matrix) ''' array([[65.33333333, 26.66666667, 74.33333333, ..., 35.33333333, 29. , 32.66666667], [ 2.33333333, 57.33333333, 31.66666667, ..., 33.66666667, 30.33333333, 28.66666667], [25.33333333, 7.66666667, 80.33333333, ..., 36.33333333, 32.66666667, 30.33333333], ..., [ 6.66666667, 7. , 3. , ..., 0. , 0.33333333, 0. ], [ 3.33333333, 2.66666667, 4.33333333, ..., 0.33333333, 1. , 0. ], [ 3.66666667, 1.66666667, 0.33333333, ..., 0. , 1. , 0. ]]) '''
An edge is a set of points in the image where the brightness and color change sharply. There are different techniques for detecting edges, the most common being the Canny edge detector algorithm. Here’s an overview of how it works:
import skimage from skimage import data # data has standard test images import cv2 from matplotlib import pyplot as plt %matplotlib inline camera = data.camera() # from skimage import color # the image should be grayscale for Canny # camera = color.rgb2gray(camera) # however this one already is edges = cv2.Canny(camera,100,200) # thresholds for the hysteresis procedure plt.subplot(121), plt.imshow(camera, cmap = 'gray') plt.title('Original Image') plt.subplot(122), plt.imshow(edges, cmap = 'gray') plt.title('Edge Image') plt.show()
The ways features are maintained and served can differ significantly across projects and teams. This introduces infrastructure complexity and often results in duplication of work. Some of the challenges faced by distributed organizations include:
To address these issues, a feature store acts as a central vault for storing documented, curated, and access-controlled features within an organization.
Feature Store | |||
Name | Description | Metadata | Definition |
average_user_order_value | The average order value for a user. | Why the feature was added to the model, how it contributes to generalization, the engineer’s name in the organization responsible for maintaining the feature’s data source, the input type, the output type. | Versioned code executed in a runtime environment and applied to the input to compute the feature value. |
Essentially, a feature store allows data engineers to insert features. In turn, data analysts and machine-learning engineers use an API to get feature values they deem relevant.
Additionally, feature values in a feature store should be versioned to ensure that the data analyst is able to rebuild the model with the same feature values as those used to train the previous model version. After the feature value for a given input is updated, the previous value is not erased; instead, it’s saved with a timestamp indicating when it was generated.
Throughout the past few years, analysts and engineers have invented, experimented with, and validated various best practices that apply to feature engineering. In this article, we’ve looked at normalization, standardization, and categorical features. Other practices include generating simple features, reusing legacy systems, using IDs as features when needed, reducing cardinality when possible, using counts with caution, making feature selection when necessary, carefully testing the code, keeping the code, model, and data in sync, isolating feature-extraction code, serializing the model and feature extractor together, and logging the values of features.
Feature engineering is a creative process, and as a machine-learning engineer, you’re in the best position to determine which features are good for your recommendation model.
In the next post in this series, we’ll focus on building collaborative filtering recommendation models, which will be a walk in the park now that we’ve got feature engineering out of the way. Stay tuned! And if you have any questions, ask me on Twitter.
Ciprian Borodescu
AI Product Manager | On a mission to help people succeed through the use of AIPowered by Algolia AI Recommendations
Ciprian Borodescu
AI Product Manager | On a mission to help people succeed through the use of AIJulien Lemoine
Co-founder & former CTO at AlgoliaJulien Lemoine
Co-founder & former CTO at Algolia