Search by Algolia
An introduction to transformer models in neural networks and machine learning
ai

An introduction to transformer models in neural networks and machine learning

What do OpenAI and DeepMind have in common? Give up? These innovative organizations both utilize technology known as transformer models ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What’s the secret of online merchandise management? Giving store merchandisers the right tools
e-commerce

What’s the secret of online merchandise management? Giving store merchandisers the right tools

As a successful in-store boutique manager in 1994, you might have had your merchandisers adorn your street-facing storefront ...

Catherine Dee

Search and Discovery writer

New features and capabilities in Algolia InstantSearch
engineering

New features and capabilities in Algolia InstantSearch

At Algolia, our business is more than search and discovery, it’s the continuous improvement of site search. If you ...

Haroen Viaene

JavaScript Library Developer

Feature Spotlight: Analytics
product

Feature Spotlight: Analytics

Analytics brings math and data into the otherwise very subjective world of ecommerce. It helps companies quantify how well their ...

Jaden Baptista

Technical Writer

What is clustering?
ai

What is clustering?

Amid all the momentous developments in the generative AI data space, are you a data scientist struggling to make sense ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is a vector database?
product

What is a vector database?

Fashion ideas for guest aunt informal summer wedding Funny movie to get my bored high-schoolers off their addictive gaming ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Unlock the power of image-based recommendation with Algolia’s LookingSimilar
engineering

Unlock the power of image-based recommendation with Algolia’s LookingSimilar

Imagine you're visiting an online art gallery and a specific painting catches your eye. You'd like to find ...

Raed Chammam

Senior Software Engineer

Empowering Change: Algolia's Global Giving Days Impact Report
algolia

Empowering Change: Algolia's Global Giving Days Impact Report

At Algolia, our commitment to making a positive impact extends far beyond the digital landscape. We believe in the power ...

Amy Ciba

Senior Manager, People Success

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve
e-commerce

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve

In today’s post-pandemic-yet-still-super-competitive retail landscape, gaining, keeping, and converting ecommerce customers is no easy ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Algolia x eTail | A busy few days in Boston
algolia

Algolia x eTail | A busy few days in Boston

There are few atmospheres as unique as that of a conference exhibit hall: the air always filled with an indescribable ...

Marissa Wharton

Marketing Content Manager

What are vectors and how do they apply to machine learning?
ai

What are vectors and how do they apply to machine learning?

To consider the question of what vectors are, it helps to be a mathematician, or at least someone who’s ...

Catherine Dee

Search and Discovery writer

Why imports are important in JS
engineering

Why imports are important in JS

My first foray into programming was writing Python on a Raspberry Pi to flicker some LED lights — it wasn’t ...

Jaden Baptista

Technical Writer

What is ecommerce? The complete guide
e-commerce

What is ecommerce? The complete guide

How well do you know the world of modern ecommerce?  With retail ecommerce sales having exceeded $5.7 trillion worldwide ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Data is king: The role of data capture and integrity in embracing AI
ai

Data is king: The role of data capture and integrity in embracing AI

In a world of artificial intelligence (AI), data serves as the foundation for machine learning (ML) models to identify trends ...

Alexandra Anghel

Director of AI Engineering

What are data privacy and data security? Why are they  critical for an organization?
product

What are data privacy and data security? Why are they critical for an organization?

Imagine you’re a leading healthcare provider that performs extensive data collection as part of your patient management. You’re ...

Catherine Dee

Search and Discovery writer

Achieving digital excellence: Algolia's insights from the GDS Retail Digital Summit
e-commerce

Achieving digital excellence: Algolia's insights from the GDS Retail Digital Summit

In an era where customer experience reigns supreme, achieving digital excellence is a worthy goal for retail leaders. But what ...

Marissa Wharton

Marketing Content Manager

AI at scale: Managing ML models over time & across use cases
ai

AI at scale: Managing ML models over time & across use cases

Just a few years ago it would have required considerable resources to build a new AI service from scratch. Of ...

Benoit Perrot

VP, Engineering

How continuous learning lets machine learning  provide increasingly accurate predictions and recommendations
ai

How continuous learning lets machine learning provide increasingly accurate predictions and recommendations

What new data points have you learned lately? Learning is never ending (hence the phrase “lifelong learning”), so chances are ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

Engineering features that go into a recommender system

In the first part of this series, we talked about the key components of a high-performance recommender system: (1) Data Sources, (2) Feature Store, (3) Machine Learning Models, (4 & 5) Predictions & Actions, (6) Results, (7) Evaluation and (8) AI Ethics.

In this third post, we’ll do a deep dive into the vast topic of feature engineering for recommender systems. While the path from raw data to recommendations goes through various tools and systems, the process involves two mathematical entities that are the bread and butter of any recommendation system: features and models.

image from data to recommendations

A feature is a numeric representation of raw data. Feature engineering is the process of composing the most appropriate features given the data, model, and task. In a basic collaborative filtering scenario we do not actually have features because ratings are, in fact, labels.

Content-based systems work with a wide variety of item descriptions and knowledge about users. Feature engineering involves converting these different types of unstructured data into standardized descriptions. Although it’s possible to use any kind of representation, such as a multidimensional data representation, the most common approach is to extract keywords from the underlying data. 

Items have multiple fields in which various attributes are listed. For example, books have a description, title, and author. In some cases, these descriptions and attributes can be converted into keywords. 

ItemID Title Authors Description Genre Price
0000031852 2034: A Novel of the Next World War Elliot Ackerman, Admiral James Stavridis USN From two former military officers and award-winning authors, a chillingly authentic geopolitical thriller that imagines a naval clash between the US and China in the South China Sea in 2034–and the path from there to a nightmarish global conflagration. Thrillers & Suspense $17.84

On the other hand, you can work directly with a multidimensional (structured) representation when attributes contain numerical quantities (e.g., price) or fields that are drawn from a small universe of possibilities (e.g., genre).

In addition to describing items, you’ll want to collect user profile data. For example, MovieLens, a classical recommendation data set of user attributes, contains three user attributes: gender, age, and occupation. Because these are single-label attributes, they can be encoded during a preprocess using one-hot encoding.

UserID Gender Age Occupation
M/F *  1:  “Under 18”
* 18:  “18-24”
* 25:  “25-34”
* 35:  “35-44”
* 45:  “45-49”
* 50:  “50-55”
* 56:  “56+”
*  0:  “other” or not specified
*  1:  “academic/educator”
*  2:  “artist”
*  3:  “clerical/admin”
*  4:  “college/grad student”
*  5:  “customer service”
*  6:  “doctor/health care”
*  7:  “executive/managerial”
*  8:  “farmer”
*  9:  “homemaker”
* 10:  “K-12 student”
* 11:  “lawyer”
* 12:  “programmer”
* 13:  “retired”
* 14:  “sales/marketing”
* 15:  “scientist”
* 16:  “self-employed”
* 17:  “technician/engineer”
* 18:  “tradesman/craftsman”
* 19:  “unemployed”
* 20:  “writer”

Finally, not all features are created equal. You can apply feature weighting, which gives differential weights depending on a feature’s importance, or feature selection, which includes or excludes attributes based on relevance.

Now let’s explore feature engineering methodologies for the most common item and user attributes in a recommendation engine.

Numerical features

Discretization

The price attribute contained in an item dataset is a continuous variable because it can take on an uncountable set of values, and it may contain any value within a given range. To transform this raw feature into a format that can be ingested by a machine-learning model, you’ll use quantization: essentially, mapping a continuous value to a discrete value. Conceptually, this can be thought of as an ordered sequence of statistical bins.

image of the discretization process

There are three typical approaches to binning: 

  • Uniform: all bins have identical widths
  • Quantile based: all bins have the same number of values
  • K-means based: each bin belongs to the nearest one-dimensional k-means cluster

Uniform binning is the simplest approach. It divides a range of possible values into N bins of the same width using the formula:

    \[width = \frac{maxvalue-minvalue}{N}\]

where N is the number of bins or intervals. 

N is normally determined through experimentation — there’s no rule of thumb here.

For example, if the variable interval is [10, 110] and you want to create 5 bins, that means 110-10 / 5 = 20, so each bin’s width is 20 and the intervals will be [10, 30], [30, 50], [50, 70], [70 – 90], and [90, 110].

The code and histogram(s) for uniform, quantile, or k-means binning looks something like this:

from sklearn.preprocessing import KBinsDiscretizer
# create the discretizer object with strategy uniform
discretizer = KBinsDiscretizer(bins, encode='ordinal', strategy='uniform') # replace "uniform" with "quantile" or "kmeans" to change discretization strategies
data_disc= discretizer.fit_transform(data)
Uniform (bins = 10) Quantile (bins = 10) K-means (bins = 10)
image of bins image of bins image of bins

Normalization and standardization

The two most-discussed scaling methods are normalization (rescaling values into a range of [0,1]) and standardization (rescaling data to have a mean of 0 and a standard deviation of 1). Here are visual representations of the data after it has been normalized and standardized: 

image of the difference-between-normalization-standardization

You can use normalization for features such as pageviews, clicks, and transaction amounts because the values are not normally (Gaussian) distributed — most of the time they’re long tail.  

Here’s the formula for normalization:

    \[{X}' = \frac{X-X_{min} }{X_{max}-X_{min}}\]

where Xmax and Xmin are the maximum and minimum values of the feature, respectively.

# data normalization with sklearn
from sklearn.preprocessing import MinMaxScaler

# fit scaler on training data
norm = MinMaxScaler().fit(X_train)

# transform training data
X_train_norm = norm.transform(X_train)

# transform testing data
X_test_norm = norm.transform(X_test)

Standardization is useful for customer reviews because the data follows a Gaussian (normal) distribution:

    \[{X}' = \frac{X-\mu}{\sigma}\]

where μ is the mean of the feature values and 𝝈 is the standard deviation of the feature values. Note that in this case, the values are not restricted to a particular range.

from sklearn.preprocessing import StandardScaler

# fit on training data
scale = StandardScaler().fit(X_train)
    
# transform the training data 
X_train_stand = scale.transform(X_train)
    
# transform the testing data 
X_test_stand = scale.transform(X_test)

Categorical features

Often, features are represented as categorical instead of continuous values. As in the above example, users could have features such as gender ([male, female]), age ([Under 18, 18-24, 25-34]), and occupation ([other, academic/educator, artist, clerical/admin], …). Such features can be efficiently coded as integers, for instance, [male, 18-24, clerical/admin] would be expressed as [0, 1, 3] while [female, 25-34, academic/educator] would be [1, 2, 1].

We have a few options for converting categorical features to integers:

  • Use the OrdinalEncoder. This estimator converts each categorical feature to one new feature of integers (0 to n_categories – 1)
from sklearn.preprocessing import OrdinalEncoder
 
user_data = [['male', '18-24', 'clerical/admin'], ['female', '25-34', 'academic/educator']]
encoder = OrdinalEncoder().fit(user_data)
encoder.transform([['female', '25-34', 'clerical/admin']])
 
# array([[0., 1., 1.]]
  • Use a scikit-learn estimator. This estimator uses a one-of-K scheme, also known as one-hot or dummy encoding, to convert categories to integers.
from sklearn.preprocessing import OneHotEncoder
 
user_data = [['male', '18-24', 'clerical/admin'], ['female', '25-34', 'academic/educator']]
encoder = OneHotEncoder().fit(user_data)
encoder.transform([['female', '25-34', 'clerical/admin'],['male', '25-34', 'academic/educator']]).toarray()
 
# array([[1., 0., 0., 1., 0., 1.], [0., 1., 0., 1., 1., 0.]])

Text embedding

These methods — normalization, standardization, and categorical features — are used to compose the features. They all rely on a semantic understanding of language. Let’s take a look at how we can read text-based content. 

Natural-language processing (NLP) is a subfield of AI that enables computers to understand and process human language. There are two techniques for achieving this task: applying a bag-of-words model to unprocessed text, and preprocessing text in order to use a neural network model later.

Bag of words

The bag-of-words model is the most commonly used process, as it’s easy to implement and understand. The idea is to create an occurrence matrix for sentences and documents without taking into account grammar or word order.

from sklearn.feature_extraction.text import CountVectorizer
 
corpus = ["From two former military officers and award-winning authors, a chillingly authentic geopolitical thriller that imagines a naval clash between the US and China in the South China Sea in 2034–and the path from there to a nightmarish global conflagration."]
 
vectorizer = CountVectorizer(stop_words=None, ngram_range=(1, 1), min_df=1, max_df=1)
# stop_words - Please see the following guidelines before choosing a value for this param: https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words
# ngram_range - Provides the range of n-values for the word n-grams to be extracted. (1, 1) means the vectorizer will only take into account unigrams, thus implementing the Bag-of-Words model
# min_df && max_df - The minimum and maximum document frequency required for a term to be included in the vocabulary
 
X = vectorizer.fit_transform(corpus)
 
print(vectorizer.get_feature_names())
'''
['2034', 'and', 'authentic', 'authors', 'award', 'between', 'chillingly', 'china', 'clash', 'conflagration', 'former', 'from', 'geopolitical', 'global', 'imagines', 'in', 'military', 'naval', 'nightmarish', 'officers', 'path', 'sea', 'south', 'that', 'the', 'there', 'thriller', 'to', 'two', 'us', 'winning']
'''
 
print(X.toarray())
#[[1 3 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1]]

The resulting frequencies are used to train a classifier. Note that because preprocessing the sentences is not required, this approach brings a series of drawbacks, such as sparse representation of the resulting vectors, a poor job in making sense of text data, and suboptimal performance when dealing with a large number of documents.

Preprocessing text

The standard order for preprocessing sentences is tokenization, removing unnecessary punctuation and stop words, stemming, and lemmatization. 

  • Tokenization

Tokenization consists of turning sentences into words. The word_tokenizer from the nltk package tokenizes a string to split off punctuation other than periods.

import nltk
nltk.download('punkt') # a pre-trained Punkt tokenizer for English
from nltk.tokenize import word_tokenize
 
text = "From two former military officers and award-winning authors, a chillingly authentic geopolitical thriller that imagines a naval clash between the US and China in the South China Sea in 2034–and the path from there to a nightmarish global conflagration."
 
tokens = word_tokenize(text)
print(tokens)
'''
['From', 'two', 'former', 'military', 'officers', 'and', 'award-winning', 'authors', ',', 'a', 'chillingly', 'authentic', 'geopolitical', 'thriller', 'that', 'imagines', 'a', 'naval', 'clash', 'between', 'the', 'US', 'and', 'China', 'in', 'the', 'South', 'China', 'Sea', 'in', '2034–and', 'the', 'path', 'from', 'there', 'to', 'a', 'nightmarish', 'global', 'conflagration', '.']
''' 
  • Removing stop words

Different packages have predefined stop words. The alternative is to define a custom list of stop words relevant to the corpus. 

nltk.download('stopwords') # 2,400 stopwords for 11 languages
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]

print(tokens)
'''
['From', 'two', 'former', 'military', 'officers', 'award-winning', 'authors', ',', 'chillingly', 'authentic', 'geopolitical', 'thriller', 'imagines', 'naval', 'clash', 'US', 'China', 'South', 'China', 'Sea', '2034–and', 'path', 'nightmarish', 'global', 'conflagration', '.']
'''
  • Stemming

The words in the corpus are reduced to their roots by removing suffixes and prefixes. The stemmer looks for a list of common suffixes and prefixes and removes them.

from nltk.stem.porter import PorterStemmer #there are more available

porter = PorterStemmer()
stems = []
for t in tokens:    
    stems.append(porter.stem(t))

print(stems)
'''
['from', 'two', 'former', 'militari', 'offic', 'award-win', 'author', ',', 'chillingli', 'authent', 'geopolit', 'thriller', 'imagin', 'naval', 'clash', 'us', 'china', 'south', 'china', 'sea', '2034–and', 'path', 'nightmarish', 'global', 'conflagr', '.']
'''
  • . Lemmatization

Lemmatization has the same expected output as stemming: reducing the word to either a common base or its root. However, the lemmatizer takes into account the morphological analysis of the word, using the same base for all its inflections. 

nltk.download('wordnet') # a dictionary is needed for a Lemmatizer
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()
lemmas = []
for t in tokens:    
    lemmas.append(lemmatizer.lemmatize(t))
print(lemmas)

'''
['From', 'two', 'former', 'military', 'officer', 'award-winning', 'author', ',', 'chillingly', 'authentic', 'geopolitical', 'thriller', 'imago', 'naval', 'clash', 'US', 'China', 'South', 'China', 'Sea', '2034–and', 'path', 'nightmarish', 'global', 'conflagration', '.']
'''

Image encoding

Before jumping into this topic, you must understand the way computers “see” images. Each image is represented as either 1 or 3 matrices of pixels. Each matrix represents a channel. For black-and-white images, there is only one channel, while for colored images, there are three: Red, Green, and Blue. Each pixel is in turn represented by a number between 0 and 255, which denotes the intensity of the color.

image of black and white

Pixel values as features

The simplest way to retrieve the features from an image is to rearrange all the pixels and generate a feature vector. For a grayscale image, this is easily achieved using NumPy:

import skimage
from skimage import data, io # data has standard test images
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

camera = data.camera() 
io.imshow(camera)
plt.show()

black and white image 

print(camera.shape) # (height, width)
# (512, 512)
features = np.reshape(camera, (512*512))
print(features.shape, features)
# ((262144,), array([156, 157, 160, ..., 121, 113, 111], dtype=uint8))

The same technique can be used for RGB images. However, a more suitable approach would be to create the feature vector by using the mean value of pixels from all the channels.

import skimage
from skimage import data, io # data has standard test images
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

astronaut = data.astronaut() 
io.imshow(astronaut)
plt.show()


color image

print(astronaut.shape) # (height, width, no. of channels)
# (512, 512, 3)
feature_matrix = np.zeros((512, 512)) # matrix initialized with 0
for i in range(0, astronaut.shape[0]):
    for j in range(0, astronaut.shape[1]):
        feature_matrix[i][j] = ((astronaut[i, j, 0]) +(astronaut[i, j, 1]) + (astronaut[i, j, 2])) / 3

print(feature_matrix)
'''
array([[65.33333333, 26.66666667, 74.33333333, ..., 35.33333333,
        29.        , 32.66666667],
       [ 2.33333333, 57.33333333, 31.66666667, ..., 33.66666667,
        30.33333333, 28.66666667],
       [25.33333333,  7.66666667, 80.33333333, ..., 36.33333333,
        32.66666667, 30.33333333],
       ...,
       [ 6.66666667,  7.        ,  3.        , ...,  0.        ,
         0.33333333,  0.        ],
       [ 3.33333333,  2.66666667,  4.33333333, ...,  0.33333333,
         1.        ,  0.        ],
       [ 3.66666667,  1.66666667,  0.33333333, ...,  0.        ,
         1.        ,  0.        ]])
'''

Edge detection

An edge is a set of points in the image where the brightness and color change sharply. There are different techniques for detecting edges, the most common being the Canny edge detector algorithm. Here’s an overview of how it works:

  1. Noise reduction using a Gaussian filter
  2. Gradient calculation
  3. Non maximum suppression (the algorithm goes through all the points on the gradient intensity matrix and finds the pixels with the maximum values in the edge directions)
  4. Double threshold (splits the pixels into two categories: strong and weak)
  5. Edge tracking by hysteresis (converts the weak pixels into strong ones if and only if there’s another strong pixel as their neighbor) 
import skimage
from skimage import data # data has standard test images
import cv2
from matplotlib import pyplot as plt
%matplotlib inline

camera = data.camera() 
# from skimage import color # the image should be grayscale for Canny
# camera = color.rgb2gray(camera) # however this one already is
edges = cv2.Canny(camera,100,200) # thresholds for the hysteresis procedure

plt.subplot(121), plt.imshow(camera, cmap = 'gray')
plt.title('Original Image')
plt.subplot(122), plt.imshow(edges, cmap = 'gray')
plt.title('Edge Image')
plt.show()

image of the an edge image

Final word and feature stores

The ways features are maintained and served can differ significantly across projects and teams. This introduces infrastructure complexity and often results in duplication of work. Some of the challenges faced by distributed organizations include: 

  • Features are not reused
  • Feature definitions vary
  • Features take a long time to be computed
  • There is inconsistency between training and serving
  • Feature decay is unknown

To address these issues, a feature store acts as a central vault for storing documented, curated, and access-controlled features within an organization. 

Feature Store
Name Description Metadata Definition
average_user_order_value The average order value for a user. Why the feature was added to the model, how it contributes to generalization, the engineer’s name in the organization responsible for maintaining the feature’s data source, the input type, the output type. Versioned code executed in a runtime environment and applied to the input to compute the feature value.

 

Essentially, a feature store allows data engineers to insert features. In turn, data analysts and machine-learning engineers use an API to get feature values they deem relevant. 

Additionally, feature values in a feature store should be versioned to ensure that the data analyst is able to rebuild the model with the same feature values as those used to train the previous model version. After the feature value for a given input is updated, the previous value is not erased; instead, it’s saved with a timestamp indicating when it was generated. 

Throughout the past few years, analysts and engineers have invented, experimented with, and validated various best practices that apply to feature engineering. In this article, we’ve looked at normalization, standardization, and categorical features. Other practices include generating simple features, reusing legacy systems, using IDs as features when needed, reducing cardinality when possible, using counts with caution, making feature selection when necessary, carefully testing the code, keeping the code, model, and data in sync, isolating feature-extraction code, serializing the model and feature extractor together, and logging the values of features. 

Feature engineering is a creative process, and as a machine-learning engineer, you’re in the best position to determine which features are good for your recommendation model. 

In the next post in this series, we’ll focus on building collaborative filtering recommendation models, which will be a walk in the park now that we’ve got feature engineering out of the way. Stay tuned! And if you have any questions, ask me on Twitter.

About the author
Ciprian Borodescu

AI Product Manager | On a mission to help people succeed through the use of AI

githublinkedintwitter

Get your AI on!

The AI podcast where software developers and business leaders come together to share their experience in dealing with problems that can be solved through intelligent use of data.

Listen to the podcasts
Get your AI on!

Recommended Articles

Powered byAlgolia Algolia Recommend

The anatomy of high-performance recommender systems - Part 1
ai

Ciprian Borodescu

AI Product Manager | On a mission to help people succeed through the use of AI

Visual Shopping & Visual Discovery – How image search reimagines online shopping
ai

Julien Lemoine

Co-founder & former CTO at Algolia

Algolia's top 10 tips to achieve highly relevant search results
product

Julien Lemoine

Co-founder & former CTO at Algolia