In the first part of this series, we talked about the key components of a high-performance recommender system: (1) Data Sources, (2) Feature Store, (3) Machine Learning Models, (4) Predictions & Actions, (6) Results, (7) Evaluation and (8) AI Ethics.
In this article we’re diving deeper into the common data sources required for your collaborative filtering type of recommender systems. At the very least, the inputs for a recommendation engine include users, items, and ratings:
USERS/ITEMS | U1 | U2 | U3 | U4 | U5 |
I1 | 1 | ❓ | 3 | 4 | ❓ |
I2 | 3 | ❓ | ❓ | 2 | 3 |
I3 | 2 | 5 | 3 | ❓ | ❓ |
I4 | ❓ | 4 | 1 | ❓ | ❓ |
I5 | 5 | ❓ | 2 | ❓ | 5 |
There are three types of recommender system datasets that you must prepare for a collaborative filtering system:
1. Items Dataset
Export data from your in-house ecommerce platform, or from your CMS platforms like Shopify, Magento, or WooCommerce. Include the product catalog (items) along with information such as price, SKU type, or availability.
2. Users Dataset
Metadata about your users might include information such as age, gender, or loyalty membership, which can be important signals for recommender systems.
3. Interactions Dataset
Google Analytics (or any third-party analytics platform) is often considered a good source of user interaction info, such as location or device (mobile, tablet, desktop), page views, time on site, and conversions.
Before you can send the items dataset to a recommendation engine, you need to extract data from one or several sources and format it in a way that the recommendation engine recognizes.
Let’s consider the straightforward scenario where you’re already using Algolia Search for your ecommerce and you’re sending your product catalog using one of the Algolia API clients. Here’s how the JSON might look like:
[
{
"item_id": "0000031852",
"title": "CeraVe Moisturizing Cream",
"description": "Developed with dermatologists, CeraVe Moisturizing Cream has a unique formula that provides 24-hour hydration and helps restore the protective skin barrier with three essential ceramides (1,3,6-II). This rich, non-greasy, fast-absorbing formula is ideal for sensitive skin on both the face and body.",
"price": 16.08,
"image": "../images/0000031852.jpg",
"categories": [
"Beauty & Personal Care",
"Body Creams"
],
"availability": "in stock"
},
{
"item_id": "0000042941",
"title": "REVLON One-Step Hair Dryer And Volumizer Hot Air Brush, Black, Packaging May Vary",
"description": "The Revlon One-Step Hair Dryer and Volumizer is a Hot Air Brush to deliver gorgeous volume and brilliant shine in a single step. The unique oval brush design smooth hair while the rounded edges quickly create volume at the root and beautifully full-bodied bends at the ends in a single pass, for salon blowouts at home.",
"price": 41.88,
"image": "../images/0000042941.jpg",
"categories": [
"Beauty & Personal Care",
"Hot-Air Hair Brushes"
],
"availability": "in stock"
},
{
"item_id": "0000053422",
"title": "Maybelline Lash Sensational Washable Mascara",
"description": "Get a sensational full-fan effect with Maybelline New York’s fan favorite Lash Sensational Washable Mascara! Lashes grow in more than one layer. This volumizing mascara can unfold layer upon layer of lashes thanks to its exclusive brush with ten layers of bristle.",
"price": 6.49,
"image": "../images/0000053422.jpg",
"categories": [
"Beauty & Personal Care",
"Makeup",
"Mascara"
],
"availability": "out of stock"
}
]
Obviously, you don’t need to include everything — you should be selective about what goes in the dataset, gathering solely information that’s useful for building your recommender system. For example, if the recommendation engine is not built to process images, you should simply disregard the “image” metadata.
Another option is to export the product catalog in a CSV format from your Shopify, Magento or WooCommerce store. Below are the steps required each of them:
Exporting products from Shopify
Exporting products from Magento
For Magento 2, data export is an asynchronous operation, which executes in the background so that you can continue working in the Admin without waiting for the operation to finish. Here are the steps:
Exporting products from WooCommerce
WooCommerce has a built-in product CSV importer and exporter that you can use by following the next steps:
Last but not least, you might already be using Google Merchant Center in which case you’re already in possession of a product feed that you can reuse for your recommendation system.
While the items dataset was easier to produce or export, when it comes to working with users datasets, things get a bit more complex, mostly due to privacy constraints rather than technical ones. Clearly, from a machine learning point of view, the more metadata you have on a user the better. However, this might not always be possible and you must be transparent in how you handle user data (e.g., information collected from or about a user, including device information).
In 2006, Netflix disclosed insufficiently anonymous information about nearly half a million customers. During Netflix’s $1 million contest to improve its recommendation system, some researchers were able to re-identify users. It clearly violated users’ privacy, and the lawsuit ended $9 million for Netflix to settle the lawsuit.
In the EU, a lot of personal data is considered sensitive and is subject to specific processing conditions. When discussing Personally Identifiable Information (PII), we need to distinguish between authenticated, unauthenticated, or anonymous users.
There’s one nuance that I’d like to emphasize here — according to EU regulations, anonymization is the process of creating anonymous information, which is “information that does not relate to an identified or identifiable natural person” or “to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”. While we’re considering unauthenticated users as being anonymous, it is worth noting that ensuring 100% anonymity is very hard to achieve.
Unauthenticated/Anonymous | Authenticated |
cookie_id | user_id |
session_id | first and last name |
device | home address |
country | email address |
city | telephone number |
passport number | |
driver’s license number | |
social security number | |
photo of a face | |
credit card number | |
account username | |
fingerprints |
An important design decision needs to be made at this stage: should you build your recommender system for authenticated or for unauthenticated users? If you go for authenticated users, you have more metadata at your disposal and, in theory, your recommendation system could be more accurate, but the downside is that the volume of users that will see recommendations will be very, very small.
On the other hand, building a recommendation system with anonymous user metadata only translates into a much wider audience for your recommendations even if the accuracy might not be that high.
This is the kind of compromise that needs to be made in the early stages of implementing a recommendation model. Granted, at the later stages, a hybrid approach might be taken into consideration if the return on investment makes it worthwhile.
Exporting unauthenticated users from Google Analytics
The User Explorer report lets you isolate and examine individuals rather than aggregate user behavior. Individual user behavior is associated with either Client ID (authenticated) or User ID (anonymous).
For each client or user ID, you see the following data:
User Data
(visible when clicking on Client Id on the above screen image) |
Interactions Data |
|
|
Additionally, when you segment the User Explorer report based on any combination of demographic, technology, and behavioral information, you have a list of all IDs associated with that segment that you can export. Please note that Google Analytics 4 (GA4) released in 2019 introduces some important nuances – the User Explorer is one of the techniques available in the Analysis section.
However, the export functionality in the dashboard is limited to only a few columns/dimensions that you can get, which doesn’t make it really useful for our Users Dataset.
The solution is to use the Google Analytics Reporting API v4. But there’s a caveat: the default version doesn’t export any client ids from the User Explorer report, and you have to make these available by creating custom dimensions. This allows the analytics API to export data at the Client ID or Session level, instead of returning only aggregated data.From the Google Analytics admin panel, go to the Admin section. In the Property section, go to Custom Definitions > Custom Dimensions. Add the following dimensions:
After adding all dimensions, your table will look similar to the below one.
Next, you need to set up client ID and session ID tracking. If the Google Analytics tracker is included directly in the source of your website, you can add the custom dimensions values by modifying the existing code.
Copy the following script and replace the view id (should be something like UI-xxxxx-01) to the one corresponding to your website. Also, replace ‘dimension1’, ‘dimension2’, and (optionally) ‘dimension3’ with the actual dimensions indexes created in Step 1. You can read more about adding the Client ID to a custom dimension with gtag.js.
<!-- Global site tag (gtag.js) - Google Analytics --> <script async src="https://www.googletagmanager.com/gtag/js?id=UA-XXXXXXXX-Y"></script> <script> window.dataLayer = window.dataLayer || []; function gtag() { dataLayer.push(arguments); } gtag('js', new Date()); // Maps 'dimension1' to client id, 'dimension2' to session id and 'dimension3' to user id. gtag("config", "UA-XXXXXXXX-Y", { custom_map: { dimension1: "clientId", dimension2: "sessionId", dimension3: "userId", // Remove this line if you're not using the UserID custom dimension } }); // Sends an event that sets the dimensions gtag("event", "custom_set_dimensions", { sessionId: new Date().getTime() + "." + Math.random().toString(36).substring(5) // Remove following line if you're not using the UserID custom dimension userId: '## Replace with User ID from your CMS ##' }); </script>
The last part includes creating API credentials that can be used to authenticate the Google Analytics Reporting API and access the data. See Google’s docs on how to access the Analytics Reporting API.
You can use the following Python script that will create an anonymous Users Dataset for your collaborative filtering recommender system. The script creates a dataframe with the Google Analytics data and saves it as a csv file.
from apiclient.discovery import build from oauth2client.service_account import ServiceAccountCredentials import pandas as pd SCOPES = ['https://www.googleapis.com/auth/analytics.readonly'] KEY_FILE_LOCATION = '<REPLACE_WITH_JSON_FILE>' VIEW_ID = '<REPLACE_WITH_VIEW_ID>' CUSTOM_DIMENSION_USER = '<REPLACE_WITH_CUSTOM_DIMENSION_INDEX, ex. ga:dimension1>' def initialize_analyticsreporting(): """Initializes an Analytics Reporting API V4 service object. Returns: An authorized Analytics Reporting API V4 service object. """ credentials = ServiceAccountCredentials.from_json_keyfile_name( KEY_FILE_LOCATION, SCOPES) # Build the service object. analytics = build('analyticsreporting', 'v4', credentials=credentials) return analytics def get_report(analytics): """Queries the Analytics Reporting API V4. The API returns a maximum of 100,000 rows per request. If you have more data, you have to implement pagination: - https://developers.google.com/analytics/devguides/reporting/core/v4/basics#pagination - https://developers.google.com/analytics/devguides/reporting/core/v4/rest/v4/reports/batchGet#ReportRequest.FIELDS.page_size Args: analytics: An authorized Analytics Reporting API V4 service object. Returns: The Analytics Reporting API V4 response. """ return analytics.reports().batchGet( body={ 'reportRequests': [ { 'viewId': VIEW_ID, 'dateRanges': [{'startDate': '7daysAgo', 'endDate': 'today'}], 'metrics': [{'expression': 'ga:sessions'}], 'dimensions': [{'name': CUSTOM_DIMENSION_USER}, {'name': 'ga:country'}, {'name': 'ga:deviceCategory'}], 'pageSize': 10000 }] } ).execute() def parse_response(response): """Parses the Analytics Reporting API V4 response and returns it in a Pandas dataframe. Args: response: An Analytics Reporting API V4 response. Returns: Pandas dataframe with the data. """ df = pd.DataFrame() for report in response.get('reports', []): columnHeader = report.get('columnHeader', {}) dimensionHeaders = columnHeader.get('dimensions', []) metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', []) for row in report.get('data', {}).get('rows', []): dimensions = row.get('dimensions', []) dateRangeValues = row.get('metrics', []) entry = {} for header, dimension in zip(dimensionHeaders, dimensions): entry[header] = dimension for i, values in enumerate(dateRangeValues): for metricHeader, value in zip(metricHeaders, values.get('values')): entry[metricHeader.get('name')] = value df = df.append(entry, ignore_index=True) return df def main(): analytics = initialize_analyticsreporting() response = get_report(analytics) df = parse_response(response) print(df.count()) print(df.head(10)) df.to_csv("data.csv") if __name__ == '__main__': main() sample_read_ga_data.py Displaying sample_read_ga_data.py.
Exporting customers from Shopify
You can export a CSV file of all your customers and their details by following the next quick steps:
As you can see from this template, there are a lot of columns that you don’t actually need – for example, it doesn’t make sense to include in your Users Dataset the full name or e-mail address as inputs for the recommender system.
Exporting customers from Magento
Similarly to exporting products from Magento 2, you can set Entity Type to “Customers Main File” and you’ll be provided with a list of customers that you can further cleanup and use in your recommender system.
Exporting customers from WooCommerce
To manually export customers from WooCommerce:
3. Click Export.
The final piece of the dataset puzzle is to generate the interactions datasets. Since there are close to 30M live websites using Google Analytics according to Builtwith.com, it makes sense to cover the best way to export user behavior for our recommendation system using the Google Analytics Reporting API v4.
We’ve already covered most of the process in the section on users datasets and you can pretty much use the same Python script to export the key events that will go into the recommender system. You can take a look at the Dimension & Metrics Explorer if you want to add additional columns.
from apiclient.discovery import build from oauth2client.service_account import ServiceAccountCredentials import pandas as pd SCOPES = ['https://www.googleapis.com/auth/analytics.readonly'] KEY_FILE_LOCATION = '<REPLACE_WITH_JSON_FILE>' VIEW_ID = '<REPLACE_WITH_VIEW_ID>' CUSTOM_DIMENSION_USER = '<REPLACE_WITH_CUSTOM_DIMENSION_INDEX, ex. ga:dimension1>' CUSTOM_DIMENSION_SESSION = '<REPLACE_WITH_CUSTOM_DIMENSION_INDEX, ex. ga:dimension2>' def initialize_analyticsreporting(): """Initializes an Analytics Reporting API V4 service object. Returns: An authorized Analytics Reporting API V4 service object. """ credentials = ServiceAccountCredentials.from_json_keyfile_name( KEY_FILE_LOCATION, SCOPES) # Build the service object. analytics = build('analyticsreporting', 'v4', credentials=credentials) return analytics def get_report(analytics): """Queries the Analytics Reporting API V4. The API returns a maximum of 100,000 rows per request. If you have more data, you have to implement pagination: - https://developers.google.com/analytics/devguides/reporting/core/v4/basics#pagination - https://developers.google.com/analytics/devguides/reporting/core/v4/rest/v4/reports/batchGet#ReportRequest.FIELDS.page_size Args: analytics: An authorized Analytics Reporting API V4 service object. Returns: The Analytics Reporting API V4 response. """ return analytics.reports().batchGet( body={ 'reportRequests': [ { 'viewId': VIEW_ID, 'dateRanges': [{'startDate': '7daysAgo', 'endDate': 'today'}], 'metrics': [ {'expression': 'ga:productListViews'}, {'expression': 'ga:productDetailViews'}, {'expression': 'ga:productAddsToCart'}, {'expression': 'ga:productCheckouts'}, {'expression': 'ga:itemQuantity'}, {'expression': 'ga:itemRevenue'} ], 'dimensions': [ {'name': CUSTOM_DIMENSION_USER}, {'name': CUSTOM_DIMENSION_SESSION}, {'name': 'ga:productSku'} ], 'pageSize': 10000 }] } ).execute() def parse_response(response): """Parses the Analytics Reporting API V4 response and returns it in a Pandas dataframe. Args: response: An Analytics Reporting API V4 response. Returns: Pandas dataframe with the data. """ df = pd.DataFrame() for report in response.get('reports', []): columnHeader = report.get('columnHeader', {}) dimensionHeaders = columnHeader.get('dimensions', []) metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', []) for row in report.get('data', {}).get('rows', []): dimensions = row.get('dimensions', []) dateRangeValues = row.get('metrics', []) entry = {} for header, dimension in zip(dimensionHeaders, dimensions): entry[header] = dimension for i, values in enumerate(dateRangeValues): for metricHeader, value in zip(metricHeaders, values.get('values')): entry[metricHeader.get('name')] = value df = df.append(entry, ignore_index=True) return df def main(): analytics = initialize_analyticsreporting() response = get_report(analytics) df = parse_response(response) print(df.count()) print(df.head(10)) df.to_csv("interactions.csv") if __name__ == '__main__': main() sample_read_ga_interactions.py Displaying sample_read_ga_interactions.py.
Event | Trigger | Parameters |
add_payment_info | when a user submits their payment information | coupon, currency, items, payment_type, value |
add_shipping_info | when a user submits their shipping information | coupon, currency, items, shipping_tier, value |
add_to_cart | when a user adds items to cart | currency, items, value |
add_to_wishlist | when a user adds items to a wishlist | currency, items, value |
begin_checkout | when a user begins checkout | coupon, currency, items, value |
generate_lead | when a user submits a form or request for information | value, currency |
purchase | when a user completes a purchase | affiliation, coupon, currency, items, transaction_id, shipping, tax, value (required parameter) |
refund | when a refund is issued | affiliation, coupon, currency, items, transaction_id, shipping, tax, value |
remove_from_cart | when a user removes items from a cart | currency, items, value |
select_item | when an item is selected from a list | items, item_list_name, item_list_id |
select_promotion | when a user selects a promotion | items, promotion_id, promotion_name, creative_name, creative_slot, location_id |
view_cart | when a user views their cart | currency, items, value |
view_item | when a user views an item | currency, items, value |
view_item_list | when a user sees a list of items/offerings | items, item_list_name, item_list_id |
view_promotion | when a promotion is shown to a user | items, promotion_id, promotion_name, creative_name, creative_slot, location_id |
Retail and ecommerce apps should log the events listed below. Logging events along with their prescribed parameters ensures maximum available detail in reports and improves the overall performance of the collaborative filtering recommender system.
While having more user interactions increases the accuracy of the recommendation system, you might want to consider experimenting with fewer and more basic ones, like product detail page views (for unauthenticated users) and/or orders (for authenticated users). The intuition here is that the recommendation system will be able to recommend relevant items solely based on implicit ratings such as: product detail page views and purchase history.
Exporting orders from Shopify
You can export orders along with their transaction histories or you can export only the transaction histories of your orders:
Exporting orders from Magento
In your Magento 2 backend, go to Sales > Orders and click the Export drop down, tick CSV.
Exporting orders from WooCommerce
You can follow the same steps to export orders from WooCommerce as you did for exporting customers, the only difference is selecting “orders” for Export type.
Remember, for your first iteration of collaborative filtering recommendations system, you might need to clean up the interactions dataset export until it consists only of: UserId, Date, ProductIds.
If you reached this point, you should have 3 CSVs files, each with the following structure:
Items Dataset | |||||
ItemId | Title | Description | Price | Other Item Metadata (…) |
Users Dataset | ||||
UserId | Country | City | Other User Metadata (…) |
Interactions Dataset | |||||
ItemId | UserId | Timestamp | EventType | Other Interaction Metadata (…) |
Please notice that it is absolutely possible to ignore the users dataset altogether if you do not have (or do not want to include) any user metadata.
Interactions are the basis of building the User Rating Matrix (URM). Just by using user_id – item_id pairs, you can create the first version of the URM / baseline model. Other features can be added one by one, while measuring the performance of the model at each step.
From all of the available datasets, properties that are numeric or can be one-hot encoded can be easily added into the models. Others will require extra-processing — for example applying NLP to extract embeddings from text data, or computer vision to process product images.
In the next blog post in this series we’ll focus on preparing our datasets and exploring feature engineering for our collaborative filtering recommender system, which represents a critical step in building a highly performant machine learning model.
Stay tuned!
And, if you have any questions: https://twitter.com/cborodescu
Ciprian Borodescu
AI Product Manager | On a mission to help people succeed through the use of AIPowered by Algolia AI Recommendations
Ciprian Borodescu
AI Product Manager | On a mission to help people succeed through the use of AIAlexis Monks
Solutions ArchitectPeter Villani
Sr. Tech & Business WriterPaul-Louis Nech
Senior ML Engineer