Handling common data sources for your recommender system

In the first part of this series, we talked about the key components of a high-performance recommender system: (1) Data Sources, (2) Feature Store, (3) Machine Learning Models, (4) Predictions & Actions, (6) Results, (7) Evaluation and (8) AI Ethics.

In this article we’re diving deeper into the common data sources required for your collaborative filtering type of recommender systems. At the very least, the inputs for a recommendation engine include users, items, and ratings:

USERS/ITEMS	U1	U2	U3	U4	U5
I1	1	❓	3	4	❓
I2	3	❓	❓	2	3
I3	2	5	3	❓	❓
I4	❓	4	1	❓	❓
I5	5	❓	2	❓	5

There are three types of recommender system datasets that you must prepare for a collaborative filtering system:

1. Items Dataset
Export data from your in-house ecommerce platform, or from your CMS platforms like Shopify, Magento, or WooCommerce. Include the product catalog (items) along with information such as price, SKU type, or availability.

2. Users Dataset
Metadata about your users might include information such as age, gender, or loyalty membership, which can be important signals for recommender systems.

3. Interactions Dataset
Google Analytics (or any third-party analytics platform) is often considered a good source of user interaction info, such as location or device (mobile, tablet, desktop), page views, time on site, and conversions.

How to prepare the Items Dataset

Before you can send the items dataset to a recommendation engine, you need to extract data from one or several sources and format it in a way that the recommendation engine recognizes.

Let’s consider the straightforward scenario where you’re already using Algolia Search for your ecommerce and you’re sending your product catalog using one of the Algolia API clients. Here’s how the JSON might look like:

[
  {
    "item_id": "0000031852",
    "title": "CeraVe Moisturizing Cream",
    "description": "Developed with dermatologists, CeraVe Moisturizing Cream has a unique formula that provides 24-hour hydration and helps restore the protective skin barrier with three essential ceramides (1,3,6-II). This rich, non-greasy, fast-absorbing formula is ideal for sensitive skin on both the face and body.",
    "price": 16.08,
    "image": "../images/0000031852.jpg",
    "categories": [
      "Beauty & Personal Care",
      "Body Creams"
    ],
    "availability": "in stock"
  },
  {
    "item_id": "0000042941",
    "title": "REVLON One-Step Hair Dryer And Volumizer Hot Air Brush, Black, Packaging May Vary",
    "description": "The Revlon One-Step Hair Dryer and Volumizer is a Hot Air Brush to deliver gorgeous volume and brilliant shine in a single step. The unique oval brush design smooth hair while the rounded edges quickly create volume at the root and beautifully full-bodied bends at the ends in a single pass, for salon blowouts at home.",
    "price": 41.88,
    "image": "../images/0000042941.jpg",
    "categories": [
      "Beauty & Personal Care",
      "Hot-Air Hair Brushes"
    ],
    "availability": "in stock"
  },
  {
    "item_id": "0000053422",
    "title": "Maybelline Lash Sensational Washable Mascara",
    "description": "Get a sensational full-fan effect with Maybelline New York’s fan favorite Lash Sensational Washable Mascara! Lashes grow in more than one layer. This volumizing mascara can unfold layer upon layer of lashes thanks to its exclusive brush with ten layers of bristle.",
    "price": 6.49,
    "image": "../images/0000053422.jpg",
    "categories": [
      "Beauty & Personal Care",
      "Makeup",
      "Mascara"
    ],
    "availability": "out of stock"
  }
]

Obviously, you don’t need to include everything — you should be selective about what goes in the dataset, gathering solely information that’s useful for building your recommender system. For example, if the recommendation engine is not built to process images, you should simply disregard the “image” metadata.

Another option is to export the product catalog in a CSV format from your Shopify, Magento or WooCommerce store. Below are the steps required each of them:

Exporting products from Shopify

From your Shopify admin, go to Products > All products. Note: If you want to export only some of your products, then you can filter your product list to view and select specific products for export.
Click Export. From the dialog box, choose the products you want to export: The Current page (of products), All products, Selected products (that you have selected), Current Search (products that match your search and filters).
Select which type of CSV file you want to export – use Plain CSV file.
Click Export products.
You should end up with a CSV file similar to this one that you can download for inspection.

Exporting products from Magento

For Magento 2, data export is an asynchronous operation, which executes in the background so that you can continue working in the Admin without waiting for the operation to finish. Here are the steps:

On the Admin sidebar, go to System > Data Transfer > Export.
In the Export Settings section, set Entity Type to “Products”.
Accept the default Export File Format of CSV.
By default, the Entity Attributes section lists all the available attributes in alphabetical order. You should select the specific attributes to include in your export.
Scroll down and click Continue in the lower-right corner of the page. By default, all exported files are located in the <Magento-root-directory>/var/export folder. If the Remote storage module is enabled, all exported files are located in the <remote-storage-root-directory>/import_export/export folder.

Exporting products from WooCommerce

WooCommerce has a built-in product CSV importer and exporter that you can use by following the next steps:

Go to: WooCommerce > Products.
Select Export at the top. The Export Products screen displays.
Select to Export all columns, Export all products, or Export all categories. Or select which columns, products, or categories to export by using the dropdown menu.
Tick the box to Export Custom Meta if you want to include more metadata for your recommender systems.
Select Generate CSV and wait for the export to finish.

Last but not least, you might already be using Google Merchant Center in which case you’re already in possession of a product feed that you can reuse for your recommendation system.

How to prepare the Users Dataset

While the items dataset was easier to produce or export, when it comes to working with users datasets, things get a bit more complex, mostly due to privacy constraints rather than technical ones. Clearly, from a machine learning point of view, the more metadata you have on a user the better. However, this might not always be possible and you must be transparent in how you handle user data (e.g., information collected from or about a user, including device information).

In 2006, Netflix disclosed insufficiently anonymous information about nearly half a million customers. During Netflix’s $1 million contest to improve its recommendation system, some researchers were able to re-identify users. It clearly violated users’ privacy, and the lawsuit ended $9 million for Netflix to settle the lawsuit.

In the EU, a lot of personal data is considered sensitive and is subject to specific processing conditions. When discussing Personally Identifiable Information (PII), we need to distinguish between authenticated, unauthenticated, or anonymous users.

There’s one nuance that I’d like to emphasize here — according to EU regulations, anonymization is the process of creating anonymous information, which is “information that does not relate to an identified or identifiable natural person” or “to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”. While we’re considering unauthenticated users as being anonymous, it is worth noting that ensuring 100% anonymity is very hard to achieve.

Unauthenticated/Anonymous	Authenticated
cookie_id	user_id
session_id	first and last name
device	home address
country	email address
city	telephone number
	passport number
	driver’s license number
	social security number
	photo of a face
	credit card number
	account username
	fingerprints

An important design decision needs to be made at this stage: should you build your recommender system for authenticated or for unauthenticated users? If you go for authenticated users, you have more metadata at your disposal and, in theory, your recommendation system could be more accurate, but the downside is that the volume of users that will see recommendations will be very, very small.

On the other hand, building a recommendation system with anonymous user metadata only translates into a much wider audience for your recommendations even if the accuracy might not be that high.

This is the kind of compromise that needs to be made in the early stages of implementing a recommendation model. Granted, at the later stages, a hybrid approach might be taken into consideration if the return on investment makes it worthwhile.

Exporting unauthenticated users from Google Analytics

The User Explorer report lets you isolate and examine individuals rather than aggregate user behavior. Individual user behavior is associated with either Client ID (authenticated) or User ID (anonymous).

For each client or user ID, you see the following data:

User Data

(visible when clicking on Client Id on the above screen image)

Interactions Data

Device Category
Device Platform
Acquisition Date
Channel
Source/Medium
Campaign

Date Last Seen (when the user last initiated a session)
Sessions
Avg. Session Duration
Bounce Rate
Revenue
Transactions
Goal Conversion Rate

Additionally, when you segment the User Explorer report based on any combination of demographic, technology, and behavioral information, you have a list of all IDs associated with that segment that you can export. Please note that Google Analytics 4 (GA4) released in 2019 introduces some important nuances – the User Explorer is one of the techniques available in the Analysis section.

However, the export functionality in the dashboard is limited to only a few columns/dimensions that you can get, which doesn’t make it really useful for our Users Dataset.

The solution is to use the Google Analytics Reporting API v4. But there’s a caveat: the default version doesn’t export any client ids from the User Explorer report, and you have to make these available by creating custom dimensions. This allows the analytics API to export data at the Client ID or Session level, instead of returning only aggregated data.From the Google Analytics admin panel, go to the Admin section. In the Property section, go to Custom Definitions > Custom Dimensions. Add the following dimensions:

ClientID with the User scope

From the Custom Definitions section, go to the Custom Dimensions section and click the +New Custom Dimension button. Add the name “ClientID” and choose the User scope.

SessionID with the Session scope

From the Custom Definitions section, go to the Custom Dimensions section and click the +New Custom Dimension button. Add the name “SessionID” and choose the Session scope.

(Optional) UserID with the User scope.

From the Custom Definitions section, go to the Custom Dimensions section and click the +New Custom Dimension button. Add the name “UserID” and choose the User scope. This custom dimension will contain the users ids from your CMS. It should be used only when connecting GA to other platforms, such as Shopify, Magento or WooCommerce.

After adding all dimensions, your table will look similar to the below one.

Next, you need to set up client ID and session ID tracking. If the Google Analytics tracker is included directly in the source of your website, you can add the custom dimensions values by modifying the existing code.

Copy the following script and replace the view id (should be something like UI-xxxxx-01) to the one corresponding to your website. Also, replace ‘dimension1’, ‘dimension2’, and (optionally) ‘dimension3’ with the actual dimensions indexes created in Step 1. You can read more about adding the Client ID to a custom dimension with gtag.js.

<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-XXXXXXXX-Y"></script>
<script>
    window.dataLayer = window.dataLayer || [];
    function gtag() { dataLayer.push(arguments); }
    gtag('js', new Date());

    // Maps 'dimension1' to client id, 'dimension2' to session id and 'dimension3' to user id.
    gtag("config", "UA-XXXXXXXX-Y", {
        custom_map: {
            dimension1: "clientId",
            dimension2: "sessionId",
            dimension3: "userId", // Remove this line if you're not using the UserID custom dimension 
        }
    });

    // Sends an event that sets the dimensions
    gtag("event", "custom_set_dimensions", {
        sessionId: new Date().getTime() + "." + Math.random().toString(36).substring(5)

        // Remove following line if you're not using the UserID custom dimension
        userId: '## Replace with User ID from your CMS ##'
    });
</script>

The last part includes creating API credentials that can be used to authenticate the Google Analytics Reporting API and access the data. See Google’s docs on how to access the Analytics Reporting API.

You can use the following Python script that will create an anonymous Users Dataset for your collaborative filtering recommender system. The script creates a dataframe with the Google Analytics data and saves it as a csv file.

from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
import pandas as pd

SCOPES = ['https://www.googleapis.com/auth/analytics.readonly']
KEY_FILE_LOCATION = '<REPLACE_WITH_JSON_FILE>'
VIEW_ID = '<REPLACE_WITH_VIEW_ID>'
CUSTOM_DIMENSION_USER = '<REPLACE_WITH_CUSTOM_DIMENSION_INDEX, ex. ga:dimension1>'

def initialize_analyticsreporting():
  """Initializes an Analytics Reporting API V4 service object.

  Returns:
    An authorized Analytics Reporting API V4 service object.
  """
  credentials = ServiceAccountCredentials.from_json_keyfile_name(
      KEY_FILE_LOCATION, SCOPES)

  # Build the service object.
  analytics = build('analyticsreporting', 'v4', credentials=credentials)

  return analytics


def get_report(analytics):
  """Queries the Analytics Reporting API V4.
  The API returns a maximum of 100,000 rows per request. If you have more data, you have to implement pagination:
  - https://developers.google.com/analytics/devguides/reporting/core/v4/basics#pagination
  - https://developers.google.com/analytics/devguides/reporting/core/v4/rest/v4/reports/batchGet#ReportRequest.FIELDS.page_size
  
  Args:
    analytics: An authorized Analytics Reporting API V4 service object.
  Returns:
    The Analytics Reporting API V4 response.
  """
  return analytics.reports().batchGet(
      body={
        'reportRequests': [
        {
          'viewId': VIEW_ID,
          'dateRanges': [{'startDate': '7daysAgo', 'endDate': 'today'}],
          'metrics': [{'expression': 'ga:sessions'}],
          'dimensions': [{'name': CUSTOM_DIMENSION_USER}, {'name': 'ga:country'}, {'name': 'ga:deviceCategory'}],
          'pageSize': 10000
        }]
      }
  ).execute()


def parse_response(response):
  """Parses the Analytics Reporting API V4 response and returns it in a Pandas dataframe.

  Args:
    response: An Analytics Reporting API V4 response.
  Returns:
    Pandas dataframe with the data.
  """
  df = pd.DataFrame() 
  
  for report in response.get('reports', []):
    columnHeader = report.get('columnHeader', {})
    dimensionHeaders = columnHeader.get('dimensions', [])
    metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', [])

    for row in report.get('data', {}).get('rows', []):
      dimensions = row.get('dimensions', [])
      dateRangeValues = row.get('metrics', [])

      entry = {}
      for header, dimension in zip(dimensionHeaders, dimensions):
        entry[header] = dimension

      for i, values in enumerate(dateRangeValues):
        
        for metricHeader, value in zip(metricHeaders, values.get('values')):
          entry[metricHeader.get('name')] = value
          
        df = df.append(entry, ignore_index=True)
    return df


def main():
  analytics = initialize_analyticsreporting()
  response = get_report(analytics)
  df = parse_response(response)
  print(df.count())
  print(df.head(10))
  df.to_csv("data.csv")

if __name__ == '__main__':
  main()
sample_read_ga_data.py
Displaying sample_read_ga_data.py.

Exporting customers from Shopify

You can export a CSV file of all your customers and their details by following the next quick steps:

From your Shopify admin, go to Customers.
Click Export.
Click All customers to export all your store’s customers.
Select Plain CSV file format.
Click Export customers.

As you can see from this template, there are a lot of columns that you don’t actually need – for example, it doesn’t make sense to include in your Users Dataset the full name or e-mail address as inputs for the recommender system.

Exporting customers from Magento

Similarly to exporting products from Magento 2, you can set Entity Type to “Customers Main File” and you’ll be provided with a list of customers that you can further cleanup and use in your recommender system.

Exporting customers from WooCommerce

To manually export customers from WooCommerce:

Go to WooCommerce > Export.
On the Manual Export tab, update the following settings:
1. Output type: Choose to export your file in CSV format.
2. Export type: Choose to export customers.
3. Format: Select a predefined or custom format.
4. Filename: Enter a name for the file generated by this export.
5. Mark as exported: Enable to ensure the exported data is excluded from future exports. Click here to learn more about this setting.
6. Batch processing: Only enable if your site does not support background processing.

3. Click Export.

How to prepare the Interactions Dataset

The final piece of the dataset puzzle is to generate the interactions datasets. Since there are close to 30M live websites using Google Analytics according to Builtwith.com, it makes sense to cover the best way to export user behavior for our recommendation system using the Google Analytics Reporting API v4.

We’ve already covered most of the process in the section on users datasets and you can pretty much use the same Python script to export the key events that will go into the recommender system. You can take a look at the Dimension & Metrics Explorer if you want to add additional columns.

from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
import pandas as pd

SCOPES = ['https://www.googleapis.com/auth/analytics.readonly']
KEY_FILE_LOCATION = '<REPLACE_WITH_JSON_FILE>'
VIEW_ID = '<REPLACE_WITH_VIEW_ID>'
CUSTOM_DIMENSION_USER = '<REPLACE_WITH_CUSTOM_DIMENSION_INDEX, ex. ga:dimension1>'
CUSTOM_DIMENSION_SESSION = '<REPLACE_WITH_CUSTOM_DIMENSION_INDEX, ex. ga:dimension2>'

def initialize_analyticsreporting():
  """Initializes an Analytics Reporting API V4 service object.

  Returns:
    An authorized Analytics Reporting API V4 service object.
  """
  credentials = ServiceAccountCredentials.from_json_keyfile_name(
      KEY_FILE_LOCATION, SCOPES)

  # Build the service object.
  analytics = build('analyticsreporting', 'v4', credentials=credentials)

  return analytics


def get_report(analytics):
  """Queries the Analytics Reporting API V4.
  The API returns a maximum of 100,000 rows per request. If you have more data, you have to implement pagination:
  - https://developers.google.com/analytics/devguides/reporting/core/v4/basics#pagination
  - https://developers.google.com/analytics/devguides/reporting/core/v4/rest/v4/reports/batchGet#ReportRequest.FIELDS.page_size
  
  Args:
    analytics: An authorized Analytics Reporting API V4 service object.
  Returns:
    The Analytics Reporting API V4 response.
  """
  return analytics.reports().batchGet(
      body={
        'reportRequests': [
        {
          'viewId': VIEW_ID,
          'dateRanges': [{'startDate': '7daysAgo', 'endDate': 'today'}],
          'metrics': [
                {'expression': 'ga:productListViews'},
                {'expression': 'ga:productDetailViews'},
                {'expression': 'ga:productAddsToCart'},
                {'expression': 'ga:productCheckouts'},
                {'expression': 'ga:itemQuantity'},
                {'expression': 'ga:itemRevenue'}
            ],
          'dimensions': [
              {'name': CUSTOM_DIMENSION_USER},
              {'name': CUSTOM_DIMENSION_SESSION},
              {'name': 'ga:productSku'}
            ],
          'pageSize': 10000
        }]
      }
  ).execute()


def parse_response(response):
  """Parses the Analytics Reporting API V4 response and returns it in a Pandas dataframe.

  Args:
    response: An Analytics Reporting API V4 response.
  Returns:
    Pandas dataframe with the data.
  """
  df = pd.DataFrame() 
  
  for report in response.get('reports', []):
    columnHeader = report.get('columnHeader', {})
    dimensionHeaders = columnHeader.get('dimensions', [])
    metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', [])

    for row in report.get('data', {}).get('rows', []):
      dimensions = row.get('dimensions', [])
      dateRangeValues = row.get('metrics', [])

      entry = {}
      for header, dimension in zip(dimensionHeaders, dimensions):
        entry[header] = dimension

      for i, values in enumerate(dateRangeValues):
        
        for metricHeader, value in zip(metricHeaders, values.get('values')):
          entry[metricHeader.get('name')] = value
          
        df = df.append(entry, ignore_index=True)
    return df


def main():
  analytics = initialize_analyticsreporting()
  response = get_report(analytics)
  df = parse_response(response)
  print(df.count())
  print(df.head(10))
  df.to_csv("interactions.csv")

if __name__ == '__main__':
  main()
sample_read_ga_interactions.py
Displaying sample_read_ga_interactions.py.

Event	Trigger	Parameters
add_payment_info	when a user submits their payment information	coupon, currency, items, payment_type, value
add_shipping_info	when a user submits their shipping information	coupon, currency, items, shipping_tier, value
add_to_cart	when a user adds items to cart	currency, items, value
add_to_wishlist	when a user adds items to a wishlist	currency, items, value
begin_checkout	when a user begins checkout	coupon, currency, items, value
generate_lead	when a user submits a form or request for information	value, currency
purchase	when a user completes a purchase	affiliation, coupon, currency, items, transaction_id, shipping, tax, value (required parameter)
refund	when a refund is issued	affiliation, coupon, currency, items, transaction_id, shipping, tax, value
remove_from_cart	when a user removes items from a cart	currency, items, value
select_item	when an item is selected from a list	items, item_list_name, item_list_id
select_promotion	when a user selects a promotion	items, promotion_id, promotion_name, creative_name, creative_slot, location_id
view_cart	when a user views their cart	currency, items, value
view_item	when a user views an item	currency, items, value
view_item_list	when a user sees a list of items/offerings	items, item_list_name, item_list_id
view_promotion	when a promotion is shown to a user	items, promotion_id, promotion_name, creative_name, creative_slot, location_id

Retail and ecommerce apps should log the events listed below. Logging events along with their prescribed parameters ensures maximum available detail in reports and improves the overall performance of the collaborative filtering recommender system.

While having more user interactions increases the accuracy of the recommendation system, you might want to consider experimenting with fewer and more basic ones, like product detail page views (for unauthenticated users) and/or orders (for authenticated users). The intuition here is that the recommendation system will be able to recommend relevant items solely based on implicit ratings such as: product detail page views and purchase history.

Exporting orders from Shopify

You can export orders along with their transaction histories or you can export only the transaction histories of your orders:

From the Orders page, click Export.
In the Export orders window:
- Select the option for the orders that you want to export. For example, if you want to export your orders by date, then click Export orders by date and set the start and end dates for the orders that you want to export.
- Under Export as, select a file format.
If you want to download all information about your orders, then click Export orders. If you want to download your transaction information only, then click Export transaction histories.

Exporting orders from Magento

In your Magento 2 backend, go to Sales > Orders and click the Export drop down, tick CSV.

Exporting orders from WooCommerce

You can follow the same steps to export orders from WooCommerce as you did for exporting customers, the only difference is selecting “orders” for Export type.

Remember, for your first iteration of collaborative filtering recommendations system, you might need to clean up the interactions dataset export until it consists only of: UserId, Date, ProductIds.

Putting it all together

If you reached this point, you should have 3 CSVs files, each with the following structure:

*Items Dataset*
ItemId	Title	Description	Price	Other Item Metadata (…)

*Users Dataset*
UserId	Country	City	Other User Metadata (…)

*Interactions Dataset*
ItemId	UserId	Timestamp	EventType	Other Interaction Metadata (…)

Please notice that it is absolutely possible to ignore the users dataset altogether if you do not have (or do not want to include) any user metadata.

digram2_-The-Anatomy-of-a-Performant-Recommendation-Engine-2-1536x960.png

Interactions are the basis of building the User Rating Matrix (URM). Just by using user_id – item_id pairs, you can create the first version of the URM / baseline model. Other features can be added one by one, while measuring the performance of the model at each step.

From all of the available datasets, properties that are numeric or can be one-hot encoded can be easily added into the models. Others will require extra-processing — for example applying NLP to extract embeddings from text data, or computer vision to process product images.

In the next blog post in this series we’ll focus on preparing our datasets and exploring feature engineering for our collaborative filtering recommender system, which represents a critical step in building a highly performant machine learning model.

Stay tuned!

And, if you have any questions: https://twitter.com/cborodescu

The anatomy of high-performance recommender systems, Part II

Handling common data sources for your recommender system

How to prepare the Items Dataset

How to prepare the Users Dataset

How to prepare the Interactions Dataset

Putting it all together

Recommended Content

Get the AI search that shows users what they need

Agentic intelligence layer powering commerce discovery

A leader for the third consecutive year

Increased Operating Profit and Improved Efficiency

Named a leader in knowledge discovery

Top scores across every B2B category