Search by Algolia
Haystack EU 2023: Learnings and reflections from our team
ai

Haystack EU 2023: Learnings and reflections from our team

If you have built search experiences, you know creating a great search experience is a never ending process: the data ...

Paul-Louis Nech

Senior ML Engineer

What is k-means clustering? An introduction
product

What is k-means clustering? An introduction

Just as with a school kid who’s left unsupervised when their teacher steps outside to deal with a distraction ...

Catherine Dee

Search and Discovery writer

Feature Spotlight: Synonyms
product

Feature Spotlight: Synonyms

Back in May 2014, we added support for synonyms inside Algolia. We took our time to really nail the details ...

Jaden Baptista

Technical Writer

Feature Spotlight: Query Rules
product

Feature Spotlight: Query Rules

You’re running an ecommerce site for an electronics retailer, and you’re seeing in your analytics that users keep ...

Jaden Baptista

Technical Writer

An introduction to transformer models in neural networks and machine learning
ai

An introduction to transformer models in neural networks and machine learning

What do OpenAI and DeepMind have in common? Give up? These innovative organizations both utilize technology known as transformer models ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What’s the secret of online merchandise management? Giving store merchandisers the right tools
e-commerce

What’s the secret of online merchandise management? Giving store merchandisers the right tools

As a successful in-store boutique manager in 1994, you might have had your merchandisers adorn your street-facing storefront ...

Catherine Dee

Search and Discovery writer

New features and capabilities in Algolia InstantSearch
engineering

New features and capabilities in Algolia InstantSearch

At Algolia, our business is more than search and discovery, it’s the continuous improvement of site search. If you ...

Haroen Viaene

JavaScript Library Developer

Feature Spotlight: Analytics
product

Feature Spotlight: Analytics

Analytics brings math and data into the otherwise very subjective world of ecommerce. It helps companies quantify how well their ...

Jaden Baptista

Technical Writer

What is clustering?
ai

What is clustering?

Amid all the momentous developments in the generative AI data space, are you a data scientist struggling to make sense ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is a vector database?
product

What is a vector database?

Fashion ideas for guest aunt informal summer wedding Funny movie to get my bored high-schoolers off their addictive gaming ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Unlock the power of image-based recommendation with Algolia’s LookingSimilar
engineering

Unlock the power of image-based recommendation with Algolia’s LookingSimilar

Imagine you're visiting an online art gallery and a specific painting catches your eye. You'd like to find ...

Raed Chammam

Senior Software Engineer

Empowering Change: Algolia's Global Giving Days Impact Report
algolia

Empowering Change: Algolia's Global Giving Days Impact Report

At Algolia, our commitment to making a positive impact extends far beyond the digital landscape. We believe in the power ...

Amy Ciba

Senior Manager, People Success

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve
e-commerce

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve

In today’s post-pandemic-yet-still-super-competitive retail landscape, gaining, keeping, and converting ecommerce customers is no easy ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Algolia x eTail | A busy few days in Boston
algolia

Algolia x eTail | A busy few days in Boston

There are few atmospheres as unique as that of a conference exhibit hall: the air always filled with an indescribable ...

Marissa Wharton

Marketing Content Manager

What are vectors and how do they apply to machine learning?
ai

What are vectors and how do they apply to machine learning?

To consider the question of what vectors are, it helps to be a mathematician, or at least someone who’s ...

Catherine Dee

Search and Discovery writer

Why imports are important in JS
engineering

Why imports are important in JS

My first foray into programming was writing Python on a Raspberry Pi to flicker some LED lights — it wasn’t ...

Jaden Baptista

Technical Writer

What is ecommerce? The complete guide
e-commerce

What is ecommerce? The complete guide

How well do you know the world of modern ecommerce?  With retail ecommerce sales having exceeded $5.7 trillion worldwide ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Data is king: The role of data capture and integrity in embracing AI
ai

Data is king: The role of data capture and integrity in embracing AI

In a world of artificial intelligence (AI), data serves as the foundation for machine learning (ML) models to identify trends ...

Alexandra Anghel

Director of AI Engineering

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

Handling common data sources for your recommender system

In the first part of this series, we talked about the key components of a high-performance recommender system: (1) Data Sources, (2) Feature Store, (3) Machine Learning Models, (4) Predictions & Actions, (6) Results, (7) Evaluation and (8) AI Ethics.

In this article we’re diving deeper into the common data sources required for your collaborative filtering type of recommender systems. At the very least, the inputs for a recommendation engine include users, items, and ratings:

USERS/ITEMS U1 U2 U3 U4 U5
I1 1 3 4
I2 3 2 3
I3 2 5 3
I4 4 1
I5 5 2 5

There are three types of recommender system datasets that you must prepare for a collaborative filtering system:

1. Items Dataset
Export data from your in-house ecommerce platform, or from your CMS platforms like Shopify, Magento, or WooCommerce. Include the product catalog (items) along with information such as price, SKU type, or availability.

2. Users Dataset
Metadata about your users might include information such as age, gender, or loyalty membership, which can be important signals for recommender systems.

3. Interactions Dataset
Google Analytics (or any third-party analytics platform) is often considered a good source of user interaction info, such as location or device (mobile, tablet, desktop), page views, time on site, and conversions. 

 

How to prepare the Items Dataset

Before you can send the items dataset to a recommendation engine, you need to extract data from one or several sources and format it in a way that the recommendation engine recognizes. 

Let’s consider the straightforward scenario where you’re already using Algolia Search for your ecommerce and you’re sending your product catalog using one of the Algolia API clients. Here’s how the JSON might look like:

[
  {
    "item_id": "0000031852",
    "title": "CeraVe Moisturizing Cream",
    "description": "Developed with dermatologists, CeraVe Moisturizing Cream has a unique formula that provides 24-hour hydration and helps restore the protective skin barrier with three essential ceramides (1,3,6-II). This rich, non-greasy, fast-absorbing formula is ideal for sensitive skin on both the face and body.",
    "price": 16.08,
    "image": "../images/0000031852.jpg",
    "categories": [
      "Beauty & Personal Care",
      "Body Creams"
    ],
    "availability": "in stock"
  },
  {
    "item_id": "0000042941",
    "title": "REVLON One-Step Hair Dryer And Volumizer Hot Air Brush, Black, Packaging May Vary",
    "description": "The Revlon One-Step Hair Dryer and Volumizer is a Hot Air Brush to deliver gorgeous volume and brilliant shine in a single step. The unique oval brush design smooth hair while the rounded edges quickly create volume at the root and beautifully full-bodied bends at the ends in a single pass, for salon blowouts at home.",
    "price": 41.88,
    "image": "../images/0000042941.jpg",
    "categories": [
      "Beauty & Personal Care",
      "Hot-Air Hair Brushes"
    ],
    "availability": "in stock"
  },
  {
    "item_id": "0000053422",
    "title": "Maybelline Lash Sensational Washable Mascara",
    "description": "Get a sensational full-fan effect with Maybelline New York’s fan favorite Lash Sensational Washable Mascara! Lashes grow in more than one layer. This volumizing mascara can unfold layer upon layer of lashes thanks to its exclusive brush with ten layers of bristle.",
    "price": 6.49,
    "image": "../images/0000053422.jpg",
    "categories": [
      "Beauty & Personal Care",
      "Makeup",
      "Mascara"
    ],
    "availability": "out of stock"
  }
]

Obviously, you don’t need to include everything — you should be selective about what goes in the dataset, gathering solely information that’s useful for building your recommender system. For example, if the recommendation engine is not built to process images, you should simply disregard the “image” metadata.

Another option is to export the product catalog in a CSV format from your Shopify, Magento or WooCommerce store. Below are the steps required each of them:

 

Exporting products from Shopify

  1. From your Shopify admin, go to Products > All products. Note: If you want to export only some of your products, then you can filter your product list to view and select specific products for export.
  2. Click Export. From the dialog box, choose the products you want to export: The Current page (of products), All products, Selected products (that you have selected), Current Search (products that match your search and filters).
  3. Select which type of CSV file you want to export  – use Plain CSV file
  4. Click Export products.
  5. You should end up with a CSV file similar to this one that you can download for inspection.

 

Exporting products from Magento

For Magento 2, data export is an asynchronous operation, which executes in the background so that you can continue working in the Admin without waiting for the operation to finish. Here are the steps:

  1. On the Admin sidebar, go to System > Data Transfer > Export.
  2. In the Export Settings section, set Entity Type to “Products”.
  3. Accept the default Export File Format of CSV.
  4. By default, the Entity Attributes section lists all the available attributes in alphabetical order. You should select the specific attributes to include in your export.
  5. Scroll down and click Continue in the lower-right corner of the page. By default, all exported files are located in the <Magento-root-directory>/var/export folder. If the Remote storage module is enabled, all exported files are located in the <remote-storage-root-directory>/import_export/export folder.

 

Exporting products from WooCommerce

WooCommerce has a built-in product CSV importer and exporter that you can use by following the next steps:

  1. Go to: WooCommerce > Products.
  2. Select Export at the top. The Export Products screen displays.
  3. Select to Export all columns, Export all products, or Export all categories. Or select which columns, products, or categories to export by using the dropdown menu.
  4. Tick the box to Export Custom Meta if you want to include more metadata for your recommender systems.
  5. Select Generate CSV and wait for the export to finish.

Last but not least, you might already be using Google Merchant Center in which case you’re already in possession of a product feed that you can reuse for your recommendation system.

 

How to prepare the Users Dataset

While the items dataset was easier to produce or export, when it comes to working with users datasets, things get a bit more complex, mostly due to privacy constraints rather than technical ones. Clearly, from a machine learning point of view, the more metadata you have on a user the better. However, this might not always be possible and you must be transparent in how you handle user data (e.g., information collected from or about a user, including device information).

In 2006, Netflix disclosed insufficiently anonymous information about nearly half a million customers. During Netflix’s $1 million contest to improve its recommendation system, some researchers were able to re-identify users. It clearly violated users’ privacy, and the lawsuit ended $9 million for Netflix to settle the lawsuit.

In the EU, a lot of personal data is considered sensitive and is subject to specific processing conditions. When discussing Personally Identifiable Information (PII), we need to distinguish between authenticated, unauthenticated, or anonymous users. 

There’s one nuance that I’d like to emphasize here — according to EU regulations, anonymization is the process of creating anonymous information, which is “information that does not relate to an identified or identifiable natural person” or “to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”. While we’re considering unauthenticated users as being anonymous, it is worth noting that ensuring 100% anonymity is very hard to achieve.

Unauthenticated/Anonymous Authenticated
cookie_id user_id
session_id first and last name
device home address
country email address
city telephone number
passport number
driver’s license number
social security number
photo of a face
credit card number
account username
fingerprints

An important design decision needs to be made at this stage: should you build your recommender system for authenticated or for unauthenticated users? If you go for authenticated users, you have more metadata at your disposal and, in theory, your recommendation system could be more accurate, but the downside is that the volume of users that will see recommendations will be very, very small.

On the other hand, building a recommendation system with anonymous user metadata only translates into a much wider audience for your recommendations even if the accuracy might not be that high.

This is the kind of compromise that needs to be made in the early stages of implementing a recommendation model. Granted, at the later stages, a hybrid approach might be taken into consideration if the return on investment makes it worthwhile.

 

Exporting unauthenticated users from Google Analytics

The User Explorer report lets you isolate and examine individuals rather than aggregate user behavior. Individual user behavior is associated with either Client ID (authenticated) or User ID (anonymous).

For each client or user ID, you see the following data:

User Data 

(visible when clicking on Client Id on the above screen image)

Interactions Data
  • Device Category 
  • Device Platform 
  • Acquisition Date 
  • Channel 
  • Source/Medium 
  • Campaign
  • Date Last Seen (when the user last initiated a session)
  • Sessions 
  • Avg. Session Duration 
  • Bounce Rate 
  • Revenue 
  • Transactions 
  • Goal Conversion Rate

Additionally, when you segment the User Explorer report based on any combination of demographic, technology, and behavioral information, you have a list of all IDs associated with that segment that you can export. Please note that Google Analytics 4 (GA4) released in 2019 introduces some important nuances – the User Explorer is one of the techniques available in the Analysis section. 

However, the export functionality in the dashboard is limited to only a few columns/dimensions that you can get, which doesn’t make it really useful for our Users Dataset. 

The solution is to use the Google Analytics Reporting API v4. But there’s a caveat: the default version doesn’t export any client ids from the User Explorer report, and you have to make these available by creating custom dimensions. This allows the analytics API to export data at the Client ID or Session level, instead of returning only aggregated data.From the Google Analytics admin panel, go to the Admin section. In the Property section, go to Custom Definitions > Custom Dimensions. Add the following dimensions:

  • ClientID with the User scope

    From the Custom Definitions section, go to the Custom Dimensions section and click the +New Custom Dimension button. Add the name “ClientID” and choose the User scope.

 

  • SessionID with the Session scope

    From the Custom Definitions section, go to the Custom Dimensions section and click the +New Custom Dimension button. Add the name “SessionID” and choose the Session scope.

 

  • (Optional) UserID with the User scope.

    From the Custom Definitions section, go to the Custom Dimensions section and click the +New Custom Dimension button. Add the name “UserID” and choose the User scope. This custom dimension will contain the users ids from your CMS. It should be used only when connecting GA to other platforms, such as Shopify, Magento or WooCommerce.

 

After adding all dimensions, your table will look similar to the below one.

Next, you need to set up client ID and session ID tracking. If the Google Analytics tracker is included directly in the source of your website, you can add the custom dimensions values by modifying the existing code. 

Copy the following script and replace the view id (should be something like UI-xxxxx-01) to the one corresponding to your website. Also, replace ‘dimension1’, ‘dimension2’, and (optionally) ‘dimension3’ with the actual dimensions indexes created in Step 1. You can read more about adding the Client ID to a custom dimension with gtag.js.

<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-XXXXXXXX-Y"></script>
<script>
    window.dataLayer = window.dataLayer || [];
    function gtag() { dataLayer.push(arguments); }
    gtag('js', new Date());

    // Maps 'dimension1' to client id, 'dimension2' to session id and 'dimension3' to user id.
    gtag("config", "UA-XXXXXXXX-Y", {
        custom_map: {
            dimension1: "clientId",
            dimension2: "sessionId",
            dimension3: "userId", // Remove this line if you're not using the UserID custom dimension 
        }
    });

    // Sends an event that sets the dimensions
    gtag("event", "custom_set_dimensions", {
        sessionId: new Date().getTime() + "." + Math.random().toString(36).substring(5)

        // Remove following line if you're not using the UserID custom dimension
        userId: '## Replace with User ID from your CMS ##'
    });
</script>

The last part includes creating API credentials that can be used to authenticate the Google Analytics Reporting API and access the data. See Google’s docs on how to access the Analytics Reporting API.

You can use the following Python script that will create an anonymous Users Dataset for your collaborative filtering recommender system. The script creates a dataframe with the Google Analytics data and saves it as a csv file.

from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
import pandas as pd

SCOPES = ['https://www.googleapis.com/auth/analytics.readonly']
KEY_FILE_LOCATION = '<REPLACE_WITH_JSON_FILE>'
VIEW_ID = '<REPLACE_WITH_VIEW_ID>'
CUSTOM_DIMENSION_USER = '<REPLACE_WITH_CUSTOM_DIMENSION_INDEX, ex. ga:dimension1>'

def initialize_analyticsreporting():
  """Initializes an Analytics Reporting API V4 service object.

  Returns:
    An authorized Analytics Reporting API V4 service object.
  """
  credentials = ServiceAccountCredentials.from_json_keyfile_name(
      KEY_FILE_LOCATION, SCOPES)

  # Build the service object.
  analytics = build('analyticsreporting', 'v4', credentials=credentials)

  return analytics


def get_report(analytics):
  """Queries the Analytics Reporting API V4.
  The API returns a maximum of 100,000 rows per request. If you have more data, you have to implement pagination:
  - https://developers.google.com/analytics/devguides/reporting/core/v4/basics#pagination
  - https://developers.google.com/analytics/devguides/reporting/core/v4/rest/v4/reports/batchGet#ReportRequest.FIELDS.page_size
  
  Args:
    analytics: An authorized Analytics Reporting API V4 service object.
  Returns:
    The Analytics Reporting API V4 response.
  """
  return analytics.reports().batchGet(
      body={
        'reportRequests': [
        {
          'viewId': VIEW_ID,
          'dateRanges': [{'startDate': '7daysAgo', 'endDate': 'today'}],
          'metrics': [{'expression': 'ga:sessions'}],
          'dimensions': [{'name': CUSTOM_DIMENSION_USER}, {'name': 'ga:country'}, {'name': 'ga:deviceCategory'}],
          'pageSize': 10000
        }]
      }
  ).execute()


def parse_response(response):
  """Parses the Analytics Reporting API V4 response and returns it in a Pandas dataframe.

  Args:
    response: An Analytics Reporting API V4 response.
  Returns:
    Pandas dataframe with the data.
  """
  df = pd.DataFrame() 
  
  for report in response.get('reports', []):
    columnHeader = report.get('columnHeader', {})
    dimensionHeaders = columnHeader.get('dimensions', [])
    metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', [])

    for row in report.get('data', {}).get('rows', []):
      dimensions = row.get('dimensions', [])
      dateRangeValues = row.get('metrics', [])

      entry = {}
      for header, dimension in zip(dimensionHeaders, dimensions):
        entry[header] = dimension

      for i, values in enumerate(dateRangeValues):
        
        for metricHeader, value in zip(metricHeaders, values.get('values')):
          entry[metricHeader.get('name')] = value
          
        df = df.append(entry, ignore_index=True)
    return df


def main():
  analytics = initialize_analyticsreporting()
  response = get_report(analytics)
  df = parse_response(response)
  print(df.count())
  print(df.head(10))
  df.to_csv("data.csv")

if __name__ == '__main__':
  main()
sample_read_ga_data.py
Displaying sample_read_ga_data.py.

 


Exporting customers from Shopify

You can export a CSV file of all your customers and their details by following the next quick steps:

  1. From your Shopify admin, go to Customers
  2. Click Export
  3. Click All customers to export all your store’s customers. 
  4. Select Plain CSV file format. 
  5. Click Export customers.

As you can see from this template, there are a lot of columns that you don’t actually need – for example, it doesn’t make sense to include in your Users Dataset the full name or e-mail address as inputs for the recommender system.

 

Exporting customers from Magento

Similarly to exporting products from Magento 2, you can set Entity Type to “Customers Main File” and you’ll be provided with a list of customers that you can further cleanup and use in your recommender system.

 

Exporting customers from WooCommerce

To manually export customers from WooCommerce:

  1. Go to WooCommerce > Export.
  2. On the Manual Export tab, update the following settings:
    1. Output type: Choose to export your file in CSV format.
    2. Export type: Choose to export customers.
    3. Format: Select a predefined or custom format.
    4. Filename: Enter a name for the file generated by this export.
    5. Mark as exported: Enable to ensure the exported data is excluded from future exports. Click here to learn more about this setting.
    6. Batch processing: Only enable if your site does not support background processing.

3. Click Export.

 

How to prepare the Interactions Dataset

The final piece of the dataset puzzle is to generate the interactions datasets. Since there are close to 30M live websites using Google Analytics according to Builtwith.com, it makes sense to cover the best way to export user behavior for our recommendation system using the Google Analytics Reporting API v4.

We’ve already covered most of the process in the section on users datasets and you can pretty much use the same Python script to export the key events that will go into the recommender system. You can take a look at the Dimension & Metrics Explorer if you want to add additional columns.

from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
import pandas as pd

SCOPES = ['https://www.googleapis.com/auth/analytics.readonly']
KEY_FILE_LOCATION = '<REPLACE_WITH_JSON_FILE>'
VIEW_ID = '<REPLACE_WITH_VIEW_ID>'
CUSTOM_DIMENSION_USER = '<REPLACE_WITH_CUSTOM_DIMENSION_INDEX, ex. ga:dimension1>'
CUSTOM_DIMENSION_SESSION = '<REPLACE_WITH_CUSTOM_DIMENSION_INDEX, ex. ga:dimension2>'

def initialize_analyticsreporting():
  """Initializes an Analytics Reporting API V4 service object.

  Returns:
    An authorized Analytics Reporting API V4 service object.
  """
  credentials = ServiceAccountCredentials.from_json_keyfile_name(
      KEY_FILE_LOCATION, SCOPES)

  # Build the service object.
  analytics = build('analyticsreporting', 'v4', credentials=credentials)

  return analytics


def get_report(analytics):
  """Queries the Analytics Reporting API V4.
  The API returns a maximum of 100,000 rows per request. If you have more data, you have to implement pagination:
  - https://developers.google.com/analytics/devguides/reporting/core/v4/basics#pagination
  - https://developers.google.com/analytics/devguides/reporting/core/v4/rest/v4/reports/batchGet#ReportRequest.FIELDS.page_size
  
  Args:
    analytics: An authorized Analytics Reporting API V4 service object.
  Returns:
    The Analytics Reporting API V4 response.
  """
  return analytics.reports().batchGet(
      body={
        'reportRequests': [
        {
          'viewId': VIEW_ID,
          'dateRanges': [{'startDate': '7daysAgo', 'endDate': 'today'}],
          'metrics': [
                {'expression': 'ga:productListViews'},
                {'expression': 'ga:productDetailViews'},
                {'expression': 'ga:productAddsToCart'},
                {'expression': 'ga:productCheckouts'},
                {'expression': 'ga:itemQuantity'},
                {'expression': 'ga:itemRevenue'}
            ],
          'dimensions': [
              {'name': CUSTOM_DIMENSION_USER},
              {'name': CUSTOM_DIMENSION_SESSION},
              {'name': 'ga:productSku'}
            ],
          'pageSize': 10000
        }]
      }
  ).execute()


def parse_response(response):
  """Parses the Analytics Reporting API V4 response and returns it in a Pandas dataframe.

  Args:
    response: An Analytics Reporting API V4 response.
  Returns:
    Pandas dataframe with the data.
  """
  df = pd.DataFrame() 
  
  for report in response.get('reports', []):
    columnHeader = report.get('columnHeader', {})
    dimensionHeaders = columnHeader.get('dimensions', [])
    metricHeaders = columnHeader.get('metricHeader', {}).get('metricHeaderEntries', [])

    for row in report.get('data', {}).get('rows', []):
      dimensions = row.get('dimensions', [])
      dateRangeValues = row.get('metrics', [])

      entry = {}
      for header, dimension in zip(dimensionHeaders, dimensions):
        entry[header] = dimension

      for i, values in enumerate(dateRangeValues):
        
        for metricHeader, value in zip(metricHeaders, values.get('values')):
          entry[metricHeader.get('name')] = value
          
        df = df.append(entry, ignore_index=True)
    return df


def main():
  analytics = initialize_analyticsreporting()
  response = get_report(analytics)
  df = parse_response(response)
  print(df.count())
  print(df.head(10))
  df.to_csv("interactions.csv")

if __name__ == '__main__':
  main()
sample_read_ga_interactions.py
Displaying sample_read_ga_interactions.py.

 

Event Trigger Parameters
add_payment_info when a user submits their payment information coupon, currency, items, payment_type, value
add_shipping_info when a user submits their shipping information coupon, currency, items, shipping_tier, value
add_to_cart when a user adds items to cart currency, items, value
add_to_wishlist when a user adds items to a wishlist currency, items, value
begin_checkout when a user begins checkout coupon, currency, items, value
generate_lead when a user submits a form or request for information value, currency
purchase when a user completes a purchase affiliation, coupon, currency, items, transaction_id, shipping, tax, value (required parameter)
refund when a refund is issued affiliation, coupon, currency, items, transaction_id, shipping, tax, value
remove_from_cart when a user removes items from a cart currency, items, value
select_item when an item is selected from a list items, item_list_name, item_list_id
select_promotion when a user selects a promotion items, promotion_id, promotion_name, creative_name, creative_slot, location_id
view_cart when a user views their cart currency, items, value
view_item when a user views an item currency, items, value
view_item_list when a user sees a list of items/offerings items, item_list_name, item_list_id
view_promotion when a promotion is shown to a user items, promotion_id, promotion_name, creative_name, creative_slot, location_id

Retail and ecommerce apps should log the events listed below. Logging events along with their prescribed parameters ensures maximum available detail in reports and improves the overall performance of the collaborative filtering recommender system.

While having more user interactions increases the accuracy of the recommendation system, you might want to consider experimenting with fewer and more basic ones, like product detail page views (for unauthenticated users) and/or orders (for authenticated users). The intuition here is that the recommendation system will be able to recommend relevant items solely based on implicit ratings such as: product detail page views and purchase history.

 

Exporting orders from Shopify

You can export orders along with their transaction histories or you can export only the transaction histories of your orders:

  1. From the Orders page, click Export.
  2. In the Export orders window:
    • Select the option for the orders that you want to export. For example, if you want to export your orders by date, then click Export orders by date and set the start and end dates for the orders that you want to export.
    • Under Export as, select a file format.
  3. If you want to download all information about your orders, then click Export orders. If you want to download your transaction information only, then click Export transaction histories.

 

Exporting orders from Magento

In your Magento 2 backend, go to Sales > Orders and click the Export drop down, tick CSV.

 

Exporting orders from WooCommerce

You can follow the same steps to export orders from WooCommerce as you did for exporting customers, the only difference is selecting “orders” for Export type.

Remember, for your first iteration of collaborative filtering recommendations system, you might need to clean up the interactions dataset export until it consists only of: UserId, Date, ProductIds.

 

Putting it all together

If you reached this point, you should have 3 CSVs files, each with the following structure:

Items Dataset
ItemId Title Description Price Other Item Metadata (…)

 

Users Dataset
UserId Country City Other User Metadata (…)

 

Interactions Dataset
ItemId UserId Timestamp EventType Other Interaction Metadata (…)

Please notice that it is absolutely possible to ignore the users dataset altogether if you do not have (or do not want to include) any user metadata. 

Interactions are the basis of building the User Rating Matrix (URM). Just by using user_id – item_id pairs, you can create the first version of the URM / baseline model. Other features can be added one by one, while measuring the performance of the model at each step.

 

From all of the available datasets, properties that are numeric or can be one-hot encoded can be easily added into the models. Others will require extra-processing — for example applying NLP to extract embeddings from text data, or computer vision to process product images.

In the next blog post in this series we’ll focus on preparing our datasets and exploring feature engineering for our collaborative filtering recommender system, which represents a critical step in building a highly performant machine learning model.

Stay tuned! 

And, if you have any questions:  https://twitter.com/cborodescu

 

About the author
Ciprian Borodescu

AI Product Manager | On a mission to help people succeed through the use of AI

githublinkedintwitter

Get your AI on!

The AI podcast where software developers and business leaders come together to share their experience in dealing with problems that can be solved through intelligent use of data.

Listen to the podcasts
Get your AI on!

Recommended Articles

Powered byAlgolia Algolia Recommend

The anatomy of high-performance recommender systems - Part 1
ai

Ciprian Borodescu

AI Product Manager | On a mission to help people succeed through the use of AI

5 reasons to add clicks to site search analytics (and code to do it)
product

Alexis Monks
Peter Villani

Alexis Monks &

Peter Villani

Why you should capture click and conversion events from day one
engineering

Jaden Baptista

Technical Writer