Tools / Crawler / Enriching data

Add CSV data to your extracted records

If the data you want to include in your extracted records isn’t present on the site you’re crawling, you can add that data from a CSV file. For example, you may be crawling an ecommerce site, but details such as cost pricing and inventory levels aren’t on the site.

Prepare a spreadsheet in a crawler-compatible format

Your spreadsheet’s format must be compatible with the crawler.

  • It must contain a header row
  • One of the header columns must be url
  • The url column should contain all URLs to which you want to add external data.
  • The remaining columns are for data that you want to add to the record for each URL. For an example, see this spreadsheet.

Publish your spreadsheet online

Add your spreadsheet data to Google Sheets and publish it online.

  • Ensure the spreadsheet’s sharing settings (Restrict access to the following) allow the crawler to access it.
  • Ensure the spreadsheet is published to the web as a CSV file. To do this, change the setting from Web page to Comma-separated values (.csv).
  • The crawler uses the latest data in your Google spreadsheet. To prevent this, and just use the initially uploaded data, clear the check from Automatically republish when changes are made.

Google Sheets online publishing options

After you click Publish, copy the URL. You’ll need this to create an external data source.

The CSV data doesn’t have to be in a Google Sheet but it must be available online.

To link your spreadsheet, create an external data source and add it to the recordExtractor function in your crawler configuration.

Create an external data source

  1. From the Crawler page, select the External Data tab.
  2. Click Add External Data
  3. As External Data type, select CSV, add your CSV file’s URL, and click Create.

    CSV external data source options

  4. To test the data source, click Explore Data and then Refresh Data. It should extract the correct number of rows from your spreadsheet.

Add external data to extracted records

  1. Go to the Crawler page, select your crawler, and click Editor.
  2. Add the externalData parameter to your crawler. For example:

    1
    
     externalData: ['myCSV']
    
  3. Add the dataSources parameter to the recordExtractor function and reference the columns from your CSV data source.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
     recordExtractor: ({ url, dataSources }) => [
       {
         objectID: url.href,
         pageviews: dataSources.myCSV.pageviews,
         category: dataSources.myCSV.category,
         // Convert the string to a boolean because CSV doesn't have boolean types
         onsale: dataSources.myCSV.onsale === 'true',
       },
     ]
    

The crawler only keeps data from URLs that match your crawler configuration’s startUrls, extraUrls, and pathsToMatch parameters.

Test the data

  1. In the URL Tester, enter the URL of a page with CSV data attached to it (one of those you added to your spreadsheet).
  2. Click Run Test.
  3. Confirm that the extracted records contain the data from your CSV.

If this doesn’t work as expected, check that the URL in your spreadsheet matches the site URL.

Did you find this page helpful?