Add CSV data to your extracted records
If the data you want to include in your extracted records isn’t present on the site you’re crawling, you can add that data from a CSV file. For example, you may be crawling an ecommerce site, but details such as cost pricing and inventory levels aren’t on the site.
Prepare a spreadsheet in a crawler-compatible format
Your spreadsheet’s format must be compatible with the crawler.
- It must contain a header row
- One of the header columns must be
url
- The
url
column should contain all URLs to which you want to add external data. - The remaining columns are for data that you want to add to the record for each URL. For an example, see this spreadsheet.
Publish your spreadsheet online
Add your spreadsheet data to Google Sheets and publish it online.
- Ensure the spreadsheet’s sharing settings (Restrict access to the following) allow the crawler to access it.
- Ensure the spreadsheet is published to the web as a CSV file. To do this, change the setting from Web page to Comma-separated values (.csv).
- The crawler uses the latest data in your Google spreadsheet. To prevent this, and just use the initially uploaded data, clear the check from Automatically republish when changes are made.
After you click Publish, copy the URL. You’ll need this to create an external data source.
The CSV data doesn’t have to be in a Google Sheet but it must be available online.
Link the published spreadsheet in your crawler configuration
To link your spreadsheet, create an external data source and add it to the recordExtractor
function in your crawler configuration.
Create an external data source
- From the Crawler page, select the External Data tab.
- Click Add External Data
-
As External Data type, select CSV, add your CSV file’s URL, and click Create.
- To test the data source, click Explore Data and then Refresh Data. It should extract the correct number of rows from your spreadsheet.
Add external data to extracted records
- Go to the Crawler page, select your crawler, and click Editor.
-
Add the
externalData
parameter to your crawler. For example:Copy1
externalData: ['myCSV']
-
Add the
dataSources
parameter to therecordExtractor
function and reference the columns from your CSV data source.Copy1 2 3 4 5 6 7 8 9
recordExtractor: ({ url, dataSources }) => [ { objectID: url.href, pageviews: dataSources.myCSV.pageviews, category: dataSources.myCSV.category, // Convert the string to a boolean because CSV doesn't have boolean types onsale: dataSources.myCSV.onsale === 'true', }, ]
The crawler only keeps data from URLs that match your crawler configuration’s startUrls
, extraUrls
, and pathsToMatch
parameters.
Test the data
- In the URL Tester, enter the URL of a page with CSV data attached to it (one of those you added to your spreadsheet).
- Click Run Test.
- Confirm that the extracted records contain the data from your CSV.
If this doesn’t work as expected, check that the URL in your spreadsheet matches the site URL.