API Reference / Crawler Configuration API / actions
Type: Array of Objects
Required
Parameter syntax
{
  actions: [
    {
      indexName: 'index_name',
      pathsToMatch: ['url_path', ...]
      fileTypesToMatch: ['file_type', ...],
      autoGenerateObjectIDs: true|false,
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
      }
    },
  ],
}

About this parameter

Determines which web pages are translated into Algolia records and in what way.

A single action defines:

  1. the subset of your crawler’s websites it targets,
  2. the extraction process for those websites,
  3. and the indices to which the extracted records are pushed.

A single web page can match multiple actions. In this case, your crawler creates a record for each matched action.

Examples

1
2
3
4
5
6
7
8
9
10
11
12
13
{
  actions: [
    {
      indexName: 'dev_blog_algolia',
      pathsToMatch: ['https://blog.algolia.com/**'],
      fileTypesToMatch: ['pdf'],
      autoGenerateObjectIDs: false,
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
        ...
      }
    },
  ],
}

Parameters

pathsToMatch
type: string
Required

Determines which webpages match for this action. This list is checked against the url of webpages using micromatch. You can use negation, wildcards and more.

selectorsToMatch
type: string
Optional

Checks for the presence or absence of DOM nodes.

fileTypesToMatch
type: string
default: html
Optional

Set this value if you want to index documents. Chosen file types will be converted to HTML using Tika, then treated as a normal HTML page. See the documents guide for a list of available fileTypes.

autoGenerateObjectIDs
type: bool
default: true

Generate an objectID for records that don’t have one. Setting this parameter to false means we’ll raise an error if an extracted record doesn’t have an objectID.

recordExtractor
type: function
Required

A recordExtractor is a custom Javascript function that lets you execute your own code and extract what you want from a page. Your record extractor should return either an array of JSON or an empty array. If the function returns an empty array, the page is skipped.

1
2
3
4
5
6
7
8
9
10
recordExtractor: ({ url, $, contentLength, fileType})  => {
  return [
    {
      url: url.href,
      text: $('p').html()
      ... /* anything you want */
    }
  ];
  // return []; skips the page
}

action ➔ recordExtractor

$
type: object (Cheerio instance)
Optional

A Cheerio instance containing the HTML of the crawled page.

url
type: Location object
Optional

A Location object containing the URL and metadata for the crawled page.

fileType
type: string
Optional

The fileType of the crawled page (e.g.: html, pdf, …).

contentLength
type: number
Optional

The number of bytes in the crawled page.

dataSources
type: object
Optional

Array of external data sources.

helpers
type: object
Optional

Collection of functions to help you extract content and generate records.

recordExtractor ➔ helpers

splitContentIntoRecords
type: function
Optional

The helpers.splitContentIntoRecords() function is callable from your recordExtractor. It extracts textual content from the resource (i.e. HTML page or document) and splits it into in one or more records. It can be used to index the textual content exhaustively and in a way to prevent record_too_big errors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
recordExtractor: ({ url, $, helpers }) => {
  const baseRecord = {
    url,
    title: $('head title').text().trim(),
  };
  const records = helpers.splitContentIntoRecords({
    baseRecord,
    $elements: $('body'),
    maxRecordBytes: 1000,
    textAttributeName = 'text',
    orderingAttributeName = 'part',
  });
  // You can still alter produced records
  // afterwards, if needed.
  return records;
}

In the example recordExtractor() function above, crawling a long HTMTL page will return an array of records that will never exceed the limit of 1000 bytes per record. The records, extracted by the splitContentIntoRecords method, would look similar to this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[
  {
    url: 'http://test.com/index.html',
    title: 'Home - Test.com',
    part: 0
    text: 'Welcome on test.com, the best resource to',
  },
  {
    url: 'http://test.com/index.html',
    title: 'Home - Test.com',
    part: 1
    text: 'find interesting content online.',
  }
]

Assuming that the automatic generation of objectIDs is enabled in your configuration, the crawler generates an objectID for each of the generated records.

In order to prevent duplicate results when searching for a word that appears in multiple records belonging to the same resource (page), we recommend that you enable distinct in your index settings, set the attributeForDistinct, searchableAttributes, and add a custom ranking from first record on your page to the last:

1
2
3
4
5
6
7
8
initialIndexSettings: {
  'my-index': {
    distinct: true,
    attributeForDistinct: 'url'
    searchableAttributes: [ 'title', 'text' ],
    customRanking: [ 'asc(part)' ],
  }
}

Please be aware that using distinct comes with some specificities.

helpers ➔ splitContentIntoRecords

$elements
type: string
default: $("body")

A Cheerio instance that determines from which element(s) textual content will be extracted and turned into records.

baseRecord
type: object
default: {}

Attributes (and their values) to add to all resulting records.

maxRecordBytes
type: number
default: 10000

Maximum number of bytes allowed per record, on the resulting Algolia index. You can refer to the record size limits for your plan to prevent any errors regarding record size.

textAttributeName
type: string
default: text

Name of the attribute in which to store the text of each record.

orderingAttributeName
type: string
Optional

Name of the attribute in which to store the number of each record.

Did you find this page helpful?