Tools / Crawler / APIs / Configuration
Type: Action[]
Required
Parameter syntax
{
  actions: [
    {
      indexName: 'index_name',
      pathsToMatch: ['url_path', ...]
      fileTypesToMatch: ['file_type', ...],
      autoGenerateObjectIDs: true|false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
      }
    },
  ],
}

About this parameter

Determines which web pages are translated into Algolia records and in what way.

A single action defines:

  • The URLs to crawl
  • The extraction process for those websites
  • The indices to which the extracted records are added

A single web page can match multiple actions. In this case, your crawler creates a record for each matched action.

Examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
  actions: [
    {
      indexName: 'dev_blog_algolia',
      pathsToMatch: ['https://blog.algolia.com/**'],
      fileTypesToMatch: ['pdf'],
      autoGenerateObjectIDs: false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
        ...
      }
    },
  ],
}

Parameters

Action

name
type: string
Optional

The unique identifier of this action (useful for debugging). Required if schedule is set.

indexName
type: string
Required

The index name targeted by this action. This value is appended to the indexPrefix, when specified.

schedule
type: string
Optional

How often to perform a complete crawl for this action. For more information, see schedule.

pathsToMatch
type: string[]
Required

Determines which webpages to process with this action. This list is checked against the URL of webpages using the micromatch library. You can use negation, wildcards and more.

selectorsToMatch
type: string
Optional

Checks for DOM elements matching selectors. If the page doesn’t contain any element matching the selectors, it’s ignored. You can use negation: to ignore pages with a .main class, include !.main in the list.

fileTypesToMatch
type: string[]
default: ["html"]
Optional

Set this value to index documents. For a list of available file types, see Non-HTML documents. Documents are first converted to HTML with Apache Tika, then processed as a HTML page.

autoGenerateObjectIDs
type: bool
default: true

Generate an object ID for records that don’t have one. Set this parameter to false to raise an error if an extracted record doesn’t have an object ID.

recordExtractor
type: function
Required

A custom Javascript function that lets you run your own code and extract what you want from a page. Your record extractor should return either an array of JSON objects or an empty array. If the function returns an empty array, the page is skipped.

1
2
3
4
5
6
7
8
9
10
recordExtractor: ({ url, $, contentLength, fileType})  => {
  return [
    {
      url: url.href,
      text: $('p').html()
      ... /* anything you want */
    }
  ];
  // return []; skips the page
}

action ➔ recordExtractor

$
type: object (Cheerio instance)
Optional

A Cheerio instance with the HTML of the crawled page.

url
type: Location object
Optional

A Location object with the URL and metadata for the crawled page.

fileType
type: string
Optional

The file type of the crawled page or document.

contentLength
type: number
Optional

The number of bytes of the crawled page.

dataSources
type: object
Optional

Object with the external data sources of the current URL. Each key of the object corresponds to an externalData object.

1
2
3
4
5
6
{
  dataSources: {
    dataSourceId1: { data1: 'val1', data2: 'val2' },
    dataSourceId2: { data1: 'val1', data2: 'val2' },
  }
}
helpers
type: object
Optional

Functions that help you extract content and generate records.

recordExtractor ➔ helpers

docsearch
type: function
Optional

A function that extracts content and formats it to be compatible with DocSearch. It creates an optimized number of records for relevancy and hierarchy, and you can use it without DocSearch or to index non-documentation content.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
recordExtractor: ({ url, $, helpers }) => {
  return helpers.docsearch({
    aggregateContent: true,
    indexHeadings: true,
    recordVersion: 'v3',
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
}

For more examples, see the DocSearch documentation.

article
type: function
Optional

A function that extracts content from pages identified as articles. Articles are recognized by one of the following:

  • The og:type meta tag with the value article:

    1
    
    <meta property="og:type" content="article"/>
    
  • One of the following JSON-LD schema types: Article, NewsArticle, Report, or BlogPosting

    1
    2
    3
    
    recordExtractor: ({ url, $, helpers }) => {
      return helpers.article({ url, $ });
    }
    

This helper returns an object with the following properties:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
{
  /**
  * The object's unique identifier,
  * in this case, the article's URL
  */
  objectID: string,

  /**
  * The article's URL (without parameters or hashes)
  */
  url: string,

  /**
  * The language the page content is written in
  * - html[attr=lang]
  */
  lang?: string,

  /**
  * The article's headline (selected from one of the following in order of preference):
  * - meta[property="og:title"]
  * - meta[name="twitter:title"]
  * - head > title
  * - First <h1>
  */
  headline : string,

  /**
  * The article's description (selected from one of the following in order of preference):
  * - meta[name="description"]
  * - meta[property="og:description"]
  * - meta[name="twitter:description"]
  */
  description?: string,

  /**
  * Article keywords
  * - meta[name="keywords"]
  * - The `keywords` field of the JSON-LD Article object: https://schema.org/Article
  */
  keywords: string[],

  /**
  * Article tags
  * - meta[property="article:tag"]
  */
  tags: string[],

  /**
  * The image associated with the article (selected from one of the following in order of preference):
  * - meta[property="og:image"]
  * - meta[name="twitter:image"]
  */
  image?: string,

  /**
  * The article's author
  * - meta[property="article:author"]
  * - The `author` field of the JSON-LD Article object: https://schema.org/Article

  */
  authors?: string[],

  /**
  * Article publication date
  * - meta[property="article:published_time"]
  * - The `datePublished` field of the JSON-LD Article object: https://schema.org/Article
  */
  datePublished?: string,

  /**
  * The date when the article was last modified
  * - meta[property="article:modified_time"]
  * - The `dateModified` field of the JSON-LD Article object: https://schema.org/Article
  */
  dateModified?: string,

  /**
  * The article category
  * - meta[property="article:section"]
  * - The `category` field of the JSON-LD Article object: https://schema.org/Article
  */
  category?: string,

  /**
  * The article's content (body copy)
  */
  content?: string,
}
product
type: function
Optional

A function that extracts content from pages identified as product pages. Product pages are recognized by looking for one of the following JSON-LD schema types: Product,DietarySupplement,Drug,IndividualProduct,ProductCollection,ProductGroup,ProductModel,SomeProducts, or Vehicle.

1
2
3
recordExtractor: ({ url, $, helpers }) => {
  return helpers.product({ url, $ });
}

This helper returns the following object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
{
  /**
  * The object's unique identifier,
  * in this case, it will be the product page's URL
  */
  objectID: string,

  /**
  * The product page URL (without parameters or hashes)
  */
  url: string,

  /**
  * The product name
  * - html[attr=lang]
  */
  lang?: string,

  /**
  * The language the page content is written in
  * - `name` field of JSON-LD Product object: https://schema.org/Product
  */
  name?: string,

  /**
  * The product SKU
  * - The `sku` field of the JSON-LD Product object: https://schema.org/Product
  */
  sku : string,

  /**
  * The product's description
  * - The `description` field of the JSON-LD Product object: https://schema.org/Product
  */
  description?: string,

  /**
  * The image associated with the product
  * - The `image` field of the JSON-LD Product object: https://schema.org/Product
  */
  image?: string,

  /**
  * The product's price (selected from one of the following in order of preference):
  * - The `offers.price` field of the JSON-LD Product object: https://schema.org/Product
  * - The `offers.highPrice` field of the JSON-LD Product object: https://schema.org/Product
  * - The `offers.lowPrice` field of the JSON-LD Product object: https://schema.org/Product
  */
  price?: string,

  /**
  * The product's currency
  * - The `offers.priceCurrency` field of the JSON-LD Product object: https://schema.org/Product
  */
  price?: string,

  /**
  * The product category
  * - The `category` field of the JSON-LD Product object: https://schema.org/Product
  */
  category?: string,
}
page
type: function
Optional

A function that extracts text content from pages regardless of its type or category.

1
2
3
4
5
6
7
8
9
10
recordExtractor: ({ url, $, helpers }) => {
  return helpers.page({
    url,
    $,
    recordProps: {
      title: 'head title',
      content: 'body',
    },
  });
}

This helper returns the following object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
{
  /**
  * The object's unique identifier
  */
  objectID: string;

  /**
  * The page's URL
  */
  url: string;

  /**
  * The URL's hostname
  * - http://example.com/ = example.com
  */
  hostname: string;

  /**
  * The URL's path, everything after the hostname
  */
  path: string;

  /**
  * The URL depth, based on the number of slashes after the domain
  * - http://example.com/ = 1
  * - http://example.com/about = 1
  * - http://example.com/about/ = 2
  * - etc.
  */
  depth: number;

  /**
  * The page's file type.
  * One of: html, xml, json, pdf, doc, xls, ppt, odt, ods, odp, email
  */
  fileType: FileType;

  /**
  * The page length in bytes
  */
  contentLength: number;

  /**
  * The page title
  * - head > title
  */
  title?: string;

  /**
  * The page's description
  * - meta[name=description]
  */
  description?: string;

  /**
  * The page's keywords
  * - meta[name="keywords"]
  */
  keywords?: string[];

  /**
  * The image associated with the page
  * - meta[property="og:image"]
  */
  image?: string;

  /**
  * The page headers
  * - h1 and h2 tags content
  */
  headers?: string[];

  /**
  * The page's content (body copy)
  */
  content: string;
}
codeSnippets
type: function
Optional

Within a recordExtractor, the helpers.codeSnippets() function extracts code snippets from a page.

  • It looks for <pre> tags and extracts the content.
  • It also extracts the language class prefix from the <pre> tag.

Use this property to populate one attribute in the record, not to create multiple records.

1
2
3
4
recordExtractor: ({ url, $, helpers }) => {
  const code = helpers.codeSnippets({ tag, languageClassPrefix })
  return { code };
}

The helper returns an array of code objects with the following properties:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
  /**
  * The content of the code snippet
  * - pre, code, etc. default: pre
  */
  content: string;

  /**
  * The code snippet's language (if found)
  * - pre[class^=language-] default: language-
  */
  languageClassPrefix?: string;

  /**
  * The URL to the nearest sibling <a> tag
  * - pre + a
  */
  codeUrl?: string;

  /**
  * Text fragment URL with the code snippet
  * This is a selection of text within a page that's linked to from another page.
  * https://developer.mozilla.org/en-US/docs/Web/URI/Fragment/Text_fragments
  */
  fragmentUrl?: string;
}[]
splitContentIntoRecords
type: function
Optional

A function that extracts text from a HTML page and splits it into one or more records. This can help prevent record_to_big errors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
recordExtractor: ({ url, $, helpers }) => {
  const baseRecord = {
    url,
    title: $('head title').text().trim(),
  };
  const records = helpers.splitContentIntoRecords({
    baseRecord,
    $elements: $('body'),
    maxRecordBytes: 1000,
    textAttributeName: 'text',
    orderingAttributeName: 'part',
  });
  // You can still alter produced records
  // afterwards, if needed.
  return records;
}

In the preceding example, crawling a long HTML page returns an array of records that never exceeds the limit of 1,000 bytes per record. The records would look similar to this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[
  {
    url: 'http://test.com/index.html',
    title: 'Home - Test.com',
    part: 0
    text: 'Welcome on test.com, the best resource to',
  },
  {
    url: 'http://test.com/index.html',
    title: 'Home - Test.com',
    part: 1
    text: 'find interesting content online.',
  }
]

To prevent duplicate results when searching for a word that appears in multiple records belonging to the same resource (page), set distinct to true in your index settings, set the attributeForDistinct, searchableAttributes, and add a custom ranking from first record on your page to the last:

1
2
3
4
5
6
7
8
initialIndexSettings: {
  'my-index': {
    distinct: true,
    attributeForDistinct: 'url'
    searchableAttributes: [ 'title', 'text' ],
    customRanking: [ 'asc(part)' ],
  }
}

helpers ➔ splitContentIntoRecords

$elements
type: string
default: $("body")
Optional

A Cheerio selector that determines from which elements textual content will be extracted and turned into records.

baseRecord
type: object
default: {}
Optional

Attributes (and their values) to add to all resulting records.

maxRecordBytes
type: number
default: 10000
Optional

Maximum number of bytes allowed per record, on the resulting Algolia index. To avoid errors, check the record size limits for your plan.

textAttributeName
type: string
default: text
Optional

Name of the attribute in which to store the text of each record.

orderingAttributeName
type: string
Optional

Name of the attribute in which to store the number of each record.

helpers ➔ docsearch

recordProps
type: object
Required

Main docsearch configuration.

aggregateContent
type: boolean
default: true
Optional

Whether the helper automatically merges sibling elements and separates them by a line break.

For: <p>Foo</p><p>Bar<p>

1
2
3
4
{
  aggregateContent: false,
  // creates 2 records
}
1
2
3
4
{
  aggregateContent: true,
  // creates 1 records
}
indexHeadings
type: boolean | object
default: true
Optional

Whether the helper creates records for headings.

If false, only records for the content level are created. If an object with the properties from, to is provided, only records for matching heading levels are created.

1
2
3
4
5
6
{
  recordProps: {
    indexHeadings: false
    indexHeadings: { from: 4, to: 6 }
  }
}
recordVersion
type: string
default: v2
Optional

Change the version of the extracted records. It’s not correlated with the DocSearch version and may be incremented independently.

  • v2: compatible with DocSearch >= @2
  • v3: compatible with DocSearch >= @3

docsearch ➔ recordProps

lvl0
type: object
Required

Select the main category of the page. You should index the title and h1 of the page in lvl1.

1
2
3
4
5
6
{
  lvl0: {
    selectors: '.page-category',
    defaultValue: 'documentation'
  }
}
lvl1
type: string | string[]
Required

Select the main title of the page.

1
2
3
{
  lvl1: 'head > title'
}
content
type: string | string[]
Required

Select the content elements of the page.

1
2
3
{
  lvl1: 'body > p, main li'
}
pageRank
type: string
Optional

Add an attribute pageRank to the extracted records that you can use to boost the relevance of associated records in the index settings. You can pass any numeric value as a string, including negative values.

1
2
3
{
  pageRank: "30"
}
lv2, lvl3, lvl4, lvl5, lvl6
type: string | string[]
Optional

Select other headings of the page.

1
2
3
4
5
{
  lvl2: "main h2",
  lvl3: "footer h3"
  lvl4: ["h4", "div.important"],
}
*
type: string | string[] | object
Optional

All extra keys are added to the extracted records.

1
2
3
4
5
6
7
{
  myCustomAttribute: '.myCustomClass',
  ogDesc: {
    selectors: 'head meta[name="og:desc"]',
    defaultValue: 'Default description'
  }
}
Did you find this page helpful?