actions - Algolia

Type: Action []
Required

A single action defines:

The URLs to crawl
The extraction process for those websites
The indices to which the extracted records are added

A single web page can match multiple actions. In this case, your crawler creates a record for each matched action.

Examples

JavaScript

{
  actions: [
    {
      indexName: 'dev_blog_algolia',
      pathsToMatch: ['https://blog.algolia.com/**'],
      fileTypesToMatch: ['pdf'],
      autoGenerateObjectIDs: false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
        ...
      }
    },
  ],
}

Parameters

Action

indexName

string

required

Index name targeted by this action. This value is appended to the indexPrefix if specified.

pathsToMatch

string[]

required

URL patterns for web pages to which this action should apply. The patterns are evaluated using the micromatch library. You can use wildcard characters, negation, and more.

recordExtractor

function

required

A JavaScript function to extract content from a web page and turn it into Algolia records. The function should return a JSON array which may be empty. An empty array means the page is skipped.Example:

JavaScript

recordExtractor: ({ url, $, contentLength, fileType }) => {
  return [
    {
      url: url.href,
      text: $("p").html(),
      // ... anything you want
    },
  ];
  // return []; skips the page
};

Show child attributes

recordExtractor.$

object

A Cheerio instance with the HTML of the crawled page.

recordExtractor.contentLength

number

Number of bytes of the crawled page.

recordExtractor.dataSources

object

External data sources for the crawled page. Each key corresponds to an externalData object.Example:

JavaScript

{
  dataSources: {
    dataSourceId1: { data1: "val1", data2: "val2" },
    dataSourceId2: { data1: "val1", data2: "val2" },
  },
}

recordExtractor.fileType

string

File type of the crawled page or document.

recordExtractor.helpers

object

Helper functions for extracting content and turning it into Algolia records.

Show child attributes

helpers.article

function

A function that extracts content from pages identified as articles. Articles are pages with the og:type meta tag: <meta property="og:type" content="article"/> or with the JSON-LD schema types: Article, NewsArticle, Report, or BlogPosting.Example:

JavaScript

recordExtractor: ({ url, $, helpers }) => {
  return helpers.article({ url, $ });
};

The article helper returns an object with the following properties:

TypeScript

{
  /**
  * The object's unique identifier,
  * in this case, the article's URL
  */
  objectID: string,

  /**
  * The article's URL (without parameters or hashes)
  */
  url: string,

  /**
  * The language the page content is written in
  * - html[attr=lang]
  */
  lang?: string,

  /**
  * The article's headline (selected from one of the following in order of preference):
  * - meta[property="og:title"]
  * - meta[name="twitter:title"]
  * - head > title
  * - First <h1>
  */
  headline : string,

  /**
  * The article's description (selected from one of the following in order of preference):
  * - meta[name="description"]
  * - meta[property="og:description"]
  * - meta[name="twitter:description"]
  */
  description?: string,

  /**
  * Article keywords
  * - meta[name="keywords"]
  * - The `keywords` field of the JSON-LD Article object: https://schema.org/Article
  */
  keywords: string[],

  /**
  * Article tags
  * - meta[property="article:tag"]
  */
  tags: string[],

  /**
  * The image associated with the article (selected from one of the following in order of preference):
  * - meta[property="og:image"]
  * - meta[name="twitter:image"]
  */
  image?: string,

  /**
  * The article's author
  * - meta[property="article:author"]
  * - The `author` field of the JSON-LD Article object: https://schema.org/Article

  */
  authors?: string[],

  /**
  * Article publication date
  * - meta[property="article:published_time"]
  * - The `datePublished` field of the JSON-LD Article object: https://schema.org/Article
  */
  datePublished?: string,

  /**
  * The date when the article was last modified
  * - meta[property="article:modified_time"]
  * - The `dateModified` field of the JSON-LD Article object: https://schema.org/Article
  */
  dateModified?: string,

  /**
  * The article category
  * - meta[property="article:section"]
  * - The `category` field of the JSON-LD Article object: https://schema.org/Article
  */
  category?: string,

  /**
  * The article's content (body copy)
  */
  content?: string,
}

helpers.codeSnippets

function

A function that extracts code snippets from a page. It searches for HTML pre elements on the page and also extracts the programming language, in which the code snippet is written.If the crawler finds several code snippets on a page, this property returns a list of those snippets.Example:

JavaScript

recordExtractor: ({ url, $, helpers }) => {
  const code = helpers.codeSnippets({ tag, languageClassPrefix });
  return { code };
};

The codeSnippets helper returns an array of code objects with the following properties:

TypeScript

{
  /**
  * The content of the code snippet
  * - pre, code, etc. default: pre
  */
  content: string;

  /**
  * The code snippet's language (if found)
  * - pre[class^=language-] default: language-
  */
  languageClassPrefix?: string;

  /**
  * The URL to the nearest sibling <a> tag
  * - pre + a
  */
  codeUrl?: string;
}[]

helpers.docsearch

function

A function that extracts content and formats it to be compatible with DocSearch. It creates an optimal number of records for relevancy and hierarchy. You can also use it without DocSearch or to index non-documentation content.Example:

JavaScript

recordExtractor: ({ url, $, helpers }) => {
  return helpers.docsearch({
    aggregateContent: true,
    indexHeadings: true,
    recordVersion: "v3",
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
};

For more examples, see Record extractor in the DocSearch documentation.

Show child attributes

docsearch.recordProps

object

required

Main DocSearch record properties.

Show child attributes

recordProps.lvl0

object

required

Selector for the main category or section of the crawled page.Example:

JavaScript

{
  lvl0: {
    selectors: ".page-category",
    defaultValue: "documentation"
  },
}

recordProps.lvl1

string | string[]

required

Selector for the main title of the crawled page.Example:

JavaScript

{
  // ...
  lvl1: "head > title",
}

recordProps.content

string | string[]

required

Selectors for the main content elements of the crawled page.

JavaScript

{
  // ...
  content: "body > p, main li",
}

recordProps.pageRank

string

Attribute for increasing or decreasing the relevance of this record. You can pass any number as a string, including negative numbers.

JavaScript

{
  // ...
  pageRank: "30",
}

recordProps.lvl2

string | string[]

Selector for level 2 headings (h2) on the page.

recordProps.lvl3

string | string[]

Selector for level 3 headings (h3) on the page.

recordProps.lvl4

string | string[]

Selector for level 4 headings (h4) on the page.

recordProps.lvl5

string | string[]

Selector for level 5 headings (h5) on the page.

recordProps.lvl6

string | string[]

Selector for level 6 headings (h6) on the page.

recordProps.*

string | string[] | object

Extra attributes to add to the extracted records.Example:

JavaScript

{
  myCustomAttribute: ".myCustomClass",
  ogDesc: {
    selectors: "head meta[name='og:desc']",
    defaultValue: "Default description",
  },
}

docsearch.aggregateContent

boolean

default:true

Whether to automatically merge sibling elements, separated by a line break. For example, <p>Foo</p><p>Bar</p> with aggregateContent set to true creates one record. With aggregateContent set to false, two records are created.

docsearch.indexHeadings

boolean | object

default:true

Whether to create records for headings. If false, only records for the content level are created. You can provide an object with from and to properties to limit the range of extracted heading levels.Example:

JavaScript

{
  recordProps: {
    indexHeadings: false
    indexHeadings: { from: 4, to: 6 }
  }
}

docsearch.recordVersion

string

default:"v2"

Version of the extracted records. Not correlated with the DocSearch version and may increase independently.

v2. Compatible with DocSearch version 2
v3. Compatible with DocSearch version 3

helpers.page

function

A function that extracts text content from pages regardless of its type or category.Example:

JavaScript

recordExtractor: ({ url, $, helpers }) => {
  return helpers.page({
    url,
    $,
    recordProps: {
      title: "head title",
      content: "body",
    },
  });
};

The page helper returns an object with the following properties:

TypeScript

{
  /**
  * The object's unique identifier
  */
  objectID: string;

  /**
  * The page's URL
  */
  url: string;

  /**
  * The URL's hostname
  * - http://example.com/ = example.com
  */
  hostname: string;

  /**
  * The URL's path, everything after the hostname
  */
  path: string;

  /**
  * The URL depth, based on the number of slashes after the domain
  * - http://example.com/ = 1
  * - http://example.com/about = 1
  * - http://example.com/about/ = 2
  * - etc.
  */
  depth: number;

  /**
  * The page's file type.
  * One of: html, xml, json, pdf, doc, xls, ppt, odt, ods, odp, email
  */
  fileType: FileType;

  /**
  * The page length in bytes
  */
  contentLength: number;

  /**
  * The page title
  * - head > title
  */
  title?: string;

  /**
  * The page's description
  * - meta[name=description]
  */
  description?: string;

  /**
  * The page's keywords
  * - meta[name="keywords"]
  */
  keywords?: string[];

  /**
  * The image associated with the page
  * - meta[property="og:image"]
  */
  image?: string;

  /**
  * The page headers
  * - h1 and h2 tags content
  */
  headers?: string[];

  /**
  * The page's content (body copy)
  */
  content: string;
}

helpers.product

function

A function that extracts content from pages with the following JSON-LD schema types: Product, DietarySupplement, Drug, IndividualProduct, ProductCollection, ProductGroup, ProductModel, SomeProducts, or Vehicle.Example:

JavaScript

recordExtractor: ({ url, $, helpers }) => {
  return helpers.product({ url, $ });
};

The product helper returns an object with the following properties:

TypeScript

{
  /**
  * The object's unique identifier,
  * in this case, it will be the product page's URL
  */
  objectID: string,

  /**
  * The product page URL (without parameters or hashes)
  */
  url: string,

  /**
  * The product name
  * - html[attr=lang]
  */
  lang?: string,

  /**
  * The language the page content is written in
  * - `name` field of JSON-LD Product object: https://schema.org/Product
  */
  name?: string,

  /**
  * The product SKU
  * - The `sku` field of the JSON-LD Product object: https://schema.org/Product
  */
  sku : string,

  /**
  * The product's description
  * - The `description` field of the JSON-LD Product object: https://schema.org/Product
  */
  description?: string,

  /**
  * The image associated with the product
  * - The `image` field of the JSON-LD Product object: https://schema.org/Product
  */
  image?: string,

  /**
  * The product's price (selected from one of the following in order of preference):
  * - The `offers.price` field of the JSON-LD Product object: https://schema.org/Product
  * - The `offers.highPrice` field of the JSON-LD Product object: https://schema.org/Product
  * - The `offers.lowPrice` field of the JSON-LD Product object: https://schema.org/Product
  */
  price?: string,

  /**
  * The product's currency
  * - The `offers.priceCurrency` field of the JSON-LD Product object: https://schema.org/Product
  */
  price?: string,

  /**
  * The product category
  * - The `category` field of the JSON-LD Product object: https://schema.org/Product
  */
  category?: string,
}

helpers.splitContentIntoRecords

function

A function that extracts text from a HTML page and splits it into one or more records. This reduces the size of individual records.Example:

JavaScript

recordExtractor: ({ url, $, helpers }) => {
  const baseRecord = {
    url,
    title: $("head title").text().trim(),
  };
  const records = helpers.splitContentIntoRecords({
    baseRecord,
    $elements: $("body"),
    maxRecordBytes: 1000,
    textAttributeName: "text",
    orderingAttributeName: "part",
  });
  // You can still alter produced records
  // afterwards, if needed.
  return records;
};

In the preceding example, crawling a long HTML page returns an array of records that don’t exceed the limit of 1,000 bytes per record. Each extracted record would look similar to:

JavaScript

[
  {
    url: "http://test.com/index.html",
    title: "Home - Test.com",
    part: 0,
    text: "Welcome on test.com, the best resource to",
  },
  {
    url: "http://test.com/index.html",
    title: "Home - Test.com",
    part: 1,
    text: "find interesting content online.",
  },
];

To prevent duplicate results when searching for a word that appears in multiple records from the same URL, set distinct to true in your index settings, set url as attributeForDistinct, and add a custom ranking from first record on your page to the last:

JavaScript

initialIndexSettings: {
  "my-index": {
    distinct: true,
    attributeForDistinct: "url",
    searchableAttributes: ["title", "text"],
    customRanking: ["asc(part)"],
  },
}

Show child attributes

splitContentIntoRecords.$elements

string

default:"$('body')"

A Cheerio selector for the HTML element from which to extract the content.

splitContentIntoRecords.baseRecord

object

Attributes (and their values) to add to all extracted records.

splitContentIntoRecords.maxRecordBytes

number

default:1000

Maximum number of bytes per record. To avoid errors, see Record size limits.

splitContentIntoRecords.textAttributeName

string

default:"text"

Attribute name for storing the text of each extracted record.

splitContentIntoRecords.orderingAttributeName

string

Attribute name for storing the sequential number of each extracted record.

recordExtractor.url

object

A Location object with the URL and metadata of the crawled page.

name

string

Unique identifier of this action. Required if schedule is set.

schedule

string

How often to perform a crawl. For more information, see schedule.

selectorsToMatch

string[]

CSS selectors to identify content to index. The crawler includes only pages that contain at least one element matching these selectors. You can use wildcards and negation, for example, !.main ignores pages with an element with the main class.

fileTypesToMatch

string[]

default:"[\"html\"]"

Type of files to index. For a list of supported file types, see Non-HTML documents. Non-HTML file types are first converted to HTML with Apache Tika, then processed as an HTML page.

autoGenerateObjectIDs

bool

default:true

Whether to generate object IDs for records that don’t already have an objectID field. If false, extracted records without object IDs throw an error.

​Examples

​Parameters

​Action

Examples

Parameters

Action