> ## Documentation Index
> Fetch the complete documentation index at: https://algolia.com/llms.txt
> Use this file to discover all available pages before exploring further.

# actions

> Determines which web pages are translated into Algolia records and in what way.

* **Type**: `Action []`
* **Required**

A single action defines:

* The URLs to crawl
* The extraction process for those websites
* The indices to which the extracted records are added

A single web page can match multiple actions.
In this case, your crawler creates a record for each matched action.

## Examples

```js JavaScript icon=code theme={"system"}
{
  actions: [
    {
      indexName: 'dev_blog_algolia',
      pathsToMatch: ['https://blog.algolia.com/**'],
      fileTypesToMatch: ['pdf'],
      autoGenerateObjectIDs: false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
        ...
      }
    },
  ],
}
```

## Parameters

### Action

<ParamField body="indexName" type="string" required>
  Index name targeted by this action.
  This value is appended to the `indexPrefix` if specified.
</ParamField>

<ParamField body="pathsToMatch" type="string[]" required>
  URL patterns for web pages to which this action should apply.
  The patterns are evaluated using the [`micromatch`](https://github.com/micromatch/micromatch) library.
  You can use wildcard characters, negation, and more.
</ParamField>

<ParamField body="recordExtractor" type="function" required>
  A JavaScript function to extract content from a web page and turn it into Algolia records.
  The function should return a JSON array which may be empty.
  An empty array means the page is skipped.

  **Example:**

  ```js JavaScript icon=code theme={"system"}
  recordExtractor: ({ url, $, contentLength, fileType }) => {
    return [
      {
        url: url.href,
        text: $("p").html(),
        // ... anything you want
      },
    ];
    // return []; skips the page
  };
  ```

  <Expandable>
    <ParamField body="recordExtractor.$" type="object">
      A [Cheerio instance](https://cheerio.js.org) with the HTML of the crawled page.
    </ParamField>

    <ParamField body="recordExtractor.contentLength" type="number">
      Number of bytes of the crawled page.
    </ParamField>

    <ParamField body="recordExtractor.dataSources" type="object">
      External data sources for the crawled page.
      Each key corresponds to an [`externalData`](/doc/tools/crawler/apis/configuration/external-data)
      object.

      **Example:**

      ```js JavaScript icon=code theme={"system"}
      {
        dataSources: {
          dataSourceId1: { data1: "val1", data2: "val2" },
          dataSourceId2: { data1: "val1", data2: "val2" },
        },
      }
      ```
    </ParamField>

    <ParamField body="recordExtractor.fileType" type="string">
      File type of the crawled page or document.
    </ParamField>

    <ParamField body="recordExtractor.helpers" type="object">
      Helper functions for extracting content and turning it into Algolia records.

      <Expandable>
        <ParamField body="helpers.article" type="function">
          A function that extracts content from pages identified as *articles*.
          Articles are pages with the `og:type` meta tag: `<meta property="og:type" content="article"/>`
          or with the [JSON-LD schema](https://schema.org/) types: `Article`, `NewsArticle`, `Report`, or `BlogPosting`.

          **Example:**

          ```js JavaScript icon=code theme={"system"}
          recordExtractor: ({ url, $, helpers }) => {
            return helpers.article({ url, $ });
          };
          ```

          The `article` helper returns an object with the following properties:

          ```ts TypeScript icon=code theme={"system"}
          {
            /**
            * The object's unique identifier,
            * in this case, the article's URL
            */
            objectID: string,

            /**
            * The article's URL (without parameters or hashes)
            */
            url: string,

            /**
            * The language the page content is written in
            * - html[attr=lang]
            */
            lang?: string,

            /**
            * The article's headline (selected from one of the following in order of preference):
            * - meta[property="og:title"]
            * - meta[name="twitter:title"]
            * - head > title
            * - First <h1>
            */
            headline : string,

            /**
            * The article's description (selected from one of the following in order of preference):
            * - meta[name="description"]
            * - meta[property="og:description"]
            * - meta[name="twitter:description"]
            */
            description?: string,

            /**
            * Article keywords
            * - meta[name="keywords"]
            * - The `keywords` field of the JSON-LD Article object: https://schema.org/Article
            */
            keywords: string[],

            /**
            * Article tags
            * - meta[property="article:tag"]
            */
            tags: string[],

            /**
            * The image associated with the article (selected from one of the following in order of preference):
            * - meta[property="og:image"]
            * - meta[name="twitter:image"]
            */
            image?: string,

            /**
            * The article's author
            * - meta[property="article:author"]
            * - The `author` field of the JSON-LD Article object: https://schema.org/Article

            */
            authors?: string[],

            /**
            * Article publication date
            * - meta[property="article:published_time"]
            * - The `datePublished` field of the JSON-LD Article object: https://schema.org/Article
            */
            datePublished?: string,

            /**
            * The date when the article was last modified
            * - meta[property="article:modified_time"]
            * - The `dateModified` field of the JSON-LD Article object: https://schema.org/Article
            */
            dateModified?: string,

            /**
            * The article category
            * - meta[property="article:section"]
            * - The `category` field of the JSON-LD Article object: https://schema.org/Article
            */
            category?: string,

            /**
            * The article's content (body copy)
            */
            content?: string,
          }
          ```
        </ParamField>

        <ParamField body="helpers.codeSnippets" type="function">
          A function that extracts code snippets from a page.
          It searches for HTML `pre` elements on the page and also extracts the programming language,
          in which the code snippet is written.

          If the crawler finds several code snippets on a page, this property returns a list of those snippets.

          **Example:**

          ```js JavaScript icon=code theme={"system"}
          recordExtractor: ({ url, $, helpers }) => {
            const code = helpers.codeSnippets({ tag, languageClassPrefix });
            return { code };
          };
          ```

          The `codeSnippets` helper returns an array of code objects with the following properties:

          ```ts TypeScript icon=code theme={"system"}
          {
            /**
            * The content of the code snippet
            * - pre, code, etc. default: pre
            */
            content: string;

            /**
            * The code snippet's language (if found)
            * - pre[class^=language-] default: language-
            */
            languageClassPrefix?: string;

            /**
            * The URL to the nearest sibling <a> tag
            * - pre + a
            */
            codeUrl?: string;
          }[]
          ```
        </ParamField>

        <ParamField body="helpers.docsearch" type="function">
          A function that extracts content and formats it to be compatible with [DocSearch](https://docsearch.algolia.com).
          It creates an optimal number of records for relevancy and hierarchy.
          You can also use it without DocSearch or to index non-documentation content.

          **Example:**

          ```js JavaScript icon=code theme={"system"}
          recordExtractor: ({ url, $, helpers }) => {
            return helpers.docsearch({
              aggregateContent: true,
              indexHeadings: true,
              recordVersion: "v3",
              recordProps: {
                lvl0: {
                  selectors: "header h1",
                },
                lvl1: "article h2",
                lvl2: "article h3",
                lvl3: "article h4",
                lvl4: "article h5",
                lvl5: "article h6",
                content: "main p, main li",
              },
            });
          };
          ```

          For more examples, see [Record extractor](https://docsearch.algolia.com/docs/record-extractor) in the DocSearch documentation.

          <Expandable>
            <ParamField body="docsearch.recordProps" type="object" required>
              Main DocSearch record properties.

              <Expandable>
                <ParamField body="recordProps.lvl0" type="object" required>
                  Selector for the main category or section of the crawled page.

                  **Example:**

                  ```js JavaScript icon=code theme={"system"}
                  {
                    lvl0: {
                      selectors: ".page-category",
                      defaultValue: "documentation"
                    },
                  }
                  ```
                </ParamField>

                <ParamField body="recordProps.lvl1" type="string | string[]" required>
                  Selector for the main title of the crawled page.

                  **Example:**

                  ```js JavaScript icon=code theme={"system"}
                  {
                    // ...
                    lvl1: "head > title",
                  }
                  ```
                </ParamField>

                <ParamField body="recordProps.content" type="string | string[]" required>
                  Selectors for the main content elements of the crawled page.

                  ```js JavaScript icon=code theme={"system"}
                  {
                    // ...
                    content: "body > p, main li",
                  }
                  ```
                </ParamField>

                <ParamField body="recordProps.pageRank" type="string">
                  Attribute for increasing or decreasing the relevance of this record.
                  You can pass any number as a string, including negative numbers.

                  ```js JavaScript icon=code theme={"system"}
                  {
                    // ...
                    pageRank: "30",
                  }
                  ```
                </ParamField>

                <ParamField body="recordProps.lvl2" type="string | string[]">
                  Selector for level 2 headings (`h2`) on the page.
                </ParamField>

                <ParamField body="recordProps.lvl3" type="string | string[]">
                  Selector for level 3 headings (`h3`) on the page.
                </ParamField>

                <ParamField body="recordProps.lvl4" type="string | string[]">
                  Selector for level 4 headings (`h4`) on the page.
                </ParamField>

                <ParamField body="recordProps.lvl5" type="string | string[]">
                  Selector for level 5 headings (`h5`) on the page.
                </ParamField>

                <ParamField body="recordProps.lvl6" type="string | string[]">
                  Selector for level 6 headings (`h6`) on the page.
                </ParamField>

                <ParamField body="recordProps.*" type="string | string[] | object">
                  Extra attributes to add to the extracted records.

                  **Example:**

                  ```js JavaScript icon=code theme={"system"}
                  {
                    myCustomAttribute: ".myCustomClass",
                    ogDesc: {
                      selectors: "head meta[name='og:desc']",
                      defaultValue: "Default description",
                    },
                  }
                  ```
                </ParamField>
              </Expandable>
            </ParamField>

            <ParamField body="docsearch.aggregateContent" type="boolean" default={true}>
              Whether to automatically merge sibling elements, separated by a line break.
              For example, `<p>Foo</p><p>Bar</p>` with `aggregateContent` set to `true`
              creates one record. With `aggregateContent` set to `false`, two records are created.
            </ParamField>

            <ParamField body="docsearch.indexHeadings" type="boolean | object" default={true}>
              Whether to create records for headings.
              If false, only records for the `content` level are created.
              You can provide an object with `from` and `to` properties to limit
              the range of extracted heading levels.

              **Example:**

              ```js JavaScript icon=code theme={"system"}
              {
                recordProps: {
                  indexHeadings: false
                  indexHeadings: { from: 4, to: 6 }
                }
              }
              ```
            </ParamField>

            <ParamField body="docsearch.recordVersion" type="string" default="v2">
              Version of the extracted records. Not correlated with the DocSearch version
              and may increase independently.

              * `v2`. Compatible with DocSearch version 2
              * `v3`. Compatible with DocSearch version 3
            </ParamField>
          </Expandable>
        </ParamField>

        <ParamField body="helpers.page" type="function">
          A function that extracts text content from pages regardless of its type or category.

          **Example:**

          ```js JavaScript icon=code theme={"system"}
          recordExtractor: ({ url, $, helpers }) => {
            return helpers.page({
              url,
              $,
              recordProps: {
                title: "head title",
                content: "body",
              },
            });
          };
          ```

          The `page` helper returns an object with the following properties:

          ```ts TypeScript icon=code theme={"system"}
          {
            /**
            * The object's unique identifier
            */
            objectID: string;

            /**
            * The page's URL
            */
            url: string;

            /**
            * The URL's hostname
            * - http://example.com/ = example.com
            */
            hostname: string;

            /**
            * The URL's path, everything after the hostname
            */
            path: string;

            /**
            * The URL depth, based on the number of slashes after the domain
            * - http://example.com/ = 1
            * - http://example.com/about = 1
            * - http://example.com/about/ = 2
            * - etc.
            */
            depth: number;

            /**
            * The page's file type.
            * One of: html, xml, json, pdf, doc, xls, ppt, odt, ods, odp, email
            */
            fileType: FileType;

            /**
            * The page length in bytes
            */
            contentLength: number;

            /**
            * The page title
            * - head > title
            */
            title?: string;

            /**
            * The page's description
            * - meta[name=description]
            */
            description?: string;

            /**
            * The page's keywords
            * - meta[name="keywords"]
            */
            keywords?: string[];

            /**
            * The image associated with the page
            * - meta[property="og:image"]
            */
            image?: string;

            /**
            * The page headers
            * - h1 and h2 tags content
            */
            headers?: string[];

            /**
            * The page's content (body copy)
            */
            content: string;
          }
          ```
        </ParamField>

        <ParamField body="helpers.product" type="function">
          A function that extracts content from pages with the following JSON-LD schema types:
          `Product`, `DietarySupplement`, `Drug`, `IndividualProduct`,
          `ProductCollection`, `ProductGroup`, `ProductModel`, `SomeProducts`, or `Vehicle`.

          **Example:**

          ```js JavaScript icon=code theme={"system"}
          recordExtractor: ({ url, $, helpers }) => {
            return helpers.product({ url, $ });
          };
          ```

          The `product` helper returns an object with the following properties:

          ```ts TypeScript icon=code theme={"system"}
          {
            /**
            * The object's unique identifier,
            * in this case, it will be the product page's URL
            */
            objectID: string,

            /**
            * The product page URL (without parameters or hashes)
            */
            url: string,

            /**
            * The product name
            * - html[attr=lang]
            */
            lang?: string,

            /**
            * The language the page content is written in
            * - `name` field of JSON-LD Product object: https://schema.org/Product
            */
            name?: string,

            /**
            * The product SKU
            * - The `sku` field of the JSON-LD Product object: https://schema.org/Product
            */
            sku : string,

            /**
            * The product's description
            * - The `description` field of the JSON-LD Product object: https://schema.org/Product
            */
            description?: string,

            /**
            * The image associated with the product
            * - The `image` field of the JSON-LD Product object: https://schema.org/Product
            */
            image?: string,

            /**
            * The product's price (selected from one of the following in order of preference):
            * - The `offers.price` field of the JSON-LD Product object: https://schema.org/Product
            * - The `offers.highPrice` field of the JSON-LD Product object: https://schema.org/Product
            * - The `offers.lowPrice` field of the JSON-LD Product object: https://schema.org/Product
            */
            price?: string,

            /**
            * The product's currency
            * - The `offers.priceCurrency` field of the JSON-LD Product object: https://schema.org/Product
            */
            price?: string,

            /**
            * The product category
            * - The `category` field of the JSON-LD Product object: https://schema.org/Product
            */
            category?: string,
          }
          ```
        </ParamField>

        <ParamField body="helpers.splitContentIntoRecords" type="function">
          A function that extracts text from a HTML page and splits it into one or more records.
          This reduces the size of individual records.

          **Example:**

          ```js JavaScript icon=code theme={"system"}
          recordExtractor: ({ url, $, helpers }) => {
            const baseRecord = {
              url,
              title: $("head title").text().trim(),
            };
            const records = helpers.splitContentIntoRecords({
              baseRecord,
              $elements: $("body"),
              maxRecordBytes: 1000,
              textAttributeName: "text",
              orderingAttributeName: "part",
            });
            // You can still alter produced records
            // afterwards, if needed.
            return records;
          };
          ```

          In the preceding example, crawling a long HTML page returns an array of records that don't exceed the limit of 1,000 bytes per record.
          Each extracted record would look similar to:

          ```js JavaScript icon=code theme={"system"}
          [
            {
              url: "http://test.com/index.html",
              title: "Home - Test.com",
              part: 0,
              text: "Welcome on test.com, the best resource to",
            },
            {
              url: "http://test.com/index.html",
              title: "Home - Test.com",
              part: 1,
              text: "find interesting content online.",
            },
          ];
          ```

          To prevent duplicate results when searching for a word that appears in multiple records from the same URL,
          set `distinct` to true in your index settings,
          set `url` as `attributeForDistinct`, and add a custom ranking from first record on your page to the last:

          ```js JavaScript icon=code theme={"system"}
          initialIndexSettings: {
            "my-index": {
              distinct: true,
              attributeForDistinct: "url",
              searchableAttributes: ["title", "text"],
              customRanking: ["asc(part)"],
            },
          }
          ```

          <Expandable>
            <ParamField body="splitContentIntoRecords.$elements" type="string" default="$('body')">
              A [Cheerio selector](https://cheerio.js.org) for the HTML element from which to extract the content.
            </ParamField>

            <ParamField body="splitContentIntoRecords.baseRecord" type="object">
              Attributes (and their values) to add to all extracted records.
            </ParamField>

            <ParamField body="splitContentIntoRecords.maxRecordBytes" type="number" default={1000}>
              Maximum number of bytes per record.
              To avoid errors, see [Record size limits](https://support.algolia.com/hc/en-us/articles/4406981897617-Is-there-a-size-limit-for-my-index-records).
            </ParamField>

            <ParamField body="splitContentIntoRecords.textAttributeName" type="string" default="text">
              Attribute name for storing the text of each extracted record.
            </ParamField>

            <ParamField body="splitContentIntoRecords.orderingAttributeName" type="string">
              Attribute name for storing the sequential number of each extracted record.
            </ParamField>
          </Expandable>
        </ParamField>
      </Expandable>
    </ParamField>

    <ParamField body="recordExtractor.url" type="object">
      A [Location](https://developer.mozilla.org/en-US/docs/Web/API/Location) object
      with the URL and metadata of the crawled page.
    </ParamField>
  </Expandable>
</ParamField>

<ParamField body="name" type="string">
  Unique identifier of this action.
  Required if `schedule` is set.
</ParamField>

<ParamField body="schedule" type="string">
  How often to perform a crawl.
  For more information, see [`schedule`](/doc/tools/crawler/apis/configuration/schedule).
</ParamField>

<ParamField body="selectorsToMatch" type="string[]">
  CSS selectors to identify content to index.
  The crawler includes only pages that contain at least one element matching these selectors.
  You can use wildcards and negation, for example, `!.main` ignores pages with an element with the `main` class.
</ParamField>

<ParamField body="fileTypesToMatch" type="string[]" default="[&#x22;html&#x22;]">
  Type of files to index.
  For a list of supported file types,
  see [Non-HTML documents](/doc/tools/crawler/extracting-data/non-html-documents).
  Non-HTML file types are first converted to HTML with Apache Tika,
  then processed as an HTML page.
</ParamField>

<ParamField body="autoGenerateObjectIDs" type="bool" default={true}>
  Whether to generate object IDs for records that don't already have an `objectID` field.
  If false, extracted records without object IDs throw an error.
</ParamField>
