> ## Documentation Index
> Fetch the complete documentation index at: https://algolia.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract data with Cheerio

> Learn the Cheerio syntax to extract data in the Algolia Crawler, and discover ready-to-use selectors and extractors.

export const Records = () => <Tooltip tip="A record is a searchable object in an Algolia index. Each record consists of named attributes." cta="Algolia records" href="/doc/guides/sending-and-managing-data/prepare-your-data#algolia-records">
    records
  </Tooltip>;

export const Index = () => <Tooltip tip="An Algolia index is a searchable dataset that consists of records and configuration settings. These settings define how the records are searched and ranked.">
    index
  </Tooltip>;

When creating a [`recordExtractor`](/doc/tools/crawler/apis/configuration/actions#param-record-extractor),
the most important parameter is the [Cheerio](https://cheerio.js.org) instance (`$`).
Cheerio is a server-side implementation of [jQuery](https://jquery.com/).
The Crawler uses it to expose the page's DOM so you can extract the content you want using Cheerio's
[Selectors API](https://cheerio.js.org/#selectors).

While Cheerio provides extensive documentation,
you may need to experiment with its syntax to successfully crawl your pages.

This guide outlines common techniques for building <Records /> from your site's content.

## Common extraction techniques

The following \[helpers]\(/doc/uild a UI hierarchy/crawler/getting-started/concepts/#helpers) may be useful for extracting content from *your* pages.

### Extract content from metadata elements

To get content from [`meta` elements](/doc/tools/crawler/extracting-data/finding-and-extracting-data#meta-tags),
parse their `content` attribute.

```js JavaScript icon=code theme={"system"}
// Get `title` from <meta content="Page title" property="og:title">
const title = $('meta[property="og:title"]').attr("content");

// Get `description` from <meta content="Page description" name="description">
const description = $("meta[name=description]").attr("content");
```

### Extract data from JSON-LD

To get content from [supported JSON-LD attributes](/doc/tools/crawler/extracting-data/finding-and-extracting-data#helpers):

```js JavaScript icon=code theme={"system"}
let jsonld;
const node = $('script[type="application/ld+json"]').get(0);

try {
  jsonld = JSON.parse(node.firstChild.data);
} catch (err) {
  // For debugging purposes, show the node in the developer console
  console.log(node);
}
```

### Get text from several CSS selectors

To get content from several [CSS selectors](/doc/tools/crawler/extracting-data/finding-and-extracting-data#meta-tags),
query them all and retrieve an array of content.

```js JavaScript icon=code theme={"system"}
const allHeadings = $("h1, h2")
  .map((i, e) => $(e).text())
  .get(); // ["First <h1>", "First <h2>", "Second <h2>", "Second <h1>"]
```

## Build a UI hierarchy

The InstantSearch UI libraries provide a [`hierarchicalMenu`](/doc/api-reference/widgets/hierarchical-menu/js) widget
for displaying hierarchical information.
This widget expects a special format of your records.
If your site shows a [breadcrumb](https://wikipedia.org/wiki/Breadcrumb_navigation),
you can turn it into a hierarchy in your records.

```html HTML icon=code-xml theme={"system"}
<ul class="breadcrumb">
  <li><a href="/home">Home</a></li>
  <li><a href="/home/pictures">Pictures</a></li>
  <li><a href="/home/pictures/summer15">Summer 15</a></li>
  <li>Italy</li>
</ul>
```

```js JavaScript icon=code theme={"system"}
function buildHierarchy(arr) {
  const hierarchy = {};

  for (let i = 0; i < arr.length; ++i) {
    res[`lvl${i}`] = arr.slice(0, i + 1).join(" > ");
  }

  return hierarchy;
}

const breadcrumb = $("ul.breadcrumb li")
  .map((i, e) => $(e).text())
  .get();

const hierarchy = buildHierarchy(breadcrumb); // This is compatible with InstantSearch's hierarchical menu widgets
```

## Index separate indices based on content

To add records to separate indices, create several [`actions`](/doc/tools/crawler/apis/configuration/actions),
each targeting a separate `indexName`.
You can then decide which pages each `action` processes by specifying the [`pathsToMatch`](/doc/tools/crawler/apis/configuration/actions#param-paths-to-match) parameter

Sometimes you need to check the page content to determine which action should process it.
For example, if you have a separate <Index /> for each language, use the `html` tag's `lang` attribute to determine which index to use.

In the following example, both actions process the same pages but crawls or skips them depending on the `lang` attribute.

```js JavaScript icon=code theme={"system"}
{
  // ...
  actions: [
    {
      indexName: "english",
      pathsToMatch: ["http://example.com/**"],
      recordExtractor: ({ $, url }) => {
        if ($("html").attr("lang") !== "en") {
          return []; // Skip non-English pages
        }

        return [
          {
            objectID: url.href,
            content: $("p").text(),
          },
        ];
      },
    },
    {
      indexName: "french",
      pathsToMatch: ["http://example.com/**"],
      recordExtractor: ({ $, url }) => {
        if ($("html").attr("lang") !== "fr") {
          return []; // Skip non-French pages
        }

        return [
          {
            objectID: url.href,
            content: $("p").text(),
          },
        ];
      },
    },
  ];
}
```

## Split content

For better performance and relevance,
[split long content into several records](/doc/guides/sending-and-managing-data/prepare-your-data/how-to/indexing-long-documents).

### Split PDF files

The Crawler [transforms PDF documents into HTML with Apache Tika](/doc/tools/crawler/extracting-data/non-html-documents) and exposes it to you with Cheerio.
Use the **HTML** tab of the **URL Tester** to see the extracted HTML.

Based on the structure of the resulting HTML, you should be able to separate the content into individual records.

#### Basic PDF splitting

The HTML that Tika generates is often mainly composed of `p` tags, meaning the `$('p').text()` returns the complete text of your PDF.
Since PDF documents tend to be long and there's a [size limit](/doc/tools/crawler/troubleshooting/indexing-issues#records-exceed-the-maximum-for-your-algolia-plan) for Algolia records,
wrap such text with the [`splitContentIntoRecords`](/doc/tools/crawler/apis/configuration/actions#param-helpers-split-content-into-records) helper.
For example:

```js JavaScript icon=code theme={"system"}
{
  // ...
  recordExtractor: ({ url, $, contentLength, fileType, helpers }) => {
    const records = helpers.splitContentIntoRecords({
      baseRecord: { url },
      $elements: $('p'),
      maxRecordBytes: 10000,
    });

    return records;
  },
}
```

#### Advanced PDF splitting

PDF generation tools often create files with minimal structure.
It's typical to encounter `div` tags as a way to define individual pages.
For example, [this document](https://www.un.org/sustainabledevelopment/wp-content/uploads/2015/10/COP21-FAQs.pdf) has the following structure when transformed into HTML:

```html HTML icon=code-xml theme={"system"}
<body>
  <div class="page">
    <p></p>
    <p></p>
    ...
  </div>
  <!-- ... -->
</body>
```

This creates one record per page.

You can also combine this with a browser feature to open PDF documents on a given page: by adding `#page=n` at the end of a URL pointing to a PDF document, the browser opens it on that page.

By generating one record per page, you can redirect users to the page of the document that matches their search, which further improves their experience.
For example:

```js JavaScript icon=code theme={"system"}
{
  // ...
  recordExtractor: ({ url, $, contentLength, fileType }) => {
    const records = $('div.page')
      .map(function (i, e) {
        return {
          url: `${url}#page=${i + 1}`,
          content: $(e).text().trim(),
        };
      })
      .get();

    return records;
  },
};
```

### Split pages using URI fragments

If you have [URI fragments](https://en.wikipedia.org/wiki/URI_fragment) in your pages,
it's a good idea to have your records pointing to them.
With the following HTML:

```html HTML icon=code-xml theme={"system"}
<body>
  <h1 id="part1">Part 1</h1>
  <p></p>
  <p></p>
  <!-- ... -->
  <h1 id="part2">Part 2</h1>
  <p></p>
  <!-- ... -->
</body>
```

You can then create one record per heading,
so your users land on the relevant part of the page when they click a search result.

```js JavaScript icon=code theme={"system"}
{
  // ...
  recordExtractor: ({ url, $, contentLength, fileType }) => {
    const records = $('h1')
      .map(function (i, e) {
        return {
          url: `${url}#${$(e).attr('id')}`,
          content: $(e).nextUntil('h1').text(),
        };
      })
      .get();

    return records;
  },
};
```
