Extract data with Cheerio

When creating a recordExtractor, the most important parameter is the Cheerio instance ($). Cheerio is a server-side implementation of jQuery. The Crawler uses it to expose the page’s DOM so you can extract the content you want using Cheerio’s Selectors API. While Cheerio provides extensive documentation, you may need to experiment with its syntax to successfully crawl your pages. This guide outlines common techniques for building from your site’s content.

Common extraction techniques

The following [helpers](/doc/uild a UI hierarchy/crawler/getting-started/concepts/#helpers) may be useful for extracting content from your pages.

Extract content from metadata elements

To get content from meta elements, parse their content attribute.

JavaScript

// Get `title` from <meta content="Page title" property="og:title">
const title = $('meta[property="og:title"]').attr("content");

// Get `description` from <meta content="Page description" name="description">
const description = $("meta[name=description]").attr("content");

Extract data from JSON-LD

To get content from supported JSON-LD attributes:

JavaScript

let jsonld;
const node = $('script[type="application/ld+json"]').get(0);

try {
  jsonld = JSON.parse(node.firstChild.data);
} catch (err) {
  // For debugging purposes, show the node in the developer console
  console.log(node);
}

Get text from several CSS selectors

To get content from several CSS selectors, query them all and retrieve an array of content.

JavaScript

const allHeadings = $("h1, h2")
  .map((i, e) => $(e).text())
  .get(); // ["First <h1>", "First <h2>", "Second <h2>", "Second <h1>"]

Build a UI hierarchy

The InstantSearch UI libraries provide a hierarchicalMenu widget for displaying hierarchical information. This widget expects a special format of your records. If your site shows a breadcrumb, you can turn it into a hierarchy in your records.

HTML

<ul class="breadcrumb">
  <li><a href="/home">Home</a></li>
  <li><a href="/home/pictures">Pictures</a></li>
  <li><a href="/home/pictures/summer15">Summer 15</a></li>
  <li>Italy</li>
</ul>

JavaScript

function buildHierarchy(arr) {
  const hierarchy = {};

  for (let i = 0; i < arr.length; ++i) {
    res[`lvl${i}`] = arr.slice(0, i + 1).join(" > ");
  }

  return hierarchy;
}

const breadcrumb = $("ul.breadcrumb li")
  .map((i, e) => $(e).text())
  .get();

const hierarchy = buildHierarchy(breadcrumb); // This is compatible with InstantSearch's hierarchical menu widgets

Index separate indices based on content

To add records to separate indices, create several actions, each targeting a separate indexName. You can then decide which pages each action processes by specifying the pathsToMatch parameter Sometimes you need to check the page content to determine which action should process it. For example, if you have a separate for each language, use the html tag’s lang attribute to determine which index to use. In the following example, both actions process the same pages but crawls or skips them depending on the lang attribute.

JavaScript

{
  // ...
  actions: [
    {
      indexName: "english",
      pathsToMatch: ["http://example.com/**"],
      recordExtractor: ({ $, url }) => {
        if ($("html").attr("lang") !== "en") {
          return []; // Skip non-English pages
        }

        return [
          {
            objectID: url.href,
            content: $("p").text(),
          },
        ];
      },
    },
    {
      indexName: "french",
      pathsToMatch: ["http://example.com/**"],
      recordExtractor: ({ $, url }) => {
        if ($("html").attr("lang") !== "fr") {
          return []; // Skip non-French pages
        }

        return [
          {
            objectID: url.href,
            content: $("p").text(),
          },
        ];
      },
    },
  ];
}

Split content

For better performance and relevance, split long content into several records.

Split PDF files

The Crawler transforms PDF documents into HTML with Apache Tika and exposes it to you with Cheerio. Use the HTML tab of the URL Tester to see the extracted HTML. Based on the structure of the resulting HTML, you should be able to separate the content into individual records.

Basic PDF splitting

The HTML that Tika generates is often mainly composed of p tags, meaning the $('p').text() returns the complete text of your PDF. Since PDF documents tend to be long and there’s a size limit for Algolia records, wrap such text with the splitContentIntoRecords helper. For example:

JavaScript

{
  // ...
  recordExtractor: ({ url, $, contentLength, fileType, helpers }) => {
    const records = helpers.splitContentIntoRecords({
      baseRecord: { url },
      $elements: $('p'),
      maxRecordBytes: 10000,
    });

    return records;
  },
}

Advanced PDF splitting

PDF generation tools often create files with minimal structure. It’s typical to encounter div tags as a way to define individual pages. For example, this document has the following structure when transformed into HTML:

HTML

<body>
  <div class="page">
    <p></p>
    <p></p>
    ...
  </div>
  <!-- ... -->
</body>

This creates one record per page. You can also combine this with a browser feature to open PDF documents on a given page: by adding #page=n at the end of a URL pointing to a PDF document, the browser opens it on that page. By generating one record per page, you can redirect users to the page of the document that matches their search, which further improves their experience. For example:

JavaScript

{
  // ...
  recordExtractor: ({ url, $, contentLength, fileType }) => {
    const records = $('div.page')
      .map(function (i, e) {
        return {
          url: `${url}#page=${i + 1}`,
          content: $(e).text().trim(),
        };
      })
      .get();

    return records;
  },
};

Split pages using URI fragments

If you have URI fragments in your pages, it’s a good idea to have your records pointing to them. With the following HTML:

HTML

<body>
  <h1 id="part1">Part 1</h1>
  <p></p>
  <p></p>
  <!-- ... -->
  <h1 id="part2">Part 2</h1>
  <p></p>
  <!-- ... -->
</body>

You can then create one record per heading, so your users land on the relevant part of the page when they click a search result.

JavaScript

{
  // ...
  recordExtractor: ({ url, $, contentLength, fileType }) => {
    const records = $('h1')
      .map(function (i, e) {
        return {
          url: `${url}#${$(e).attr('id')}`,
          content: $(e).nextUntil('h1').text(),
        };
      })
      .get();

    return records;
  },
};

Get started

Extract data

Enrich data

Troubleshooting

APIs

Netlify plugin

Extract data with Cheerio

Common extraction techniques

Extract content from metadata elements

Extract data from JSON-LD

Get text from several CSS selectors

Build a UI hierarchy

Index separate indices based on content

Split content

Split PDF files

Basic PDF splitting

Advanced PDF splitting

Split pages using URI fragments

​Common extraction techniques

​Extract content from metadata elements

​Extract data from JSON-LD

​Get text from several CSS selectors

​Build a UI hierarchy

​Index separate indices based on content

​Split content

​Split PDF files

​Basic PDF splitting

​Advanced PDF splitting

​Split pages using URI fragments

Common extraction techniques

Extract content from metadata elements

Extract data from JSON-LD

Get text from several CSS selectors

Build a UI hierarchy

Index separate indices based on content

Split content

Split PDF files

Basic PDF splitting

Advanced PDF splitting

Split pages using URI fragments