Tools / Crawler / Extracting data

Extracting data with Cheerio

When creating a recordExtractor, the most important parameter is the Cheerio instance ($). Cheerio is a server-side implementation of jQuery. The Crawler uses it to expose the page’s DOM so you can extract the content you want using Cheerio’s Selectors API.

While Cheerio provides extensive documentation, you may need to experiment with its syntax to successfully crawl your pages.

This guide outlines common techniques for building records from your site’s content.

Common extraction techniques

The following helpers may be useful for extracting content from your pages.

Extract content from metadata elements

To get content from <meta> elements, you need to parse their content attribute.

1
2
3
4
5
// Get `title` from <meta content="Page title" property="og:title">
const title = $('meta[property="og:title"]').attr('content');

// Get `description` from <meta content="Page description" name="description">
const description = $('meta[name=description]').attr('content');

Extract data from JSON-LD

To get content from supported JSON-LD attributes:

1
2
3
4
5
6
7
8
9
let jsonld;
const node = $('script[type="application/ld+json"]').get(0);

try {
  jsonld = JSON.parse(node.firstChild.data);
} catch (err) {
  // For debugging purposes, show the node in the developer console
  console.log(node);
}

Get text from several CSS selectors

To get content from several CSS selectors, you can query them all and retrieve an array of content.

1
2
3
const allHeadings = $('h1, h2')
  .map((i, e) => $(e).text())
  .get(); // ["First <h1>", "First <h2>", "Second <h2>", "Second <h1>"]

Build a UI hierarchy

InstantSearch libraries provide a hierarchicalMenu widget to display hierarchical information. This widget expects a special format in your records. If your site displays a breadcrumb, you can turn it into a hierarchy in your records.

1
2
3
4
5
6
<ul class="breadcrumb">
  <li><a href="/home">Home</a></li>
  <li><a href="/home/pictures">Pictures</a></li>
  <li><a href="/home/pictures/summer15">Summer 15</a></li>
  <li>Italy</li>
</ul>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function buildHierarchy(arr) {
  const hierarchy = {};

  for (let i = 0; i < arr.length; ++i) {
    res[`lvl${i}`] = arr.slice(0, i + 1).join(' > ');
  }

  return hierarchy;
}

const breadcrumb = $('ul.breadcrumb li')
  .map((i, e) => $(e).text())
  .get();

const hierarchy = buildHierarchy(breadcrumb); // This is compatible with InstantSearch's hierarchical menu widgets

Index separate indices based on content

To add records to separate indices, create several actions, each targeting a separate indexName. You can then decide which pages each action processes by specifying the pathsToMatch parameter

Sometimes you need to check the page content to determine which action should process it. For example, if you have a separate index for each language, use the <html> tag’s lang attribute to determine which index to use.

In the following example, both actions process the same pages but crawls or skips them depending on the lang attribute.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
{
  // ...
  actions: [
    {
      indexName: 'english',
      pathsToMatch: ['http://example.com/**'],
      recordExtractor: ({ $, url }) => {
        if ($('html').attr('lang') !== 'en') {
          return []; // Skip non-English pages
        }

        return [
          {
            objectID: url.href,
            content: $('p').text(),
          },
        ];
      },
    },
    {
      indexName: 'french',
      pathsToMatch: ['http://example.com/**'],
      recordExtractor: ({ $, url }) => {
        if ($('html').attr('lang') !== 'fr') {
          return []; // Skip non-French pages
        }

        return [
          {
            objectID: url.href,
            content: $('p').text(),
          },
        ];
      },
    },
  ];
}

Split content

You should split long content into several records to enhance performance and relevance.

Split PDF files

The Crawler transforms PDF documents into HTML with Apache Tika and exposes it to you with Cheerio. Use the HTML tab of the URL Tester to see the extracted HTML.

Based on the structure of the resulting HTML, you should be able to separate the content into individual records.

Basic PDF splitting

The HTML that Tika generates is often mainly composed of <p> tags, meaning the $('p').text() should return the complete text of your PDF. Since PDFs tend to be long and there’s a size limit on Algolia records, you should wrap such text with the splitContentIntoRecords helper. For example:

1
2
3
4
5
6
7
8
9
10
11
12
{
  // ...
  recordExtractor: ({ url, $, contentLength, fileType, helpers }) => {
    const records = helpers.splitContentIntoRecords({
      baseRecord: { url },
      $elements: $('p'),
      maxRecordBytes: 10000,
    });

    return records;
  },
}

Advanced PDF splitting

PDF generation tools often create files with minimal structure. It’s typical to encounter <div> tags as a way to define individual pages. For example, this document has the following structure when transformed into HTML:

1
2
3
4
5
6
7
8
<body>
  <div class="page">
    <p></p>
    <p></p>
    ...
  </div>
  <!-- ... -->
</body>

This creates one record per page.

You can also combine this with a browser feature to open PDF documents on a given page: by adding #page=n at the end of a URL pointing to a PDF document, the browser opens it on that page.

By generating one record per page, you can redirect users to the page of the document that matches their search, which further improves their experience. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  // ...
  recordExtractor: ({ url, $, contentLength, fileType }) => {
    const records = $('div.page')
      .map(function (i, e) {
        return {
          url: `${url}#page=${i + 1}`,
          content: $(e).text().trim(),
        };
      })
      .get();

    return records;
  },
};

Split pages using URI fragments

If you have URI fragments in your pages, it’s a good idea to have your records pointing to them. With the following HTML:

1
2
3
4
5
6
7
8
9
<body>
  <h1 id="part1">Part 1</h1>
  <p></p>
  <p></p>
  <!-- ... -->
  <h1 id="part2">Part 2</h1>
  <p></p>
  <!-- ... -->
</body>

You can then create one record per heading, so your users land on the relevant part of the page when they click a search result.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  // ...
  recordExtractor: ({ url, $, contentLength, fileType }) => {
    const records = $('h1')
      .map(function (i, e) {
        return {
          url: `${url}#${$(e).attr('id')}`,
          content: $(e).nextUntil('h1').text(),
        };
      })
      .get();

    return records;
  },
};
Did you find this page helpful?