Tools / Crawler / Guides / Extracting data / How to

Extracting data with Cheerio

When implementing your recordExtractor, the most important parameter is the Cheerio instance ($). Cheerio is a server-side implementation of jQuery. The Crawler uses it to expose the page’s DOM so you can extract the content you want using Cheerio’s Selectors API.

Both jQuery and Cheerio provide comprehensive documentation, but nailing the right syntax for your crawling needs can take some trial and error.

This guide provides you with the most common extractions strategies you need to build records out of your site’s content.

Common extraction strategies

Here’s a non-exhaustive list of common helpers you might find useful to extract content from your pages. Make sure to adapt them to your use case if need be.

Extract content from metadata elements

To get content from <meta> elements, you need to parse their content attribute.

1
2
3
4
5
// Get `title` from <meta content="Page title" property="og:title">
const title = $('meta[property="og:title"]').attr('content');

// Get `description` from <meta content="Page description" name="description">
const description = $('meta[name=description]').attr('content');

Extract data from JSON-LD

If your pages expose JSON-LD, you can access it like that:

1
2
3
4
5
6
7
8
9
let jsonld;
const node = $('script[type="application/ld+json"]').get(0);

try {
  jsonld = JSON.parse(node.firstChild.data);
} catch (err) {
  // In case of error, you can try to debug by logging the node
  console.log(node);
}

Get text from multiple selectors

If you need to get text from multiple selectors, you can query them all and retrieve an array of content.

1
2
3
const allHeadings = $('h1, h2')
  .map((i, e) => $(e).text())
  .get(); // ["First <h1>", "First <h2>", "Second <h2>", "Second <h1>"]

Build a hierarchy

InstantSearch libraries provide a hierarchicalMenu widget to display hierarchical information. This widget expects a special format in your records. If your site displays a breadcrumb, you can turn it into a hierarchy in your records.

1
2
3
4
5
6
<ul class="breadcrumb">
  <li><a href="/home">Home</a></li>
  <li><a href="/home/pictures">Pictures</a></li>
  <li><a href="/home/pictures/summer15">Summer 15</a></li>
  <li>Italy</li>
</ul>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function buildHierarchy(arr) {
  const hierarchy = {};

  for (let i = 0; i < arr.length; ++i) {
    res[`lvl${i}`] = arr.slice(0, i + 1).join(' > ');
  }

  return hierarchy;
}

const breadcrumb = $('ul.breadcrumb li')
  .map((i, e) => $(e).text())
  .get();

const hierarchy = buildHierarchy(breadcrumb); // This is compatible with InstantSearch's hierarchical menu widgets

Indexing in separate indices based on content

To push records in separate indices, you have to create multiple actions, each one targeting a separate indexName. You can then decide which pages each action processes by specifying the pathsToMatch.

Yet, sometimes you need to rely on the content of the page to determine which action needs to process it. For example, if you have an index per language, you might want to rely on the lang attribute of the <html> tag to know in where to index a page.

In the following example, both actions process the same pages, but might either crawl or skip them depending on the lang attribute.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
{
  // ...
  actions: [
    {
      indexName: 'english',
      pathsToMatch: ['http://example.com/**'],
      recordExtractor: ({ $, url }) => {
        if ($('html').attr('lang') !== 'en') {
          return []; // Skip non-English pages
        }

        return [
          {
            objectID: url.href,
            content: $('p').text(),
          },
        ];
      },
    },
    {
      indexName: 'french',
      pathsToMatch: ['http://example.com/**'],
      recordExtractor: ({ $, url }) => {
        if ($('html').attr('lang') !== 'fr') {
          return []; // Skip non-French pages
        }

        return [
          {
            objectID: url.href,
            content: $('p').text(),
          },
        ];
      },
    },
  ];
}

Splitting content

You should split long content into multiple records for performance and relevance reasons.

PDFs

The Crawler transforms PDF documents into HTML using Apache Tika and exposes it to you via Cheerio. You can use the HTML tab of the URL Tester to see the extracted HTML.

Depending on how the resulting HTML, you should be able to split the content into multiple records.

Basic PDF splitting

The HTML that Tika generates is often mainly composed of <p> tags, meaning the $('p').text() should return the complete text of your PDF. Yet, PDFs tend to be long, and since Algolia’s records size is limited, you should always wrap long text with the splitContentIntoRecords helper. Your PDF extractor could look like the following:

1
2
3
4
5
6
7
8
9
10
11
12
{
  // ...
  recordExtractor: ({ url, $, contentLength, fileType }) => {
    const records = helpers.splitContentIntoRecords({
      baseRecord: { url },
      $elements: $('p'),
      maxRecordBytes: 10000,
    });

    return records;
  },
}

Advanced PDF splitting

Many PDFs generators create PDFs with a minimal structure. It’s common to have <div> tags to identify the pages. For example, this document has the following structure when transformed to HTML:

1
2
3
4
5
6
7
8
<body>
  <div class="page">
    <p></p>
    <p></p>
    ...
  </div>
  <!-- ... -->
</body>

This lets you create one record per page.

You can also combine this with a browser feature to open PDFs documents on a given page: by adding #page=n at the end of a URL pointing to a PDF document, the browser opens it on that page.

By generating one record per page with their own URLs, you can redirect users to the page of the document that matched their search, which further improves the search experience. Your PDF extraction would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  // ...
  recordExtractor: ({ url, $, contentLength, fileType }) => {
    const records = $('div.page')
      .map(function (i, e) {
        return {
          url: `${url}#page=${i + 1}`,
          content: $(e).text().trim(),
        };
      })
      .get();

    return records;
  },
};

Splitting using URI fragments

If you have URI fragments in your pages, it’s a good idea to have your records pointing to them. With the following HTML:

1
2
3
4
5
6
7
8
9
<body>
  <h1 id="part1">Part 1</h1>
  <p></p>
  <p></p>
  <!-- ... -->
  <h1 id="part2">Part 2</h1>
  <p></p>
  <!-- ... -->
</body>

You can then create one record per heading, so your users land on the relevant part of the page when they click a search result.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  // ...
  recordExtractor: ({ url, $, contentLength, fileType }) => {
    const records = $('h1')
      .map(function (i, e) {
        return {
          url: `${url}#${$(e).attr('id')}`,
          content: $(e).nextUntil('h1').text(),
        };
      })
      .get();

    return records;
  },
};
Did you find this page helpful?