Tools / Crawler / Extracting Data

Extracting Data

This page provides an overview of the Crawler’s extraction process. We’ll cover how pages are selected and processed, and how records are extracted from those pages.

Processing a page

To understand extraction, it is important to first understand how pages are processed by the Crawler.

Pages are processed in five main steps:

  1. A page is fetched.
  2. Links and records are extracted from the page.
  3. The extracted records are indexed to Algolia.
  4. The extracted links are added to the Crawler’s URL database.
  5. For each new, non-excluded page added to the database, the process is repeated.

Adding a page

When a crawl starts, your crawler adds all the URLs stored in the following parameters to its URL database:

For each of these pages, your crawler fetches linked pages. It looks for links in any of the following formats:

  • head > link[rel=alternate]
  • a[href]
  • iframe[src]
  • area[href]
  • head > link[rel=canonical]
  • redirect target when HTTP code is 301 or 302

However, not all links that match are added. There are a number of reasons why a page might be skipped/ignored.

If a page is not ignored, its content is extracted.

Extracting records

Pages are extracted by a recordExtractor. These extractors are assigned to actions via the recordExtractor parameter. This parameter links to a function that returns the data you want to index, organized in a array of JSON objects.

Anatomy of a recordExtractor

1
2
3
4
5
6
7
8
9
10
recordExtractor: ({ url, $, contentLength, fileType }) => {
     return [
          {
              url: url.href,
              title: $("head > title").text(),
              description: $("meta[name=description]").attr("content"),
              type: $('meta[property="og:type"]').attr("content"),
          }
     ];
}

Extraction function

recordExtractor is a custom function that take a website’s metadata, HTML (and potentially external data), and returns an array of JSON objects.

Parameters

This function receives an object with several properties to help you build your final records:

  1. $: A Cheerio instance that contains the crawled website’s content (we will go over what this means in the extracting a site’s content section).
  2. url: A Location object that contains the URL of the page being crawled.
  3. filetype: the file type of the webpage (html, pdf, etc.).
  4. contentlength: the length of the webpage’s content.
  5. datasources: the external data sets that you’ve defined in your crawler and want to combine with your extraction data.
  6. helpers: a collection of functions to help you extract content and generate records.

url, fileType, and contentLength provide useful metadata on the page you are crawling. However, to extract content from your webpages, you need to use the Cheerio instance ($).

Return structure

The JSON objects returned by your recordExtractor are directly converted into a record in your Algolia index.

They can contain any type of value as long as they are compatible with an Algolia record. However, their size must be lower than 500 KB each, and you can return a maximum of 200 records per crawled URL.

Extracting a site’s content

Website content is accessible through a recordExtractor’s Cheerio instance ($) parameter. Cheerio is “a lean implementation of core jQuery designed specifically for the server”. Checkout Cheerio’s documentation for examples, syntax, and guidance.

Extracting from JavaScript based sites

You can also use your crawler on JavaScript-based websites. To do this, set renderJavaScript to true in your crawler’s configuration file.

Setting renderJavaScript to true makes the crawling process a lot slower, so you have the possibility to use it for only a subset of your website.

Extracting data from non-HTML documents

You can use Crawler to index documents (such as .pdf’s and .doc’s). Documents are transformed into HTML by a dedicated Tika Server.

Did you find this page helpful?