API Reference / Crawler Configuration API / linkExtractor

linkExtractor

Type: function
Parameter syntax
linkExtractor: ({ $, url, defaultExtractor }) ==> {
  ...
  // return ['https://...']
}

About this parameter

Override the default logic used to extract URLs from pages.

By default, we queue all URLs that comply with pathsToMatch, fileTypesToMatch, and exclusions. You can override this default logic by providing a custom function which executes on each crawled page, and returns the URLs to queue.

The expected return value is an array of URLs (as strings).

Examples

1
2
3
4
5
6
7
8
9
10
  {
    linkExtractor: ({ $, url, defaultExtractor }) => {
      if (/example.com\/doc\//.test(url.href) {
        // For all pages under /doc, only queue the first found link
        return defaultExtractor().slice(0,1);
      }
      // Otherwise, use the default logic (queue all found links)
      return defaultExtractor();
    },
  }
1
2
3
4
5
6
{
  linkExtractor: ({ $, url, defaultExtractor }) => {
    return /sitemap.xml/.test(url.href) ? defaultExtractor() : [];
    // This turns off link discovery, except for URLs listed in sitemap.xml
  },
}
1
2
3
4
5
6
{
  linkExtractor: ({ $ }) => {
    // Access the DOM and extract what you specify
    return [$('.my-link').attr('href')]
  },
}

Parameters

url
type: URL
Optional

URL of the resource that was just crawled.

defaultExtractor
type: function
Optional

Default function used internally by the Crawler to discover URLs from a resource’s content. It returns an array of strings containing all URLs found on the current resource (if they match the configuration).

$
type: object (Cheerio instance)
Optional

A Cheerio instance containing the HTML of the crawled page.

Did you find this page helpful?