> ## Documentation Index
> Fetch the complete documentation index at: https://algolia.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Crawler data extraction overview

> How the Algolia Crawler extracts data from pages.

export const Records = () => <Tooltip tip="A record is a searchable object in an Algolia index. Each record consists of named attributes." cta="Algolia records" href="/doc/guides/sending-and-managing-data/prepare-your-data#algolia-records">
    records
  </Tooltip>;

export const Index = () => <Tooltip tip="An Algolia index is a searchable dataset that consists of records and configuration settings. These settings define how the records are searched and ranked.">
    index
  </Tooltip>;

The Crawler processes pages as follows:

1. Retrieve page.
2. Extract links and records from the page.
3. Send extracted <Records /> to Algolia.
4. Add extracted links to the crawler's URL database.

The process repeats until all the required pages have been extracted.

## The crawler URL database

When a crawl starts, your crawler adds all the URLs in the following parameters to its URL database:

* [`startUrls`](/doc/tools/crawler/apis/configuration/start-urls)
* [`sitemaps`](/doc/tools/crawler/apis/configuration/sitemaps)
* [`extraUrls`](/doc/tools/crawler/apis/configuration/extra-urls)

For each of these pages, your crawler fetches linked pages. It looks for links in any of the following formats:

* `head > link[rel=alternate]`
* `a[href]`
* `iframe[src]`
* `area[href]`
* `head > link[rel=canonical]`
* Redirect target (when HTTP code is `301` or `302`)

<Info>
  You can specify that some links [should be ignored](/doc/tools/crawler/getting-started/crawler-configuration#decide-what-you-want-to-exclude).
</Info>

## The record extractor

The [`recordExtractor`](/doc/tools/crawler/apis/configuration/actions#param-record-extractor) parameter takes a site's metadata and HTML and returns an array of JSON objects.
For example:

```js JavaScript icon=code theme={"system"}
recordExtractor: ({ url, $, contentLength, fileType }) => {
  return [
    {
      url: url.href,
      title: $("head > title").text(),
      description: $("meta[name=description]").attr("content"),
      type: $('meta[property="og:type"]').attr("content"),
    },
  ];
};
```

### `recordExtractor` properties

This function receives an object with several properties:

* `$`: site content is accessed by a [Cheerio instance (`$`) parameter](https://cheerio.js.org).
* `url`: a [Location](https://developer.mozilla.org/en-US/docs/Web/API/Location) object that contains the URL of the page being crawled
* `fileType`: the file type of the webpage (such as `html` or `pdf`)
* `contentLength`: the length of the page's content
* `datasources`: any [external data](/doc/tools/crawler/enriching-data/overview) you want to combine with your extracted data
* `helpers`: a collection of functions to help you extract content and generate records.

`url`, `fileType`, and `contentLength` provide useful metadata on the page you are crawling.
However, to extract content from your pages, you must use the Cheerio instance (`$`).
For more details, see [`recordExtractor`](/doc/tools/crawler/apis/configuration/actions#param-record-extractor)

### `recordExtractor` return structure

The JSON objects returned by your `recordExtractor` are directly converted into records in your Algolia <Index />.

They can contain any type as long as they're compatible with an [Algolia record](/doc/guides/sending-and-managing-data/prepare-your-data#algolia-records):

* Each record must be less than 500 KB
* You can return a maximum of 200 records per crawled URL.

## Extract from JavaScript-based sites

You can use your crawler on JavaScript-based sites.
To do this, set [`renderJavaScript`](/doc/tools/crawler/apis/configuration/render-java-script) to `true` in your crawler's configuration.

<Note>
  Since setting `renderJavaScript` to `true` slows the crawling process, you can use it for only a subset of your site.
</Note>

## Further reading

* [Crawler configuration](/doc/tools/crawler/getting-started/crawler-configuration)
* [Data extraction examples](/doc/tools/crawler/extracting-data/data-extraction-examples)
* [Non-HTML documents](/doc/tools/crawler/extracting-data/non-html-documents)
* [Extraction issues](/doc/tools/crawler/troubleshooting/extraction-issues)
* [Concepts](/doc/tools/crawler/getting-started/concepts)
* [Algolia service limits](/doc/guides/scaling/algolia-service-limits)
