> ## Documentation Index > Fetch the complete documentation index at: https://algolia.com/llms.txt > Use this file to discover all available pages before exploring further. # Crawler data extraction overview > How the Algolia Crawler extracts data from pages. export const Records = () => records ; export const Index = () => index ; The Crawler processes pages as follows: 1. Retrieve page. 2. Extract links and records from the page. 3. Send extracted to Algolia. 4. Add extracted links to the crawler's URL database. The process repeats until all the required pages have been extracted. ## The crawler URL database When a crawl starts, your crawler adds all the URLs in the following parameters to its URL database: * [`startUrls`](/doc/tools/crawler/apis/configuration/start-urls) * [`sitemaps`](/doc/tools/crawler/apis/configuration/sitemaps) * [`extraUrls`](/doc/tools/crawler/apis/configuration/extra-urls) For each of these pages, your crawler fetches linked pages. It looks for links in any of the following formats: * `head > link[rel=alternate]` * `a[href]` * `iframe[src]` * `area[href]` * `head > link[rel=canonical]` * Redirect target (when HTTP code is `301` or `302`) You can specify that some links [should be ignored](/doc/tools/crawler/getting-started/crawler-configuration#decide-what-you-want-to-exclude). ## The record extractor The [`recordExtractor`](/doc/tools/crawler/apis/configuration/actions#param-record-extractor) parameter takes a site's metadata and HTML and returns an array of JSON objects. For example: ```js JavaScript icon=code theme={"system"} recordExtractor: ({ url, $, contentLength, fileType }) => { return [ { url: url.href, title: $("head > title").text(), description: $("meta[name=description]").attr("content"), type: $('meta[property="og:type"]').attr("content"), }, ]; }; ``` ### `recordExtractor` properties This function receives an object with several properties: * `$`: site content is accessed by a [Cheerio instance (`$`) parameter](https://cheerio.js.org). * `url`: a [Location](https://developer.mozilla.org/en-US/docs/Web/API/Location) object that contains the URL of the page being crawled * `fileType`: the file type of the webpage (such as `html` or `pdf`) * `contentLength`: the length of the page's content * `datasources`: any [external data](/doc/tools/crawler/enriching-data/overview) you want to combine with your extracted data * `helpers`: a collection of functions to help you extract content and generate records. `url`, `fileType`, and `contentLength` provide useful metadata on the page you are crawling. However, to extract content from your pages, you must use the Cheerio instance (`$`). For more details, see [`recordExtractor`](/doc/tools/crawler/apis/configuration/actions#param-record-extractor) ### `recordExtractor` return structure The JSON objects returned by your `recordExtractor` are directly converted into records in your Algolia . They can contain any type as long as they're compatible with an [Algolia record](/doc/guides/sending-and-managing-data/prepare-your-data#algolia-records): * Each record must be less than 500 KB * You can return a maximum of 200 records per crawled URL. ## Extract from JavaScript-based sites You can use your crawler on JavaScript-based sites. To do this, set [`renderJavaScript`](/doc/tools/crawler/apis/configuration/render-java-script) to `true` in your crawler's configuration. Since setting `renderJavaScript` to `true` slows the crawling process, you can use it for only a subset of your site. ## Further reading * [Crawler configuration](/doc/tools/crawler/getting-started/crawler-configuration) * [Data extraction examples](/doc/tools/crawler/extracting-data/data-extraction-examples) * [Non-HTML documents](/doc/tools/crawler/extracting-data/non-html-documents) * [Extraction issues](/doc/tools/crawler/troubleshooting/extraction-issues) * [Concepts](/doc/tools/crawler/getting-started/concepts) * [Algolia service limits](/doc/guides/scaling/algolia-service-limits)