> ## Documentation Index > Fetch the complete documentation index at: https://algolia.com/llms.txt > Use this file to discover all available pages before exploring further. # Extract data with Cheerio > Learn the Cheerio syntax to extract data in the Algolia Crawler, and discover ready-to-use selectors and extractors. export const Records = () => records ; export const Index = () => index ; When creating a [`recordExtractor`](/doc/tools/crawler/apis/configuration/actions#param-record-extractor), the most important parameter is the [Cheerio](https://cheerio.js.org) instance (`$`). Cheerio is a server-side implementation of [jQuery](https://jquery.com/). The Crawler uses it to expose the page's DOM so you can extract the content you want using Cheerio's [Selectors API](https://cheerio.js.org/#selectors). While Cheerio provides extensive documentation, you may need to experiment with its syntax to successfully crawl your pages. This guide outlines common techniques for building from your site's content. ## Common extraction techniques The following \[helpers]\(/doc/uild a UI hierarchy/crawler/getting-started/concepts/#helpers) may be useful for extracting content from *your* pages. ### Extract content from metadata elements To get content from [`meta` elements](/doc/tools/crawler/extracting-data/finding-and-extracting-data#meta-tags), parse their `content` attribute. ```js JavaScript icon=code theme={"system"} // Get `title` from const title = $('meta[property="og:title"]').attr("content"); // Get `description` from const description = $("meta[name=description]").attr("content"); ``` ### Extract data from JSON-LD To get content from [supported JSON-LD attributes](/doc/tools/crawler/extracting-data/finding-and-extracting-data#helpers): ```js JavaScript icon=code theme={"system"} let jsonld; const node = $('script[type="application/ld+json"]').get(0); try { jsonld = JSON.parse(node.firstChild.data); } catch (err) { // For debugging purposes, show the node in the developer console console.log(node); } ``` ### Get text from several CSS selectors To get content from several [CSS selectors](/doc/tools/crawler/extracting-data/finding-and-extracting-data#meta-tags), query them all and retrieve an array of content. ```js JavaScript icon=code theme={"system"} const allHeadings = $("h1, h2") .map((i, e) => $(e).text()) .get(); // ["First

", "First

", "Second

"] ``` ## Build a UI hierarchy The InstantSearch UI libraries provide a [`hierarchicalMenu`](/doc/api-reference/widgets/hierarchical-menu/js) widget for displaying hierarchical information. This widget expects a special format of your records. If your site shows a [breadcrumb](https://wikipedia.org/wiki/Breadcrumb_navigation), you can turn it into a hierarchy in your records. ```html HTML icon=code-xml theme={"system"}

Home

Pictures

Summer 15

Italy

``` ```js JavaScript icon=code theme={"system"} function buildHierarchy(arr) { const hierarchy = {}; for (let i = 0; i < arr.length; ++i) { res[`lvl${i}`] = arr.slice(0, i + 1).join(" > "); } return hierarchy; } const breadcrumb = $("ul.breadcrumb li") .map((i, e) => $(e).text()) .get(); const hierarchy = buildHierarchy(breadcrumb); // This is compatible with InstantSearch's hierarchical menu widgets ``` ## Index separate indices based on content To add records to separate indices, create several [`actions`](/doc/tools/crawler/apis/configuration/actions), each targeting a separate `indexName`. You can then decide which pages each `action` processes by specifying the [`pathsToMatch`](/doc/tools/crawler/apis/configuration/actions#param-paths-to-match) parameter Sometimes you need to check the page content to determine which action should process it. For example, if you have a separate for each language, use the `html` tag's `lang` attribute to determine which index to use. In the following example, both actions process the same pages but crawls or skips them depending on the `lang` attribute. ```js JavaScript icon=code theme={"system"} { // ... actions: [ { indexName: "english", pathsToMatch: ["http://example.com/"], recordExtractor: ({ $, url }) => { if ($("html").attr("lang") !== "en") { return []; // Skip non-English pages } return [ { objectID: url.href, content: $("p").text(), }, ]; }, }, { indexName: "french", pathsToMatch: ["http://example.com/"], recordExtractor: ({ $, url }) => { if ($("html").attr("lang") !== "fr") { return []; // Skip non-French pages } return [ { objectID: url.href, content: $("p").text(), }, ]; }, }, ]; } ``` ## Split content For better performance and relevance, [split long content into several records](/doc/guides/sending-and-managing-data/prepare-your-data/how-to/indexing-long-documents). ### Split PDF files The Crawler [transforms PDF documents into HTML with Apache Tika](/doc/tools/crawler/extracting-data/non-html-documents) and exposes it to you with Cheerio. Use the HTML tab of the URL Tester to see the extracted HTML. Based on the structure of the resulting HTML, you should be able to separate the content into individual records. #### Basic PDF splitting The HTML that Tika generates is often mainly composed of `p` tags, meaning the `$('p').text()` returns the complete text of your PDF. Since PDF documents tend to be long and there's a [size limit](/doc/tools/crawler/troubleshooting/indexing-issues#records-exceed-the-maximum-for-your-algolia-plan) for Algolia records, wrap such text with the [`splitContentIntoRecords`](/doc/tools/crawler/apis/configuration/actions#param-helpers-split-content-into-records) helper. For example: ```js JavaScript icon=code theme={"system"} { // ... recordExtractor: ({ url, $, contentLength, fileType, helpers }) => { const records = helpers.splitContentIntoRecords({ baseRecord: { url }, $elements: $('p'), maxRecordBytes: 10000, }); return records; }, } ``` #### Advanced PDF splitting PDF generation tools often create files with minimal structure. It's typical to encounter `div` tags as a way to define individual pages. For example, [this document](https://www.un.org/sustainabledevelopment/wp-content/uploads/2015/10/COP21-FAQs.pdf) has the following structure when transformed into HTML: ```html HTML icon=code-xml theme={"system"}

...
``` This creates one record per page. You can also combine this with a browser feature to open PDF documents on a given page: by adding `#page=n` at the end of a URL pointing to a PDF document, the browser opens it on that page. By generating one record per page, you can redirect users to the page of the document that matches their search, which further improves their experience. For example: ```js JavaScript icon=code theme={"system"} { // ... recordExtractor: ({ url, $, contentLength, fileType }) => { const records = $('div.page') .map(function (i, e) { return { url: `${url}#page=${i + 1}`, content: $(e).text().trim(), }; }) .get(); return records; }, }; ``` ### Split pages using URI fragments If you have [URI fragments](https://en.wikipedia.org/wiki/URI_fragment) in your pages, it's a good idea to have your records pointing to them. With the following HTML: ```html HTML icon=code-xml theme={"system"}
Part 1

Part 2

``` You can then create one record per heading, so your users land on the relevant part of the page when they click a search result. ```js JavaScript icon=code theme={"system"} { // ... recordExtractor: ({ url, $, contentLength, fileType }) => { const records = $('h1') .map(function (i, e) { return { url: `${url}#${$(e).attr('id')}`, content: $(e).nextUntil('h1').text(), }; }) .get(); return records; }, }; ```