> ## Documentation Index > Fetch the complete documentation index at: https://algolia.com/llms.txt > Use this file to discover all available pages before exploring further. # Extract data from non-HTML documents > Learn how to extract data from document formats such as PDF with your Algolia crawler. The Crawler can extract data from files like PDF and Word documents. To do this, it uses [Apache Tika](https://tika.apache.org) to extract a document's content and transform it into a basic HTML file. ## Limitations Because it's difficult to translate non-HTML documents into HTML, there are limitations: * PDF documents can break if it's exported with an unknown font. * The transformed HTML has little semantic value: headings, paragraphs, and lists in the original document might not be marked in the HTML. This makes good relevancy hard to achieve. * Document indexing is slower than classic HTML indexing. * Language detection isn't available. ## Enable document extraction To enable document extraction, add the [`fileTypesToMatch`](/doc/tools/crawler/apis/configuration/actions#param-file-types-to-match) parameter to at least one of your crawler's actions. For a list of supported file types, see [Supported file types](#supported-file-types). The document's transformed HTML is stored in the [`recordExtractor.$`](/doc/tools/crawler/apis/configuration/actions#param-record-extractor) parameter. The file type is stored in the [`recordExtractor.fileType`](/doc/tools/crawler/apis/configuration/actions#param-record-extractor) parameter. ```js JavaScript icon=code theme={"system"} ({ // [...] actions: [ { indexName: "crawler-example", pathsToMatch: ["https://www.example.com/**"], fileTypesToMatch: ["pdf", "doc"], recordExtractor: ({ url, $, fileType }) => { console.log($.html(), fileType); }, }, ], }); ``` ### Sample crawler configuration For an example configuration for document extraction, see [`config.documents.js`](https://github.com/algolia/crawler-configurations-examples/blob/master/config.documents.js) on GitHub. ## Supported file types The metadata shown in these examples might not be included on every page. ### PDF documents * Associated extension: `.pdf` * `fileTypesToMatch`: `pdf` For example, in this `.pdf` file, Tika exposes the following HTML which the crawler then passes to your `recordExtractor`. A PDF document with the text: Test PDF file content

```html HTML icon=code-xml expandable theme={"system"} test-docx-file.pages

Test PDF file content

``` ### Microsoft Word documents * Associated extensions: `.doc`, `.docx` * `fileTypesToMatch`: `doc` For example, in this `.doc` file, Tika exposes the following HTML, which the crawler then passes to its `recordExtractor`. A Microsoft Word document with the text: Test DOC file content

```html HTML icon=code-xml theme={"system"}

Test DOC file content

``` ### OpenDocument text documents * Associated extension: `.odt` * `fileTypesToMatch`: `odt` ### Microsoft Excel spreadsheets * Associated extensions: `.xls`, `.xlsx` * `fileTypesToMatch`: `xls` For example, in this `.xls` file, Tika exposes the following HTML, which the crawler then passes to its `recordExtractor`. A Microsoft Excel spreadsheet with the text: Test XML file content

A Microsoft Excel spreadsheet with the text: Test XML file content

```html HTML icon=code-xml theme={"system"}

Feuille 1

Test XLS file content

&C&"Helvetica,Regular"&12&K000000&P

``` ### OpenDocument spreadsheets * Associated extension: `.ods` * `fileTypesToMatch`: `ods` ### Microsoft PowerPoint documents * Associated extensions: `.ppt`, `.pptx` * `fileTypesToMatch`: `ppt` For example, in this `.ppt` file, Tika exposes the following HTML, which the crawler then passes to its `recordExtractor`. A Microsoft PowerPoint slide with the text: Test PPT file content

```html HTML icon=code-xml theme={"system"}

Test PPT file content

``` ### OpenDocument presentations * Associated extension: `.odp` * `fileTypesToMatch`: `odp` ### Emails * Associated extension: `.msg` * `fileTypesToMatch`: `email` The file type `email` includes all documents related to email. The Crawler supports the Outlook Mail Message (`.msg`) format. For example, Tika converts this email into the following HTML: Screenshot of an email with 'From: from@domain.com', 'To: to@domain.com', 'Subject: this is a mail to test message file', and message text.

Screenshot of an email with 'From: from@domain.com', 'To: to@domain.com', 'Subject: this is a mail to test message file', and message text.

```html HTML icon=code-xml expandable theme={"system"} this is a mail to test msg file

this is a mail to test msg file

From: from@domain.com
To: to@domain.com
Recipients: to@domain.com

This message was sent using a msg file

```