Extract data from non-HTML documents
The Crawler can extract data from files like PDFs and DOCs. To do this, it uses Apache Tika to extract a document’s content and transform it into a basic HTML file.
Limitations
Because it’s difficult to translate non-HTML documents into HTML, there are limitations:
- A PDF can break if it’s exported with an unknown font.
- The transformed HTML has little semantic value: headings, paragraphs, and lists in the original document might not be marked in the HTML. This makes good relevancy hard to achieve.
- Document indexing is slower than classic HTML indexing.
- Language detection isn’t available.
Enable document extraction
To enable document extraction, add the fileTypesToMatch
parameter to at least one of your crawler’s actions.
The available fileTypesToMatch
are:
html
for web pages. This is the default when nofileTypesToMatch
parameter is presentpdf
for PDF documentsdoc
,xls
, andppt
for Microsoft Office documentsodt
,ods
, andodp
for Open documentsemail
for electronic mail documents
The document’s transformed HTML is stored in the recordExtractor
$
parameter.
The file type is stored in the recordExtractor
filetype
parameter.
1
2
3
4
5
6
7
8
9
10
11
12
13
({
[...]
actions: [
{
indexName: 'crawler-example',
pathsToMatch: ['https://www.example.com/**'],
fileTypesToMatch: ['pdf', 'doc'],
recordExtractor: ({ url, $, fileType }) => {
console.log($.html(), fileType);
}
},
]
});
Sample crawler configuration
For an example configuration for document extraction,
see config.documents.js
on GitHub.
Supported file types
- Associated extension:
.pdf
fileTypesToMatch
:pdf
For example, in this .pdf
file, Tika exposes the following HTML which the crawler then passes to your recordExtractor
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2018-07-17T13:35:40Z"/>
<meta name="pdf:PDFVersion" content="1.3"/>
<meta name="pdf:docinfo:title" content="test-docx-file.pages"/>
<meta name="xmp:CreatorTool" content="Pages"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="dcterms:created" content="2018-07-17T13:35:40Z"/>
<meta name="Last-Modified" content="2018-07-17T13:35:40Z"/>
<meta name="dcterms:modified" content="2018-07-17T13:35:40Z"/>
<meta name="dc:format" content="application/pdf; version=1.3"/>
<meta name="Last-Save-Date" content="2018-07-17T13:35:40Z"/>
<meta name="pdf:docinfo:creator_tool" content="Pages"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="pdf:docinfo:modified" content="2018-07-17T13:35:40Z"/>
<meta name="meta:save-date" content="2018-07-17T13:35:40Z"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="dc:title" content="test-docx-file.pages"/>
<meta name="modified" content="2018-07-17T13:35:40Z"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="meta:creation-date" content="2018-07-17T13:35:40Z"/>
<meta name="created" content="Tue Jul 17 13:35:40 UTC 2018"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="Creation-Date" content="2018-07-17T13:35:40Z"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="producer" content="Mac OS X 10.13.5 Quartz PDFContext"/>
<meta name="access_permission:can_modify" content="true"/>
<meta name="pdf:docinfo:producer" content="Mac OS X 10.13.5 Quartz PDFContext"/>
<meta name="pdf:docinfo:created" content="2018-07-17T13:35:40Z"/>
<title>test-docx-file.pages</title>
</head>
<body>
<div class="page">
<p/>
<p>Test PDF file content</p>
<p/>
</div>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.
Word document
- Associated extensions:
.doc
,.docx
fileTypesToMatch
:doc
For example, in this .doc
file, Tika exposes the following HTML, which the crawler then passes to its recordExtractor
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
<meta name="Content-Type" content="application/msword"/>
<title>
</title>
</head>
<body>
<div class="header"/>
<p class="body">Test DOC file content</p>
<div class="footer"/>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.
OpenDocument text
- Associated extension:
.odt
fileTypesToMatch
:odt
Excel spreadsheet
- Associated extensions:
.xls
,.xlsx
fileTypesToMatch
:xls
For example, in this .xls
file, Tika exposes the following HTML, which the crawler then passes to its recordExtractor
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
<meta name="Content-Type" content="application/vnd.ms-excel"/>
<title>
</title>
</head>
<body>
<div class="page">
<h1>Feuille 1</h1>
<table>
<tbody>
<tr>
<td>Test XLS file content</td>
</tr>
</tbody>
</table>
<div class="outside">&C&"Helvetica,Regular"&12&K000000&P</div>
</div>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.
OpenDocument spreadsheet
- Associated extension:
.ods
fileTypesToMatch
:ods
PowerPoint document
- Associated extensions:
.ppt
,.pptx
fileTypesToMatch
:ppt
For example, in this .ppt
file, Tika exposes the following HTML, which the crawler then passes to its recordExtractor
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
<meta name="Content-Type" content="application/vnd.ms-powerpoint"/>
<title>
</title>
</head>
<body>
<div class="slideShow">
<div class="slide">
<div class="slide-master-content"/>
<div class="slide-content">
<p>Test PPT file content</p>
<p/>
</div>
</div>
<div class="ocr"/>
</div>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.
OpenDocument presentation
- Associated extension:
.odp
fileTypesToMatch
:odp
Email documents
- Associated extension:
.msg
fileTypesToMatch
:email
The file type email
includes all documents related to email.
The Crawler supports the Outlook Mail Message (.msg
) format.
For example, Tika converts this email into the following HTML:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2017-06-01T15:24:31Z" />
<meta name="Message:To-Email" content="to@domain.com" />
<meta name="dc:description" content="this is a mail to test msg file" />
<meta name="subject" content="this is a mail to test msg file" />
<meta name="dc:creator" content="from@domain.com" />
<meta name="Message:From-Email" content="from@domain.com" />
<meta name="dcterms:created" content="2017-06-01T15:24:31Z" />
<meta name="Message-To" content="to@domain.com" />
<meta name="dcterms:modified" content="2017-06-01T15:24:31Z" />
<meta name="Last-Modified" content="2017-06-01T15:24:31Z" />
<meta name="Message-Recipient-Address" content="to@domain.com" />
<meta name="Message:Raw-Header:X-Unsent" content="1" />
<meta name="Message:Raw-Header:Subject" content="this is a mail to test msg file" />
<meta name="meta:mapi-message-class" content="NOTE" />
<meta name="Message:To-Display-Name" content="to@domain.com" />
<meta name="Last-Save-Date" content="2017-06-01T15:24:31Z" />
<meta name="Message:Raw-Header:MIME-Version" content="1.0" />
<meta name="meta:save-date" content="2017-06-01T15:24:31Z" />
<meta name="dc:title" content="this is a mail to test msg file" />
<meta name="Message:Raw-Header:Message-ID" content="<c58b1b52f61f4789ba40339c6e993440>" />
<meta name="modified" content="2017-06-01T15:24:31Z" />
<meta name="Content-Type" content="application/vnd.ms-outlook" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser" />
<meta name="creator" content="from@domain.com" />
<meta name="Message:Raw-Header:From" content="from@domain.com" />
<meta name="meta:author" content="from@domain.com" />
<meta name="meta:creation-date" content="2017-06-01T15:24:31Z" />
<meta name="meta:mapi-from-representing-email" content="from@domain.com" />
<meta name="Creation-Date" content="2017-06-01T15:24:31Z" />
<meta name="Message-Cc" content="" />
<meta name="Message-Bcc" content="" />
<meta name="meta:mapi-from-representing-name" content="from@domain.com" />
<meta name="Message:Raw-Header:To" content="to@domain.com" />
<meta name="Message:From-Name" content="from@domain.com" />
<meta name="Author" content="from@domain.com" />
<meta name="Message-From" content="from@domain.com" />
<meta name="Message:To-Name" content="" />
<title>this is a mail to test msg file</title>
</head>
<body>
<h1>this is a mail to test msg file</h1>
<dl>
<dt>From</dt>
<dd>from@domain.com</dd>
<dt>To</dt>
<dd>to@domain.com</dd>
<dt>Recipients</dt>
<dd>to@domain.com</dd>
</dl>
<div class="message-body">
<p>This message was sent using a msg file </p>
</div>
</body>
</html>
The metadata presented here isn’t guaranteed to appear on every document.