Crawler: Actions
Action[]
{ actions: [ { indexName: 'index_name', pathsToMatch: ['url_path', ...] fileTypesToMatch: ['file_type', ...], autoGenerateObjectIDs: true|false, schedule: 'every 1 day', recordExtractor: ({ url, $, contentLength, fileType, dataSources }) => { } }, ], }
About this parameter
Determines which web pages are translated into Algolia records and in what way.
A single action defines:
- The URLs to crawl
- The extraction process for those websites
- The indices to which the extracted records are added
A single web page can match multiple actions. In this case, your crawler creates a record for each matched action.
Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
actions: [
{
indexName: 'dev_blog_algolia',
pathsToMatch: ['https://blog.algolia.com/**'],
fileTypesToMatch: ['pdf'],
autoGenerateObjectIDs: false,
schedule: 'every 1 day',
recordExtractor: ({ url, $, contentLength, fileType, dataSources }) => {
...
}
},
],
}
Parameters
Action
name
|
type: string
Optional
The unique identifier of this action (useful for debugging).
Required if |
||
indexName
|
type: string
Required
The index name targeted by this action.
This value is appended to the |
||
schedule
|
type: string
Optional
How often to perform a complete crawl for this action.
For more information, see |
||
pathsToMatch
|
type: string[]
Required
Determines which webpages to process with this action.
This list is checked against the URL of webpages using the |
||
selectorsToMatch
|
type: string
Optional
Checks for DOM elements matching selectors. If the page doesn’t contain any element matching the selectors, it’s ignored.
You can use negation: to ignore pages with a |
||
fileTypesToMatch
|
type: string[]
default: ["html"]
Optional
Set this value to index documents. For a list of available file types, see Non-HTML documents. Documents are first converted to HTML with Apache Tika, then processed as a HTML page. |
||
autoGenerateObjectIDs
|
type: bool
default: true
Generate an object ID for records that don’t have one.
Set this parameter to |
||
recordExtractor
|
type: function
Required
A custom Javascript function that lets you run your own code and extract what you want from a page. Your record extractor should return either an array of JSON objects or an empty array. If the function returns an empty array, the page is skipped.
Copy
|
action ➔ recordExtractor
$
|
type: object (Cheerio instance)
Optional
A Cheerio instance with the HTML of the crawled page. |
||
url
|
type: Location object
Optional
A |
||
fileType
|
type: string
Optional
The file type of the crawled page or document. |
||
contentLength
|
type: number
Optional
The number of bytes of the crawled page. |
||
dataSources
|
type: object
Optional
Object with the external data sources of the current URL.
Each key of the object corresponds to an
Copy
|
||
helpers
|
type: object
Optional
Functions that help you extract content and generate records. |
recordExtractor ➔ helpers
docsearch
|
type: function
Optional
A function that extracts content and formats it to be compatible with DocSearch. It creates an optimized number of records for relevancy and hierarchy, and you can use it without DocSearch or to index non-documentation content.
Copy
For more examples, see the DocSearch documentation. |
||||||
article
|
type: function
Optional
A function that extracts content from pages identified as articles. Articles are recognized by one of the following:
This helper returns an object with the following properties:
Copy
|
||||||
product
|
type: function
Optional
A function that extracts content from pages identified as product pages.
Product pages are recognized by looking for one of the following JSON-LD schema types:
Copy
This helper returns the following object:
Copy
|
||||||
page
|
type: function
Optional
A function that extracts text content from pages regardless of its type or category.
Copy
This helper returns the following object:
Copy
|
||||||
codeSnippets
|
type: function
Optional
Within a
Use this property to populate one attribute in the record, not to create multiple records.
Copy
The helper returns an array of code objects with the following properties:
Copy
|
||||||
splitContentIntoRecords
|
type: function
Optional
A function that extracts text from a HTML page and splits it into one or more records.
This can help prevent
Copy
In the preceding example, crawling a long HTML page returns an array of records that never exceeds the limit of 1,000 bytes per record. The records would look similar to this:
Copy
To prevent duplicate results when searching for a word that appears in multiple records belonging to the same resource (page),
set
Copy
|
helpers ➔ splitContentIntoRecords
$elements
|
type: string
default: $("body")
Optional
A Cheerio selector that determines from which elements textual content will be extracted and turned into records. |
baseRecord
|
type: object
default: {}
Optional
Attributes (and their values) to add to all resulting records. |
maxRecordBytes
|
type: number
default: 10000
Optional
Maximum number of bytes allowed per record, on the resulting Algolia index. To avoid errors, check the record size limits for your plan. |
textAttributeName
|
type: string
default: text
Optional
Name of the attribute in which to store the text of each record. |
orderingAttributeName
|
type: string
Optional
Name of the attribute in which to store the number of each record. |
helpers ➔ docsearch
recordProps
|
type: object
Required
Main docsearch configuration. |
||||
aggregateContent
|
type: boolean
default: true
Optional
Whether the helper automatically merges sibling elements and separates them by a line break. For:
Copy
Copy
|
||||
indexHeadings
|
type: boolean | object
default: true
Optional
Whether the helper creates records for headings. If false, only records for the
Copy
|
||||
recordVersion
|
type: string
default: v2
Optional
Change the version of the extracted records. It’s not correlated with the DocSearch version and may be incremented independently.
|
docsearch ➔ recordProps
lvl0
|
type: object
Required
Select the main category of the page.
You should index the
Copy
|
||
lvl1
|
type: string | string[]
Required
Select the main title of the page.
Copy
|
||
content
|
type: string | string[]
Required
Select the content elements of the page.
Copy
|
||
pageRank
|
type: string
Optional
Add an attribute
Copy
|
||
lv2, lvl3, lvl4, lvl5, lvl6
|
type: string | string[]
Optional
Select other headings of the page.
Copy
|
||
*
|
type: string | string[] | object
Optional
All extra keys are added to the extracted records.
Copy
|