Crawler: Actions
Action[]
{ actions: [ { indexName: 'index_name', pathsToMatch: ['url_path', ...] fileTypesToMatch: ['file_type', ...], autoGenerateObjectIDs: true|false, schedule: 'every 1 day', recordExtractor: ({ url, $, contentLength, fileType, dataSources }) => { } }, ], }
About this parameter
Determines which web pages are translated into Algolia records and in what way.
A single action defines:
- the subset of your crawler’s websites it targets,
- the extraction process for those websites,
- and the indices to which the extracted records are pushed.
A single web page can match multiple actions. In this case, your crawler creates a record for each matched action.
Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
actions: [
{
indexName: 'dev_blog_algolia',
pathsToMatch: ['https://blog.algolia.com/**'],
fileTypesToMatch: ['pdf'],
autoGenerateObjectIDs: false,
schedule: 'every 1 day',
recordExtractor: ({ url, $, contentLength, fileType, dataSources }) => {
...
}
},
],
}
Parameters
Action
name
|
type: string
Optional
The unique identifier of this action (useful for debugging).
Required if |
||
indexName
|
type: string
Required
The index name targeted by this action. This value is appended to the |
||
schedule
|
type: string
Optional
How often to perform a complete crawl for this action. See main property |
||
pathsToMatch
|
type: string[]
Required
Determines which webpages match for this action. This list is checked against the url of webpages using micromatch. You can use negation, wildcards and more. |
||
selectorsToMatch
|
type: string
Optional
Checks for the presence of DOM nodes matching the given selectors: if the page doesn’t contain any node matching the selectors, it’s ignored.
You can also check for the absence of selectors by using negation: if you want to ignore pages that contain a |
||
fileTypesToMatch
|
type: string[]
default: ["html"]
Optional
Set this value if you want to index documents. Chosen file types will be converted to HTML using Apache Tika, then treated as a normal HTML page. See the documents guide for a list of available |
||
autoGenerateObjectIDs
|
type: bool
default: true
Generate an |
||
recordExtractor
|
type: function
Required
A
Copy
|
action ➔ recordExtractor
$
|
type: object (Cheerio instance)
Optional
A Cheerio instance containing the HTML of the crawled page. |
||
url
|
type: Location object
Optional
A |
||
fileType
|
type: string
Optional
The fileType of the crawled page (e.g.: html, pdf, …). |
||
contentLength
|
type: number
Optional
The number of bytes in the crawled page. |
||
dataSources
|
type: object
Optional
Object containing the external data sources of the current URL. Each key of the object corresponds to an
Copy
|
||
helpers
|
type: object
Optional
Collection of functions to help you extract content and generate records. |
recordExtractor ➔ helpers
docsearch
|
type: function
Optional
You can call the
Copy
You can find more examples in the DocSearch documentation |
||||||
article
|
type: function
Optional
Within a
This helper will return an object with the following properties:
Copy
|
||||||
product
|
type: function
Optional
Within a
Copy
This helper will return the following object:
Copy
|
||||||
page
|
type: function
Optional
Within a
Copy
This helper will return the following object:
Copy
|
||||||
splitContentIntoRecords
|
type: function
Optional
The
Copy
In the preceding
Copy
Assuming that the automatic generation of In order to prevent duplicate results when searching for a word that appears in multiple records belonging to the same resource (page), we recommend that you enable
Copy
Please be aware that using |
helpers ➔ splitContentIntoRecords
$elements
|
type: string
default: $("body")
Optional
A Cheerio selector that determines from which elements textual content will be extracted and turned into records. |
baseRecord
|
type: object
default: {}
Optional
Attributes (and their values) to add to all resulting records. |
maxRecordBytes
|
type: number
default: 10000
Optional
Maximum number of bytes allowed per record, on the resulting Algolia index. You can refer to the record size limits for your plan to prevent any errors regarding record size. |
textAttributeName
|
type: string
default: text
Optional
Name of the attribute in which to store the text of each record. |
orderingAttributeName
|
type: string
Optional
Name of the attribute in which to store the number of each record. |
helpers ➔ docsearch
recordProps
|
type: object
Required
Main docsearch configuration. |
||||
aggregateContent
|
type: boolean
default: true
Optional
Whether the helpers automatically merge sibling elements and separate them by a line break. For:
Copy
Copy
|
||||
indexHeadings
|
type: boolean | object
default: true
Optional
Whether the helpers create records for headings. When
Copy
|
||||
recordVersion
|
type: string
default: v2
Optional
Change the version of the extracted records. It’s not correlated with the DocSearch version and can be incremented independently.
|
docsearch ➔ recordProps
lvl0
|
type: object
Required
Select the main category of the page.
You should index the
Copy
|
||
lvl1
|
type: string | string[]
Required
Select the main title of the page.
Copy
|
||
content
|
type: string | string[]
Required
Select the content elements of the page.
Copy
|
||
pageRank
|
type: string
Optional
Add an attribute
Copy
|
||
lv2, lvl3, lvl4, lvl5, lvl6
|
type: string | string[]
Optional
Select other headings of the page.
Copy
|
||
*
|
type: string | string[] | object
Optional
All extra keys are added to the extracted records.
Copy
|