Crawler concepts
Learn more about configuration templates, helpers, actions, and crawler indices.
Configuration templates
Configuration templates help you with creating your crawler configuration. They contain pre-built actions for extracting data from your site, based on known page layouts. When creating your crawler, you can choose between these templates:
- Default.
- Static site generators. These templates extract content from site generators like Docusaurus and Vuepress.
After choosing a template, you can edit the configuration to change or extend it.
Helpers
Helpers are functions that make it easier to extract relevant content from your page. You can use them in your actions. For example:
- If you have a page that you declared as an Article with metadata,
you can use the
helpers.article()
helper to extract records from it. - If you have a long page, you can use the
helpers.splitContentIntoRecords()
helper to split the page into smaller chunks.
For more information, see Helpers.
Actions
Actions instruct the crawler what information to extract from matching URLs. They’re part of your Crawler configuration. A crawler can have up to 30 actions.
Each action in the configuration must include:
indexName
: the name of the Algolia index where you want to store the extracted records.pathsToMatch
: patterns for URLs to which this action should apply. For example:https://www.algolia.com/blog/**
tells the crawler to run this action on all pages of the Algolia blog.recordExtractor
: a function that defines what information to extract from each visited page, and formats it as records for your Algolia index. You can use helpers to write less code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
actions: [
{
indexName: "algolia-docs",
pathsToMatch: ["https://www.algolia.com/doc/**"],
recordExtractor: ({ helpers }) => {
return helpers.docsearch({
recordProps: {
lvl1: ["header h1", "article h1", "main h1", "h1", "head > title"],
content: ["article p, article li", "main p, main li", "p", "li"],
lvl0: {
selectors: "",
defaultValue: "Documentation",
},
lvl2: ["article h2", "main h2", "h2"],
lvl3: ["article h3", "main h3", "h3"],
lvl4: ["article h4", "main h4", "h4"],
lvl5: ["article h5", "main h5", "h5"],
lvl6: ["article h6", "main h6", "h6"],
},
aggregateContent: true,
recordVersion: "v3",
});
},
},
],
For complete configurations, see the examples repository on GitHub.
Indices created by the crawler
An index is where the Algolia Crawler stores the extracted data from your pages. In most cases, you’ll have one index for each content type, such as articles or products. You can find all your indices, including those that weren’t created by the Crawler, in the Algolia dashboard.
The Algolia Crawler creates three types of indices:
- Production indices don’t have suffixes. The production index contains the records extracted by the latest crawl. If you manually start a new crawl and a scheduled crawl is ongoing, these records are added to a temporary index instead.
- Backup indices have the .bak suffix. For extra safety, you can keep a backup of your last production index. To learn more, see the
saveBackup
parameter. - Temporary indices have the .tmp suffix. During a crawl, the Crawler adds extracted records to a temporary index. If the crawl is successful, the temporary index from the latest crawl replaces the production index.