API Reference / Crawler Configuration API

Crawler Configuration API

Parameter

appId

The ID of the application you want to store the crawler extractions in.

apiKey

API key for your targeted application.

indexPrefix

Prefix added to the names of all indices defined in the crawler’s configuration.

rateLimit

Number of concurrent tasks per second that can run for this configuration.

schedule

How often a complete crawl should be performed.

startUrls

The crawler uses these URLs as entry points to start crawling.

sitemaps

URLs found in sitemaps are treated as startUrls for the crawler: they are used as starting points for the crawl.

ignoreRobotsTxtRules

When set to true, the crawler will ignore rules set in your robots.txt.

extraUrls

URLs found in extraUrls are treated as startUrls for your crawler: they are used as starting points for the crawl.

maxUrls

Limits the number of URLs your crawler can process.

maxDepth

Limits the processing of URLs to the specified depth, inclusively.

saveBackup

Whether to save a backup of your production index before it is overwritten by the index generated during a crawl.

renderJavaScript

When true, all web pages are rendered with a chrome headless browser. The crawler will use the rendered HTML.

initialIndexSettings

Defines the settings for the indices that the crawler updates.

exclusionPatterns

Tells the crawler which URLs to ignore or exclude.

ignoreQueryParams

Filters out specified query parameters from crawled URLs. This can help you avoid indexing duplicate URLs.

requestOptions

Set a proxy and headers for the crawler’s web requests.

linkExtractor

Determines the function used to extract URLs from pages.

externalDataSources

Defines external data sources you want to retrieve during every crawl and make available to your extractor function.

login

This property defines how the crawler acquires a session cookie.

actions

Determines which web pages are translated into Algolia records and in what way.

Did you find this page helpful?