What IP address can I use for IP whitelisting?
The Crawler uses a static IP address that can be whitelisted if necessary:
What is the user agent of the Crawler (useful for whitelisting)?
When fetching pages, the crawler identifies itself with the following user agent:
Algolia Crawler/xx.xx.xx (each
xx is a number, e.g.
The version at the end of that user agent changes regularly as the product evolves.
You might want to specify our user agent in your
robots.txt file. In that case, you can specify it as
User-agent: Algolia Crawler without any version number.
That being said, whitelisting the user agent manually (e.g. through nginx or other custom validation) may require changes to let the crawler fetch pages.
Consider using IP validation instead.
One of my pages was not crawled
There are several possibilities for why this might have happened:
- Crawling a website completely can take hours depending on the size: make sure that the crawling operation has finished.
- Some pages may not be linked with each other: make sure that there exists a way to navigate from the website’s start pages to the missing page, or that the missing page is listed in the
sitemaps. If it is inaccessible, you may want to add its URL as a start URL in your crawler’s configuration file.
- The page may have been ignored if it refers to a canonical URL, or if it does not match a
pathsToMatchin any of your crawler’s actions, or if it matches any
exclusionPatterns. For more information, checkout the question: when are pages skipped or ignored?
truein your configuration (note: this makes the crawling process slower).
- If the page is behind a login wall, you may need to setup the
loginproperty of your configuration.
If none of these solve your problem, an error may have happened while crawling the page. Please check your logs using the Monitoring and URL Inspector tabs.
You can also use the URL tester in the Editor tab of the Admin to get details on why a URL was skipped / ignored.
The Crawler doesn’t see the same HTML as me
Sometimes, websites behave differently depending on the user agent they receive. You can see the HTML discovered by the Crawler in the URL Tester.
If this HTML is missing information, the last thing to check after trying to debug your selectors is to check whether it can be due to the Crawler’s user agent. You can do this with browser extensions or using cURL. Send several requests with a few different user agents and compare the results.
1 2 3 curl http://example.com curl -H "User-Agent: Algolia Crawler" http://example.com curl -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:71.0) Gecko/20100101 Firefox/71.0" http://example.com
How to see the Crawler tab in the Algolia Dashboard?
- Make sure that your crawler configuration file was saved on crawler.algolia.com.
- Make sure that you have admin ACL permissions on the indices specified in your crawler’s configuration file.
- Make sure that your
apiKeyare correct and that they grant admin permission to the crawler for these indices.
When are pages skipped or ignored?
At addition time
Pages skipped at addition time are not added to the URL database. This can happen for two reasons:
- The URL doesn’t match at least one of your actions’
- The URL matches one of your crawler’s
At retrieval time:
Pages skipped at retrieval time are added to the URL database, retrieved, but not processed. Those are flagged “Ignored” in our interface. This can happen for a number of reasons:
robots.txtcheck didn’t allow this URL to be crawled.
- The URL is a redirect (note, we add the redirect target to the URL database but skip the rest of the page).
- The page’s HTTP status code is not 200.
- The content type is not one of the expected ones.
- The page contains a canonical link. Note: we add the canonical target as a page to crawl (according to the same addition-time filters) and then skip the current page.
When are records deleted?
Launching a crawl completely clears the state of your database. When your crawl completes, your old indices are overwritten by the data indexed during the new run. If you want to save a backup of your old index, set the
saveBackup parameter of your crawler to