Troubleshooting site access issues

Issues can be indicated by:

The crawl status (Failed or Skipped) and category (Fetch). Reasons for this status include specific meta tags on your page, or the page being blocked by the Crawler’s ad blocker.
An error message

The Crawler might run into site access issues due to:

Issue	Error messages
Problems with SSL certificates	`SSL error`
URL redirects	`HTTP redirect (301, 302)` `HTTP redirect (301, 302) - Not followed` `JavaScript redirect - Not followed`
Document type actions	`File type didn't match any action`
Document extraction timeout	`Document extraction timeout reached`
Tika errors	`Document extraction: unprocessable document` `Tika Error`
Safety check breaches	`Too many failed URLs` `Too many missing records`
Robots.txt restrictions	`Forbidden by robots.txt`
HTTP status errors	`Content-Type not supported` `HTTP Bad Request (400)` `HTTP Connection Error (502,503,504)` `HTTP Forbidden (403)` `HTTP Gone (410)` `HTTP Internal Server Error (500)` `HTTP Not Found (404)` `HTTP Not Implemented (501)` `HTTP Status Code not supported` `HTTP Too Many Requests (429)` `HTTP Unauthorized (401)`
DNS error	`DNS error`
Network error	`Fetch timeout` `Network error`

SSL issues

SSL (Secure Sockets Layer) errors indicate that your site’s domain name doesn’t match the domain name listed in its SSL certificate. This can happen if the domain named changed, and wasn’t updated in the SSL certificate, or if you accidentally use one certificate for several domains. When the Algolia Crawler checks your site’s SSL certificate, it validates its authenticity by comparing it to trusted certificates issued by a certificate authority (CA). Different browsers maintain their own trusted CA lists. This can lead to a site rendering in a browser but causes issues for the Algolia Crawler (or vice versa). The error message UNABLE_TO_VERIFY_LEAF_SIGNATURE indicates that the Crawler couldn’t verify the authenticity of your SSL certificate. To learn more, see How certificate chains work

Investigation

If an intermediate certificate authority (CA) has verified your SSL certificate, you must also install their verification certificate on your site. To ensure your SSL certificate is set up correctly, use tools like OpenSSL. Using OpenSSL, run a command like this from your terminal:

Command line

openssl s_client -showcerts -connect algolia.com:443 -servername algolia.com

If your SSL certificates are set up correctly, the output from the command includes the statement Verify return code: 0 (ok), and no errors. If there’s a problem, you’ll see error messages indicating what’s wrong. For example:

Command line

CONNECTED(00000005)
depth=0 C = FR, L = PARIS, O = Algolia, CN = Algolia.com
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 C = FR, L = PARIS, O = Algolia, CN = Algolia.com
verify error:num=21:unable to verify the first certificate
verify return:1

Solution

How you fix your SSL certificate setup depends on the web server you’re using. For information about different web servers, see What’s My Chain Cert.

Redirect issues

A crawler might follow a URL redirect but won’t crawl the new page. This might occur if the new page isn’t included in the crawler’s actions.

Solution

Add any missing domains.
Update pathsToMatch in the crawler configuration to allow the redirected page to be crawled.

Issues crawling non-HTML documents

The Crawler uses the Apache Tika toolkit to extract content from non-HTML documents such as PPT, XLS, and PDF. This process can fail if:

You don’t have an action to handle this document type
It took too long to extract content from the document
Tika generated an error due to an incompatible or password-protected document.

For more information, see Extract data from non-HTML documents

Document type actions

If the crawler encounters a document with no matching action in the crawler configuration, it generates an error.

Solution

Review your crawler configuration and ensure that the document type is included in an action.

Document extraction timeout

The request took too much time to resolve. This can happen when encountering an unusually large or complex document, such as a PDF or an Excel spreadsheet with many tables.

Solution

If possible, split the document into smaller parts.

Tika errors

PDF document extraction failed due to a Tika error. Tika generates errors for a variety of reasons. For more information, see the Apache Tika documentation

Solutions

Cause of Tika error	Solution
PDF is password-protected	Remove password-protection
PDF is too big	Split the document into smaller parts
PDF is in an incompatible format	Convert the PDF to an earlier version

Safety check breaches

If you’ve configured a safety check to stop crawling at certain threshold, this error message is shown when that limit is reached.

Solution

Use the URL Inspector to see what caused the failures.

Domain issues

Some site access issues are caused by domain-level problems such as:

Forbidden by `robots.txt`

The crawler won’t follow a URL if the path isn’t allowed by your robots.txt directives.

Solution

Update your robots.txt file for each domain to ensure paths are allowed.

HTTP status errors

If a URL returns an error message, the crawler can’t access those pages.

Solution

Check that you haven’t hit a rate limit for your Algolia plan
Use an HTML markup validator to identify and resolve HTML errors in your source content. Fix any HTTP server errors on the domain itself: ensure the correct HTTP response codes are being sent.

DNS errors

You may see DNS errors if your crawler can’t connect with your domain due to one of these conditions:

DNS timeout: your DNS server didn’t reply to the request in time
DNS lookup error: your DNS server failed to locate your domain name.

Solution

If you’re running your own DNS server, ensure it’s online and not overloaded.
Ensure the Crawler’s IP address (34.66.202.43) isn’t blocked by a firewall rule
Ensure that both UDP and TCP requests are allowed
Look at your DNS records. Check that your A and CNAME records point to the correct IP address and hostname, respectively.
Check that all your name servers point to your site’s IP addresses.

If you’ve changed your DNS configuration within the last 72 hours, you may need to wait for your changes to propagate across the global DNS network.

Network errors

Network errors can happen either before or while crawling a URL. These errors typically mean the server is overloaded.

Solution

Ensure the Crawler’s IP address (34.66.202.43) isn’t blocked by a firewall rule
Use tools like tcpdump and Wireshark to check traffic and look for anomalies. The error may be in any server component that handles network traffic. For example, overloaded network interfaces may drop packets leading to timeouts (inability to establish a connection) and reset connections (because a port was mistakenly closed).

If you can’t find anything suspicious, contact your hosting company.

Meta tags

If a crawled pages contains the noindex or nofollow meta tags, the crawler won’t:

noindex index anything on that page
nofollow follow any links from that page.

Solution

Either remove the meta tags from the source content or ignore them by setting ignoreNoIndex or ignoreNoFollowTo to true.

Ad blocker

If your site loads content from third-party services, such as product listings from online stores, with a tool like Google Tag Manager, the Crawler’s ad blocker can prevent extraction of that content.

Solution

Update your crawler’s configuration:

Turn off the Crawler’s ad blocker
Enable renderJavaScript so the crawler can extract the dynamic content loaded by third-party scripts.

JSON

"renderJavaScript": {
  "enabled": true,
  "adblock": false
}

​SSL issues

​Investigation

​Solution

​Redirect issues

​Solution

​Issues crawling non-HTML documents

​Document type actions

​Solution

​Document extraction timeout

​Solution

​Tika errors

​Solutions

​Safety check breaches

​Solution

​Domain issues

​Forbidden by robots.txt

​Solution

​HTTP status errors

​Solution

​DNS errors

​Solution

​Network errors

​Solution

​Meta tags

​Solution

​Ad blocker

​Solution

SSL issues

Investigation

Solution

Redirect issues

Solution

Issues crawling non-HTML documents

Document type actions

Solution

Document extraction timeout

Solution

Tika errors

Solutions

Safety check breaches

Solution

Domain issues

Forbidden by `robots.txt`

Solution

HTTP status errors

Solution

DNS errors

Solution

Network errors

Solution

Meta tags

Solution

Ad blocker

Solution