Tools / Crawler / Troubleshooting

Troubleshooting site access issues

The crawler reports site access (fetch) issues when it can’t access your site. Issues can be indicated by:

The Crawler might run into site access issues due to:

Issue Error messages
Problems with SSL certificates SSL error
URL redirects HTTP redirect (301, 302), HTTP redirect (301, 302) - Not followed, JavaScript redirect - Not followed
Document type actions File type didn't match any action
Document extraction timeout Document extraction timeout reached
Tika errors Document extraction: unprocessable document, Tika Error
Robots.txt restrictions Forbidden by robots.txt
HTTP status errors Content-Type not supported, HTTP Bad Request (400), HTTP Connection Error (502,503,504), HTTP Forbidden (403), HTTP Gone (410), HTTP Internal Server Error (500), HTTP Not Found (404), HTTP Not Implemented (501), HTTP Status Code not supported, HTTP Too Many Requests (429), HTTP Unauthorized (401)
DNS error DNS error
Network error Fetch timeout, Network error

SSL issues

SSL (Secure Sockets Layer) errors indicate that your site’s domain name doesn’t match the domain name listed in its SSL certificate. This can happen if the domain named changed, and wasn’t updated in the SSL certificate, or if you accidentally use one certificate for several domains.

When the Algolia Crawler checks your site’s SSL certificate, it validates its authenticity by comparing it to trusted certificates issued by a certificate authority (CA). Different browsers maintain their own trusted CA lists. This can lead to a site rendering in a browser but causes issues for the Algolia Crawler (or vice versa). The error message UNABLE_TO_VERIFY_LEAF_SIGNATURE indicates that the Crawler couldn’t verify the authenticity of your SSL certificate.

Investigation

If an intermediate certificate authority (CA) has verified your SSL certificate, you must also install their verification certificate on your site. To ensure your SSL certificate is set up correctly, use tools like OpenSSL.

Using OpenSSL, run a command like this from your terminal:

1
openssl s_client -showcerts -connect algolia.com:443 -servername algolia.com

If your SSL certificates are set up correctly, the output from the command includes the statement Verify return code: 0 (ok), and no errors.

If there’s a problem, you’ll see error messages indicating what’s wrong. For example:

1
2
3
4
5
6
7
CONNECTED(00000005)
depth=0 C = FR, L = PARIS, O = Algolia, CN = Algolia.com
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 C = FR, L = PARIS, O = Algolia, CN = Algolia.com
verify error:num=21:unable to verify the first certificate
verify return:1

Solution

How you fix your SSL certificate setup depends on the web server you’re using. For information about different web servers, see What’s My Chain Cert.

Redirect issues

A crawler might follow a URL redirect but won’t crawl the new page. This might occur if the new page isn’t included in the crawler’s actions.

Solution

Issues crawling non-HTML documents

The Crawler uses the Apache Tika toolkit to extract content from non-HTML documents such as PPT, XLS, and PDF. This process can fail if:

Document type actions

If the crawler encounters a document with no matching action in the crawler configuration, it will generate an error.

Solution

Review your crawler configuration and ensure that the document type is included in an action.

Document extraction timeout

The request took too much time to resolve. This can happen when encountering an unusually large or complex document, such as a PDF or an Excel spreadsheet with many tables.

Solution

If possible, split the document into smaller parts.

Tika errors

PDF document extraction failed due to a Tika error. Tika generates errors for a variety of reasons.

Solutions

Cause of Tika error Solution
PDF is password-protected Remove password-protection
PDF is too big Split the document into smaller parts
PDF is in an incompatible format Convert the PDF to an earlier version

Domain issues

Some site access issues are caused by domain-level problems such as:

Forbidden by robots.txt

The crawler won’t follow a URL if the path isn’t allowed by your robots.txt directives.

Solution

Update your robots.txt file for each domain to ensure paths are allowed.

HTTP status errors

If a URL returns an error message, the crawler can’t access those pages.

Solution

  1. Check that you haven’t hit a rate limit for your Algolia plan
  2. Use an HTML markup validator to identify and resolve HTML errors in your source content. Fix any HTTP server errors on the domain itself: ensure the correct HTTP response codes are being sent.

DNS errors

You may see DNS errors if your crawler can’t connect with your domain due to a:

  • DNS timeout: your DNS server didn’t reply to the request in time
  • DNS lookup error: your DNS server failed to locate your domain name.

Solution

  • If you’re running your own DNS server, ensure it’s online and not overloaded.
  • Ensure the Crawler’s IP address (34.66.202.43) isn’t blocked by a firewall rule
  • Ensure that both UDP and TCP requests are allowed
  • Look at your DNS records. Check that your A and CNAME records point to the correct IP address and hostname, respectively.
  • Check that all your name servers point to your site’s IP addresses.

If you’ve changed your DNS configuration within the last 72 hours, you may need to wait for your changes to propagate across the global DNS network.

Network errors

Network errors can happen either before or while crawling a URL. These errors typically mean the server is overloaded.

Solution

  • Ensure the Crawler’s IP address (34.66.202.43) isn’t blocked by a firewall rule
  • Use tools like tcpdump and Wireshark to check traffic and look for anomalies. The error may be in any server component that handles network traffic. For example, overloaded network interfaces may drop packets leading to timeouts (inability to establish a connection) and reset connections (because a port was mistakenly closed).

If you can’t find anything suspicious, contact your hosting company.

Meta tags

If a crawled pages contains the noindex or nofollow meta tags, the crawler won’t:

  • noindex index anything on that page
  • nofollow follow any links from that page.

Solution

Either remove the meta tags from the source content or ignore them by setting ignoreNoIndex or ignoreNoFollowTo to true.

Ad blocker

If your site loads content from third-party services, such as product listings from online stores, with a tool like Google Tag Manager, the Crawler’s ad blocker can prevent extraction of that content.

Solution

Update your crawler’s configuration:

  • Turn off the Crawler’s ad blocker
  • Enable renderJavaScript so the crawler can extract the dynamic content loaded by third-party scripts.
1
2
3
4
"renderJavaScript": {
  "enabled": true,
  "adblock": false
}
Did you find this page helpful?