Troubleshooting site access issues
The crawler reports site access (fetch) issues when it can’t access your site. Issues can be indicated by:
- The crawl status (
Failed
orSkipped
) and category (Fetch
). Reasons for this status include specific meta tags on your page, or the page being blocked by the Crawler’s ad blocker. - An error message
The Crawler might run into site access issues due to:
Issue | Error messages |
---|---|
Problems with SSL certificates | SSL error |
URL redirects | HTTP redirect (301, 302) , HTTP redirect (301, 302) - Not followed , JavaScript redirect - Not followed |
Document type actions | File type didn't match any action |
Document extraction timeout | Document extraction timeout reached |
Tika errors | Document extraction: unprocessable document , Tika Error |
Robots.txt restrictions | Forbidden by robots.txt |
HTTP status errors | Content-Type not supported , HTTP Bad Request (400) , HTTP Connection Error (502,503,504) , HTTP Forbidden (403) , HTTP Gone (410) , HTTP Internal Server Error (500) , HTTP Not Found (404) , HTTP Not Implemented (501) , HTTP Status Code not supported , HTTP Too Many Requests (429) , HTTP Unauthorized (401) |
DNS error | DNS error |
Network error | Fetch timeout , Network error |
SSL issues
SSL (Secure Sockets Layer) errors indicate that your site’s domain name doesn’t match the domain name listed in its SSL certificate. This can happen if the domain named changed, and wasn’t updated in the SSL certificate, or if you accidentally use one certificate for several domains.
When the Algolia Crawler checks your site’s SSL certificate, it validates its authenticity by comparing it to trusted certificates issued by a certificate authority (CA).
Different browsers maintain their own trusted CA lists.
This can lead to a site rendering in a browser but causes issues for the Algolia Crawler (or vice versa).
The error message UNABLE_TO_VERIFY_LEAF_SIGNATURE
indicates that the Crawler couldn’t verify the authenticity of your SSL certificate.
Investigation
If an intermediate certificate authority (CA) has verified your SSL certificate, you must also install their verification certificate on your site. To ensure your SSL certificate is set up correctly, use tools like OpenSSL.
Using OpenSSL, run a command like this from your terminal:
1
openssl s_client -showcerts -connect algolia.com:443 -servername algolia.com
If your SSL certificates are set up correctly, the output from the command includes the statement Verify return code: 0 (ok)
, and no errors.
If there’s a problem, you’ll see error messages indicating what’s wrong. For example:
1
2
3
4
5
6
7
CONNECTED(00000005)
depth=0 C = FR, L = PARIS, O = Algolia, CN = Algolia.com
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 C = FR, L = PARIS, O = Algolia, CN = Algolia.com
verify error:num=21:unable to verify the first certificate
verify return:1
Solution
How you fix your SSL certificate setup depends on the web server you’re using. For information about different web servers, see What’s My Chain Cert.
Redirect issues
A crawler might follow a URL redirect but won’t crawl the new page. This might occur if the new page isn’t included in the crawler’s actions.
Solution
- Add any missing domains.
- Update
pathsToMatch
in the crawler configuration to allow the redirected page to be crawled.
Issues crawling non-HTML documents
The Crawler uses the Apache Tika toolkit to extract content from non-HTML documents such as PPT, XLS, and PDF. This process can fail if:
- You don’t have an action to handle this document type
- It took too long to extract content from the document
- Tika generated an error due to an incompatible or password-protected document.
Document type actions
If the crawler encounters a document with no matching action in the crawler configuration, it will generate an error.
Solution
Review your crawler configuration and ensure that the document type is included in an action.
Document extraction timeout
The request took too much time to resolve. This can happen when encountering an unusually large or complex document, such as a PDF or an Excel spreadsheet with many tables.
Solution
If possible, split the document into smaller parts.
Tika errors
PDF document extraction failed due to a Tika error. Tika generates errors for a variety of reasons.
Solutions
Cause of Tika error | Solution |
---|---|
PDF is password-protected | Remove password-protection |
PDF is too big | Split the document into smaller parts |
PDF is in an incompatible format | Convert the PDF to an earlier version |
Domain issues
Some site access issues are caused by domain-level problems such as:
Forbidden by robots.txt
The crawler won’t follow a URL if the path isn’t allowed by your robots.txt
directives.
Solution
Update your robots.txt
file for each domain to ensure paths are allowed.
HTTP status errors
If a URL returns an error message, the crawler can’t access those pages.
Solution
- Check that you haven’t hit a rate limit for your Algolia plan
- Use an HTML markup validator to identify and resolve HTML errors in your source content. Fix any HTTP server errors on the domain itself: ensure the correct HTTP response codes are being sent.
DNS errors
You may see DNS errors if your crawler can’t connect with your domain due to a:
- DNS timeout: your DNS server didn’t reply to the request in time
- DNS lookup error: your DNS server failed to locate your domain name.
Solution
- If you’re running your own DNS server, ensure it’s online and not overloaded.
- Ensure the Crawler’s IP address (
34.66.202.43
) isn’t blocked by a firewall rule - Ensure that both UDP and TCP requests are allowed
- Look at your DNS records. Check that your A and CNAME records point to the correct IP address and hostname, respectively.
- Check that all your name servers point to your site’s IP addresses.
If you’ve changed your DNS configuration within the last 72 hours, you may need to wait for your changes to propagate across the global DNS network.
Network errors
Network errors can happen either before or while crawling a URL. These errors typically mean the server is overloaded.
Solution
- Ensure the Crawler’s IP address (
34.66.202.43
) isn’t blocked by a firewall rule - Use tools like tcpdump and Wireshark to check traffic and look for anomalies. The error may be in any server component that handles network traffic. For example, overloaded network interfaces may drop packets leading to timeouts (inability to establish a connection) and reset connections (because a port was mistakenly closed).
If you can’t find anything suspicious, contact your hosting company.
Meta tags
If a crawled pages contains the noindex
or nofollow
meta tags, the crawler won’t:
noindex
index anything on that pagenofollow
follow any links from that page.
Solution
Either remove the meta tags from the source content or ignore them by setting ignoreNoIndex
or ignoreNoFollowTo
to true
.
Ad blocker
If your site loads content from third-party services, such as product listings from online stores, with a tool like Google Tag Manager, the Crawler’s ad blocker can prevent extraction of that content.
Solution
Update your crawler’s configuration:
- Turn off the Crawler’s ad blocker
- Enable
renderJavaScript
so the crawler can extract the dynamic content loaded by third-party scripts.
1
2
3
4
"renderJavaScript": {
"enabled": true,
"adblock": false
}