> ## Documentation Index
> Fetch the complete documentation index at: https://algolia.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Troubleshooting site access issues

> The crawler reports site access (fetch) issues when it can't access your site.

Issues can be indicated by:

* The [crawl status (`Failed` or `Skipped`) and category (`Fetch`)](/doc/tools/crawler/troubleshooting/crawl-status). Reasons for this status include specific [meta tags](#meta-tags) on your page, or the page being blocked by the Crawler's [ad blocker](#ad-blocker).
* An [error message](/doc/tools/crawler/troubleshooting/error-messages)

The Crawler might run into site access issues due to:

| Issue                                                                                                 | Error messages                                                                                                                                                                                                                                                                                                                                                          |
| ----------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [Problems with SSL certificates](#ssl-issues)                                                         | `SSL error`                                                                                                                                                                                                                                                                                                                                                             |
| [URL redirects](#redirect-issues)                                                                     | `HTTP redirect (301, 302)`<br />`HTTP redirect (301, 302) - Not followed`<br />`JavaScript redirect - Not followed`                                                                                                                                                                                                                                                     |
| [Document type actions](#document-extraction-timeout)                                                 | `File type didn't match any action`                                                                                                                                                                                                                                                                                                                                     |
| [Document extraction timeout](#document-extraction-timeout)                                           | `Document extraction timeout reached`                                                                                                                                                                                                                                                                                                                                   |
| [Tika errors](#tika-errors)                                                                           | `Document extraction: unprocessable document`<br />`Tika Error`                                                                                                                                                                                                                                                                                                         |
| [Safety check breaches](#safety-check-breaches)                                                       | `Too many failed URLs`<br />`Too many missing records`                                                                                                                                                                                                                                                                                                                  |
| [Robots.txt restrictions](/doc/tools/crawler/troubleshooting/fetching-issues#forbidden-by-robots-txt) | `Forbidden by robots.txt`                                                                                                                                                                                                                                                                                                                                               |
| [HTTP status errors](#http-status-errors)                                                             | `Content-Type not supported`<br />`HTTP Bad Request (400)`<br />`HTTP Connection Error (502,503,504)`<br />`HTTP Forbidden (403)`<br />`HTTP Gone (410)`<br />`HTTP Internal Server Error (500)`<br />`HTTP Not Found (404)`<br />`HTTP Not Implemented (501)`<br />`HTTP Status Code not supported`<br />`HTTP Too Many Requests (429)`<br />`HTTP Unauthorized (401)` |
| [DNS error](#dns-errors)                                                                              | `DNS error`                                                                                                                                                                                                                                                                                                                                                             |
| [Network error](#network-errors)                                                                      | `Fetch timeout`<br />`Network error`                                                                                                                                                                                                                                                                                                                                    |

## SSL issues

[SSL (Secure Sockets Layer)](https://en.wikipedia.org/wiki/Transport_Layer_Security#SSL_1.0,_2.0,_and_3.0) errors indicate that your site's domain name doesn't match the domain name listed in its SSL certificate.
This can happen if the domain named changed, and wasn't updated in the SSL certificate, or if you accidentally use one certificate for several domains.

When the Algolia Crawler checks your site's SSL certificate, it validates its authenticity by comparing it to trusted certificates issued by a certificate authority (CA).
Different browsers maintain their own trusted CA lists.
This can lead to a site rendering in a browser but causes issues for the Algolia Crawler (or vice versa).
The error message `UNABLE_TO_VERIFY_LEAF_SIGNATURE` indicates that the Crawler couldn't verify the authenticity of your SSL certificate.

To learn more, see [How certificate chains work](https://knowledge.digicert.com/solution/how-certificate-chains-work)

### Investigation

If an intermediate certificate authority (CA) has verified your SSL certificate,
you must also install *their* verification certificate on your site.
To ensure your SSL certificate is set up correctly, use tools like [OpenSSL](https://www.openssl.org/).

Using OpenSSL, run a command like this from your terminal:

```sh Command line icon=square-terminal theme={"system"}
openssl s_client -showcerts -connect algolia.com:443 -servername algolia.com
```

If your SSL certificates are set up correctly,
the output from the command includes the statement `Verify return code: 0 (ok)`, and no errors.

If there's a problem, you'll see error messages indicating what's wrong. For example:

```sh Command line icon=square-terminal theme={"system"}
CONNECTED(00000005)
depth=0 C = FR, L = PARIS, O = Algolia, CN = Algolia.com
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 C = FR, L = PARIS, O = Algolia, CN = Algolia.com
verify error:num=21:unable to verify the first certificate
verify return:1
```

### Solution

How you fix your SSL certificate setup depends on the web server you're using.
For information about different web servers, see [What's My Chain Cert](https://whatsmychaincert.com/).

## Redirect issues

A crawler might follow a URL redirect but won't crawl the new page.
This might occur if the new page isn't included in the crawler's actions.

### Solution

* [Add any missing domains](/doc/tools/crawler/getting-started/create-crawler).
* Update [`pathsToMatch`](/doc/tools/crawler/apis/configuration/actions#param-paths-to-match) in the crawler configuration to allow the redirected page to be crawled.

## Issues crawling non-HTML documents

The Crawler uses the [Apache Tika toolkit](https://tika.apache.org/) to extract content from non-HTML documents such as PPT,
XLS, and PDF.
This  process can fail if:

* You don't have an [action to handle this document type](#document-type-actions)
* It took [too long to extract  content](#document-extraction-timeout) from the document
* [Tika generated an error](#tika-errors) due to an incompatible or password-protected document.

For more information, see [Extract data from non-HTML documents ](/doc/tools/crawler/extracting-data/non-html-documents)

### Document type actions

If the crawler encounters a document with no matching action in the crawler configuration,
it generates an error.

#### Solution

Review your [crawler configuration](/doc/tools/crawler/getting-started/crawler-configuration#decide-what-content-you-want-to-include) and ensure that the document type is included in an action.

### Document extraction timeout

The request took too much time to resolve.
This can happen when encountering an unusually large or complex document,
such as a PDF or an Excel spreadsheet with many tables.

#### Solution

If possible, split the document into smaller parts.

### Tika errors

PDF document extraction failed due to a Tika error.
Tika generates errors for a variety of reasons.

For more information, see the [Apache Tika documentation](https://tika.apache.org/3.3.1/parser.html)

#### Solutions

| Cause of Tika error              | Solution                              |
| -------------------------------- | ------------------------------------- |
| PDF is password-protected        | Remove password-protection            |
| PDF is too big                   | Split the document into smaller parts |
| PDF is in an incompatible format | Convert the PDF to an earlier version |

## Safety check breaches

If you've configured a [safety check](/doc/tools/crawler/getting-started/crawler-configuration#safety-checks)
to stop crawling at certain threshold,
this error message is shown when that limit is reached.

#### Solution

Use the [URL Inspector](/doc/tools/crawler/getting-started/monitoring#url-inspector) to see what caused the failures.

## Domain issues

Some site access issues are caused by domain-level problems such as:

* [`robots.txt` restrictions](/doc/tools/crawler/troubleshooting/fetching-issues#forbidden-by-robots-txt)
* [HTTP status errors](#http-status-errors)
* [Network errors](#network-errors).

### Forbidden by `robots.txt`

The crawler won't follow a URL if the path isn't allowed by your `robots.txt` directives.

#### Solution

Update your `robots.txt` file for each domain to ensure paths are allowed.

### HTTP status errors

If a URL returns an error message, the crawler can't access those pages.

#### Solution

1. Check that you haven't hit a [rate limit for your Algolia plan](/doc/guides/security/api-keys/in-depth/api-key-restrictions#rate-limit)
2. Use an [HTML markup validator](https://validator.w3.org/) to identify and resolve HTML errors in your source content. Fix any HTTP server errors on the domain itself: ensure the correct HTTP response codes are being sent.

### DNS errors

You may see DNS errors if your crawler can't connect with your domain due to one of these conditions:

* **DNS timeout:** your DNS server didn't reply to the request in time
* **DNS lookup error:** your DNS server failed to locate your domain name.

#### Solution

* If you're running your own DNS server, ensure it's online and not overloaded.
* Ensure the Crawler's IP address (`34.66.202.43`) isn't blocked by a firewall rule
* Ensure that both UDP and TCP requests are allowed
* Look at your DNS records. Check that your A and CNAME records point to the correct IP address and hostname, respectively.
* Check that all your name servers point to your site's IP addresses.

<Info>
  If you've changed your DNS configuration within the last 72 hours,
  you may need to wait for your changes to propagate across the global DNS network.
</Info>

### Network errors

Network errors can happen either before or while crawling a URL.
These errors typically mean the server is overloaded.

#### Solution

* Ensure the Crawler's IP address (`34.66.202.43`) isn't blocked by a firewall rule
* Use tools like [tcpdump](https://www.tcpdump.org/manpages/tcpdump.1.html) and [Wireshark](https://www.wireshark.org/) to check traffic and look for anomalies. The error may be in any server component that handles network traffic. For example, overloaded network interfaces may drop packets leading to timeouts (inability to establish a connection) and reset connections (because a port was mistakenly closed).

If you can't find anything suspicious, contact your hosting company.

## Meta tags

If a crawled pages contains the [`noindex` or `nofollow` meta tags](https://developers.google.com/search/docs/crawling-indexing/block-indexing),
the crawler won't:

* `noindex` index anything on that page
* `nofollow` follow any links from that page.

### Solution

Either remove the meta tags from the source content or ignore them by setting [`ignoreNoIndex`](/doc/tools/crawler/apis/configuration/ignore-no-index) or
[`ignoreNoFollowTo`](/doc/tools/crawler/apis/configuration/ignore-no-follow-to) to `true`.

## Ad blocker

If your site loads content from third-party services,
such as product listings from online stores,
with a tool like Google Tag Manager,
the Crawler's ad blocker can prevent extraction of that content.

### Solution

Update your crawler's configuration:

* Turn off the Crawler's ad blocker
* Enable [`renderJavaScript`](/doc/tools/crawler/apis/configuration/render-java-script)
  so the crawler can extract the dynamic content loaded by third-party scripts.

```json JSON icon=braces theme={"system"}
"renderJavaScript": {
  "enabled": true,
  "adblock": false
}
```
