Get your e-commerce site search audited by our experts

Get the audit >

Liberate your web content withAlgolia Crawler

Let people searching for your content easily find it with our customizable, hosted web crawler that catalogs and stores your site's web pages.

Get a DemoStart Free Trial

Leading brands use Algolia to power website search and discovery

Adobe
Dior
NPR

How our website crawler works

A site crawler tool that uncovers all your content, no matter where it's stored

Provide your users with great site search

Is your website content siloed in separate systems and managed by different teams? The first step in providing a high-quality site search experience is implementing a first-rate crawling process.

Our web spider can save your company time and lower your expenses by eliminating the need for building data pipelines between each of your content repositories and your site search software, as well as the project management that entails.

Turn your site into structured content

You can tell our website crawler exactly how to operate so that it accurately interprets your content. For example, in addition to standard web pages, you can ensure that it lets users search for and navigate news articles, job postings, and financial reports, including information that's in documents, PDFs, HTML, and JavaScript.

You don't need to add meta tags

You can have your content extracted without first adding meta tags to your site. Our web crawler doesn't rely on custom metadata. Instead, it provides your technical team with an easy-to-use editor for defining which content you want to extract and how to structure it.

Enrich your content to make it more relevant

To enhance search-result relevance for your users, you can enrich your extracted content with business web data, including from Google Analytics and Adobe Analytics. With Algolia Crawler, you can use data about visitor behavior and page performance to adjust your search engine rankings, attach categories to your content to power advanced navigation, and more.

Configure your crawling as needed

Want to save time and avoid unnecessary work? With our website crawler, you can index only selected parts of your site when you need them crawled.

Learn about the business value of Algolia Crawler

Configure our site crawler tool to scan site data on a fixed schedule.Manually trigger a site crawl on specific sections of your website.Define what parts of your content you want crawled or avoided.Configure our crawler to explore and index login protected pages.
Configure our site crawler tool to scan site data on a fixed schedule.

You can configure our site crawler tool to look at your web data on a set real-time schedule, such as every night at 9 p.m., with a recrawl at noon the next day.

Manually trigger a site crawl on specific sections of your website.

If necessary, you can manually trigger crawling of a particular section of your website, or even the whole thing.

Define what parts of your content you want crawled or avoided.

You can define which parts of your site, or which web pages, you want crawled (or avoided) by our web spider, or you can let it automatically crawl everywhere.

Configure our crawler to explore and index login protected pages.

If you need to crawl pages protected by log-in credentials, it's not a problem: you can configure our crawler to take care of that.

Keep your searchable content up to date

Our production-ready crawler includes a set of tools that lets you provide continually fresh search content. These include URL Inspector, Monitoring, Data Analysis, and Path Explorer.

Get all the details and data for each site crawling session performed.Get crawl reports on URLs, including errors.Analyse crawl data, and assess the quality of your web-crawler generated index.Analyze crawling paths, URLs crawled, records extracted, and errors encountered.
Get all the details and data for each site crawling session performed.

URL Inspector

On the Inspector tab, you can see and inspect all your crawled URLs, noting whether each crawl succeeded, when it was completed, and the records that were generated.

Get crawl reports on URLs, including errors.

Monitoring

On the Monitoring tab, you can view the details on the latest crawl, plus sort your crawled URLs by status (success, ignored, failed).

Analyse crawl data, and assess the quality of your web-crawler generated index.

Data Analysis

On the Data Analysis tab, you can assess the quality of your web-crawler-generated index and see whether any records are missing attributes. 

Analyze crawling paths, URLs crawled, records extracted, and errors encountered.

Path Explorer

On the Path Explorer tab, you can see which paths the crawler has explored; for each, how many URLs were crawled, how many records were extracted, and how many errors were received during the crawling process.

“We realized that search should be a core competence of the LegalZoom enterprise, and we see Algolia as a revenue generating product.”

Mrinal Murari
Tools team lead & senior software engineer
Legalzoom
Read the full story

Recommended content

See more

Website Crawler FAQ

  • A web crawler (or "web spider") is a bot (software program) that gathers and indexes web data (also known as web scraping) so it can be made available to people using a search engine to find information.

    A website crawler achieves this by visiting a website (or multiple sites), downloading web pages, and diligently following links on sites to discover newly created content. The site crawler tool catalogs the information it discovers in a searchable index.

    There are several types of website crawler. Some crawlers find and index data across the entire Internet (the global information system of website information is known as the World Wide Web). Large-scale and well-known web crawlers include Googlebot, Bingbot (for  Microsoft Bing's search engine); Baidu Spider (China), and Yandex (Russia). In addition, many smaller and lesser-known web crawlers focus their crawling processes on exploring certain types of web data, such as images, videos, or email.

  • A database crawler is a specific type of web crawler that parses and catalogs information stored in tables in a database. When this information is catalogued, people can then find it by using search engines. 

    Different types of databases require different configuration in order for the crawler to extract their information in an intelligent way. You specify the type of data and fields you want crawled and determine a crawling schedule.

    A database crawler treats each row in a table as a separate document, parsing and indexing column values as searchable fields. 

    A database crawler can also be set up to crawl various tables by using a plug-in. In a relational database, this allows the joining of rows from multiple tables that have the same key fields and treating them as one document. Then, when the document is displayed in search results, the data from the joined tables appears as additional fields.

  • Like other web content, a website's XML sitemap can be crawled by a web crawler. If a website has sitemap URL in its robots.txt, the sitemap will be automatically crawled. However, you can also separately download and crawl the XML sitemap URLs with a tool such as Screaming Frog. 

    To convert a sitemap file into a format that a program like Screaming Frog can crawl, you  import the file into Microsoft Excel and copy the URLs to a text file.

    If a sitemap has any "dirt" in it, that is, it references outdated pages that lead to header response code indicating errors (such as 404), redirects, or application errors, the data turned up and indexed by a crawler and made available to search engines can be error prone. This is why it makes sense to spend the effort needed to crawl a sitemap and then correct any issues.

    How do you know if your sitemap is dirty? In Google Webmaster Tools, the "Sitemaps" section shows you both the number of pages submitted in the sitemap and the number of pages indexed. This should be a ratio of roughly 1 to 1. If it's a low ratio of indexed material to a high number of submitted pages, there could be errors with the URLs in the sitemap.


  • The goal of a web crawler software program (a.k.a. "web spider") is to explore web pages, discover and fetch data, and index it so that it can be accessed by people using a search engine. A website crawler completes this mission by systematically examining a website (or multiple sites), downloading its web pages, and following its links to identify new content. The site crawler tool then catalogs the information it uncovers in a searchable index for quick retrieval.

  • Web crawling is having a software program (a "bot") systematically explore websites and index the data it finds, making it easy for people to locate by using a search engine.

    Web scraping, a slightly different form of gathering web data, involves collecting (downloading) specific types of information, for instance, about pricing. 

    In ecommerce, both of these types of data gathering are especially valuable because the data collected and analyzed can lead to marketers data-based decisions that can boost sales. 

    Marketers can compare data about products being sold on other sites with the same ones they're selling, for instance.

    If they find out that shoppers are routinely entering certain keywords in a search engine to locate a given product, they might decide to add those words to the product description to attract potential buyers to the product listing.

    Consumers typically want the best deals, and they can easily search for the lowest prices on the web. If a company sees that a competitor has a lower price on a product they offer, they can lower their own price to ensure that prospective customers won't choose the competitor's due solely to a lower cost. 

    By gathering product review and ranking data, marketers and businesspeople can uncover information about flaws in their own and competitors' products.

    They can also use crawler technology to monitor product reviews and rankings so that they can swiftly respond when people post negative comments, thereby improving their customer service.

    They can find out which products are bestsellers and potentially identify hot new markets.

    All of this revenue-impacting activity makes ecommerce web crawling and web scraping an important and lucrative subdomain of these activities as a whole.