If you have a lot of data, hosted on a variety of sites and in a variety of formats, it can be daunting to change platforms, centralize your data, or upload your records to Algolia. If all your data is represented on a collection of websites, our Crawler can make this process a lot easier!
What is the Crawler?
The Crawler is an automated web scraping program. When given a set of start URLs, it visits and extracts content from those pages, and the URLs they link to, and the URLs that those linked pages link to, and so on. With little configuration the Crawler can populate and maintain Algolia indices for you by periodically extracting content from your web pages.
What can it do for you?
The Crawler can help you extract content from multiple sites, format that content, and upload it to Algolia. The Crawler:
- Quickly aggregates your distributed content.
- Automatically and periodically updates your aggregated content.
- Enables you to quickly and accurately search through your records (and provide this same search experience to your clients)!
When should you use it?
You should consider using the Crawler if you have a substantial amount of data that is any (or multiple) of the following:
- inaccessible from its original source,
- stored in different formats,
- hosted on various websites,
- maintained by different people,
- displayed in a single webpage, but from different sources.
For the Crawler to work effectively, most of the information you want to index should appear on websites. However, the Crawler can work even if the representation of your web-content is inconsistent, or if your webpages require login credentials.
How to use the Crawler
You can use your crawlers with two tools: the Admin Console, and our Crawler REST API.
The Admin Console
You can create, configure and launch crawlers through the Crawler Admin Console. You can also test your crawler on specific URLs and analyze the results. The Admin is the best way to get familiar with and to start using the Crawler.
Crawler REST API
While there are use cases for the Crawler REST API, we encourage everyone to use the Admin Console. Use of the REST API is not covered by any SLA.