Tools / Crawler / Netlify plugin

The plugin uses the Algolia Crawler to crawl your website and index it into Algolia. Algolia is schemaless. To offer an effortless experience, the plugin creates your Algolia index with a standard record schema.

Standard record schema

All root-level properties are computations of selectors with fallbacks.

All properties that aren’t marked as optional are present in the final record. Optional records are only present if the Algolia Crawler finds relevant information.

Default schema

By default, the Netlify plugin extracts one record per page, with the following schema:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
{
  /**
   * The object's unique identifier
   */
  objectID: string;

  /**
   * The URL where the Algolia Crawler found the record
   */
  url: string;

  /**
   * The lang of the page
   * - html[attr=lang]
   */
  lang?: string;

  /**
   * The title of the page
   * - og:title
   * - head > title
   */
  title?: string;

  /**
   * The description of the page
   * - meta[name=description]
   * - meta[property="og:description"]
   */
  description?: string;

  /**
   * The keywords of the page
   * - meta[name="keywords"]
   * - meta[property="article:tag"]
   */
  keywords?: string[];

  /**
   * The image of the page
   * - meta[property="og:image"]
   */
  image?: string;

  /**
   * The authors of the page
   * - `author` field of JSON-LD Article object: https://schema.org/Article
   * - meta[property="article:author"]
   */
  authors?: string[];

  /**
   * The publish date of the page
   * - `datePublished` field of JSON-LD Article object: https://schema.org/Article
   * - meta[property="article:published_time"]
   */
  datePublished?: number;

  /**
   * The modified date of the page
   * - `dateModified` field of JSON-LD Article object: https://schema.org/Article
   * - meta[property="article:modified_time"]
   */
  dateModified?: number;

  /**
   * The category of the page
   * - meta[property="article:section"
   * - meta[property="product:category"]
   */
  category?: string;

  /**
   * The URL depth, based on the number of slashes after the domain
   * - http://example.com/ = 1
   * - http://example.com/about = 1
   * - http://example.com/about/ = 2
   * - etc.
   */
  urlDepth?: number;

  /**
   * The content of your page
   */
  content: string;
}

Example record

1
2
3
4
5
6
7
8
9
10
{
  "objectID": "https://www.algolia.com/products/crawler/#0",
  "lang": "en",
  "title": "Crawler | Web Crawler | Ecommerce Crawler",
  "url": "https://www.algolia.com/products/crawler/",
  "image": "https://res.cloudinary.com/hilnmyskv/image/upload/v1527077656/Algolia_OG_image_m3xgjb.png",
  "urlDepth": 2,
  "content": "Algolia Crawler Unleash your content Algolia Crawler is a hosted and highly customizable web crawler that makes sense of any content of a website and makes it deliverable through a seamless experience Request a demo World's leading brands use Algolia to power their Site Search and Discovery Accelerate time to value Great Site Search experiences are based on various types of content, but this content is siloed in disparate systems managed by different teams. By automatically extracting content from your websites, Algolia Crawler removes the need for building data pipelines between each of your content repository and Algolia, and avoids complex internal project management, saving time and resources. Turn web pages into structured content Tailor the crawler to make sure it accurately interprets your content. It allows your users to search and navigate news articles, job posts, FAQ answers, financial reports or any type of content your website offers, including JavaScript, PDFs and Docs, instead of generic web pages. Extract content without editing your website Extract structured content without the need to add any metatag to your website. Algolia Crawler provides an easy to use editor for your technical team, so they can define what content to extract and how to structure it, ensuring an optimal end user experience. Enrich your content to improve the experience Algolia Crawler can enrich the extracted content with business data, including Google Analytics data, to enhance the relevance of the end user experience. From using your visitor behaviors and page performance to adjust the search rankings, to attaching categories to your content to power advanced navigation, possibilities are endless. Configure the crawler to your needs Algolia Crawler gives you the options to index the parts of your websites you need, when you need it. Schedule automatic crawls at the timing of your choice Manually trigger a crawl of part or all your websites when necessary Define what parts of your websites the crawler should or should not explore, or let it explore your websites automatically Configure the crawler to explore login protected pages when necessary Rely on a Production Ready crawler Algolia Crawler comes with a complete set of tools to make sure you always fuel your site search experience with up to date and accurate content. URL Inspector Search and inspect all the crawled URLs. For each URL, check when it was last crawled, whether the crawl was successful, and the records it generated. Monitoring Get a detailed report of the errors encountered during the last crawl. Data analysis Assess the quality of the extracted data. For each type of content, the Data Analyser compares all the extracted content to identify missing data. Path Explorer Assess which paths the Crawler explores, and for each path, how many URLs were crawled, how many records were extracted, what errors happened. We realized that search should be a core competence of the LegalZoom enterprise, and we see Algolia as a revenue generating product. Mrinal Murari Tools team lead & senior software engineer Read the full story Additional Resources",
  "description": "Surface the most relevant content with Algolia's Crawler. Our custom crawler makes sense of all your content and delivers an enhanced end user experience."
}

Content splitting for better relevance

For better relevance and to stay within Algolia’s records size limits, Algolia splits the content of long pages into multiple records. Algolia creates all indices with the index settings { distinct: true, attributeForDistinct: 'url' } to deduplicate them.

Hierarchical schema

If your website has a hierarchical structure, you might want to create one Algolia record per section. To support this, set template = "hierarchical" in your .toml configuration file:

1
template = "hierarchical"

With this template, the plugin creates a new Algolia record for each header tag (<h1>, <h2>, and so on). If your headers have an id, it’s added to the url. This lets users directly navigate to the relevant sections.

The schema of the records is close to the default schema:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
  ...defaultSchema,
  /**
   * The current hierarchy of the extracted record, in an InstantSearch compatible format.
   */
  hierarchy: { lvl0: 'H1 heading', lvl1: 'H2 heading', lvl2: 'H3 heading', ... };
  hierarchicalCategories: { lvl0: 'H1 heading', lvl1: 'H1 heading > H2 heading', lvl3: 'H1 heading > H2 heading > H3 heading', ... };
  /**
   * Lengh of the extracted content for the current section.
   * Can be used for custom ranking, for e.g. redirect users to sections with more content first.
   */
  contentLength: number;
}

Supported JSON-LD attributes

The Algolia Crawler supports a limited set of JSON-LD attributes. They should follow the Schema.org structure. If present, the Algolia Crawler prioritizes the attributes found in JSON-LD during the extraction. The current list of supported attributes are:

  • Article
  • author
  • datePublished
  • dateModified

Sitemaps

The plugin automatically finds sitemaps in your robots.txt based on common sitemap URLs.

Exclude URLs from crawls

The plugin supports several standard way of excluding URLs from a crawl.

Exclude URLs with a robots.txt file

To allow or disallow Algolia to crawl URLs, use the standard robots.txt syntax:

1
2
User-agent: Algolia Crawler
Disallow: /foo/bar
1
2
User-agent: Algolia Crawler
Allow: /*

Exclude URLs with a robots meta tags

To exclude a specific page, including a robots meta tag or a meta tag that only targets the Algolia Crawler.

1
2
3
4
<head>
  <!-- Will target only the Algolia Crawler -->
  <meta name="Algolia crawler" content="noindex">
</head>
1
2
3
4
<head>
  <!-- Will target all robots: Algolia Crawler, Google, Bing, etc... -->
  <meta name="robots" content="noindex" />
</head>

Exclude URLs with canonical URLs

You can indirectly exclude a page from the crawl and redirect the Algolia Crawler to another page by using the canonical meta tag. Canonical URLs are especially helpful to exclude page variants with query parameters, such as pagination parameters or search terms.

1
2
3
<head>
  <link rel="canonical" href="/another-page.html" />
</head>

Enable JavaScript to crawl client-side rendered pages

If your website is client-side rendered, you can turn on JavaScript rendering by setting the renderJavaScript option to true in your netlify.toml file.

Password protection

If your Netlify site is password-protected, the Crawler automatically uses this password to crawl your website.

The Algolia Crawler stores your encrypted password.

Did you find this page helpful?