Tools / Crawler / Extraction Strategy

Extraction Strategy

The plugin uses the Algolia Crawler to crawl your website and index it into Algolia. Algolia is a schemaless search engine, but to offer an effortless experience, the plugin populates your Algolia index with a standard record schema.

Here’s an overview of what the Algolia Crawler extracts and how you can improve the crawl.

Standard record schema

All root-level properties are a computation of selectors, with a fallback.

This extraction logic might change, with a proper deprecation period.

All properties that aren’t marked as optional are present in the final record. Others can be missing if the Algolia Crawler couldn’t find any relevant information.

Default schema

By default, the Netlify plugin tries to extract one record per web page, with the following schema:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
{
  /**
   * The object's unique identifier
   */
  objectID: string;

  /**
   * The URL where the Algolia Crawler found the record
   */
  url: string;

  /**
   * The lang of the page
   * - html[attr=lang]
   */
  lang?: string;

  /**
   * The title of the page
   * - og:title
   * - head > title
   */
  title?: string;

  /**
   * The description of the page
   * - meta[name=description]
   * - meta[property="og:description"]
   */
  description?: string;

  /**
   * The keywords of the page
   * - meta[name="keywords"]
   * - meta[property="article:tag"]
   */
  keywords?: string[];

  /**
   * The image of the page
   * - meta[property="og:image"]
   */
  image?: string;

  /**
   * The authors of the page
   * - `author` field of JSON-LD Article object: https://schema.org/Article
   * - meta[property="article:author"]
   */
  authors?: string[];

  /**
   * The publish date of the page
   * - `datePublished` field of JSON-LD Article object: https://schema.org/Article
   * - meta[property="article:published_time"]
   */
  datePublished?: number;

  /**
   * The modified date of the page
   * - `dateModified` field of JSON-LD Article object: https://schema.org/Article
   * - meta[property="article:modified_time"]
   */
  dateModified?: number;

  /**
   * The category of the page
   * - meta[property="article:section"
   * - meta[property="product:category"]
   */
  category?: string;

  /**
   * The URL depth, based on the number of slashes after the domain
   * - http://example.com/ = 1
   * - http://example.com/about = 1
   * - http://example.com/about/ = 2
   * - etc.
   */
  urlDepth?: number;

  /**
   * The content of your page
   */
  content: string;
}

Example of record

Based on the preceding configuration, here’s what a record could look like for Algolia Crawler product page:

1
2
3
4
5
6
7
8
9
10
{
  "objectID": "https://www.algolia.com/products/crawler/#0",
  "lang": "en",
  "title": "Crawler | Web Crawler | Ecommerce Crawler",
  "url": "https://www.algolia.com/products/crawler/",
  "image": "https://res.cloudinary.com/hilnmyskv/image/upload/v1527077656/Algolia_OG_image_m3xgjb.png",
  "urlDepth": 2,
  "content": "Algolia Crawler Unleash your content Algolia Crawler is a hosted and highly customizable web crawler that makes sense of any content of a website and makes it deliverable through a seamless experience Request a demo World’s leading brands use Algolia to power their Site Search and Discovery Accelerate time to value Great Site Search experiences are based on various types of content, but this content is siloed in disparate systems managed by different teams. By automatically extracting content from your websites, Algolia Crawler removes the need for building data pipelines between each of your content repository and Algolia, and avoids complex internal project management, saving time and resources. Turn web pages into structured content Tailor the crawler to make sure it accurately interprets your content. It allows your users to search and navigate news articles, job posts, FAQ answers, financial reports or any type of content your website offers, including JavaScript, PDFs and Docs, instead of generic web pages. Extract content without editing your website Extract structured content without the need to add any metatag to your website. Algolia Crawler provides an easy to use editor for your technical team, so they can define what content to extract and how to structure it, ensuring an optimal end user experience. Enrich your content to improve the experience Algolia Crawler can enrich the extracted content with business data, including Google Analytics data, to enhance the relevance of the end user experience. From using your visitor behaviors and page performance to adjust the search rankings, to attaching categories to your content to power advanced navigation, possibilities are endless. Configure the crawler to your needs Algolia Crawler gives you the options to index the parts of your websites you need, when you need it. Schedule automatic crawls at the timing of your choice Manually trigger a crawl of part or all your websites when necessary Define what parts of your websites the crawler should or should not explore, or let it explore your websites automatically Configure the crawler to explore login protected pages when necessary Rely on a Production Ready crawler Algolia Crawler comes with a complete set of tools to make sure you always fuel your site search experience with up to date and accurate content. URL Inspector Search and inspect all the crawled URLs. For each URL, check when it was last crawled, whether the crawl was successful, and the records it generated. Monitoring Get a detailed report of the errors encountered during the last crawl. Data analysis Assess the quality of the extracted data. For each type of content, the Data Analyser compares all the extracted content to identify missing data. Path Explorer Assess which paths the Crawler explores, and for each path, how many URLs were crawled, how many records were extracted, what errors happened.“We realized that search should be a core competence of the LegalZoom enterprise, and we see Algolia as a revenue generating product.” Mrinal Murari Tools team lead & senior software engineer Read the full story Additional Resources",
  "description": "Surface the most relevant content with Algolia’s Crawler. Our custom crawler makes sense of all your content and delivers an enhanced end user experience."
}

Content splitting for relevance

For better relevance and to stay within Algolia’s records size limits, Algolia splits the content of big pages into multiple records. Algolia creates all indices with the index settings { distinct: true, attributeForDistinct: 'url' } to deduplicate them at search time.

hierarchical schema

If your website has a hierarchical structure (as it’s often the case for documentation websites), it can be more relevant to create one Algolia record per section. The Netlify crawler plugin supports such content extraction logic, and you can enable it by specifying the following parameter in your .toml configuration file:

1
template = "hierarchical"

With this template enabled, the plugin creates a new Algolia record for each header tag (<h1>, <h2>, etc.), so the content of each page is split into several records. If your headers have an id, it’s put in the url to permits search results to point directly to it, as defined in the HTML specification.

The schema of the records is close to the default schema, augmented with a couple of fields:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
  ...defaultSchema,
  /**
   * The current hierarchy of the extracted record
   */
  hierarchy: { lvl0: 'H1 heading', lvl1: 'H2 heading', lvl2: 'H3 heading', ... };
  /**
   * The current hierarchy of the extracted record, in an InstantSearch compatible format.
   * https://www.algolia.com/doc/api-reference/widgets/hierarchical-menu/js/#requirements
   */
  hierarchicalCategories: { lvl0: 'H1 heading', lvl1: 'H1 heading > H2 heading', lvl3: 'H1 heading > H2 heading > H3 heading', ... };
  /**
   * Lengh of the extracted content for the current section.
   * Can be used for custom ranking, for e.g. redirect users to sections with more content first.
   */
  contentLength: number;
}

Supported JSON-LD attributes

The Algolia Crawler supports a limited set of JSON-LD attributes. They should follow the Schema.org structure. If present, the Algolia Crawler prioritizes the attributes found in JSON-LD during the extraction. The current list of supported attributes are:

Reach out on the Algolia Discourse forum if you’d like the plugin to support other specific attributes.

Sitemaps

The plugin automatically find sitemaps in your robots.txt and tries several standard URLs that can match sitemaps.

Ignoring URLs or patterns

The plugin supports several standard way of excluding URLs from a crawl:

Using robots.txt

You can allow or disallow Algolia to crawl certain pages using the standard robots.txt syntax.

1
2
User-agent: Algolia Crawler
Disallow: /foo/bar
1
2
User-agent: Algolia Crawler
Allow: /*

Using robots meta tags

You can exclude a specific page by including a robots meta tag in its HTML code.

1
2
3
<head>
  <meta name="robots" content="noindex" />
</head>

Using canonical URLs

You can indirectly exclude a page from the crawl and redirect the Algolia Crawler to an other page by using the canonical meta tag. It also may be useful to ignore query parameters, such as pagination or search terms.

1
2
3
<head>
  <link rel=“canonical” href=“/another-page.html” />
</head>

Executing JavaScript

If your website renders on the client (for example, if you have a single-page React application without server-side rendering), you can enable JavaScript rendering by setting the renderJavaScript option in your netlify.toml file.

Password protection

If you’ve configured password protection through your Netlify site’s settings (Settings > Access Control > Visitor Access), the Crawler automatically use this password to crawl your website.

The Algolia Crawler stores your encrypted password.

Did you find this page helpful?