Tools / Crawler / Extracting data

Extract data from crawled pages

When you create a new crawler, a configuration file is automatically generated. This file helps the crawler understand what information to collect from different parts of your site, like product pages or blog articles. Following the first crawl, you can refine this configuration file to collect more comprehensive or granular details.

To target specific content for extraction, you can use meta tags, CSS selectors, or helpers.

Be aware that any change to the site’s structure and attributes might affect crawling. Use attributes that are less likely to alter, even if other aspects of the site change. For example:

If there are any changes to site attributes, you must update the crawler’s configuration to guarantee continuous data extraction.

Helpers

The Crawler can use helpers to extract supported JSON-LD attributes from the Article and Product schemas.

To identify JSON-LD attributes on your site, use an online schema markup validator (https://validator.schema.org/).

For example, to analyze the blog post https://jsonld.com/jsonld-webpage-vs-website/, copy that URL into the validator. You’ll find the author information in the Webpage schema.

Author information in JSON-LD attributes of a blog post

Unsupported JSON-LD attributes

To extract unsupported JSON-LD attributes and schemas in your pages, define a custom helper by adding the following instructions to your crawler’s configuration:

1
2
3
4
5
6
7
let jsonld;
const node = $('script[type="application/ld+json"]').get(0);
try {
  jsonld = JSON.parse(node.firstChild.data);
} catch (err) {
  // Log errors in the console
  console.log(node);

Meta tags

For specifying the content a crawler should extract, meta tags are often a good start. This is because meta tags are less likely to alter during site updates or redesigns, minimizing the need to update the crawler configuration. However, meta tags might not include everything you want. If you’re trying to get information that’s not in a meta tag, like a blog post’s author, you can:

Extract data with meta tags

You can set up actions in your crawler’s configuration file to look for specific meta tags on your pages. The following example captures a blog post’s description by finding that meta tag in the <head> section of that page’s HTML and creating an action to tell the crawler how to extract it:

description: $("meta[name=description]").attr("content"),

The description meta tag in the head section of the page

Change what you collect

Add or remove Algolia record attributes by modifying the corresponding instruction in the configuration’s recordExtractor. For example:

  • To stop capturing the description attribute for blog posts in the next crawl, delete $("meta[name=description]").attr("content") from the configuration.
  • To add an attribute, add a corresponding recordExtractor with the relevant meta tag to the configuration file.

CSS selectors

You can use CSS selectors to pinpoint the information you want to extract from the <body> of a page.

For example, to extract the author name from an MDN blog post (https://developer.mozilla.org/en-US/blog/getting-started-with-css-container-queries/), the appropriate CSS selector is .author.

Author information in the CSS classes of a blog post

To extract this, add the following to your configuration’s recordExtractor:

author = $(".author").text(),

Did you find this page helpful?