Extract data from crawled pages
When you create a new crawler, a configuration file is automatically generated. This file helps the crawler understand what information to collect from different parts of your site, like product pages or blog articles. Following the first crawl, you can refine this configuration file to collect more comprehensive or granular details.
To target specific content for extraction, you can use meta tags, CSS selectors, or helpers.
Be aware that any change to the site’s structure and attributes might affect crawling. Use attributes that are less likely to alter, even if other aspects of the site change. For example:
- Prefer data-* global attributes
- Prefer simple CSS selectors over complex ones (with children or siblings).
If there are any changes to site attributes, you must update the crawler’s configuration to guarantee continuous data extraction.
Helpers
The Crawler can use helpers to extract supported JSON-LD attributes from the Article and Product schemas.
To identify JSON-LD attributes on your site, use an online schema markup validator (https://validator.schema.org/).
For example, to analyze the blog post https://jsonld.com/jsonld-webpage-vs-website/, copy that URL into the validator.
You’ll find the author
information in the Webpage schema.
Unsupported JSON-LD attributes
To extract unsupported JSON-LD attributes and schemas in your pages, define a custom helper by adding the following instructions to your crawler’s configuration:
1
2
3
4
5
6
7
let jsonld;
const node = $('script[type="application/ld+json"]').get(0);
try {
jsonld = JSON.parse(node.firstChild.data);
} catch (err) {
// Log errors in the console
console.log(node);
Meta tags
For specifying the content a crawler should extract, meta tags are often a good start. This is because meta tags are less likely to alter during site updates or redesigns, minimizing the need to update the crawler configuration. However, meta tags might not include everything you want. If you’re trying to get information that’s not in a meta tag, like a blog post’s author, you can:
- Add the missing information to your site’s meta tags
- Use helpers
- Use CSS selectors.
Extract data with meta tags
You can set up actions in your crawler’s configuration file to look for specific meta tags on your pages.
The following example captures a blog post’s description
by finding that meta tag in the <head>
section of that page’s HTML and creating an action to tell the crawler how to extract it:
description: $("meta[name=description]").attr("content"),
Change what you collect
Add or remove Algolia record attributes by modifying the corresponding instruction in the configuration’s recordExtractor
.
For example:
- To stop capturing the
description
attribute for blog posts in the next crawl, delete$("meta[name=description]").attr("content")
from the configuration. - To add an attribute, add a corresponding
recordExtractor
with the relevant meta tag to the configuration file.
CSS selectors
You can use CSS selectors to pinpoint the information you want to extract from the <body>
of a page.
For example, to extract the author name from an MDN blog post (https://developer.mozilla.org/en-US/blog/getting-started-with-css-container-queries/), the appropriate CSS selector is .author
.
To extract this, add the following to your configuration’s recordExtractor
:
author = $(".author").text(),