Feature Spotlight: Query Rules
You’re running an ecommerce site for an electronics retailer, and you’re seeing in your analytics that users keep ...
Technical Writer
You’re running an ecommerce site for an electronics retailer, and you’re seeing in your analytics that users keep ...
Technical Writer
What do OpenAI and DeepMind have in common? Give up? These innovative organizations both utilize technology known as transformer models ...
Sr. SEO Web Digital Marketing Manager
As a successful in-store boutique manager in 1994, you might have had your merchandisers adorn your street-facing storefront ...
Search and Discovery writer
At Algolia, our business is more than search and discovery, it’s the continuous improvement of site search. If you ...
JavaScript Library Developer
Analytics brings math and data into the otherwise very subjective world of ecommerce. It helps companies quantify how well their ...
Technical Writer
Amid all the momentous developments in the generative AI data space, are you a data scientist struggling to make sense ...
Sr. SEO Web Digital Marketing Manager
Fashion ideas for guest aunt informal summer wedding Funny movie to get my bored high-schoolers off their addictive gaming ...
Sr. SEO Web Digital Marketing Manager
Imagine you're visiting an online art gallery and a specific painting catches your eye. You'd like to find ...
Senior Software Engineer
At Algolia, our commitment to making a positive impact extends far beyond the digital landscape. We believe in the power ...
Senior Manager, People Success
In today’s post-pandemic-yet-still-super-competitive retail landscape, gaining, keeping, and converting ecommerce customers is no easy ...
Sr. SEO Web Digital Marketing Manager
There are few atmospheres as unique as that of a conference exhibit hall: the air always filled with an indescribable ...
Marketing Content Manager
To consider the question of what vectors are, it helps to be a mathematician, or at least someone who’s ...
Search and Discovery writer
My first foray into programming was writing Python on a Raspberry Pi to flicker some LED lights — it wasn’t ...
Technical Writer
How well do you know the world of modern ecommerce? With retail ecommerce sales having exceeded $5.7 trillion worldwide ...
Sr. SEO Web Digital Marketing Manager
In a world of artificial intelligence (AI), data serves as the foundation for machine learning (ML) models to identify trends ...
Director of AI Engineering
Imagine you’re a leading healthcare provider that performs extensive data collection as part of your patient management. You’re ...
Search and Discovery writer
In an era where customer experience reigns supreme, achieving digital excellence is a worthy goal for retail leaders. But what ...
Marketing Content Manager
Just a few years ago it would have required considerable resources to build a new AI service from scratch. Of ...
VP, Engineering
Structuring content for websites using a logical and well-organized information architecture makes it easier for crawlers and search engines to index relevant content for specific queries.
That’s because search is at its best when you organize online content into a hierarchy of web pages and structure each page into small bits of attributes, like title, description, sections, and paragraphs.
Luckily, a large majority of information on the web can be structured in this way. However, some content does not break down so easily, such as document search or purely textual content like blogs, technical documentation, and online news journals.
Document-based search may seem easy at first – it’s just matching the text of the content to the text of the query. But there are several pitfalls to web page structures that you need to be aware of and avoid. This article explains those pitfalls, proposing the most optimized index and web page structure in large document search.
While the suggestions in this article may apply to any website that offers an organized collection of textual content (blog, newsfeed), we discuss only large texts within the context of technical documentation, using our Laravel technical documentation implementation.
Before diving into the details of those pitfalls, let’s pull out the essentials.
Break up each page into small chunks, and save each chunk as a separate record. Incorporate the hierarchy of the website into each record.
Algolia does not use statistics or technology like NLP or TF-IDF to understand or decrypt the text. Instead, it focuses on the characters of the text itself, creating relevance by textually matching the query with the content, and then applying a ranking formula, typo tolerance, proximity, and other smart ways to read the text and order the results.
Show only parts of the text in the search results, not a large portion. Otherwise, it’s too much to read or scan. Additionally, only allow two instances of the same page to show up in the results, to make room for other pages that match.
That’s the general strategy for structuring large texts. Now for the implementation details and some pitfalls to avoid.
Take a look at DocSearch, the easiest and fastest way to add search to your documentation. Take a look, it’s free!
Developer documentations often mean lengthy pages filled with a lot of content. Most people try to index the complete page as one entry in their search engine. But, they discover later on that there were a lot of edge cases and they try to fix them through relevance tuning but it quickly becomes an endless story as the issue comes from the actual indexing itself:
For example the query "composer upgrade"
will match the QuickStart page because the menu contains "Upgrade Guide"
and the first paragraph contains the "composer"
word. This is not the kind of match that provides a good user experience.
Developers don’t like to change web pages too often and they like to have long pages containing a lot of information. If such a page is indexed as one document, it will almost systematically trigger relevance issues. This is why we do not recommend to use a standard web crawler, but rather a scrapper to have access to the original content (most of the time available in Markdown).
For example, querying "cache incrementing value"
will match the Query Builder page because it contains a paragraph with the word "cache"
and another paragraph with the words "incrementing"
and "value"
. This is a false positive because it is not relevant: the more text you have on a page, the more irrelevant results you will get.
In order to deliver the best user experience, it is key to open the page at the exact position of the match. This is made very difficult if you only index one document per page. That’s why there are so many documentation searches that just open the page at the top and the user needs to scroll or use the search of his browser to jump to the right section. This not always easy and is a waste of time.
Indexing the titles of your documentation page will probably answer common queries but this is not enough. The underlying paragraphs contain most of the words your users will search for. To obtain a great level of relevance, it’s important to index the whole content, body text included.
In this example, the text is required to correctly answer to the "rememberForever"
or "cache driver"
queries.
With most search engines, relevance is the trickiest part of the configuration because it is often defined by a unique and complex formula that mixes a lot of information almost impossible to manage. Engineers often adjust the formula or add some bonus/malus scoring to improve the results on one specific query. Since they don’t have any non-regression tools, they cannot measure the real impact for all queries. The consequences can be significant.
In order to keep the ranking under control, it is key to split the ranking formula in several pieces that you understand and will tune independently. In practice we are able to split the ranking formula with a Tie-Breaking algorithm.
Let’s imagine your ranking formula is split in 2 parts:
You can then first apply the textual relevance and only if two hits have the same value move to the use-case/business relevance (importance). This is the best way to ensure your end-users will always have relevant hits first (from a text POV, matching exactly their query words) and then – in terms of relevance equality – tie the results using the business relevance.
Since you’re not mixing together the text & the business relevance (but applied them one after another), you can modify the business relevance without impacting how the text relevance is working.
Getting Started With Realtime Search
In order to solve all those pitfalls, we split the page in a lot of smaller chunks indexed as separate records by using the HTML structure of the page (H1, H2, H3, H4, P).
See the Validation page of Laravel’s documentation:
The first record generated will be the Validation page title. It will be transformed into the following JSON object. The “link” attribute only contains the last part of the URL, the first part being easily rebuilt with the tag:
{
"h1": "Validation",
"link": "validation",
"importance": 0,
"_tags": [
"5.1"
],
"objectID": "master-validation-13148717f8faa9037f37d28971dfc219"
}
Then, the first section of the page (The Introduction) will be turned into the following record. The link now contains an anchor text and that keeps the title of the page:
{
"h1": "Validation",
"h2": "Introduction",
"link": "validation#introduction",
"importance": 1,
"_tags": [
"5.1"
],
"objectID": "master-validation#introduction-eeafb566c2af34e739e2685efdb45524"
}
A paragraph of this page under a H3 section would be translated into the following record:
{
"h1": "Validation",
"h2": "Validation Quickstart",
"h3": "Defining the Routes",
"link": "validation#validation-quickstart",
"content": "First, let's assume we have the following routes defined in out `app/Http/routes.php` file:",
"importance": 6,
"_tags": [
"5.1"
],
"objectID": "5.1-validation#validation-quickstart-380c9827712413dbe75b5db515cd3e59"
}
This approach fixes pitfalls #1 and #2. We have solved the problem by indexing each chunk of text as an independent record while keeping the titles hierarchy in each record.
Algolia is designed natively to use a Tie-Breaking algorithm to make sure everyone understands & is able to tune the ranking. Now,Pitfall #3 can be easily resolved by applying the settings we recommend for a documentation search implementation:
Matching hits will now be sorted against those six ranking criteria: the first 5 are related to text relevance and the last one is the custom business relevance.
First, we sort the number of query words found in the records. We have decided to process the query with all words as mandatory (AND between query terms). If there are not enough matching words, we run the query again with all words as optional (OR between query terms). This process is configured with a single index setting and allows your to get the best of both worlds: AND guarantees to reduce the number of false positives while OR allows to return results even if the query is too narrow.
If two records match with the same number of search terms, we use the number of typos as the differentiator (so we have exact matches first, then matches with 1 typo, then matches with 2 typos, …).
For example if the query is “validator”, the record that contains “validate” will match with some typos but will be retrieved after the record containing “validator”.
When two records are identical for the words and typos ranking criteria, we then move to the next criteria which compares the proximity of the query terms in the record. It will basically count the number of words in between them until a limit is reached (after a certain point they are considered as “too far”).
For example, the "cache configuration"
query will have a proximity of 1 when it matches the sentence: "The cache configuration is ..."
and will have a proximity of 2 when it matches the sentence "... in the config/cache.php configuration file"
. We sort this value by increasing order as we prefer records that contains the query terms close together first.
If two records are identical for the 3 first ranking criteria, we use the name of the matched attribute to determine which hit needs to be retrieved first. In the index settings, just order the attributes you want to search by order of importance:
That means that if the match is identified inside h1, it will be better than in h2, better than in h3, etc. You can also notice there is an “unordered” flag on each attribute. It means that the position of the match inside the attribute is not considered in the ranking. That’s why the query "cache"
will match with the same attribute score for a record that contains "[Cache Configuration]"
or "[Obtaining a cache instance]"
for the same attribute.
If two records are identical for the first 4 criteria, then we use the number of query terms that match exactly in the record to determine which hit needs to be retrieved first. Because we’re returning results after each keystroke, the last query term will mostly match as a prefix (it will match beginning of words). This criterion is used to rank an exact match before a prefix match.
For example the query “valid” will retrieve the records containing “valid” before the ones containing “validation”.
There is still one important thing missing: your use-case/business criterion. If all previous criteria are identical for two records, we use the custom ranking which is defined by the user.
For example, searching for "Validation"
will match the two following records using the most important “h1” attribute. That results in a tie on all previous criteria but we want to retrieve the page title first because the other record is a paragraph. This is how the "importance"
attribute plays out when added to the records.
{
"h1": "Validation",
"link": "validation",
"importance": 0,
"_tags": [
"5.1"
],
"objectID": "master-validation-13148717f8faa9037f37d28971dfc219"
}
{
"h1": "Validation",
"h2": "Working With Error Messages",
"h3": "Custom Error Messages",
"link": "validation#custom-error-messages",
"content": "If needed, you may use custom error messages for validation instead of the defaults. There are several ways to specify custom messages. First, you may pass the custom messages as the third argument to the `Validator::make` method:",
"importance": 6,
"_tags": [
"5.1"
],
"objectID": "5.1-validation#custom-error-messages-380c9827712413dbe75b5db515cd3e59"
}
The “importance” value is a integer that goes from 0 (page title) to 7 (text section under h4) and that we use in the custom ranking in an ascending order (the smaller, the better):
The complete scale of importance is the following:
We have successfully applied this recipe on several technical documentation, such as Laravel, Bootstrap, and many other documentations websites. The way results are displayed differ but we use exactly the same approach and the same API.
One of our missions is to help all developers navigate technical documentations. If you are working on an open source project, we’d be happy to help you — here’s how to get started with DocSearch. We will provide you with a free Algolia account and with any support to make your implementation a best-in-class reference. Drop us a note!
It's extensive, clear, and, of course, searchable.
Powered by Algolia Recommend