Structuring content for websites using a logical and well-organized information architecture makes it easier for crawlers and search engines to index relevant content for specific queries.
That’s because search is at its best when you organize online content into a hierarchy of web pages and structure each page into small bits of attributes, like title, description, sections, and paragraphs.
Luckily, a large majority of information on the web can be structured in this way. However, some content does not break down so easily, such as document search or purely textual content like blogs, technical documentation, and online news journals.
Document-based search may seem easy at first – it’s just matching the text of the content to the text of the query. But there are several pitfalls to web page structures that you need to be aware of and avoid. This article explains those pitfalls, proposing the most optimized index and web page structure in large document search.
While the suggestions in this article may apply to any website that offers an organized collection of textual content (blog, newsfeed), we discuss only large texts within the context of technical documentation, using our Laravel technical documentation implementation.
Website information architecture and document search
Before diving into the details of those pitfalls, let’s pull out the essentials.
Indexing – how to structure the data
Break up each page into small chunks, and save each chunk as a separate record. Incorporate the hierarchy of the website into each record.
Search Engine – finding and ranking records
Algolia does not use statistics or technology like NLP or TF-IDF to understand or decrypt the text. Instead, it focuses on the characters of the text itself, creating relevance by textually matching the query with the content, and then applying a ranking formula, typo tolerance, proximity, and other smart ways to read the text and order the results.
Front-end UI/UX – best practices
Show only parts of the text in the search results, not a large portion. Otherwise, it’s too much to read or scan. Additionally, only allow two instances of the same page to show up in the results, to make room for other pages that match.
That’s the general strategy for structuring large texts. Now for the implementation details and some pitfalls to avoid.
Take a look at DocSearch, the easiest and fastest way to add search to your documentation. Take a look, it’s free!
Pitfall Nº1: the web page as the default entry
Developer documentations often mean lengthy pages filled with a lot of content. Most people try to index the complete page as one entry in their search engine. But, they discover later on that there were a lot of edge cases and they try to fix them through relevance tuning but it quickly becomes an endless story as the issue comes from the actual indexing itself:
1. Relevant content only
For example the query "composer upgrade" will match the QuickStart page because the menu contains "Upgrade Guide" and the first paragraph contains the "composer" word. This is not the kind of match that provides a good user experience.
2. Pages contain too long pieces of text
Developers don’t like to change web pages too often and they like to have long pages containing a lot of information. If such a page is indexed as one document, it will almost systematically trigger relevance issues. This is why we do not recommend to use a standard web crawler, but rather a scrapper to have access to the original content (most of the time available in Markdown).
For example, querying "cache incrementing value" will match the Query Builder page because it contains a paragraph with the word "cache" and another paragraph with the words "incrementing" and "value". This is a false positive because it is not relevant: the more text you have on a page, the more irrelevant results you will get.
3. The right anchored section
In order to deliver the best user experience, it is key to open the page at the exact position of the match. This is made very difficult if you only index one document per page. That’s why there are so many documentation searches that just open the page at the top and the user needs to scroll or use the search of his browser to jump to the right section. This not always easy and is a waste of time.
Pitfall Nº2: indexing titles only
Indexing the titles of your documentation page will probably answer common queries but this is not enough. The underlying paragraphs contain most of the words your users will search for. To obtain a great level of relevance, it’s important to index the whole content, body text included.
In this example, the text is required to correctly answer to the "rememberForever" or "cache driver" queries.
Pitfall Nº3: poor relevance
With most search engines, relevance is the trickiest part of the configuration because it is often defined by a unique and complex formula that mixes a lot of information almost impossible to manage. Engineers often adjust the formula or add some bonus/malus scoring to improve the results on one specific query. Since they don’t have any non-regression tools, they cannot measure the real impact for all queries. The consequences can be significant.
In order to keep the ranking under control, it is key to split the ranking formula in several pieces that you understand and will tune independently. In practice we are able to split the ranking formula with a Tie-Breaking algorithm.
Let’s imagine your ranking formula is split in 2 parts:
the first one defines the textual relevance of a matching hit,
the second one defines the importance of a matching hit (from a use-case/business POV).
You can then first apply the textual relevance and only if two hits have the same value move to the use-case/business relevance (importance). This is the best way to ensure your end-users will always have relevant hits first (from a text POV, matching exactly their query words) and then – in terms of relevance equality – tie the results using the business relevance.
Since you’re not mixing together the text & the business relevance (but applied them one after another), you can modify the business relevance without impacting how the text relevance is working. Getting Started With Realtime Search
Our recipe
1. Create small hierarchical records
In order to solve all those pitfalls, we split the page in a lot of smaller chunks indexed as separate records by using the HTML structure of the page (H1, H2, H3, H4, P).
See the Validation page of Laravel’s documentation:
The first record generated will be the Validation page title. It will be transformed into the following JSON object. The “link” attribute only contains the last part of the URL, the first part being easily rebuilt with the tag:
Then, the first section of the page (The Introduction) will be turned into the following record. The link now contains an anchor text and that keeps the title of the page:
A paragraph of this page under a H3 section would be translated into the following record:
{
"h1": "Validation",
"h2": "Validation Quickstart",
"h3": "Defining the Routes",
"link": "validation#validation-quickstart",
"content": "First, let's assume we have the following routes defined in out `app/Http/routes.php` file:",
"importance": 6,
"_tags": [
"5.1"
],
"objectID": "5.1-validation#validation-quickstart-380c9827712413dbe75b5db515cd3e59"
}
This approach fixes pitfalls #1 and #2. We have solved the problem by indexing each chunk of text as an independent record while keeping the titles hierarchy in each record.
2. Use a tie-breaking ranking algorithm
Algolia is designed natively to use a Tie-Breaking algorithm to make sure everyone understands & is able to tune the ranking. Now,Pitfall #3 can be easily resolved by applying the settings we recommend for a documentation search implementation:
Matching hits will now be sorted against those six ranking criteria: the first 5 are related to text relevance and the last one is the custom business relevance.
Ranking criterion Nº1: number of matched words (words)
First, we sort the number of query words found in the records. We have decided to process the query with all words as mandatory (AND between query terms). If there are not enough matching words, we run the query again with all words as optional (OR between query terms). This process is configured with a single index setting and allows your to get the best of both worlds: AND guarantees to reduce the number of false positives while OR allows to return results even if the query is too narrow.
Ranking criterion Nº2: number of typos (typo)
If two records match with the same number of search terms, we use the number of typos as the differentiator (so we have exact matches first, then matches with 1 typo, then matches with 2 typos, …).
For example if the query is “validator”, the record that contains “validate” will match with some typos but will be retrieved after the record containing “validator”.
Ranking criterion Nº3: proximity between query terms (proximity)
When two records are identical for the words and typos ranking criteria, we then move to the next criteria which compares the proximity of the query terms in the record. It will basically count the number of words in between them until a limit is reached (after a certain point they are considered as “too far”).
For example, the "cache configuration" query will have a proximity of 1 when it matches the sentence: "The cacheconfiguration is ..." and will have a proximity of 2 when it matches the sentence "... in the config/cache.php configuration file". We sort this value by increasing order as we prefer records that contains the query terms close together first.
Ranking criterion Nº4: the matched attribute (attribute)
If two records are identical for the 3 first ranking criteria, we use the name of the matched attribute to determine which hit needs to be retrieved first. In the index settings, just order the attributes you want to search by order of importance:
That means that if the match is identified inside h1, it will be better than in h2, better than in h3, etc. You can also notice there is an “unordered” flag on each attribute. It means that the position of the match inside the attribute is not considered in the ranking. That’s why the query "cache" will match with the same attribute score for a record that contains "[Cache Configuration]" or "[Obtaining a cache instance]" for the same attribute.
Ranking criterion Nº5: the number of terms matching exactly (exact)
If two records are identical for the first 4 criteria, then we use the number of query terms that match exactly in the record to determine which hit needs to be retrieved first. Because we’re returning results after each keystroke, the last query term will mostly match as a prefix (it will match beginning of words). This criterion is used to rank an exact match before a prefix match.
For example the query “valid” will retrieve the records containing “valid” before the ones containing “validation”.
Ranking criterion Nº6: business ranking (custom)
There is still one important thing missing: your use-case/business criterion. If all previous criteria are identical for two records, we use the custom ranking which is defined by the user.
For example, searching for "Validation" will match the two following records using the most important “h1” attribute. That results in a tie on all previous criteria but we want to retrieve the page title first because the other record is a paragraph. This is how the "importance" attribute plays out when added to the records.
{
"h1": "Validation",
"h2": "Working With Error Messages",
"h3": "Custom Error Messages",
"link": "validation#custom-error-messages",
"content": "If needed, you may use custom error messages for validation instead of the defaults. There are several ways to specify custom messages. First, you may pass the custom messages as the third argument to the `Validator::make` method:",
"importance": 6,
"_tags": [
"5.1"
],
"objectID": "5.1-validation#custom-error-messages-380c9827712413dbe75b5db515cd3e59"
}
The “importance” value is a integer that goes from 0 (page title) to 7 (text section under h4) and that we use in the custom ranking in an ascending order (the smaller, the better):
The complete scale of importance is the following:
0 for h1,
1 for h2,
2 for h3,
3 for h4,
4 for text under h1,
5 for text under h2,
6 for text under h3,
and 7 for text under h4.
This is a generic recipe
We have successfully applied this recipe on several technical documentation, such as Laravel, Bootstrap, and many other documentations websites. The way results are displayed differ but we use exactly the same approach and the same API.
One of our missions is to help all developers navigate technical documentations. If you are working on an open source project, we’d be happy to help you — here’s how to get started with DocSearch. We will provide you with a free Algolia account and with any support to make your implementation a best-in-class reference. Drop us a note!