Broadly speaking, a search index is like the index at the end of a book, where a small, non-exhaustive list of words and subjects are listed with page numbers. More precisely, it’s the mapping of a query to the content in a corpus (a large set of online books and documents, a product or film catalog). In computer-jargon, it’s an inverted list (index) of words that a search engine uses to find every word in every document within a corpus.
But is the metaphor of the book index actually correct? As in all matters related to technology, it’s hard to find a good balance between providing an overview of a subject and diving in deep – without losing meaning or your audience.
In the past, we’ve answered the question What is a search index? in different ways:
This article covers a middle ground between the functional and the technical, defining the capabilities of the powerful search indexes we often see in Google, Amazon, and Netflix, and providing an introduction to how these indexes can perform at such fast speeds.
A book index for a biography looks like this:
A search index can be represented in a very similar manner:
The book metaphor is useful because it underscores the general idea that an index is a separate object from the underlying content, which is used to (easily and quickly) find specific parts of the content (pages in a book, documents in a collection of documents)
To use another metaphor, an index helps us navigate a book like a compass to a map, where the compass replaces the need to scan the map. In the same way, an index at the end of a book is far more efficient than scanning the whole book for one phrase: it obviously saves you time and is more reliable. In the example above, the index directs you reliably to the exact sections in a biography that discuss the “early life” of the subject.
A metaphor only goes so far. The book metaphor doesn’t fully capture the capabilities, purposes, and mechanisms, nor our expectations, of a search engine index.
For example:
Let’s just say that the metaphor of a book index gets you in the door to understanding what an index does, but details like the above (and there are many more), help you understand the full potential of what a search engine index can accomplish and how it has transformed our lives.
A search index can be used in two different contexts:
Now, you can also search for books by tagging them with subjects, themes, authors, etc., but if the underlying goal of the search is to find content, the expectation is that every word and sentence in the book is searchable.
A successful object-based search (as we’ve defined it here) relies on a set of attributes that describe objects sufficiently so that a searcher can find what they are looking for using a reasonably small set of well-chosen keywords. A keyword can be one or more words, or even the first few characters of the first word. For example, while looking for the film Star Wars, a user might only need to type in “star”; but if the search engine bases its search algorithm on popularity (that is, it favors popular films in the first results), then “st” should be sufficient enough to find the blockbuster Star Wars.
If you want to find a movie, you most likely need only a few attributes, such as title, description, cast, crew, year, and a few others. If you want to perform a more general research, you’ll add attributes like themes, dialogues, cross-references, and additional background information. However, the list of attributes can get quite large. For example, cars have 1000s of attributes – material used, the name, type, and year of each part, owner history, factories and repair history, speed, and so on.
What all objects have in common is the notion of keywords. Keywords are the words the owners of the content use as they build an object’s attributes, such as title, brand, author, year, and price. Or from another point of view: keywords are the “words” that a search engine uses to match the words in an index with the query of the searcher.
As we’ve outlined above, a search engine identifies documents (books, web pages, products) that match a user’s query (keywords). To do this, it cannot scan every document. So it uses an index, either an exhaustive index of every word, or an attribute-based index with a subset of the most important descriptions.
An index is created before a user searches. It is a pre-scan of the underlying content. It’s also in a separate part of the server. For example, in a content-based scenario, the search engine pre-scans every document and saves all the unique words in an index. Many search engines structure their index in an “inverted index”, as we describe in the last (fun) section.
Search indexes come with an order. For online searches like Google and Amazon, search results are usually ordered on the “best” not “accurate” matches.
In those contexts, it’s not only about accurate results. If a user types in “brad” and Brad Pitt comes up, that doesn’t mean it’s accurate. Other results will include Brad Davis or the Brady theater. They are all relevant in different ways, but none of them can be considered “accurate”. One user who types in “brad” might choose to go to Brad Pitt’s Wikipedia page, another might go to Brad Pitt’s IMDB page. Accuracy doesn’t really capture the meaning of these choices.
It’s all about how right the result feels to a given user, or how the result matches the intent of the searcher. To return to the compass metaphor: a compass helps us navigate by combining accuracy and relevance: a compass gives us accuracy in terms of north and south; but it also gives us relevance by pointing in the general direction of our destination and helping us match our intentions with our knowledge of the physical world to reach our destination. On the other hand, we expect a GPS system to be accurate not relevant.
Consider the bank employee who looks up your records and finds out that you owe the bank some money. The bank employee’s search results better be completely accurate. Likewise, when store employees or customers look for precise products, they are not interested in relevance: they rely on an accurate, exact product identifiers.
This is not to say that searching by relevance does not contain an aspect of “accuracy”. For example, if someone types in “ball point pen”, the accuracy is to find all products that have an attribute with the words “ball point pen”. However, accuracy gives way to relevance: the relevance is which ball point pen to show first.
A more technical way to explain the difference is to consider the difference between a database and a search index. Database-like indexing (the bank example) is centered around accuracy – ensuring that exact matches are properly sorted and exhaustive. A search-index-based search like Google is more flexible, where the textual matching is a mix between textual accuracy and relevance (optimizing your content for what we call SEO (search engine optimization)).
Similar to Google, the site search we see on Amazon and Netflix, and on websites where search is provided by Algolia, rely on a combination of structured sets of attributes and a ranking system that bases relevance on popularity, trends, likes, and a business’s product-promotional needs.
Okay, so let’s open up the hood. A search engine index is saved in a structure that enables fast retrieval. We call this structure an inverted index. One thing to note, an index is saved separately on the server, in a different location than the data.
While there are many types of inverted indexes, with many nuances, the following diagram sums up the idea:
As you can see, with an inverted index, the search engine inverses the logic. So, instead of reading (scanning) a document looking for words, it inverts that process and uses the words to find the documents. Here’s an example of an inverted index:
… and so on. Let’s say there are 10,000s unique words in 999 documents.
In the above diagram, the search engine’s logic to search an inverted index followed this process to find “aardvark”:
That’s how every word in a set of documents is stored in an index. It gets more complicated for non-prefix, middle of the word queries, but you get the idea.
And that’s all … Well, there are a lot more details. If you’re interesting in more, check out Algolia CTO’s article on the inside story of indexing.
Peter Villani
Sr. Tech & Business WriterPowered by Algolia AI Recommendations