Broadly speaking, a search index is like the index at the end of a book, where a small, non-exhaustive list of words and subjects are listed with page numbers. More precisely, it’s the mapping of a query to the content in a corpus (a large set of online books and documents, a product or film catalog). In computer-jargon, it’s an inverted list (index) of words that a search engine uses to find every word in every document within a corpus. 

But is the metaphor of the book index actually correct? As in all matters related to technology, it’s hard to find a good balance between providing an overview of a subject and diving in deep – without losing meaning or your audience.

In the past, we’ve answered the question What is a search index? in different ways: 

  • At the highest level, we described what an index does, focusing especially on how it has transformed our way of shopping online 
  • At a functional level, we illustrated how a fast indexing process can be leveraged to drive a number of different and surprising scenarios 
  • At the technical level, we dove inside our seach engine index to describe the actual physical structure of an index, and how this structure enables our search engine to store millions of records and still be able to perform a search in milliseconds 

This article covers a middle ground between the functional and the technical, defining the capabilities of the powerful search indexes we often see in Google, Amazon, and Netflix, and providing an introduction to how these indexes can perform at such fast speeds. 

The book metaphor: a search index is like the index at the back of a book

A book index for a biography looks like this: 

  • Marriage: pages 30, 50-67, 246
  • Early life: pages 2-15, 77-89
  • Father: pages 10, 83-85, 150-178

A search index can be represented in a very similar manner:

  • Marriage -> documents 1, 5, 400, 900
  • Early life  -> documents 33, 45, 98, 235
  • Father  -> documents 101, 345, 785

The book metaphor is useful because it underscores the general idea that an index is a separate object from the underlying content, which is used to (easily and quickly) find specific parts of the content (pages in a book, documents in a collection of documents)

To use another metaphor, an index helps us navigate a book like a compass to a map, where the compass replaces the need to scan the map. In the same way, an index at the end of a book is far more efficient than scanning the whole book for one phrase: it obviously saves you time and is more reliable. In the example above, the index directs you reliably to the exact sections in a biography that discuss the “early life” of the subject. 

But is a book index the same as a search index?

A metaphor only goes so far. The book metaphor doesn’t fully capture the capabilities, purposes, and mechanisms, nor our expectations, of a search engine index.

For example:

  • A search index normally indexes every word and part of a book or set of content. A book’s index would be too long if it contained every word and theme in the book. However, search technology is powerful enough to create and search through an exhaustive index of words and themes.
  • We expect that a search engine will show us every piece of content that mentions the words we type in when entering a search query.
  • A search index usually searches more than one book. This is one of its most powerful aspects, to combine and search through large amounts of content in a split second. Examples of content: products on a company’s website (site search), a library’s full online catalog (every word in every book), or everything crawlable on the web (Google).
  • A computer is tolerant: it allows typos and spelling errors, ignores superfluous words like ‘the” and “is”, and understands multiple languages with complex grammars.
  • A search index is dynamic, updating its data with every change in the underlying content. And it does this as fast as possible. Some changes are expected to be in real-time (Twitter’s real time indexing, Uber, airline reservations), others in regular intervals, such as every 5 minutes, or a few times over the course of a day or week. 
  • An index provides relevance – a central topic in search technology. For our purposes here, relevance helps you discover content, connections, and other terms that you might not have thought about, and puts them in an order that makes sense to your current research.

Let’s just say that the metaphor of a book index gets you in the door to understanding what an index does, but details like the above (and there are many more), help you understand the full potential of what a search engine index can accomplish and how it has transformed our lives.

So, what is a search index?

A search index can be used in two different contexts:

  • Searching for content, like the text in books, blogs, newspaper articles, system logs, and any other “document” that contains a lot of text. The expectation here is that every word counts, because the searcher is looking for the exact words and phrases used in the content. 
  • Searching for objects, like products on Amazon, films on Netflix, services in a hospital – any object that relies on attributes to define it. The expectation here is that the object is titled and described accurately enough so that a searcher using a sensible set of keywords will find the object.

Now, you can also search for books by tagging them with subjects, themes, authors, etc., but if the underlying goal of the search is to find content, the expectation is that every word and sentence in the book is searchable.

Object-based search and attributes

A successful object-based search (as we’ve defined it here) relies on a set of attributes that describe objects sufficiently so that a searcher can find what they are looking for using a reasonably small set of well-chosen keywords. A keyword can be one or more words, or even the first few characters of the first word. For example, while looking for the film Star Wars, a user might only need to type in “star”; but if the search engine bases its search algorithm on popularity (that is, it favors popular films in the first results), then “st” should be sufficient enough to find the blockbuster Star Wars.  

The number of attributes

If you want to find a movie, you most likely need only a few attributes, such as title, description, cast, crew, year, and a few others. If you want to perform a more general research, you’ll add attributes like themes, dialogues, cross-references, and additional background information. However, the list of attributes can get quite large. For example, cars have 1000s of attributes – material used, the name, type, and year of each part, owner history, factories and repair history, speed, and so on. 

What all objects have in common is the notion of keywords. Keywords are the words the owners of the content use as they build an object’s attributes, such as title, brand, author, year, and price. Or from another point of view: keywords are the “words” that a search engine uses to match the words in an index with the query of the searcher. 

Creating an index

As we’ve outlined above, a search engine identifies documents (books, web pages, products) that match a user’s query (keywords). To do this, it cannot scan every document. So it uses an index, either an exhaustive index of every word, or an attribute-based index with a subset of the most important descriptions.

An index is created before a user searches. It is a pre-scan of the underlying content. It’s also in a separate part of the server. For example, in a content-based scenario, the search engine pre-scans every document and saves all the unique words in an index. Many search engines structure their index in an “inverted index”, as we describe in the last (fun) section.  

Different kinds of indexes – ordering results with accuracy and relevance

Search indexes come with an order. For online searches like Google and Amazon, search results are usually ordered on the “best” not “accurate” matches. 

In those contexts, it’s not only about accurate results. If a user types in “brad” and Brad Pitt comes up, that doesn’t mean it’s accurate. Other results will include Brad Davis or the Brady theater. They are all relevant in different ways, but none of them can be considered “accurate”. One user who types in “brad” might choose to go to Brad Pitt’s Wikipedia page, another might go to Brad Pitt’s IMDB page. Accuracy doesn’t really capture the meaning of these choices. 

It’s all about how right the result feels to a given user, or how the result matches the intent of the searcher. To return to the compass metaphor: a compass helps us navigate by combining accuracy and relevance: a compass gives us accuracy in terms of north and south; but it also gives us relevance by pointing in the general direction of our destination and helping us match our intentions with our knowledge of the physical world to reach our destination. On the other hand, we expect a GPS system to be accurate not relevant.

Consider the bank employee who looks up your records and finds out that you owe the bank some money. The bank employee’s search results better be completely accurate. Likewise, when store employees or customers look for precise products, they are not interested in relevance: they rely on an accurate, exact product identifiers. 

This is not to say that searching by relevance does not contain an aspect of “accuracy”. For example, if someone types in “ball point pen”, the accuracy is to find all products that have an attribute with the words “ball point pen”. However, accuracy gives way to relevance: the relevance is which ball point pen to show first. 

A more technical way to explain the difference is to consider the difference between a database and a search index. Database-like indexing (the bank example) is centered around accuracy – ensuring that exact matches are properly sorted and exhaustive. A search-index-based search like Google is more flexible, where the textual matching is a mix between textual accuracy and relevance (optimizing your content for what we call SEO (search engine optimization)).

Similar to Google, the site search we see on Amazon and Netflix, and on websites where search is provided by Algolia, rely on a combination of structured sets of attributes and a ranking system that bases relevance on popularity, trends, likes, and a business’s product-promotional needs.

The search index on the search engine’s server(s) 

Okay, so let’s open up the hood. A search engine index is saved in a structure that enables fast retrieval. We call this structure an inverted index. One thing to note, an index is saved separately on the server, in a different location than the data.

While there are many types of inverted indexes, with many nuances, the following diagram sums up the idea:

inverted index

As you can see, with an inverted index, the search engine inverses the logic. So, instead of reading (scanning) a document looking for words, it inverts that process and uses the words to find the documents. Here’s an example of an inverted index:

  • Aardvark -> documents 1, 5, 400, 900
  • Amoeba  -> documents 101, 345, 785
  • Animals -> documents 33, 45, 98, 235

… and so on. Let’s say there are 10,000s unique words in 999 documents.

In the above diagram, the search engine’s logic to search an inverted index followed this process to find “aardvark”: 

  • The index has a row with the 26 letters of the alphabet and the 10 digits. Underneath this row is an upside down tree of all the words. The first row, first column has every word that begins with “a”, the next column begins with “b”, and so on. 
  • When you type in “a”, you get every word that starts with “a”, and subsequently remove every other column.
  • This probably removes 90% of the words, so the engine has a lot less to surface.
  • You type in “aa”, which removes nearly all of the remaining words. You’re probably left with less than 1% of the original list.
  • You type in “aaa”, which removes all of the remaining words in the list: there is no document with a word that begins with “aaa”.
  • So you delete the “a” from your query and add “r” to get “aar” and only one word appears “aardvark”, which leads you to 4 documents that include information about aardvarks.  

That’s how every word in a set of documents is stored in an index. It gets more complicated for non-prefix, middle of the word queries, but you get the idea.

And that’s all … Well, there are a lot more details. If you’re interesting in more, check out Algolia CTO’s article on the inside story of indexing.

 

Recommended Articles

Powered by Algolia AI Recommendations

Get the AI search that shows users what they need