Learn a CTO’s perspective on Algolia vs. Elasticsearch.

Read More
Share on facebookShare on linkedinShare on twitterShare by email

This is the first article in a three-part series of blog posts that describe the technical and data aspects of facets and faceted search. Here in part 1, we describe what facets are and how they structure your data, using JSON as our guide.

What is a facet? What is faceted search? And what kind of data best represents facets?

Quick definition: A facet is an onscreen filter that allows end-users to narrow down their search, giving them more control over their search results’ relevance. A typical facet for an e-commerce product is an attribute like “brand” or “price”, and a facet’s values are the individual brands and prices. By clicking on a facet value, users can include and exclude whole categories of products. For example, by selecting “Apple” in the “Brand” category, a user can exclude every product except Apple products.

This is what we call a faceted search experience, where categories and common terms drive the search just as much as the text in a search bar. Every great online search experience offers a faceted search.

The data behind faceted search 

In this article, we  look at facets from the ground up, from the data to the UI. Specifically, we describe how to use JSON to create a great faceted search experience. A well-thought-out data strategy for faceting ensures effective filtering and high-speed performance. It also makes it easy for front-end developers to code, a critical consideration in a fast-paced coding environment.

Facets clean up your data 

Good search always starts with clean data. The best search index is well-structured, well-written, and includes nothing misleading or extraneous. An index is a searchable set of data, and indexing is the process that creates an index. These are the standard terms in search technology to describe how a search engine structures its data. Even though every engine structures its data differently, the end result is always an index. We use the words data and index, not dataset, database, or any other such term.  

Every search index includes facets because they organize data and ensure simplicity and completeness. Consider the following: the first example contains no facets; the next contains facets.

Without facets

Title: Star Wars: Episode V – The Empire Strikes Back

Description: Popular science fiction movie from the 70s, an American space opera, where Luke Skywalker, along with Han Solo, Princess Leia, and Chewbacca, fight Darth Vader and the Rebel Alliance to save the Galactic Empire. Created by George Lucas, the movie was produced by Lucasfilm and is now owned and distributed by Walt Disney Studios films. The ensemble cast includes Mark Hamill, Harrison Ford, Carrie Fisher, Billy Dee Williams, Anthony Daniels, David Prowse, Kenny Baker, Peter Mayhew, and Frank Oz. The film became the highest-grossing film of 1980 with $440 million.

With facets

Title: Star Wars: Episode V – The Empire Strikes Back

Description: An American space opera. Luke Skywalker, along with Han Solo, Princess Leia, and Chewbacca, fight Darth Vader and the Rebel Alliance to save the Galactic Empire.

Genre: science fiction, space opera

Era: 70s, 80s

Story: George Lucas

Saga: Star Wars

Studio: Lucasfilm, Walt Disney Studios

Production facts: The film is produced by Lucasfilm, now owned by Walt Disney Studios films.

Actors: Mark Hamill, Harrison Ford, Carrie Fisher, Billy Dee Williams, Anthony Daniels, David Prowse, Kenny Baker, Peter Mayhew, Frank Oz.

Characters: Luke Skywalker, Han Solo, Princess Leia, Chewbacca, Darth Vader, Yodi, C-3PO, R2-D2

As you can see, we’ve shortened “description” considerably, getting instantly to the point. We’ve moved its content into several different attributes, like “production facts”, “genre”, “actors”, and “saga”. Some of these attributes will be displayed in the search results. However, most of “description” breaks down into useful chunks of information for searching and faceting. This analytical process is a crucial step in creating a superior searchable index and faceted search experience for the end user. 

Breaking up your data in this way — into small bits of bite-size info — ensures that every attribute counts. This is one benefit to using facets

The basic principle is: large amounts of text in one attribute are not as good for search as smaller chunks of text in multiple attributes, many of which can be used as facets. With these facet-chunks, users can filter their results. Some examples: 

  • Include all films produced and written by George Lucas.
  • Exclude all films except science fiction films, sub-genre space opera.
  • See all Star Wars films, thanks to a “Saga” facet.

Most companies already have structured data, so it’s just a matter of transferring that structure when building the searchable index. On the other hand, some companies have multiple data sources, so they have to design new structures while merging these data sources. 

Faceted search — facets are not only for filtering

Setting up your data with bite-size pieces of information not only enables your search engine to filter records. It also focuses attention on the information, making unfiltered searches more precise. This is only possible if you define your facet attributes as searchable — meaning, you tell your search engine to look into the facets “year” and “type” attributes before looking into “title” and “description” Now, a user can type in “70s sci-fi movies” and find Star Wars without using a filter.   

Let’s get technical — engine-level faceting & JSON

There’s a lot to consider technically when building facets into your index:

  • Representing facets as input into your index.
  • Representing facets as input into a search request.
  • Receiving facets in response to a search request.
  • Combining facets with AND, OR, NOT.
  • How the engine represents facets.

In this article, we lay down the foundations of facet filtering and faceted search by discussing only the first item – How to break down and represent your data as facets. Though we touch on the other aspects, we’ll save more detailed discussions for future articles. 

Representing facets as input into your search index

The best facet data format allows for:

  • A totally customizable set of facets, as determined by your application/business needs.
  • An easily understood data structure. 
  • Flexibility, so that every record in the index can have a different set of attributes.
  • A scalable structure that allows for multiple facet values and category hierarchies.

In a word, JSON.

What is JSON vs. relational databases

Here’s a very, very quick history: Relational databases established a standard with foreign keys, atomic data, and other such improvements over previous data technologies. Relational databases allowed people to normalize their data by distributing every item’s characteristics in a database over a system of small, clearly defined linked tables. It created homogeneity, with a reliable and easily understood consistent structure. However, this homogeneity did not always meet the needs of every application with different data needs. Thus, the solution was to create multiple databases with different structures, sometimes with the same underlying information — a difficult-to-maintain solution that required extra resources. 

The Entity-Attribute-Value (EAV) model added flexibility to the relational model: now, every item could be defined differently from each other in the same database. EAV was all about heterogeneity — allowing a single database to contain a wide variety of information in an efficient manner, where every item could have its own unique and complete set of descriptive tags called key/values. EAV kept to the principle of no repetition of data by using multiple schemas and often localizing its unique values in a single table. 

JSON both encapsulates the EAV principle and breaks away from EAV’s relational roots, most notably by doing away with the schema of tables and rows and by relaxing the principle of no repetition of  data. JSON repeats and repeats information, which is why it’s so easy to use and so powerful and flexible. There are no more constraints – every record item lives within a vacuum, containing its own unique set of attributes. There is also no more predefined schema that an entity must follow. Now, there are only JSON objects (or records), and keys, and values. 

Why is JSON so good for building a searchable index? 

A search engine needs to treat every record in its index atomically, not relying on relationships with other records. It needs to de-normalize its data, not normalize it like a relational database. 

Additionally, the engine needs to be agnostic about what a record contains and therefore be capable of searching into any index, whether commercial products, movies, or professional services. It only needs to return any record that matches the search query. Most search engines do not rely on semantics; instead, they rely on textual matching, which means that a search engine only needs to match the content in an item’s attributes with the text in the search bar. The match can be full or partial. JSON, with its readability, focuses the data exclusively on its textual content.

JSON key/values are easy-to-read, easy-to-search.

[
  {
    "objectID": 42,
    "title": "Star Wars: Episode V - The Empire Strikes Back",
    "type": “movie”,
    "genre": ["science fiction", "space opera"],
    "series": "star wars"
  },
  {
    "objectID": 43,
    "title": "LEGO: Star Wars Yoda",
    "type": “toy”,
    "brand": “Lego”,
    "series": "star wars"
  }
]

It’s pretty clear what’s going here. We have two records. One is a movie, the other is a toy. They have different attributes: the movie has a “genre”, the toy has a “brand”. However, they both have “series”, which can be used optionally to link the two records on “Star Wars”.  

Some search engines use JSON formats directly as the basis of their search, with no conversion required. Most, however, convert this data into the format of their own proprietary index. In either case, the process is the same: loop through every record and key, then save (or search) the values. The best algorithms expect nothing but keys and values — though most engines ask for a minimum of required attributes. For example, an “Object ID” — a unique identifier for every record, used to find and update records while indexing. But don’t be mislead. The “objectID” doesn’t function like a primary key, as in a relational database. In search, there is little use for the primary/foreign key relationship. Interestingly, instead of keys, search relies on facets to link records. You can see that in our above example: toys and movies were linked by the “series” facet.  

REST APIs and the joys of generating and visualizing JSON

Like JSON, REST API simplified a standard by offering an alternative to SOAP API. It did this for similar reasons: to add more flexibility with less formality. SOAP is based on a protocol with strict specifications and rules. REST API does away with that. It provides an endpoint and accepts different data formats, such as CSV, XML, and JSON. It is up to the endpoint and the REST API developer to communicate to its users what to send in the formatted data. The best APIs don’t require too much specific data. They focus on the information needed to perform their job. 

For example, here’s how to search:

curl -X POST \
     -H "API-Key: 123” \
     -H "Application-Id: app_movies” \
     --data-binary '{ "params": "query=star wars” }' \
     "https://some-movie-api.com/indexes/movies/query"

The “data-binary” field contains the whole message: execute a search using these parameters. In this case – no surprise: this is a search for “star wars”.

From Curl to API Clients

Here’s how to create a record in an index called “app-movies” using an API client. The best search engines wrap Curl’s tedious syntax into APIs in different programming languages. For example, the following code is using a JavaScript API client:

index.search(“star wars”);
index.save('{
"ojectID": 42, "title":"Start Wars: Episode V - The Empire Strikes Back","Description":"An American space opera, Luke Skywalker, along with Han Solo, Princess Leia, and Chewbacca, fight Darth Vader and the Rebel Alliance to save the Galactic Empire.","Genre": ["science fiction, space opera"],"Era": ["70s, 80s"]","story":"George Lucas","Saga: Star Wars","Studio":["Lucasfilm, Walt Disney Studios"],"Production facts":"The film is produced by Lucasfilm, now owned by Walt Disney Studios films.","Actors":["Mark Hamill, Harrison Ford, Carrie Fisher, Billy Dee Williams, Anthony Daniels, David Prowse, Kenny Baker, Peter Mayhew, Frank Oz"],"Characters":["Luke Skywalker, Han Solo, Princess Leia, Chewbacca, Darth Vader, Yodi, C-3PO, R2-D2]"
}' );

Whichever format you use, Curl or JavaScript, you can see how easy it is to encapsulate the full “Star Wars: Episode V – The Empire Strikes Back” example at the start of this article. And to send it in a human- and machine-readable format. 

Parting words and a few more details on faceted search

One key takeaway is that facets are not optional. They play a role in every aspect of the search experience. Functionally, facets are used for searching data, filtering records, and ordering results. Technically, they organize your data and ensure its quality and precision, central to creating a quality search experience. 

The other key takeaway is that how you structure your facets in your index determines their functional effectiveness. Modern search needs the flexibility to adapt to the fast-paced changes of every online business. It also needs the diversity to manage a large and constantly evolving variety of user tastes and needs. JSON enables this kind of flexibility and diversity. Developers need a schemaless data structure like JSON format to represent the multiplicity of facets and filtering in the best way possible. For example, nested attributes. We’ll see that in Part 2 of this series on facets & faceted search.

 

About the author

Loading amazing content…

Subscribe to the blog updates
Thank you for subscribing, stay tuned!