Inside the Engine: Better relevance via dedup

Back to all blogs

One of the most unique and most-used features of Algolia is the Distinct feature: it enables developers to deduplicate records on-the-fly at query time based on one specific attribute. We introduced this feature three years ago, opening up a broad range of new use cases like the deduplication of product variants – a must-have for any eCommerce search. It has also considerably changed the way we recommend handling big documents like PDFs or web pages (we recommend splitting each document into several records).

This post highlights the advantages of this feature and the different challenges it represents in term of implementation.

Document search: state of the art and limitations

Searching inside large documents is probably the oldest challenge of information retrieval. It was inspired from the index at the end of books and gave the birth to an entire sub-genre of statistics. Among those methods, tf-idf and BM25 are the two most popular. They are very effective for ranking documents, but they don’t handle false positives well – they push them to the bottom of the results.

To illustrate this problem, you can perform the “states programming” query on Google and Wikipedia. According to Google Trends, this query is as popular as the Rust programming language and seems to correspond to developers that search how to develop an algorithm with a state. The same query on Wikipedia is unfortunately not very relevant! The first issue probably comes from a stemming algorithm or a synonym that considers “program” as the same than “programming.” In Algolia, you can have an expansion of singular/plural without polluting results with stemming by using the ignorePlurals feature, which is based on a linguistic lemmatizer.

That said, even if you scroll through the first 1000 hits of Wikipedia search, you won’t find the article which appears first in Google. There are hundreds of articles that contain both the word “states” and “programming.” Even the “United States” page contains both terms and one is in the title! In this case, tf-idf and BM25 are not the most useful ranking criteria. The position of the two query terms in the document is more important, in addition to their relative distance (finding them close together is always better).

Why do we split large documents

One of the best ways to avoid such relevance problems is to split big pages into several records – for example, you could create one record per section of the Wikipedia page. If a Wikipedia article were to have 4 main sections, we could create 4 different records, with the same article title, but different body content.

For example, instead of the following big record with the four sections embedded:

	{
	"title": "This is the title",
	"summary": "article summary",
	"sections": [
	{
	"title": "first section",
	"text": "..."
	},
	{
	"title": "second section",
	"text": "..."
	},
	{
	"title": "third section",
	"text": "..."
	},
	{
	"title": "fourth section",
	"text": "..."
	}
	]
	}

view raw articleWithSectionsEmbedded.json hosted with ❤ by GitHub

We could have 5 different records, one for the main one and one per section. All those records will share the same source attribute that we will use to indicate the engine that they are all component of the same article:

	[
	{
	"title": "This is the title",
	"summary": "article summary",
	"source": "articleName"
	},
	{
	"sectionTitle": "first section",
	"text": "...",
	"source": "articleName"
	},
	{
	"sectionTitle": "second section",
	"text": "...",
	"source": "articleName"
	},
	{
	"sectionTitle": "third section",
	"text": "...",
	"source": "articleName"
	},
	{
	"sectionTitle": "fourth section",
	"text": "...",
	"source": "articleName"
	}
	]

view raw articleWithSeparatedSections.json hosted with ❤ by GitHub

We can use this same approach for technical documentation, as there are often just a few long pages in order to improve the developer experience: scrolling is better than loading new pages! This is exactly why we have decided to split pages into several records in our DocSearch project; you can learn more about our approach in this blog post.

It might sound counter-intuitive, but the problem is even more visible when your data set is smaller! While the problem is only visible on well-selected queries on the Wikipedia use case, it becomes very apparent when you search inside a website with less content. You have a high probability of having false positives in your search that will frustrate for your users.

The need for deduplication

Splitting a page into several records is usually easy as you can use the formatting as a way to split. The problem with several records per document is that you will introduce duplicates in your search results. For example, all paragraphs of the United States Wikipedia article will match for the “United States” query. In this case, you would like to keep only the best match and avoid duplicates in the search results.

This is where the distinct feature comes to play! You just have to introduce one attribute with the same value for all records of the same source (typically an ID of the document) and declare it as the attributeForDistinct in your index setting. At query time, only the best match for each value of the attribute for distinct will be kept, all the other records will be removed to avoid any duplicate.

If we split the United States Wikipedia article by section and subsections, it would generate 39 records. You can find the first four records on this gist:


	[{
	"title": "United States",
	"synonyms": ["United States of America", "America", "US", "U.S.", "USA", "U.S.A.", "the landmass encompassing North America and South America America", "Americas"],
	"text": "The United States of America (USA), commonly referred to as the United States (U.S.) or America, is a federal republic composed of 50 states, a federal district, five major self-governing territories, and various possessions. Forty-eight of the fifty states and the federal district are contiguous and located in North America between Canada and Mexico. The state of Alaska is in the far northwestern corner of North America, with a land border to the east with Canada and separated by the Bering Strait from Russia. The state of Hawaii is an archipelago in the mid-Pacific. The territories are scattered about the Pacific Ocean and the Caribbean Sea. Nine time zones are covered. The geography, climate and wildlife of the country are extremely diverse.\nAt 3.8million square miles (9.8million km2) and with over 324 million people, the United States is the world's fourth-largest country by total area (and fourth-largest by land area) and the third-most populous. It is one of the world's most ethnically diverse and multicultural nations, and is home to the world's largest immigrant population. Urbanization climbed to over 80% in 2010 and leads to growing megaregions. The country's capital is Washington, D.C. and its largest city is New York City; the other major metropolitan areas, all with around five million or more inhabitants, are Los Angeles, Chicago, San Francisco, Boston, Dallas, Philadelphia, Houston, Miami, and Atlanta.\nPaleo-Indians migrated from Asia to the North American mainland at least 15,000 years ago. European colonization began in the 16th century. The United States emerged from 13 British colonies along the East Coast. Numerous disputes between Great Britain and the colonies in the aftermath of the Seven Years' War led to the American Revolution, which began in 1775. On July 4, 1776, as the colonies were fighting Great Britain in the American Revolutionary War, delegates from the 13 colonies unanimously adopted the Declaration of Independence. The war ended in 1783 with recognition of the independence of the United States by Great Britain, and was the first successful war of independence against a European colonial empire. The current constitution was adopted in 1788, after the Articles of Confederation, adopted in 1781, were felt to have provided inadequate federal powers. The first ten amendments, collectively named the Bill of Rights, were ratified in 1791 and designed to guarantee many fundamental civil liberties.\nThe United States embarked on a vigorous expansion across North America throughout the 19th century, displacing American Indian tribes, acquiring new territories, and gradually admitting new states until it spanned the continent by 1848. During the second half of the 19th century, the American Civil War led to the end of legal slavery in the country. By the end of that century, the United States extended into the Pacific Ocean, and its economy, driven in large part by the Industrial Revolution, began to soar. The Spanish–American War and World War I confirmed the country's status as a global military power. The United States emerged from World War II as a global superpower, the first country to develop nuclear weapons, the only country to use them in warfare, and a permanent member of the United Nations Security Council. It is a founding member of the Organization of American States (UAS) and various other Pan-American and international organizations. The end of the Cold War and the dissolution of the Soviet Union in 1991 left the United States as the world's sole superpower.\nThe United States is a highly developed country, with the world's largest economy by nominal GDP. It ranks highly in several measures of socioeconomic performance, including average wage, human development, per capita GDP, and productivity per person. While the U.S. economy is considered post-industrial, characterized by the dominance of services and knowledge economy, the manufacturing sector remains the second-largest in the world. Though its population is only 4.4% of the world total, the United States accounts for nearly a quarter of world GDP and almost a third of global military spending, making it the world's foremost military and economic power. The United States is a prominent political and cultural force internationally, and a leader in scientific research and technological innovations.",
	"recordType": "main",
	"recordTypeScore": 3,
	"popularity": 1812,
	"sourceArticle": "United States",
	"anchor": ""
	},
	{
	"title":"Etymology",
	"text": "See also: Naming of America, Names for United States citizens, American (word), and Names of the United States In 1507 the German cartographer Martin Waldseemüller produced a world map on which he named the lands of the Western Hemisphere \"America\" after the Italian explorer and cartographer Amerigo Vespucci (Latin: Americus Vespucius). The first documentary evidence of the phrase \"United States of America\" is from a letter dated January 2, 1776, written by Stephen Moylan, Esq., George Washington's aide-de-camp and Muster-Master General of the Continental Army. Addressed to Lt. Col. Joseph Reed, Moylan expressed his wish to carry the \"full and ample powers of the United States of America\" to Spain to assist in the revolutionary war effort. The first known publication of the phrase \"United States of America\" was in an anonymous essay in The Virginia Gazette newspaper in Williamsburg, Virginia, on April 6, 1776. The second draft of the Articles of Confederation, prepared by John Dickinson and completed by June 17, 1776, at the latest, declared \"The name of this Confederation shall be the 'United States of America.'\" The final version of the Articles sent to the states for ratification in late 1777 contains the sentence \"The Stile of this Confederacy shall be 'The United States of America'\". In June 1776, Thomas Jefferson wrote the phrase \"UNITED STATES OF AMERICA\" in all capitalized letters in the headline of his \"original Rough draught\" of the Declaration of Independence. This draft of the document did not surface until June 21, 1776, and it is unclear whether it was written before or after Dickinson used the term in his June 17 draft of the Articles of Confederation. In the final Fourth of July version of the Declaration, the title was changed to read, \"The unanimous Declaration of the thirteen united States of America\". The preamble of the Constitution states \"...establish this Constitution for the United States of America.\" The short form \"United States\" is also standard. Other common forms are the \"U.S.\", the \"USA\", and \"America\". Colloquial names are the \"U.S. of A.\" and, internationally, the \"States\". \"Columbia\", a name popular in poetry and songs of the late 18th century, derives its origin from Christopher Columbus; it appears in the name \"District of Columbia\". In non-English languages, the name is frequently the translation of either the \"United States\" or \"United States of America\", and colloquially as \"America\". In addition, an abbreviation (e.g. USA) is sometimes used. The phrase \"United States\" was originally plural, a description of a collection of independent states—e.g., \"the United States are\"—including in the Thirteenth Amendment to the United States Constitution, ratified in 1865. The singular form—e.g., \"the United States is\"— became popular after the end of the American Civil War. The singular form is now standard; the plural form is retained in the idiom \"these United States\". The difference is more significant than usage; it is a difference between a collection of states and a unit. A citizen of the United States is an \"American\". \"United States\", \"American\" and \"U.S.\" refer to the country adjectivally (\"American values\", \"U.S.forces\"). \"American\" rarely refers to subjects not connected with the United States.",
	"recordType": "section",
	"recordTypeScore": 2,
	"popularity": 1812,
	"sourceArticle": "United States",
	"anchor": "Etymology"
	},
	{
	"title":["History", "Indigenous and European contact"],
	"text": "Further information: Pre-Columbian era and Colonial history of the United States An artistic recreation of The Kincaid Site from the prehistoric Mississippian culture as it may have looked at its peak 1050-1400 AD Italian explorer Christoper Columbus arrives in America and takes possession of Guanahani The first inhabitants of North America migrated from Siberia by way of the Bering land bridge and arrived at least 15,000 years ago, though increasing evidence suggests an even earlier arrival. Some, such as the pre-Columbian Mississippian culture, developed advanced agriculture, grand architecture, and state-level societies. After the Spanish conquistadors made the first contacts, the native population declined for various reasons, primarily from diseases such as smallpox and measles. Violence was not a significant factor in the overall decline among Native Americans, though conflict among themselves and with Europeans affected specific tribes and various colonial settlements. In the Hawaiian Islands, the earliest indigenous inhabitants arrived around 1 AD from Polynesia. Europeans under the British explorer Captain James Cook arrived in the Hawaiian Islands in 1778. In the early days of colonization, many European settlers were subject to food shortages, disease, and attacks from Native Americans. Native Americans were also often at war with neighboring tribes and allied with Europeans in their colonial wars. At the same time, however, many natives and settlers came to depend on each other. Settlers traded for food and animal pelts, natives for guns, ammunition and other European wares. Natives taught many settlers where, when and how to cultivate corn, beans and squash. European missionaries and others felt it was important to \"civilize\" the Native Americans and urged them to adopt European agricultural techniques and lifestyles.",
	"recordType": "subsection",
	"recordTypeScore": 1,
	"popularity": 1812,
	"sourceArticle": "United States",
	"anchor": "Indigenous_and_European_contact"
	},
	{
	"title":["History", "Indigenous and European contact"],
	"text":"Further information: European colonization of the Americas and Thirteen Colonies Globe showing North America from 1602. Castillo de San Marcos in St. Augustine, Florida, the oldest continuously occupied European-established settlement in the United States The signing of the Mayflower Compact, 1620 After Spain sent Columbus on his first voyage to the New World in 1492, other explorers followed. The Spanish set up small settlements in New Mexico and Florida. France had several small settlements along the Mississippi River. Successful English settlement on the eastern coast of North America began with the Virginia Colony in 1607 at Jamestown and the Pilgrims' Plymouth Colony in 1620. Early experiments in communal living failed until the introduction of private farm holdings. Many settlers were dissenting Christian groups who came seeking religious freedom. The continent's first elected legislative assembly, Virginia's House of Burgesses created in 1619, and the Mayflower Compact, signed by the Pilgrims before disembarking, established precedents for the pattern of representative self-government and constitutionalism that would develop throughout the American colonies. Most settlers in every colony were small farmers, but other industries developed within a few decades as varied as the settlements. Cash crops included tobacco, rice and wheat. Extraction industries grew up in furs, fishing and lumber. Manufacturers produced rum and ships, and by the late colonial period Americans were producing one-seventh of the world's iron supply. Cities eventually dotted the coast to support local economies and serve as trade hubs. English colonists were supplemented by waves of Scotch-Irish and other groups. As coastal land grew more expensive freed indentured servants pushed further west. Slave cultivation of cash crops began with the Spanish in the 1500s, and was adopted by the English, but life expectancy was much higher in North America because of less disease and better food and treatment, leading to a rapid increase in the numbers of slaves. Colonial society was largely divided over the religious and moral implications of slavery and colonies passed acts for and against the practice. But by the turn of the 18th century, African slaves were replacing indentured servants for cash crop labor, especially in southern regions. With the British colonization of Georgia in 1732, the 13 colonies that would become the United States of America were established. All had local governments with elections open to most free men, with a growing devotion to the ancient rights of Englishmen and a sense of self-government stimulating support for republicanism. With extremely high birth rates, low death rates, and steady settlement, the colonial population grew rapidly. Relatively small Native American populations were eclipsed. The Christian revivalist movement of the 1730s and 1740s known as the Great Awakening fueled interest in both religion and religious liberty. During the Seven Years' War (also known as the French and Indian War), British forces seized Canada from the French, but the francophone population remained politically isolated from the southern colonies. Excluding the Native Americans, who were being conquered and displaced, those 13 colonies had a population of over 2.1 million in 1770, about one-third that of Britain. Despite continuing new arrivals, the rate of natural increase was such that by the 1770s only a small minority of Americans had been born overseas. The colonies' distance from Britain had allowed the development of self-government, but their success motivated monarchs to periodically seek to reassert royal authority.",
	"recordType": "subsection",
	"recordTypeScore": 1,
	"popularity": 1812,
	"sourceArticle": "United States",
	"anchor": "Settlements"
	}]

view raw WikipediaEnUnitedStatesSplit.json hosted with ❤ by GitHub

This split by section reduces a lot the probability of noise on large articles. It avoids the query “Etymology Indigenous” to match this article (which would typically match because there is a section called “Etymology” and another one called “Indigenous and European contact”). To improve the relevance, we have also divided the records into three different types that we order from the most important to the less important via the recordTypeScore attribute:

“main”, this is the first section of the article, including the title and synonyms (score of 3)
“section”: those are the main sections of the article (score of 2)
“subsection”: subdivision inside the sections of the article (score of 1)

The recordTypeScore attribute will be used in the ranking to give more importance to the “main” records than the “section” and “subsection”.

Another good property of this split is that it allows linking to the right section of the page when you click on the search result. For example, if you search for “united states Settlements”, you will be able to open the United States page with the anchor text stored in the record.

Here is the list of the four Algolia settings we applied on this index:

	{
	"attributeForDistinct": "sourceArticle",
	"searchableAttributes": [
	"title",
	"synonyms",
	"text",
	"sourceArticle"
	],
	"customRanking": [
	"desc(recordTypeScore)",
	"desc(popularity)"
	],
	"ignorePlurals": [
	"en"
	]
	}

view raw WikipediaEnIndexSettings.json hosted with ❤ by GitHub

You can set those settings via the setSettings API call or directly on the dashboard:

searchableAttributes and customRanking can be configured in the Ranking tab

Algolia Dashboard: Searchable & Ranking Attributes

ignorePlurals can also be configured in the Ranking tab. For the moment you can only enable it for all languages in the dashboard (setting it to true), we will improve this setting to let you configure also the language.
attributesForDistinct can be configured in the Display tab of the dashboard:

Each of those settings is important for the relevance:

The customRanking uses the recordTypeScore to give importance to the main chapter in the case of equality on the textual relevance. In case of equality, we use the popularity attribute which represents the number of backlinks inside Wikipedia to this article (all records of the article keep the same popularity)
The order of attributes in searchableAttributes gives the following decreasing importance for the matching attributes, title > text > synonyms > sourceArticle
The attributeForDistinct is applied on the name of the article, only the best elements will be displayed for each matched articles
We set ignorePlurals to “en” to consider all singular/plurals form of English as identical without introducing any noise (you can also set this setting at query time if you have a search in a different language using the same index)

With those settings, you can easily request a set of results without duplicates with the query parameter distinct=true. You can go even further by requesting several hits per deduplication key. You can, for example, pass distinct=3 in your query to retrieve up to three results per distinct key – this feature is usually known as grouping.

Keeping several records per distinct key

Search engines are also used for navigation or criteria-based search due to the fact that those queries are cheaper than on a regular database. Keeping several records per distinct key makes it easy to bring a different display to search.

For example, if you want to build a list of job offers, you will probably let the user search via different criteria like the role, location, job type, etc – but you might also want to aggregate the search results per company instead of having each job display company information. This is what AngelList is doing when you search for a job on their platform. They display companies along with the three best job offers, this type of display can be built very quickly by:

Having one record per job offer with an attribute containing the company ID
Configuring the company ID attribute as attributeForDistinct in the index settings
Passing distinct=3 in your query

With those three simple actions, you can build an interface like AngelList below (Note that SafeGraph has only two job offers that match the criteria, the value three is the upper-bound limit on the number of records we want per distinct key). Some customers also implement a “Show more” button on each company that allows to display all offers for this company. To implement it, you have just to filter on the company name and specify distinct=false on the query to retrieve all offers of this company.

Screen-Shot-2017-01-16-at-09.58.40-1.png — AngelList job search displaying several jobs per company

The Impact of Distinct on faceting

Faceting can very quickly become a headache when distinct is enabled. Let’s illustrate the problem with a very simple example containing two records sharing the same distinct key.

	[
	{
	"name": "First",
	"facet": [
	"A",
	"B"
	],
	"distinctKey": "key1"
	},
	{
	"name": "Second",
	"facet": [
	"B",
	"C"
	],
	"distinctKey": "key1"
	}
	]

view raw twoRecordsWithDistinctAndDifferentFacets.json hosted with ❤ by GitHub

The first record contains the facet values “A” and “B” and the second record contains the facet values “B” and “C”.

Let’s imagine that you have a query that retrieves both records and that faceting and distinct is enabled in your query. You can imagine three different behaviors in this case:

Computing the faceting before the deduplication (using distinct). In that case, the possible facet values will be “A=1”, “B=2” and “C=1”. All categories are retrieved but the count won’t fit what you have on screen.
Computing the faceting after deduplication (distinct). In this case, the result would be “A=1”, “B=1” OR “B=1”, “C=1” depending on the best matching record. In both cases, the result is not what you expect.
Applying deduplication independently for each facet value. In this case, we would have A=1, B=1 and C=1 (for each facet value, we need to look at all distinct key and perform a deduplication at this level).

The third behavior is the holy grail as the result of each facet count would be perfect. Unfortunately, the computation cost is insane and doesn’t scale as it directly depends on the number of facet values which match. In practice, it’s impossible to compute it in real time.

Let’s look at the advantages and drawbacks of the two approaches which can be computed for each query in a reasonable time:

Computing the faceting before the deduplication: all retrieved facets are valid but the problem is obviously the counts are misleading because they include the duplicates. That said, this approach works very well if you do not want to display the count, it handle any case including the fact you can have different facets in the records that share the same distinct key.
Computing the faceting after the deduplication: this approach works well when you use distinct=1 and all records with the same distinct key share the same facet values, which correspond exactly to the use case of splitting big documents. In other use cases, the results will be disturbing for users as some facet values can appear/disappear depending on the record which is kept after deduplication.

In other words, there is no perfect solution solving all problems. This is why we have decided to implement both approaches but as the deduplication can be misleading and cause weird effect, we have decided to use the faceting before the deduplication as the standard behavior. We have a query parameter called facetingAfterDistinct=true that can be added to your query when you are splitting some big records. In this case, you have a perfect result because the facets are identical in all records sharing the same distinct key.

Great features take time

We introduced the distinct feature in December 2013 and have improved it significantly in July 2015 by adding the support of distinct=N in the engine. The different faceting strategies have been a long-standing challenge and it took us a lot of time to find the best approach. The query parameter facetingAfterDistinct=true has been beta tested with real users for a few months before becoming public in December 2016.

This feature opened a lot of possibilities for our users and allowed a better indexing of big records that later gave the birth to our DocSearch project.

If you want to know more about the internal of the Algolia engine, we recommend to read the other posts in this series:

Inside the engine part 7 – Better relevance via dedup at query time

Document search: state of the art and limitations

Why do we split large documents

The need for deduplication

Keeping several records per distinct key

The Impact of Distinct on faceting

Great features take time

Recommended Content

Get the AI search that shows users what they need

Agentic intelligence layer powering commerce discovery

A leader for the third consecutive year

Increased Operating Profit and Improved Efficiency

Named a leader in knowledge discovery

Top scores across every B2B category