Looking for our logo?
One of the most unique and most-used features of Algolia is the Distinct feature: it enables developers to deduplicate records on-the-fly at query time based on one specific attribute. We introduced this feature three years ago, opening up a broad range of new use cases like the deduplication of product variants – a must-have for any eCommerce search. It has also considerably changed the way we recommend handling big documents like PDFs or web pages (we recommend splitting each document into several records).
This post highlights the advantages of this feature and the different challenges it represents in term of implementation.
Searching inside large documents is probably the oldest challenge of information retrieval. It was inspired from the index at the end of books and gave the birth to an entire sub-genre of statistics. Among those methods, tf-idf and BM25 are the two most popular. They are very effective for ranking documents, but they don’t handle false positives well – they push them to the bottom of the results.
To illustrate this problem, you can perform the “states programming” query on Google and Wikipedia. According to Google Trends, this query is as popular as the Rust programming language and seems to correspond to developers that search how to develop an algorithm with a state. The same query on Wikipedia is unfortunately not very relevant! The first issue probably comes from a stemming algorithm or a synonym that considers “program” as the same than “programming.” In Algolia, you can have an expansion of singular/plural without polluting results with stemming by using the ignorePlurals feature, which is based on a linguistic lemmatizer.
That said, even if you scroll through the first 1000 hits of Wikipedia search, you won’t find the article which appears first in Google. There are hundreds of articles that contain both the word “states” and “programming.” Even the “United States” page contains both terms and one is in the title! In this case, tf-idf and BM25 are not the most useful ranking criteria. The position of the two query terms in the document is more important, in addition to their relative distance (finding them close together is always better).
One of the best ways to avoid such relevance problems is to split big pages into several records – for example, you could create one record per section of the Wikipedia page. If a Wikipedia article were to have 4 main sections, we could create 4 different records, with the same article title, but different body content.
For example, instead of the following big record with the four sections embedded:
{ | |
"title": "This is the title", | |
"summary": "article summary", | |
"sections": [ | |
{ | |
"title": "first section", | |
"text": "..." | |
}, | |
{ | |
"title": "second section", | |
"text": "..." | |
}, | |
{ | |
"title": "third section", | |
"text": "..." | |
}, | |
{ | |
"title": "fourth section", | |
"text": "..." | |
} | |
] | |
} |
view rawarticleWithSectionsEmbedded.json hosted with ❤ by GitHub
We could have 5 different records, one for the main one and one per section. All those records will share the same source attribute that we will use to indicate the engine that they are all component of the same article:
[ | |
{ | |
"title": "This is the title", | |
"summary": "article summary", | |
"source": "articleName" | |
}, | |
{ | |
"sectionTitle": "first section", | |
"text": "...", | |
"source": "articleName" | |
}, | |
{ | |
"sectionTitle": "second section", | |
"text": "...", | |
"source": "articleName" | |
}, | |
{ | |
"sectionTitle": "third section", | |
"text": "...", | |
"source": "articleName" | |
}, | |
{ | |
"sectionTitle": "fourth section", | |
"text": "...", | |
"source": "articleName" | |
} | |
] |
view rawarticleWithSeparatedSections.json hosted with ❤ by GitHub
We can use this same approach for technical documentation, as there are often just a few long pages in order to improve the developer experience: scrolling is better than loading new pages! This is exactly why we have decided to split pages into several records in our DocSearch project; you can learn more about our approach in this blog post.
It might sound counter-intuitive, but the problem is even more visible when your data set is smaller! While the problem is only visible on well-selected queries on the Wikipedia use case, it becomes very apparent when you search inside a website with less content. You have a high probability of having false positives in your search that will frustrate for your users.
Splitting a page into several records is usually easy as you can use the formatting as a way to split. The problem with several records per document is that you will introduce duplicates in your search results. For example, all paragraphs of the United States Wikipedia article will match for the “United States” query. In this case, you would like to keep only the best match and avoid duplicates in the search results.
This is where the distinct feature comes to play! You just have to introduce one attribute with the same value for all records of the same source (typically an ID of the document) and declare it as the attributeForDistinct in your index setting. At query time, only the best match for each value of the attribute for distinct will be kept, all the other records will be removed to avoid any duplicate.
If we split the United States Wikipedia article by section and subsections, it would generate 39 records. You can find the first four records on this gist:
[{ | |
"title": "United States", | |
"synonyms": ["United States of America", "America", "US", "U.S.", "USA", "U.S.A.", "the landmass encompassing North America and South America America", "Americas"], | |
"text": "The United States of America (USA), commonly referred to as the United States (U.S.) or America, is a federal republic composed of 50 states, a federal district, five major self-governing territories, and various possessions. Forty-eight of the fifty states and the federal district are contiguous and located in North America between Canada and Mexico. The state of Alaska is in the far northwestern corner of North America, with a land border to the east with Canada and separated by the Bering Strait from Russia. The state of Hawaii is an archipelago in the mid-Pacific. The territories are scattered about the Pacific Ocean and the Caribbean Sea. Nine time zones are covered. The geography, climate and wildlife of the country are extremely diverse.\nAt 3.8million square miles (9.8million km2) and with over 324 million people, the United States is the world's fourth-largest country by total area (and fourth-largest by land area) and the third-most populous. It is one of the world's most ethnically diverse and multicultural nations, and is home to the world's largest immigrant population. Urbanization climbed to over 80% in 2010 and leads to growing megaregions. The country's capital is Washington, D.C. and its largest city is New York City; the other major metropolitan areas, all with around five million or more inhabitants, are Los Angeles, Chicago, San Francisco, Boston, Dallas, Philadelphia, Houston, Miami, and Atlanta.\nPaleo-Indians migrated from Asia to the North American mainland at least 15,000 years ago. European colonization began in the 16th century. The United States emerged from 13 British colonies along the East Coast. Numerous disputes between Great Britain and the colonies in the aftermath of the Seven Years' War led to the American Revolution, which began in 1775. On July 4, 1776, as the colonies were fighting Great Britain in the American Revolutionary War, delegates from the 13 colonies unanimously adopted the Declaration of Independence. The war ended in 1783 with recognition of the independence of the United States by Great Britain, and was the first successful war of independence against a European colonial empire. The current constitution was adopted in 1788, after the Articles of Confederation, adopted in 1781, were felt to have provided inadequate federal powers. The first ten amendments, collectively named the Bill of Rights, were ratified in 1791 and designed to guarantee many fundamental civil liberties.\nThe United States embarked on a vigorous expansion across North America throughout the 19th century, displacing American Indian tribes, acquiring new territories, and gradually admitting new states until it spanned the continent by 1848. During the second half of the 19th century, the American Civil War led to the end of legal slavery in the country. By the end of that century, the United States extended into the Pacific Ocean, and its economy, driven in large part by the Industrial Revolution, began to soar. The Spanish–American War and World War I confirmed the country's status as a global military power. The United States emerged from World War II as a global superpower, the first country to develop nuclear weapons, the only country to use them in warfare, and a permanent member of the United Nations Security Council. It is a founding member of the Organization of American States (UAS) and various other Pan-American and international organizations. The end of the Cold War and the dissolution of the Soviet Union in 1991 left the United States as the world's sole superpower.\nThe United States is a highly developed country, with the world's largest economy by nominal GDP. It ranks highly in several measures of socioeconomic performance, including average wage, human development, per capita GDP, and productivity per person. While the U.S. economy is considered post-industrial, characterized by the dominance of services and knowledge economy, the manufacturing sector remains the second-largest in the world. Though its population is only 4.4% of the world total, the United States accounts for nearly a quarter of world GDP and almost a third of global military spending, making it the world's foremost military and economic power. The United States is a prominent political and cultural force internationally, and a leader in scientific research and technological innovations.", | |
"recordType": "main", | |
"recordTypeScore": 3, | |
"popularity": 1812, | |
"sourceArticle": "United States", | |
"anchor": "" | |
}, | |
{ | |
"title":"Etymology", | |
"text": "See also: Naming of America, Names for United States citizens, American (word), and Names of the United States In 1507 the German cartographer Martin Waldseemüller produced a world map on which he named the lands of the Western Hemisphere \"America\" after the Italian explorer and cartographer Amerigo Vespucci (Latin: Americus Vespucius). The first documentary evidence of the phrase \"United States of America\" is from a letter dated January 2, 1776, written by Stephen Moylan, Esq., George Washington's aide-de-camp and Muster-Master General of the Continental Army. Addressed to Lt. Col. Joseph Reed, Moylan expressed his wish to carry the \"full and ample powers of the United States of America\" to Spain to assist in the revolutionary war effort. The first known publication of the phrase \"United States of America\" was in an anonymous essay in The Virginia Gazette newspaper in Williamsburg, Virginia, on April 6, 1776. The second draft of the Articles of Confederation, prepared by John Dickinson and completed by June 17, 1776, at the latest, declared \"The name of this Confederation shall be the 'United States of America.'\" The final version of the Articles sent to the states for ratification in late 1777 contains the sentence \"The Stile of this Confederacy shall be 'The United States of America'\". In June 1776, Thomas Jefferson wrote the phrase \"UNITED STATES OF AMERICA\" in all capitalized letters in the headline of his \"original Rough draught\" of the Declaration of Independence. This draft of the document did not surface until June 21, 1776, and it is unclear whether it was written before or after Dickinson used the term in his June 17 draft of the Articles of Confederation. In the final Fourth of July version of the Declaration, the title was changed to read, \"The unanimous Declaration of the thirteen united States of America\". The preamble of the Constitution states \"...establish this Constitution for the United States of America.\" The short form \"United States\" is also standard. Other common forms are the \"U.S.\", the \"USA\", and \"America\". Colloquial names are the \"U.S. of A.\" and, internationally, the \"States\". \"Columbia\", a name popular in poetry and songs of the late 18th century, derives its origin from Christopher Columbus; it appears in the name \"District of Columbia\". In non-English languages, the name is frequently the translation of either the \"United States\" or \"United States of America\", and colloquially as \"America\". In addition, an abbreviation (e.g. USA) is sometimes used. The phrase \"United States\" was originally plural, a description of a collection of independent states—e.g., \"the United States are\"—including in the Thirteenth Amendment to the United States Constitution, ratified in 1865. The singular form—e.g., \"the United States is\"— became popular after the end of the American Civil War. The singular form is now standard; the plural form is retained in the idiom \"these United States\". The difference is more significant than usage; it is a difference between a collection of states and a unit. A citizen of the United States is an \"American\". \"United States\", \"American\" and \"U.S.\" refer to the country adjectivally (\"American values\", \"U.S.forces\"). \"American\" rarely refers to subjects not connected with the United States.", | |
"recordType": "section", | |
"recordTypeScore": 2, | |
"popularity": 1812, | |
"sourceArticle": "United States", | |
"anchor": "Etymology" | |
}, | |
{ | |
"title":["History", "Indigenous and European contact"], | |
"text": "Further information: Pre-Columbian era and Colonial history of the United States An artistic recreation of The Kincaid Site from the prehistoric Mississippian culture as it may have looked at its peak 1050-1400 AD Italian explorer Christoper Columbus arrives in America and takes possession of Guanahani The first inhabitants of North America migrated from Siberia by way of the Bering land bridge and arrived at least 15,000 years ago, though increasing evidence suggests an even earlier arrival. Some, such as the pre-Columbian Mississippian culture, developed advanced agriculture, grand architecture, and state-level societies. After the Spanish conquistadors made the first contacts, the native population declined for various reasons, primarily from diseases such as smallpox and measles. Violence was not a significant factor in the overall decline among Native Americans, though conflict among themselves and with Europeans affected specific tribes and various colonial settlements. In the Hawaiian Islands, the earliest indigenous inhabitants arrived around 1 AD from Polynesia. Europeans under the British explorer Captain James Cook arrived in the Hawaiian Islands in 1778. In the early days of colonization, many European settlers were subject to food shortages, disease, and attacks from Native Americans. Native Americans were also often at war with neighboring tribes and allied with Europeans in their colonial wars. At the same time, however, many natives and settlers came to depend on each other. Settlers traded for food and animal pelts, natives for guns, ammunition and other European wares. Natives taught many settlers where, when and how to cultivate corn, beans and squash. European missionaries and others felt it was important to \"civilize\" the Native Americans and urged them to adopt European agricultural techniques and lifestyles.", | |
"recordType": "subsection", | |
"recordTypeScore": 1, | |
"popularity": 1812, | |
"sourceArticle": "United States", | |
"anchor": "Indigenous_and_European_contact" | |
}, | |
{ | |
"title":["History", "Indigenous and European contact"], | |
"text":"Further information: European colonization of the Americas and Thirteen Colonies Globe showing North America from 1602. Castillo de San Marcos in St. Augustine, Florida, the oldest continuously occupied European-established settlement in the United States The signing of the Mayflower Compact, 1620 After Spain sent Columbus on his first voyage to the New World in 1492, other explorers followed. The Spanish set up small settlements in New Mexico and Florida. France had several small settlements along the Mississippi River. Successful English settlement on the eastern coast of North America began with the Virginia Colony in 1607 at Jamestown and the Pilgrims' Plymouth Colony in 1620. Early experiments in communal living failed until the introduction of private farm holdings. Many settlers were dissenting Christian groups who came seeking religious freedom. The continent's first elected legislative assembly, Virginia's House of Burgesses created in 1619, and the Mayflower Compact, signed by the Pilgrims before disembarking, established precedents for the pattern of representative self-government and constitutionalism that would develop throughout the American colonies. Most settlers in every colony were small farmers, but other industries developed within a few decades as varied as the settlements. Cash crops included tobacco, rice and wheat. Extraction industries grew up in furs, fishing and lumber. Manufacturers produced rum and ships, and by the late colonial period Americans were producing one-seventh of the world's iron supply. Cities eventually dotted the coast to support local economies and serve as trade hubs. English colonists were supplemented by waves of Scotch-Irish and other groups. As coastal land grew more expensive freed indentured servants pushed further west. Slave cultivation of cash crops began with the Spanish in the 1500s, and was adopted by the English, but life expectancy was much higher in North America because of less disease and better food and treatment, leading to a rapid increase in the numbers of slaves. Colonial society was largely divided over the religious and moral implications of slavery and colonies passed acts for and against the practice. But by the turn of the 18th century, African slaves were replacing indentured servants for cash crop labor, especially in southern regions. With the British colonization of Georgia in 1732, the 13 colonies that would become the United States of America were established. All had local governments with elections open to most free men, with a growing devotion to the ancient rights of Englishmen and a sense of self-government stimulating support for republicanism. With extremely high birth rates, low death rates, and steady settlement, the colonial population grew rapidly. Relatively small Native American populations were eclipsed. The Christian revivalist movement of the 1730s and 1740s known as the Great Awakening fueled interest in both religion and religious liberty. During the Seven Years' War (also known as the French and Indian War), British forces seized Canada from the French, but the francophone population remained politically isolated from the southern colonies. Excluding the Native Americans, who were being conquered and displaced, those 13 colonies had a population of over 2.1 million in 1770, about one-third that of Britain. Despite continuing new arrivals, the rate of natural increase was such that by the 1770s only a small minority of Americans had been born overseas. The colonies' distance from Britain had allowed the development of self-government, but their success motivated monarchs to periodically seek to reassert royal authority.", | |
"recordType": "subsection", | |
"recordTypeScore": 1, | |
"popularity": 1812, | |
"sourceArticle": "United States", | |
"anchor": "Settlements" | |
}] |
view rawWikipediaEnUnitedStatesSplit.json hosted with ❤ by GitHub
This split by section reduces a lot the probability of noise on large articles. It avoids the query “Etymology Indigenous” to match this article (which would typically match because there is a section called “Etymology” and another one called “Indigenous and European contact”). To improve the relevance, we have also divided the records into three different types that we order from the most important to the less important via the recordTypeScore attribute:
The recordTypeScore attribute will be used in the ranking to give more importance to the “main” records than the “section” and “subsection”.
Another good property of this split is that it allows linking to the right section of the page when you click on the search result. For example, if you search for “united states Settlements”, you will be able to open the United States page with the anchor text stored in the record.
Here is the list of the four Algolia settings we applied on this index:
{ | |
"attributeForDistinct": "sourceArticle", | |
"searchableAttributes": [ | |
"title", | |
"synonyms", | |
"text", | |
"sourceArticle" | |
], | |
"customRanking": [ | |
"desc(recordTypeScore)", | |
"desc(popularity)" | |
], | |
"ignorePlurals": [ | |
"en" | |
] | |
} |
view rawWikipediaEnIndexSettings.json hosted with ❤ by GitHub
You can set those settings via the setSettings API call or directly on the dashboard:
Each of those settings is important for the relevance:
With those settings, you can easily request a set of results without duplicates with the query parameter distinct=true. You can go even further by requesting several hits per deduplication key. You can, for example, pass distinct=3 in your query to retrieve up to three results per distinct key – this feature is usually known as grouping.
Search engines are also used for navigation or criteria-based search due to the fact that those queries are cheaper than on a regular database. Keeping several records per distinct key makes it easy to bring a different display to search.
For example, if you want to build a list of job offers, you will probably let the user search via different criteria like the role, location, job type, etc – but you might also want to aggregate the search results per company instead of having each job display company information. This is what AngelList is doing when you search for a job on their platform. They display companies along with the three best job offers, this type of display can be built very quickly by:
With those three simple actions, you can build an interface like AngelList below (Note that SafeGraph has only two job offers that match the criteria, the value three is the upper-bound limit on the number of records we want per distinct key). Some customers also implement a “Show more” button on each company that allows to display all offers for this company. To implement it, you have just to filter on the company name and specify distinct=false on the query to retrieve all offers of this company.
Faceting can very quickly become a headache when distinct is enabled. Let’s illustrate the problem with a very simple example containing two records sharing the same distinct key.
[ | |
{ | |
"name": "First", | |
"facet": [ | |
"A", | |
"B" | |
], | |
"distinctKey": "key1" | |
}, | |
{ | |
"name": "Second", | |
"facet": [ | |
"B", | |
"C" | |
], | |
"distinctKey": "key1" | |
} | |
] |
view rawtwoRecordsWithDistinctAndDifferentFacets.json hosted with ❤ by GitHub
The first record contains the facet values “A” and “B” and the second record contains the facet values “B” and “C”.
Let’s imagine that you have a query that retrieves both records and that faceting and distinct is enabled in your query. You can imagine three different behaviors in this case:
The third behavior is the holy grail as the result of each facet count would be perfect. Unfortunately, the computation cost is insane and doesn’t scale as it directly depends on the number of facet values which match. In practice, it’s impossible to compute it in real time.
Let’s look at the advantages and drawbacks of the two approaches which can be computed for each query in a reasonable time:
In other words, there is no perfect solution solving all problems. This is why we have decided to implement both approaches but as the deduplication can be misleading and cause weird effect, we have decided to use the faceting before the deduplication as the standard behavior. We have a query parameter called facetingAfterDistinct=true that can be added to your query when you are splitting some big records. In this case, you have a perfect result because the facets are identical in all records sharing the same distinct key.
We introduced the distinct feature in December 2013 and have improved it significantly in July 2015 by adding the support of distinct=N in the engine. The different faceting strategies have been a long-standing challenge and it took us a lot of time to find the best approach. The query parameter facetingAfterDistinct=true has been beta tested with real users for a few months before becoming public in December 2016.
This feature opened a lot of possibilities for our users and allowed a better indexing of big records that later gave the birth to our DocSearch project.
If you want to know more about the internal of the Algolia engine, we recommend to read the other posts in this series:
Julien Lemoine
Co-founder & former CTO at AlgoliaPowered by Algolia AI Recommendations