Guides / Sending and managing data / Format and structure your data

Indexing Long Documents

If you want to index long documents with Algolia, you need to split them up into smaller records. Record size is limited for performance reasons, meaning each new “chunk” should realistically only be a paragraph or two.

Consider the case of a lengthy Wikipedia page with many paragraphs, and people adding new content all the time. If you indexed the whole page as a single record, you would likely hit record size limit. Besides, we don’t recommend indexing much content in a single record, as it degrades search relevance. A better approach is to create small, hierarchical records based on the structure of the page.

This approach results in some redundancy of data. For that reason, we can leverage Algolia’s distinct feature to de-duplicate records based on a single attribute.

Looking into indexing a documentation website? Take a look at DocSearch, it’s free!

Modifying the data

Before

If we took the approach of creating one record per page, our dataset could look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[
  {
    "title": "Algolia",
    "permalink": "https://en.wikipedia.org/wiki/Algolia",
    "content": "Algolia is a U.S. startup company offering a web search product through a SaaS (software as a service) model.",
    "sections": [
      {
        "name": "Company",
        "content": "Algolia was founded in 2012 by Nicolas Dessaigne and Julien Lemoine, who are originally from Paris, France. It was originally a company focused on offline search on mobile phones. Later it was selected to be part of Y Combinator's[1] Winter 2014 class.\\nStarting with two data centres in Europe and the US, Algolia opened a third centre in Singapore in March 2014,[2] and as of 2016, claimed to be present in 47 locations across 15 worldwide regions.[3] It serves roughly 1,600 customers, handling 12 billion user queries per month.[4] Those customers are among e-commerce, medium and other fields, including DC Shoes, Medium and vevo.[5] In May 2015, Algolia received 18.3 million dollars in a series A investment from a financial group led by Accel Partners,[6] and in 2017 a $53M series B investment, also led by Accel Partners[7] From June 2016 to June 2017, the usage of Algolia by small websites has increased from 632 to 1,591 in the \\\"top 1mio websites\\\" evaluated by BuiltWith. In the same timeframe, BuiltWith recorded no significant usage increase among their \\\"top 10k homepages\\\".[8]"
      },
      {
        "name": "Products and Technology",
        "content": "The Algolia model provides search as a service, offering web search across a client's website using an externally hosted search engine.[9][10] Although in-site search has long been available from general web search providers such as Google, this is typically done as a subset of general web searching. The search engine crawls or spiders the web at large, including the client site, and then offers search features restricted to only that target site. This is a large and complex task, available only to large organisations at the scale of Google or Microsoft."
      }
    ]
  },
  ...
]

We don’t recommend this. As more and more people add new content, you would likely hit the record size limit (if you haven’t already). Also, this doesn’t make for a great search experience. Full pages contain too much text, which leads to returning irrelevant results when a user performs searches.

The right approach would be to split content into smaller records, by paragraph.

After

With the strategy of splitting along the paragraph line, a single page would result in several Algolia records. Here’s an example of what this might look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[
  {
    "title": "Algolia",
    "permalink": "https://en.wikipedia.org/wiki/Algolia",
    "content": "Algolia is a U.S. startup company offering a web search product through a SaaS (software as a service) model."
  },
  {
    "title": "Algolia",
    "section": "Company",
    "permalink": "https://en.wikipedia.org/wiki/Algolia",
    "content": "Algolia was founded in 2012 by Nicolas Dessaigne and Julien Lemoine, who are originally from Paris, France. It was originally a company focused on offline search on mobile phones. Later it was selected to be part of Y Combinator's[1] Winter 2014 class."
  },
  {
    "title": "Algolia",
    "section": "Company",
    "permalink": "https://en.wikipedia.org/wiki/Algolia",
    "content": "Starting with two data centres in Europe and the US, Algolia opened a third centre in Singapore in March 2014,[2] and as of 2016, claimed to be present in 47 locations across 15 worldwide regions.[3] It serves roughly 1,600 customers, handling 12 billion user queries per month.[4] Those customers are among e-commerce, medium and other fields, including DC Shoes, Medium and vevo.[5] In May 2015, Algolia received 18.3 million dollars in a series A investment from a financial group led by Accel Partners,[6] and in 2017 a $53M series B investment, also led by Accel Partners[7] From June 2016 to June 2017, the usage of Algolia by small websites has increased from 632 to 1,591 in the \\\"top 1mio websites\\\" evaluated by BuiltWith. In the same timeframe, BuiltWith recorded no significant usage increase among their \\\"top 10k homepages\\\".[8]"
  },
  {
    "title": "Algolia",
    "section": "Products and Technology",
    "permalink": "https://en.wikipedia.org/wiki/Algolia",
    "content": "The Algolia model provides search as a service, offering web search across a client's website using an externally hosted search engine.[9][10] Although in-site search has long been available from general web search providers such as Google, this is typically done as a subset of general web searching. The search engine crawls or spiders the web at large, including the client site, and then offers search features restricted to only that target site. This is a large and complex task, available only to large organisations at the scale of Google or Microsoft."
  },
  ...
]

With this approach, you’re eliminating the risk of ever hitting the record size limit. You’re also allowing for search results to be much more precise. Besides, you can handle the duplicate data with Algolia’s distinct feature, for example, to only retrieve one result per section.

Set attributeForDistinct=”attribute_name” and distinct=true

Using the API

At indexing time

To use distinct you first need to set section as attributeForDistinct during indexing time. Only then can you set distinct to true to de-duplicate your results. Note that setting distinct at indexing time is optional. If you want to, you can set it at query time instead.

1
2
3
4
$index->setSettings([
  'attributeForDistinct' => 'section',
  'distinct' => true
]);

At query time

Once attributeForDistinct is set, you can enable distinct by setting it to true. Note that you can set distinct to true or 1 interchangeably.

1
2
3
$results = $index->search('query', [
  'distinct' => true
]);

Using the Dashboard

You can also set your attribute for distinct and enable distinct in your Algolia dashboard.

  • Go to your dashboard and select your index.
  • Click the Configuration tab.
  • In the Search behavior section, select Deduplication and Grouping.
  • Set the Distinct dropdown to true.
  • Select your attribute in the Attribute for Distinct dropdown.
  • Don’t forget to save your changes.

Did you find this page helpful?