Indexing Markdown content with Algolia

Back to all blogs

I teamed up with Starschema Full Stack Engineer Soma Osvay to write about a project near and dear to my heart: developer documentation. Soma and I both hope you enjoy this article where we cover the use-case and challenges and the implementation Soma came up with.

Here at Starschema, we have a lot of projects we’ve done as part of a consultation with markdown documentation. When we’re supporting these existing solutions or wanting to develop a new project, we often want to search through all of the documentation. We currently don’t have a solution, costing us man-hours since we have to do this work manually.

Use Case

We need to be able to search for projects related to specific topics to check implementation specifics, get ideas from the code, and much more. The Sales team also needs a solution to determine what kind of projects our team has completed for certain topics and be able to quickly communicate it back to potential clients. It would also be great to have something like an internal Stack Overflow for our Developers since we are doing a lot of deep technical work in Tableau. Project Managers also need to be able to determine which employees have worked with certain technologies so they can get questions answered quickly and easily.

In short, we need to be able to search for our own projects based on the following attributes:

Technologies used (coding languages, databases, etc)
Project keywords (tags)
Project documentation itself such as install instructions, technical documentation, etc.
Timeframe of when the project occurred

All of these attributes exist within the internal documentation markdown files – we just need a way to search them.

Implementation Plan

The plan is to use a NodeJS CLI as a proof of concept which will:

Scrape the top GitHub public repositories and grabs the README markdown file (which will ultimately represent our internal projects)
Stores the documentation files into Algolia alongside the project’s basic information (title, programming language(s), tags, etc)

The CLI will contain advanced logging and command line arguments for ease of use. We also want to host it on the web so people can try it out.

Challenge

The biggest challenge at hand is the record size — Algolia only allows our records to be 100KB at the most. However, most markdown documentation files are much larger. The solution is that we need to split up the markdown files into multiple pieces within the Index. We also then need to make sure that when we search for something, a single project only appears once—even though it is split into multiple records.

Fortunately, Algolia has a distinct feature so we can de-duplicate these results very easily.

Indexer Implementation

To make using the Indexer as easy as possible, I opted to create a CLI for it as mentioned above. After supplying the arguments required, the tool will automatically initialize the repository, remove any existing records, and configure the relevancy settings.

Powering the tool is a straightforward GitHub API that grabs the requested number of top repositories and extracts all of their metadata and downloads the README file. It’ll also filter out repositories that are missing an owner or missing a README file, giving us the best results. Finally, it’ll also convert the markdown content to HTML for easier rendering in the frontend.

To keep the record size in check, the tool will automatically split up READMEs over 50k characters into additional records. This way the records won’t get too large but one record should still serve almost all repositories. It’ll then sync this information over to Algolia 100 records at a time so we can meet their batching recommendations as described here in the documentation.

Frontend Implementation

To serve as my starting point, I took advantage of the create-instantsearch-app library released by Algolia and launched an InstantSearch.js boilerplate. From here, I was able to add additional widgets provided by InstantSearch.js such as the pagination and the page size selector which worked great.

As we also collected repository metadata with the markdown, we also needed to customize the hits component to include this additional information. Often times the metadata is as important as the library description so developers can at a glance see if it’s a popular library, who released it, the tags, and much more. I also added facets so Users could filter by the programming language, tags, or how many times it has been forked.

The final piece to the puzzle was adding the ‘Open documentation’ button allowing you to quickly and easily read the markdown content for the repository in a pop-up without leaving the application. If the record we are clicking on has multiple rows, it’ll automatically load the additional records and concatenate them in the display—awesome!

Conclusion

This project was a fun test and really showed me how flexible Algolia is for different use cases such as ours. The ready-to-go widgets saved me a lot of time during prototyping and having relevant results from the first few keystrokes is super impressive. I also think it would be super interesting if we were able to harness the power of Algolia Recommend if we were able to generate enough events from Users clicking on projects internally.

You can view a live demo of the GitHub test project here, with a button that will set you up with the default demo credentials to view our index. Interested in the backend indexing code? You can find that here on GitHub!

We hope that you enjoyed this in-depth article from Soma! If you are looking for more content like this, we have many more topics that we’ve covered on the Algolia Engineering Blog! If you’re new to Algolia, you can try it out by signing up for a free tier account.