Engineering

Building a multi-lingual rhyming dictionary with Wiktionary and Algolia – Part 1
facebooklinkedintwittermail

Bom dia! I’m Jaden, and I’m learning Portuguese. I’m loving it so far — its simple, logical rules mesh well with my developer brain.

Here’s the thing though: I just love writing songs and poetry, and it irks me that I don’t have enough of a command of Portuguese to pursue that hobby in my new language. I’m getting the hang of the grammar, and meter isn’t too difficult (several of those logical rules make this part fairly predictable), but rhyming — it feels almost impossible. Linguistic theory says this shouldn’t be too difficult, but I’m definitely struggling with it. Luckily, this should be a solvable problem if we augment my limited abilities with modern technology.

This blog is in two parts. You are currently reading part one, where we ingest data and configure our index. In part two, we’ll do some further configuration and build the frontend.

A little background here: I often find myself on Wiktionary (Wikipedia’s dictionary), where you can look up words and get results in tons of languages. For example, for the word avó, Wiktionary’s page gives us the result in Galician (where it means “grandfather”) and in Portuguese (where it means “grandmother”). Here’s a screenshot of the Portuguese section:

Notice the subsections here:

  1. Etymology — how the word came to be. This is helpful for those of us who like to geek out over linguistics, but it’s otherwise irrelevant to our usecase.
  2. Pronunciation — how the word is said. This is our target data for our rhyme-maker app.
  3. Definitions — what the word means. It would probably be helpful for us to display these alongside the word in our app.
  4. Other terms — words related to this one. These are great for language learners, but again, somewhat irrelevant to our usecase.

So let’s use a Python library to turn a simple list of Portuguese words into an array of objects containing pronunciations and definitions. I found Suyash458’s amazing WiktionaryParser on GitHub, which turns about to be just the tool I need. I Googled for a complete list of Portuguese dictionary words and whipped up a quick Python script to run them all through a slightly-customized version of WiktionaryParser. I ran it on 10 different threads so it would output the JSON of the words’ definitions and pronunciations faster:

This particular list contains all sorts of different word variants, which is exactly what we’re looking for in a rhyme-maker app.

These processes are spitting out JSON that is essentially already an Algolia index. Here’s a sample of that JSON:

{
	"word": "abacaxi",
	"definitions": [
		"noun: abacaxi (plural abacaxis); (Brazil) pineapple (fruit); (Brazil) pineapple (plant); (Portugal) a certain fragrant, sweet cultivar of pineapple; (Brazil, slang) a difficult situation; (Brazil, military slang) pineapple (hand grenade)"
	],
	"pronunciation": "a.ba.kaˈʃi",
	"ultimate_syllable": "ʃi",
	"penultimate_syllable": "ka",
	"antepenultimate_syllable": "ba",
	"stress_from_end": 1
},
...
{
	"word": "abono",
	"definitions": [
		"noun: abono (plural abonos); guarantee; benefit",
		"verb: abono; first-person singular present indicative of abonar"
	],
	"pronunciation": "ɐˈbo.nu",
	"ultimate_syllable": "nu",
	"penultimate_syllable": "bo",
	"antepenultimate_syllable": "ɐ",
	"stress_from_end": 2
},
...
{
	"word": "absurdo",
	"definitions": [
		"adjective: absurdo (feminine absurda, masculine plural absurdos, feminine plural absurdas, comparable, comparative mais absurdo, superlative o mais absurdo or absurdíssimo); absurd",
		"noun: absurdo (plural absurdos); absurdity"
	],
	"pronunciation": "abˈsuʁ.du",
	"ultimate_syllable": "du",
	"penultimate_syllable": "suʁ",
	"antepenultimate_syllable": "ab",
	"stress_from_end": 2
},
...

Each word from the bland list of dictionary words was turned into an object chock full of definições (definitions) and detalhes de pronúncia (pronunciation information) too! Uploading this as an Algolia index gives us (nearly immediately) a searchable database of thousands of Portuguese words with enough detail to turn out a rhyming dictionary.

Before we can use this index, we’ll need to make a couple tweaks to our index settings. Assuming you’ve already uploaded the JSON, you should be able to make these few changes under the Configuration tab in the index:

  1. We actually have the chance here to let our users both search for rhyming words and search through the results. The former is going to be powered by Algolia’s facets (more on that in part 2 of this article), but to enable the latter, we’ll go under Searchable Attributes and add the word attribute as searchable.
  2. We want our results returned in alphabetical order, so we’ll add the word attribute under Ranking and Sorting, set to sort in ascending order. Use the handles at the left to drag that new attribute to the top so it’ll take precedence over ranking by anything else.
  3. Under Typo-tolerance, we’ll want to turn all typo correction off. We’re building a single-word search index here, so the likelihood of a typo occurring is far outweighed by the likelihood of Algolia’s internal typo-fixing logic screwing with our custom logic.
  4. Under the Language section, we’ll need to add Portuguese so that it knows to parse the inputs correctly.
  5. In Portuguese, words can be differentiated simply by diacritics. I kept running into this bug where it would confuse abandonara (which means “maybe he had abandoned”) and abandonará (which means “he definitely will abandon”), which are only different by the little mark over that last a. Turns out, this isn’t a bug: Algolia does the smart thing for English and ditches diacritic marks (so that, for example, pedantically searching the query “naïve” matches results that spell it as “naive” without the diacritic like most English speakers do). You can override this behavior though (as we’ll need to for Portuguese, since different rules apply) in the Special characters section of your index config. Just add the diacritics you don’t want stripped in the second input, which for me was all the diacritics used in Portuguese (àáâãéêíîóôõúû).
  6. As I mentioned earlier, the actual rhyming logic is going to be handled through facets. We’ll get into more of how this works in part 2 of this article (though perhaps you can start to theorize based on how the query result objects are structured), but for now, let’s mark some attributes as facet-able under the Facets section.
  7. For the same logic that we’ll set up in part 2, we’ll need to let those facets be “displayed” in the Facet display section. They won’t actually show up to the user in the final product, but this crucial step surfaces the facet data to our frontend where the logic happens.
  8. Under the Pagination section, we’ll want to set hitsPerPage to something more reasonable for our application (50 looks about right). We also don’t care about somebody being able to scrape our entire dataset (it’s publicly available data from Wiktionary anyway), and performance is still stellar if we set this to a high number, so we’ll just put it higher than the result count any particular search could summon. Since there’s about 14k words in my dataset, 10k seems like a good number here.

That’s it! In this article, we’ve created an entire dataset of all the words in Portuguese, tagged them with their pronunciation, and given Algolia the tools it needs to search through them by how they rhyme with each other. That’s already a big accomplishment of data engineering, something that would have been near impossible years ago without tools like Wiktionary or Algolia. To add to it, we’ve generalized this process, so we could easily adapt it for any language.

In the next article in this series, we take this a step further and build a frontend to make use of this dataset. That frontend will leverage Algolia’s facet system to build out a rhyming logic while still allowing the user to search through the results, and we’ll make this frontend still work with whatever underlying language dataset we want to use.

Have fun playing around with this! Shoot us a message on Discord at Algolia if you’d like to share your ideas and suggestions.

About the authorJaden Baptista

Jaden Baptista

Technical Writer

Recommended Articles

Powered by Algolia AI Recommendations

Handling Natural Languages in Search
Engineering

Handling Natural Languages in Search

Léo Ercolanelli

Léo Ercolanelli

Software Engineer
4 questions to ask for relevant search results
Product

4 questions to ask for relevant search results

Jaden Baptista

Jaden Baptista

Technical Writer
Algolia's top 10 tips to achieve highly relevant search results
Product

Algolia's top 10 tips to achieve highly relevant search results

Julien Lemoine

Julien Lemoine

Co-founder & former CTO at Algolia