We invited our friends at Starschema to write about an example of using Algolia in combination with MongoDB. We hope that you enjoy this four-part series by Full Stack Engineer Soma Osvay.
If you’d like to look back or skip ahead, here are the other links:
Part 1 – Use-case, architecture, and current challenges
Part 2 – Proposed solution and design
Part 4 – Frontend implementation and conclusion
Just a note before I get started: you can follow along with the implementation here.
In the last article, we analyzed our data pipeline architecture and left an open question about the way that we will run our Python scripts to load the Algolia index. There were 3 options:
After discussions with our engineering team, I decided to go with the first option, because we already have an established and sophisticated way of running our current data preparation pipeline with a lot of existing scripts to clean, aggregate, and format our data before we load it into our database. Adding an extra script here won’t take much effort, and all the maintenance and monitoring tools are readily available. After deciding on the architecture steps, I decided to make a single script that both performs the initial data load into Algolia and keeps the index up-to-date, instead of a script for each of those actions.
Thankfully, Algolia supports this kind of use-case by exposing a replace_all_objects
method that actually creates a new temporary index first and then swaps it out with the live one once it’s done building. That makes for a near-instant transition between the old and the refreshed index without any downtime or data inconsistency.
Before starting to implement my Python script, I had to register for a free Algolia account and create a sample dataset that I can use to fill my index using MongoDB Atlas.
I chose to go with the default AirBnB dataset that comes with Atlas out-of-the-box, because the format and use-case is very similar to my real-life data. I also made the sample dataset publicly hosted for anybody who is following along or would like to experiment:
algolialistingstest.vswcm0y.mongodb.net
ReadOnly
AlgoliaTest
sample_airbnb
listingsAndReviews
I decided to implement the script using a Jupyter Notebook, because it lets me test pieces of my code independently, annotate my code with Markdown, play around and model the data structure iteratively, and export the created Python code as a script file easily. It’s very versatile and interactive, and I generally love to use it. I’m hosting it on Google Collab, so I can share the code very easily without anybody having to install an on-premise Jupyter environment. You can find the implemented script here. We’re using the implemented script to:
The first step is generating an API key:
We’ll need to install the Algolia Python client first, but afterwards, here’s what our connection code looks like:
# The Application ID of your Algolia Application
algolia_app_id = "[your_algolia_app_id_here]"
# The Admin API Key of your Algolia Application
algolia_admin_key = "[your_algolia_admin_key_here]"
# Define the Algolia Client and Index that we will use for this test
from algoliasearch.search_client import SearchClient
algolia_client = SearchClient.create(algolia_app_id, algolia_admin_key)
algolia_index = algolia_client.init_index("test_index")
# Test the index that we just created. We do this as part of the function, because these variables are not needed later
def test_algolia_index(index):
# Clear the index, in case it contains any records
index.clear_objects()
# Create a sample record
record = {"objectID": 1, "name": "test_record"}
# Save it to the index
index.save_object(record).wait()
# Search the index for 'test_record'
search = index.search("test_record")
# Clear all items again to clear our test record
index.clear_objects()
# Verify that the first hit is our object
if len(search["hits"]) == 1 and search["hits"][0]["objectID"] == "1":
print("Algolia index test successful")
else:
raise Exception("Algolia test failed")
# Call our test function
test_algolia_index(algolia_index)
First, install PyMongo, a Python MongoDB client, and then use this code to connect to our sample MongoDB database and read the sample data. Note that we’re only getting 5000 items so that we don’t overwhelm our free tier usage:
# Define MongoDB connection parameters. These are wrapped in a function to keep the global namespace clean
# Change these values if you are not running your own MongoDB instance
db_host = "algolialistingstest.vswcm0y.mongodb.net"
db_name = "sample_airbnb"
db_user = "ReadOnly"
db_password = "AlgoliaTest"
collection_name = "listingsAndReviews"
connection_string = f"mongodb+srv://{db_user}:{db_password}@{db_host}/{db_name}?retryWrites=true&w=majority"
# Connect to MongoDB and get the MongoDB Database and Collection instances
from pymongo import MongoClient
# Create MongoDB Client
mongo_client = MongoClient(connection_string)
# Get database instance
mongo_database = mongo_client[db_name]
# Get collection instance
mongo_collection = mongo_database[collection_name]
# Retrieve the first 5000 records from collection items
mongo_query = mongo_collection.find()
initial_items = []
for item in mongo_query:
if len(initial_items) < 5000:
initial_items.append(item)
The objects in our MongoDB sample dataset contain many attributes, some of which are irrelevant to our Algolia index. We only keep those that are required either for searching or ranking.
_id
property will be kept, as it will be the Algolia object ID as well.name
, space
, description
, neighborhood_overview
, transit
, property_type
, address
, accommodates
, bedrooms
, beds
, number_of_reviews
, bathrooms
, price
, weekly_price
, security_deposit
, cleaning_fee
, images
.review_scores
attribute on the Airbnb entry will be transformed to a scores property, which will contain the number of stars that is given to the listing._geoloc
property will be added to the object based on fields in the original address object. This will be used for geosearching.summary
, listings_url
, notes
, access
, interaction
, house_rules
, room_type
, bed_type
, minimum_nights
, maximum_nights
, cancellation_policy
, last_scraped
, calendar_last_scraped
, first_review
, last_review
, amenities
, extra_people
, guests_included
, host
, availability
, review_scores
, reviews
.Here is this transformation code:
# We define a function first that is able to strip long texts longer than 350 characters. This is done because the sample data has some records with very long descriptions, which is irrelevant to our use-case and takes up a lot of space to display
def strip_long_text(obj, trailWithDot):
if isinstance(obj, str):
# Strip texts longer than 350 characters after the next full stop (.)
ret = obj[:350].rsplit(".", 1)[0]
if trailWithDot and len(ret) > 0 and not ret.endswith("."):
ret = "."
return ret
else:
return obj
# We define a function to validate number values coming from MongoDB. MongoDB stores numbers in Decimal128 format, which is not accepted by Algolia (only as string). This function:
# 1. Converts numbers to float from Decimal128
# 2. Introduces a maximum value for these numbers, as some values in MongoDB are outliers and incorrectly filled out and it gives range filters an unreal max value.
def validate_number(num, maxValue):
if num is None:
return num
else:
val = float(str(num))
if val > maxValue:
return maxValue
return val
def prepare_algolia_object(mongo_object):
# Create an instance of the Algolia object to index, and set its objectID based on the _id of the mongo object
r = {}
r["objectID"] = mongo_object["_id"]
# prepare the string attributes
for string_property in [
["name", True],
["space", True],
["description", True],
["neighborhood_overview", True],
["transit", True],
["address", False],
["property_type", False],
]:
if string_property[0] in mongo_object:
r[string_property[0]] = strip_long_text(
mongo_object[string_property[0]], string_property[1]
)
# prepare the integer properties
for num_property in [
["accommodates", 100],
["bedrooms", 20],
["beds", 100],
["number_of_reviews", 1000000],
["bathrooms", 100],
["price", 1000],
["weekly_price", 1000],
["security_deposit", 1000],
["cleaning_fee", 1000],
]:
if num_property[0] in mongo_object:
r[num_property[0]] = validate_number(
mongo_object[num_property[0]], num_property[1]
)
# prepare the Sortable attributes (except for scores rating)
# set rating if any
if (
"review_scores" in mongo_object
and "review_scores_rating" in mongo_object["review_scores"]
):
stars = round(mongo_object["review_scores"]["review_scores_rating"] / 20, 0)
r["scores"] = {
"stars": stars,
"has_one": stars >= 1,
"has_two": stars >= 2,
"has_three": stars >= 3,
"has_four": stars >= 4,
"has_five": stars >= 5,
}
# set images
if "images" in mongo_object:
r["images"] = mongo_object["images"]
# set GeoLocation if any
if "address" in mongo_object:
if "location" in mongo_object["address"]:
if mongo_object["address"]["location"]["type"] == "Point":
r["_geoloc"] = {
"lng": mongo_object["address"]["location"]["coordinates"][0],
"lat": mongo_object["address"]["location"]["coordinates"][1],
}
return r
Now let’s tell Algolia what to do with the properties we’ve given it. We’ll set [attributesToRetrieve](<https://www.algolia.com/doc/api-reference/api-parameters/attributesToRetrieve/>)
, the attributes that Algolia will return per search result for display in our UI, to an array of these properties: summary
, description
, space
, neighborhood
, transit
, address
, number_of_reviews
, scores
, price
, cleaning_fee
, property_type
, accommodates
, bedrooms
, beds
, bathrooms
, security_deposit
, images/picture_url
, _geoloc
. Our [attributesForFaceting](<https://www.algolia.com/doc/api-reference/api-parameters/attributesForFaceting/>)
array will contain property_type
, address/country
, scores/stars
, price
, and cleaning_fee
.
We’ll also set [searchableAttributes](<https://www.algolia.com/doc/api-reference/api-parameters/searchableAttributes/>)
, the attributes that are considered when a query is calculated. Algolia won’t waste time looking outside of this list for potential search matches, so it speeds up the query, and it lets us set the priority order from highest to lowest:
name
, address/street
, address/suburb
address/market
, address/country
description
(this will be an unordered attribute)space
(another unordered attribute)neighborhood_overview
(another unordered attribute)transit
We will also update the default ranking logic for our index:
geo
– providing search results close-by is the top priority for ustypo
words
filters
proximity
attribute
exact
custom
We’re also updating our index to ignore plurals (which you might not think about much, but your users definitely will when it works as they don’t expect it to). You can find other great resources and settings on the Official Algolia API Reference page. Here’s what our code for this looks like:
algolia_index.set_settings(
{
"searchableAttributes": [
"name,address.street,address.suburb",
"address.market,address.country",
"unordered(description)",
"unordered(space)",
"unordered(neighborhood_overview)",
"transit",
],
"attributesForFaceting": [
"property_type",
"searchable(address.country)",
"scores.stars",
"price",
"cleaning_fee",
],
"attributesToRetrieve": [
"images.picture_url",
"summary",
"description",
"space",
"neighborhood",
"transit",
"address",
"number_of_reviews",
"scores",
"price",
"cleaning_fee",
"property_type",
"accommodates",
"bedrooms",
"beds",
"bathrooms",
"security_deposit",
"_geoloc",
],
"ranking": [
"geo",
"typo",
"words",
"filters",
"proximity",
"attribute",
"exact",
"custom",
],
"ignorePlurals": True,
}
)
This short piece of code loads the dataset into the Algolia index, replacing the existing index so there are no out-of-date records.
# Prepare the Algolia objects
algolia_objects = list(map(prepare_algolia_object, initial_items))
algolia_index.replace_all_objects(algolia_objects, {"safe": True}).wait()
Overall, I found that loading an Algolia index from Python is quite a straightforward task, even though my Python skills are a little rusty. Most of my time actually went into preparing the AirBnB listing objects and transforming them into what I wanted inside Algolia. This would have probably been much simpler if I was working with our own datasets, as there wouldn’t have been as much transformation needed.
I learned that Algolia exposes a wonderful Python API — it’s simpler to use than I expected and contains great documentation that guided me through the entire process, step-by-step. The code required to prepare and load the index is minimal, and it felt intuitive to me. It also performed great when loading the index: it only needed just under 5 seconds to load and replace the entire index with 5000 records, even when run from a resource-limited, cloud-hosted server. When I ran it on some of our high-speed servers with fast Internet connection, it only took about 2 seconds. Our production dataset is much larger (about 40k records), but our standard pipelines that prepare the listings data are running for over an hour every day, so I am sure that our overall performance will not be affected by Algolia. So far, its simplicity and speed has far outweighed any drawbacks.
In the first article of this series, I talked about our use-case, architecture and the search challenges we are facing.
In the second article of this series, I covered the design specification of the PoC and talked about the implementation possibilities.
In the fourth article of this series, I will implement a sample frontend so we can evaluate the product from the user’s perspective and give the developers a head-start if they choose to go with this option.
Soma Osvay
Full Stack Engineer, StarschemaPowered by Algolia AI Recommendations
Soma Osvay
Full Stack Engineer, StarschemaSoma Osvay
Full Stack Engineer, StarschemaChuck Meyer
Sr. Developer Relations Engineer