Recently, Algolia announced a new pricing plan that gives developers access to every part of Algolia for free! In addition to the feature access though, you’ll also get a million records for free! That’s a hundred times more than you could previously use on the free plan. Let’s try to build something that’ll make use of a bunch of the things we can now do for free!
One thing that immediately sprang to mind is the old Million Song Dataset Challenge, which asks contestants to train an AI recommendation model on the Million Song Dataset using the listening history of a million users. Let’s try to solve this challenge with Algolia! It breaks down into a few steps:
{
"artist": "Johnny Cash",
"title": "Ring Of Fire",
"objectID": "SOWJPWJ12D021B20C2"
}
I had the script filter out some of the more profane titles as well, so that’s why we aren’t quite hitting our 1,000,000 record limit. Each objectID
is the song ID from the original dataset, which we’ll be using in later steps.
.txt
file included in the challenge into a .csv
file to pretrain the Algolia model with. Here’s the script if you’re curious:
import time, random
output = [
'userToken,timestamp,objectID,eventType,eventName' # the header line
]
count = 0.0
opening_timestamp = int(time.time() * 1000) - 1 #ms
timerange = 7689600000 #ms, this is 89 days
with open("kaggle_visible_evaluation_triplets.txt") as f:
lines = f.readlines()
for triplet_str in lines:
triplet = triplet_str.split("\\t")
for _ in range( int( triplet[2] ) ):
output.append(
",".join([
triplet[0],
str( random.randint(opening_timestamp - timerange, opening_timestamp) ),
triplet[1],
"conversion",
"Song Played"
])
)
count += 1.0
print(str(count / len(lines) * 100) + "% done")
with open("events.csv", "w") as f:
f.write( "\\n".join( output ) )
This creates a line in the CSV file for every time a song has been played by some user. It adds a random timestamp from the last 89 days so that all of the events end up affecting the model. It also maps the user ID from the input data directly to the userToken field in the CSV so Algolia knows which songs have been played by the same person, which would imply that users who have listened to one of that batch of songs should be more likely to be recommended the rest of them.
Then in the Algolia dashboard, under the Recommendations page, we can set up a Complementary recommendations implementation. We’ll select the right index:
Then once you upload the CSV we created, you can start to train the model! It’ll take a little bit, especially since we have so many records, but once it’s done, it’ll have trained an artificial intelligence to correlate certain records with others just by their shared listening history. Assuming there’s some quality that’s shared between the two songs that’ll make one more interesting to someone who has listened to the other, the AI should theoretically (if given enough data) start to hone in on that quality and just know what kind of music you’d like given what you’ve enjoyed in the past.
Before April 2023, this project could have run you hundreds of dollars per month just to keep your index filled with all that data, let alone the costs for all the Recommend requests you’d probably receive on an application that’s large and busy enough to require a million records. Now it’s all free! Development with Algolia used to be restricted by paywalls, but now the paywall is where it should be — between development and production. As a general rule of thumb, you’ll only start paying for Algolia when it starts earning you revenue. We know you’ll make great use of this to build some amazing projects — make sure to give us a shout if you’d like to show off your handiwork!
Jaden Baptista
Freelance Writer at Authors CollectivePowered by Algolia AI Recommendations