Solving the Million Songs Challenge

Back to all blogs

Recently, Algolia announced a new pricing plan that gives developers access to every part of Algolia for free! In addition to the feature access though, you’ll also get a million records for free! That’s a hundred times more than you could previously use on the free plan. Let’s try to build something that’ll make use of a bunch of the things we can now do for free!

One thing that immediately sprang to mind is the old Million Song Dataset Challenge, which asks contestants to train an AI recommendation model on the Million Song Dataset using the listening history of a million users. Let’s try to solve this challenge with Algolia! It breaks down into a few steps:

Consume the Million Song Dataset as an Algolia index.
This part involved creating a little Python script to turn the existing dataset into a JSON file, with each record looking like this:

{ "artist": "Johnny Cash", "title": "Ring Of Fire", "objectID": "SOWJPWJ12D021B20C2" }
I had the script filter out some of the more profane titles as well, so that’s why we aren’t quite hitting our 1,000,000 record limit. Each objectID is the song ID from the original dataset, which we’ll be using in later steps.
Create a UI for the index so we can easily search through the songs.
Algolia makes this part easy too — under the UI Demos tab in the index page, we can create a preformatted demo.

We’re not building this into a production application, so this’ll skip over a bunch of unnecessary work. This kind of demo is perfect for prototypes and pitches, since it demonstrates almost everything you can do with the frontend, but it strips out all of the styling and layout complexity that you’d have to handle if you built the demo yourself. It also requires absolutely no coding knowledge at all, so non-techy managers and executives can toy around with it without causing any problems for the engineers.
Format the listening history data and train a Recommendations model with it.
Here’s where the fun part starts! I whipped up another quick Python script to convert the .txt file included in the challenge into a .csv file to pretrain the Algolia model with. Here’s the script if you’re curious:

import time, random output = [ 'userToken,timestamp,objectID,eventType,eventName' # the header line ] count = 0.0 opening_timestamp = int(time.time() * 1000) - 1 #ms timerange = 7689600000 #ms, this is 89 days with open("kaggle_visible_evaluation_triplets.txt") as f: lines = f.readlines() for triplet_str in lines: triplet = triplet_str.split("\\t") for _ in range( int( triplet[2] ) ): output.append( ",".join([ triplet[0], str( random.randint(opening_timestamp - timerange, opening_timestamp) ), triplet[1], "conversion", "Song Played" ]) ) count += 1.0 print(str(count / len(lines) * 100) + "% done") with open("events.csv", "w") as f: f.write( "\\n".join( output ) )
This creates a line in the CSV file for every time a song has been played by some user. It adds a random timestamp from the last 89 days so that all of the events end up affecting the model. It also maps the user ID from the input data directly to the userToken field in the CSV so Algolia knows which songs have been played by the same person, which would imply that users who have listened to one of that batch of songs should be more likely to be recommended the rest of them.

Then in the Algolia dashboard, under the Recommendations page, we can set up a Complementary recommendations implementation. We’ll select the right index:

Then once you upload the CSV we created, you can start to train the model! It’ll take a little bit, especially since we have so many records, but once it’s done, it’ll have trained an artificial intelligence to correlate certain records with others just by their shared listening history. Assuming there’s some quality that’s shared between the two songs that’ll make one more interesting to someone who has listened to the other, the AI should theoretically (if given enough data) start to hone in on that quality and just know what kind of music you’d like given what you’ve enjoyed in the past.

Before April 2023, this project could have run you hundreds of dollars per month just to keep your index filled with all that data, let alone the costs for all the Recommend requests you’d probably receive on an application that’s large and busy enough to require a million records. Now it’s all free! Development with Algolia used to be restricted by paywalls, but now the paywall is where it should be — between development and production. As a general rule of thumb, you’ll only start paying for Algolia when it starts earning you revenue. We know you’ll make great use of this to build some amazing projects — make sure to give us a shout if you’d like to show off your handiwork!

Solving the Million Songs Challenge

Recommended

Get the AI search that shows users what they need