Engineering

Generate a transcription index for your YouTube content using Whisper
facebooklinkedintwittermail

The theme of our 2022 Algolia Developer Conference was “Index the world and put your data in motion” so naturally, as soon as the last video was uploaded to YouTube, talk turned to how we could put all of this great new content in motion for our customers.

I knew I wanted the videos to be searchable by title and description and discoverable by category, but I wanted to do more. I wanted to be able to help developers find the exact spot in the video that matched their search. This meant indexing the transcripts of the videos. We’ve tried to do this in the past using YouTube’s captioning capabilities with mixed results. Fortunately for us, at almost the exact same time, the team at OpenAI released a new neural network called Whisper for automatic speech recognition.

In the rest of this post, I will describe the toolchain I used to build the A/VSearch CLI, an integrated command line for generating and indexing transcripts from a YouTube channel or playlist.

You can check out the results on our demo site here!

Machine Learning – Whisper

As mentioned above, OpenAI released Whisper recently which is a general-purpose speech recognition model. It can recognize many different languages and even has the ability to translate between them. Since it is natively exposed in Python, it was the perfect candidate to team up with Algolia’s Python API Client. Overall, I was thoroughly impressed with the transcription quality. Even using the medium model, minus some specific technology names, it transcribed videos without a single mistake.

One big feature I personally wish Whisper had was speaker diarization, where the model identifies different speakers throughout the recording and determines when that person spoke. Right now, you would have to manually clean up and assign segments to a speaker. It’s possible to combine Whisper with another tool, like PyAnnote to do this though, which I’d like to add as a feature in the future. I also believe there is a limitation with Whisper on multiple languages spoken in the same audio file, but this could be improved as time goes on.

As sometimes the segments provided by Whisper are quite short, it can be hard to determine the true context of the segment. Due to this, we added a context field that contains the previous segment and the proceeding one. This way it is clear what is being discussed in that particular segment, which should lead to a higher success rate for finding a specific clip.

How I built it

Since Whisper requires an audio file to run the transcription, I needed a method to take a YouTube video and convert it to an audio file. YouTube-DL was my choice as it is well supported in Python and I can have it only download the audio of the video, saving me from having to convert any downloads before transcription. Since some users would want to use the program from the command line, I added the Click library to power a CLI interface.

Sometimes there are words (or company names) that Whisper can’t detect, so I made it so you can supply patterns to perform search/replace logic on. My colleague, Chuck, had a great idea to also add a categorization feature where you can provide keywords for A/VSearch to detect during transcription and automatically apply predefined categories. To use these features, you can simply pass a JSON file with the patterns defined within and A/VSearch will parse and use them during the process.

How to use it

Using A/VSearch is super simple, you can download a release from GitHub and install it that way, or just use pip to install it via the GitHub URL which will load the latest release. Since it has a fully-featured CLI, you can just export your Algolia credentials as environment variables and get into the action! The CLI accepts URLs for playlists, channels, and individual videos and writes the transcript records to the Algolia index name you provide.

Whisper’s transcription speed can be sped up with access to a GPU. Using an NVIDIA Tesla T4 GPU transcribing a three-minute video took 25 seconds, while the same video on a 32-vCPU VM took 45 seconds. This increase is especially helpful for longer videos as it can take a fraction of the time to process.

# Create and activate a virtualenv
python3 -m venv av-search-test && cd av-search-test
source bin/activate

# Install via Pip or grab a release from GitHub
python3 -m pip install git+https://github.com/algolia-samples/avsearch

export ALGOLIA_APP_ID=AAAAA12345
export ALGOLIA_INDEX_NAME=transcriptions
export ALGOLIA_API_KEY=6c4dba625a960b4cc54b7b5312f9117d

# Transcribe a video, playlist, channel, etc.
av-search --targets "https://www.youtube.com/watch?v=epSVL87_sqA"

More information on advanced usage can be found in the GitHub repository.

How to automate it

The best way to automate A/VSearch would be to integrate it into a Python application. This way you can handle any errors gracefully and easily integrate any other solutions that may be required (like event notification for example.)

from avsearch import AVSearch
import os

avs = AVSearch(app_id='AAAAA12345', api_key=os.environ.get('ALGOLIA_API_KEY'), ...)
result = avs.transcribe([
    "https://www.youtube.com/watch?v=qSBm7d3McRI"
])

print(result)
# [
#    {
#      "objectID": "zOz-Sk4K-64-0",
#      "videoID": "zOz-Sk4K-64",
#      "videoTitle": "Welcome to Algolia DevCon! Keynote and product demos",
#      "videoDescription": "...",
#      "url": "https://youtu.be/zOz-Sk4K-64?t=0",
#      "thumbnail": "https://i.ytimg.com/...",
#      "text": "Hi everyone and welcome to DevCon 2022.",
#      "start": 0,
#      "end": 12,
#      "categories": [],
#      "context": {
#        "before": {
#          "start": 0,
#          "text": ""
#        },
#        "after": {
#          "start": 12,
#          "text": "I'm thrilled to be here with you today at Algolia's first ever developer conference"
#        }
#      }
#    },
#    ...
# ]

Configuration

Once you have some data in the index, there are some settings you should adjust to deliver the best search experience. We can do this via the Algolia Dashboard, or my personal favorite, the Algolia CLI! We’ve prepared a configuration file you can upload directly to your newly created Index to have the best settings out of the box:

# Download the settings file from the repository or fetch it manually
wget https://github.com/algolia-samples/avsearch/blob/main/examples/settings.json.example

# Transcription index name
export MY_INDEX_NAME=''

# Overwrite index settings
algolia settings settings set $MY_INDEX_NAME -F settings.json.example

If you are new to our CLI, you can find some more information on it here. If the Dashboard is more your speed, you can also upload the configuration by navigating to your Index, clicking ‘Manage Index’, and selecting ‘Import Configuration’.

Building a frontend

I built an autocomplete search experience (including cmd-K binding) to simplify integrating with the existing Algolia Developer Conference homepage. Making the search interface a modal lets me provide a rich UX with space for previews and thumbnails without needing to redesign the whole home page. Algolia’s AutcompleteJS library is great for building this type of autocomplete experience. I used our own documentation search as a model for inspiration. The large preview pane gives users more context from the video transcripts to help them find the right clip.

A/VSearch Example Frontend

I leaned into AutocompleteJS’s plugin architecture, including the official plugins for query suggestions and click events. I also created a custom plugin to load the selected video into an embedded iFrame on the same webpage (createLoadVideoPlugin).

You can see this example front end in the examples directory in the code repo or try out the live demo.

Wrap Up

If you have any questions about A/VSearch such as how it works, implementation questions, or feature requests, feel free to drop us a line over at our Discord forum! Our team would love to hear from you about A/VSearch or about any other Algolia-related questions you may have.

Want to get started transcribing your own content? Head over to the GitHub repository and grab the latest release!

Algolia Code Exchange


We hope that you enjoyed this in-depth look at A/VSearch and how we used it to power the DevCon Session search feature. If you’re new to Algolia, you can try us out by signing up for a free tier account.

About the authorMichael King

Michael King

Developer Advocate

Recommended Articles

Powered by Algolia AI Recommendations

Good API Documentation Is Not About Choosing the Right Tool
Engineering

Good API Documentation Is Not About Choosing the Right Tool

Maxime Locqueville

Maxime Locqueville

DX Engineering Manager
Post-Exit Year in Review
Algolia

Post-Exit Year in Review

Ciprian Borodescu

Ciprian Borodescu

AI Product Manager | On a mission to help people succeed through the use of AI
DevTool Intro: The Algolia CLI!
Engineering

DevTool Intro: The Algolia CLI!

Khalid Elassaad

Khalid Elassaad

Sr. TPM - Developer Experience