How To Install And Deploy Whisper, The Best Open-Source Alternative To Google Speech-To-Text

Google Speech-To-Text

Google's automatic speech recognition (speech-to-text) API is very popular. This API is able to transcribe audio and video files in 125 languages, and it proposes specific AI models for phone calls transcription, medical transcription, and more.

This API also has nice additional features like content filtering, automatic punctuation (in beta only for the moment), and speaker diarization (in beta too).

Last of all, their API can be installed on premises. But it's important to note that the on-prem AI model will keep sending data to Google in order to report API usage, which might be a concern from a privacy standpoint.

Google's pricing is basically $0.006 / 15 seconds for basic speech-to-text, and $0.009 / 15 seconds for specific use cases like video transcription or phone transcription.

Let's say you want to automatically analyze phone calls made to your support team (in order to later perform sentiment analysis or entity extraction on them for example). If you have 5 support agents spending 4h each per day on the phone with customers, Google's speech-to-text API will cost you $1,400 per month.

If you are concerned about costs or privacy, you might want to switch to an open-source alternative: OpenAI Whisper.

Whisper: The Best Alternative To Google Speech-To-Text

Whisper is an open-source AI model that has just been released by OpenAI.

OpenAI has a history of open-sourcing great AI projects. For example GPT-2 was developed by OpenAI a couple of years ago. At the time it was the best generative natural language processing model ever created, and it paved the way for much more advanced models like GPT-3, GPT-J, OPT, Bloom... Recently, they also released a nice CUDA programming framework called Triton.

Not all OpenAI's models have been open-sourced though. Their 2 most exciting models: GPT-3 and DALL-E, are still private models that can only be used through their paid API.

Whisper is taking the speech-to-text ecosystem by storm: it can automatically detect the input language, then transcribe text in around 100 languages, automatically punctuate the result, and even translate the result if needed. Accuracy is very good, and you can apply this model to any kind of input (audio, video, phone calls, medical discussions, etc.).

And of course, another great advantage of Whisper is that you can deploy it by yourself on your own servers, which is great from a privacy standpoint.

Whisper is free of course, but if you want to install it by yourself you will need to spend some human time on it, and pay for the underlying servers and GPUs. If you prefer to benefit from a managed version, you can use an API like NLP Cloud: try Whisper for free on NLP Cloud now!.

Installing And Deploying OpenAI Whisper

You have 2 options if you want to install and deploy Whisper for the moment. The first one is to use OpenAI's whisper Python library, and the second one is to use the Hugging Face Transformers implementation of Whisper. Let's explore both solutions.

Using the whisper Python lib

This solution is the simplest one. You basically need to follow OpenAI's instructions on the Github repository of the Whisper project.

First install the whisper Python lib:

pip install git+https://github.com/openai/whisper.git

Then install ffmpeg on your system if it is not the case yet:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

Several flavors of Whisper are available: tiny, base, small, medium, and large. Of course the bigger the better, so if you are looking for state of the art results we recommend the large version. Here is a very simply Python script that opens an mp3 audio file stored on your disk, automatically detects the input language, and transcribes it:

import whisper

model = whisper.load_model("large")
result = model.transcribe("audio.mp3")
print(result["text"])

Simple isn't it?

Using the Hugging Face implementation

In order to use Hugging Face's implementation of Whisper you will first need to install HF Transfomers, librosa, and Pytorch:

pip install transformers
pip install librosa
pip install torch

You also need to install ffmpeg (see above).

Now, here is a Python script that does transcription in English:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

speech, _ = librosa.load("audio.mp3")

processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")

model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "en", task = "transcribe")
input_features = processor(speech, return_tensors="pt").input_features 
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids)

print(processor.batch_decode(predicted_ids, skip_special_tokens = True))

There are 2 limitations with this Hugging Face implementation. First you need to manually set the source language (no automatic input language detection is implemented yet). And secondly no automatic chunking is applied, which means that you cannot transcribe content that is larger than 30 seconds...

Maybe these limitations will be solved in future releases?

A nice thing though is that there is a Tensorflow implementation available too, which means that you can use XLA compilation and get much faster response times.

Hardware Requirements

As we saw above, Whisper is fairly easy to install. However it requires advanced hardware. A GPU is recommended if you want to use the large version of the model.

If you use the whisper Python lib (see above) you will need around 10GB of RAM and 11GB of VRAM. It means that in practice you will need a 16GB GPU at least. It could be a NVIDIA Tesla T4 for example, or an NVIDIA A10.

On a Tesla T4, you will transcribe 30 seconds of audio in around 6 seconds.

Performance Considerations

If you want to improve the default performance mentioned above, here are several strategies you can explore:

• Use a higher end GPU. For example you will get a better response time with GPUs using the Ampere platform like A10, A40, or A100.
• Work on batch inference in order to improve the throughput
• Leverage XLA compilation with Tensorflow or Jax
• Export the model to ONNX or TensorRT, and then serve it through the NVIDIA Triton Inference Server

Conclusion

OpenAI Whisper is a revolution in the speech-to-text world. For the first time, anyone can easily access state-of-the-art automatic speech recognition thanks to this open-source model, which makes Whisper a good alternative to Google speech-to-text API.

Installing and deploying such an AI model is still a challenge though because of the hardware required under the hood. The large version of Whisper cannot really run on consumer hardware.

If you easily want to try Whisper without bothering with infrastructure considerations, please try it on the NLP Cloud API: try Whisper for free on NLP Cloud now!.

Abhinav
Devops engineer at NLP Cloud