OpenAI Whisper是目前谷歌语音转文字的最佳开源替代品。它可以在100种语言中原生工作(自动检测),增加标点符号,如果需要,它甚至可以翻译结果。在这篇文章中,我们将告诉你如何安装Whisper并将其部署到生产中。
如果你担心成本或隐私问题,你可能想换成一个开源的替代品。OpenAI Whisper。
当然,Whisper是免费的,但如果你想自己安装它,你需要花一些人力时间在上面,并为底层服务器和GPU付费。如果你喜欢从管理版本中获益,你可以使用像NLP Cloud这样的API。 现在就在NLP Cloud上免费试用Whisper吧!.
如果你想暂时安装和部署Whisper,你有两个选择。第一个是使用OpenAI的whisper Python库,第二个是使用Whisper的Hugging Face Transformers实现。让我们来探索这两种解决方案。
这个解决方案是最简单的。你基本上需要遵循OpenAI的指示 在Whisper项目的Github存储库中.
First install the whisper Python lib:
pip install git+
Then install ffmpeg on your system if it is not the case yet:
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew (
brew install ffmpeg
# on Windows using Chocolatey (
choco install ffmpeg
# on Windows using Scoop (
scoop install ffmpeg
Several flavors of Whisper are available: tiny, base, small, medium, and large. Of course the bigger the better, so if you are looking for state of the art results we recommend the large version. Here is a very simply Python script that opens an mp3 audio file stored on your disk, automatically detects the input language, and transcribes it:
import whisper
model = whisper.load_model("large")
result = model.transcribe("audio.mp3")
Simple isn't it?
In order to use Hugging Face's implementation of Whisper you will first need to install HF Transfomers, librosa, and Pytorch:
pip install transformers
pip install librosa
pip install torch
You also need to install ffmpeg (see above).
Now, here is a Python script that does transcription in English:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
speech, _ = librosa.load("audio.mp3")
processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "en", task = "transcribe")
input_features = processor(speech, return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids)
print(processor.batch_decode(predicted_ids, skip_special_tokens = True))
There are 2 limitations with this Hugging Face implementation. First you need to manually set the source language (no automatic input language detection is implemented yet). And secondly no automatic chunking is applied, which means that you cannot transcribe content that is larger than 30 seconds...
Maybe these limitations will be solved in future releases?
A nice thing though is that there is a Tensorflow implementation available too, which means that you can use XLA compilation and get much faster response times.
如果你使用whisper Python库(见上文),你将需要大约10GB的内存和11GB的VRAM。这意味着在实践中,你至少需要一个16GB的GPU。例如,它可以是NVIDIA Tesla T4,或NVIDIA A10。
OpenAI Whisper是语音到文本世界的一场革命。由于这种开源模式,任何人都可以首次轻松获得最先进的自动语音识别,这使得Whisper成为谷歌语音转文字API的良好替代品。
如果你很容易就想尝试Whisper,而不需要为基础设施的考虑而烦恼,请在NLP Cloud API上尝试。 现在就在NLP Cloud上免费试用Whisper吧!.
Julien Salinas