如何安装和部署Whisper--谷歌语音文字的最佳开源替代品

OpenAI Whisper是目前谷歌语音转文字的最佳开源替代品。它可以在100种语言中原生工作（自动检测），增加标点符号，如果需要，它甚至可以翻译结果。在这篇文章中，我们将告诉你如何安装Whisper并将其部署到生产中。

自动语音识别

谷歌语音转文字

谷歌的自动语音识别（语音到文本）API非常受欢迎。这个API能够转录125种语言的音频和视频文件，它为电话转录、医疗转录等提出了特定的AI模型。

这个API也有很好的附加功能，如内容过滤、自动标点符号（目前只在测试版）和说话人日记（也在测试版）。

最后，他们的API可以安装在场所内。但需要注意的是，为了报告API的使用情况，场所内的人工智能模型将不断向谷歌发送数据，从隐私的角度看，这可能是一个问题。

谷歌的定价基本上是0.006美元/15秒，用于基本的语音转文字，而0.009美元/15秒用于视频转录或电话转录等特定用例。

假设你想自动分析打给你的支持团队的电话（例如，为了以后对它们进行情感分析或实体提取）。如果你有5个支持人员，每人每天花4小时与客户通话，谷歌的语音到文本API将花费你每月1400美元。

如果你担心成本或隐私问题，你可能想换成一个开源的替代品。OpenAI Whisper。

悄悄话。谷歌语音转文字的最佳替代品

Whisper是OpenAI刚刚发布的一个开源的人工智能模型。

OpenAI在开源伟大的人工智能项目方面有一段历史。例如，GPT-2是几年前由OpenAI开发的。在当时，它是有史以来最好的生成性自然语言处理模型，它为更先进的模型铺平了道路，如GPT-3、GPT-J、OPT、Bloom......最近，他们还发布了一个不错的CUDA编程框架，叫做Triton。

不过，并不是所有OpenAI的模型都已经开源了。他们的两个最令人兴奋的模型。GPT-3和DALL-E，仍然是私人模型，只能通过他们的付费API使用。

Whisper正在掀起语音转文字生态系统的风暴：它可以自动检测输入语言，然后转录大约100种语言的文字，自动给结果加标点，甚至在需要时翻译结果。准确性非常好，而且你可以将这种模式应用于任何种类的输入（音频、视频、电话、医疗讨论等）。

当然，Whisper的另一个巨大优势是，你可以自己在自己的服务器上部署它，这从隐私的角度来说是非常好的。

当然，Whisper是免费的，但如果你想自己安装它，你需要花一些人力时间在上面，并为底层服务器和GPU付费。如果你喜欢从管理版本中获益，你可以使用像NLP Cloud这样的API。现在就在NLP Cloud上免费试用Whisper吧!.

安装和部署OpenAI Whisper

如果你想暂时安装和部署Whisper，你有两个选择。第一个是使用OpenAI的whisper Python库，第二个是使用Whisper的Hugging Face Transformers实现。让我们来探索这两种解决方案。

使用whisper Python库

这个解决方案是最简单的。你基本上需要遵循OpenAI的指示在Whisper项目的Github存储库中.

First install the whisper Python lib:

pip install git+https://github.com/openai/whisper.git

Then install ffmpeg on your system if it is not the case yet:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

Several flavors of Whisper are available: tiny, base, small, medium, and large. Of course the bigger the better, so if you are looking for state of the art results we recommend the large version. Here is a very simply Python script that opens an mp3 audio file stored on your disk, automatically detects the input language, and transcribes it:

import whisper

model = whisper.load_model("large")
result = model.transcribe("audio.mp3")
print(result["text"])

Simple isn't it?

使用 "拥抱 "实施

In order to use Hugging Face's implementation of Whisper you will first need to install HF Transfomers, librosa, and Pytorch:

pip install transformers
pip install librosa
pip install torch

You also need to install ffmpeg (see above).

Now, here is a Python script that does transcription in English:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

speech, _ = librosa.load("audio.mp3")

processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")

model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "en", task = "transcribe")
input_features = processor(speech, return_tensors="pt").input_features 
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids)

print(processor.batch_decode(predicted_ids, skip_special_tokens = True))

There are 2 limitations with this Hugging Face implementation. First you need to manually set the source language (no automatic input language detection is implemented yet). And secondly no automatic chunking is applied, which means that you cannot transcribe content that is larger than 30 seconds...

Maybe these limitations will be solved in future releases?

A nice thing though is that there is a Tensorflow implementation available too, which means that you can use XLA compilation and get much faster response times.

硬件要求

正如我们在上面看到的，Whisper相当容易安装。然而，它需要先进的硬件。如果你想使用该模型的大型版本，建议使用GPU。

如果你使用whisper Python库（见上文），你将需要大约10GB的内存和11GB的VRAM。这意味着在实践中，你至少需要一个16GB的GPU。例如，它可以是NVIDIA Tesla T4，或NVIDIA A10。

在特斯拉T4上，你将在6秒左右转录30秒的音频。

性能方面的考虑

如果你想改善上面提到的默认性能，这里有几个策略你可以探索。

• 使用更高端的GPU。例如，使用安培平台的GPU，如A10、A40或A100，你会得到更好的响应时间。
• 在批量推理方面开展工作，以提高产量
• 充分利用XLA与Tensorflow或Jax的编译关系
• 将模型导出至ONNX或TensorRT，然后通过NVIDIA Triton推理服务器提供服务

总结

OpenAI Whisper是语音到文本世界的一场革命。由于这种开源模式，任何人都可以首次轻松获得最先进的自动语音识别，这使得Whisper成为谷歌语音转文字API的良好替代品。

虽然安装和部署这样一个人工智能模型仍然是一个挑战，因为引擎盖下需要硬件。大型版本的Whisper不能真正在消费者硬件上运行。

如果你很容易就想尝试Whisper，而不需要为基础设施的考虑而烦恼，请在NLP Cloud API上尝试。现在就在NLP Cloud上免费试用Whisper吧!.

Julien Salinas
NLP云的首席技术官