Automatic Speech Recognition (Speech-To-Text) Whisper API

What Is Automatic Speech Recognition (Speech-To-Text)?

Automatic speech recognition (also known as speech-to-text) is about extracting text from an audio file. This is often an important first step in an AI pipeline. Great progress have been made these last few years, and it is now possible to extract text from an audio or video file with a great accuracy.

For example, here is a chapter from a LibriVox audio book (The Metal Giants, by Edmond Hamilton), stored on Archive.org: https://ia801400.us.archive.org/10/items/metalgiants_2209_librivox/metalgiants_03_hamilton_64kb.mp3.

Automatic Speech Recognition

Once we perform automatic speech recognition on this file on NLP Cloud, we get the following text:

Chapter three of The Medal Giants by Edmound Hamilton. This Librivox recording is in the public domain. Read by Ben Tucker. Chapter three: Lanier arrived [...] In a thousand homes the evening meal was being prepared, and the day's gossip related. In the west, the sun sank lower and lower, and all around, beyond the encircling hills, death marched toward the city with crashing giant strides. End of chapter three.

This is a very good text extraction, not only because there is no spelling mistake, but also because punctuation was automatically added.

Additionally, you can also get word-level timestamps, in order to perform subtitling.

Why Use Speech-To-Text?

The quality of speech-to-text has recently dramatically improved and has led to many interesting applications. Here are some examples:

Customer Support

Thanks to automatic speech recognition, you can now automatically analyze customer calls and then extract precious information. For example you can automatically know which support discussions went well, and which ones didn't so you can act accordingly.

Vocal Messages Analysis

It's sometimes hard to address all these vocal messages in a timely manner. But you can automatically analyze each incoming message and extract the intent, categorize it, detect the urgency, etc. so you can easily adapt your response.

Medical Reports

It is very common for doctors to record their discussions with their patients, or record a summary of the discussion. They can now automatically convert these reports into text and then do several kinds of post processing like conversation summarization, entity extraction, etc.

Videos Subtitling

Videos are everywhere today. Automatic video subtitling is a great way to increase accessibility, and make the content of the video more SEO friendly. As a second step you can easily translate your subtitles to make the video availble worldwide.

Automatic Speech Recognition with OpenAI Whisper Large

Whisper Large is an advanced speech recognition AI model released by OpenAI in order to dramatically improve automatic speech recognition in 97 languages.

This model automatically detects the language from the input audio or video file, and it automatically adds punctuation to the result. It can also extract word-level timestamps, which is very useful for subtitling. You can find the Whisper open-source project it here. This model was fine-tuned on popular datasets like Common Voice, Librispeech, VoxPopuli... and it is the most advanced multilingual speech-to-text model as of this writing.

Whisper Large API on NLP Cloud

NLP Cloud proposes a fast speech-to-text API that allows you to perform automatic speech recognition out of the box, based on OpenAI Whisper Large, at an affordable price.

For more details, see our documentation about automatic speech recognition here.

Testing speech-to-text locally is one thing, but using it reliably in production is another thing. With NLP Cloud you can just do both!

Frequently Asked Questions

What is automatic speech recognition?

Automatic speech recognition (ASR) is a technology that enables computers or other devices to recognize and transcribe human speech into textual data. It involves converting spoken language into a machine-readable format, which can then be used for various applications such as voice-to-text transcription, voice-activated commands, and natural language processing.

What is Whisper?

Whisper is an advanced open-source ASR (speech-to-text) model created by OpenAI. It is able to transcribe audio in 97 languages with a very good accuracy.

Can I try the Whisper API for free?

Yes, like all the models on NLP Cloud, the Whisper API can be tested for free.

Can I use the Whisper API to transcribe audio in several languages?

Yes, Whisper is able to transcribe audio in 97 languages.

Does Whisper automatically add punctuation?

Yes

Can I use Whisper to transcribe audio and automatically translate to another language?

No. You will need to use our translation endpoint once your audio is extracted: see our translation documentation here.

Does Whisper return the timestamps?

Yes

Does the Whisper API support live transcription (token streaming)?

No, not for the moment

How does your AI API handle data privacy and security during the speech recognition process?

NLP Cloud is focused on data privacy by design: we do not log or store the content of the requests you make on our API. NLP Cloud is both HIPAA and GDPR compliant.