Automatic Speech Recognition (Speech to Text)

For large files (above 200 seconds) you will need to use the asynchronous mode: see more in the documentation.
The API also returns word-level timestamps you can use for subtitling.
The API also accepts base-64 encoded files instead of URLs.

API key (get one here)

Model

Input Language

URL of your audio or video file (200 seconds maximum)

It will take some time so please be patient!

What Is Automatic Speech Recognition (Speech To Text)?

Automatic speech recognition (also known as speech to text) is about extracting text from an audio file. This is often a critical first step in an AI pipeline. Great progress have been made these last few years, and it is now possible to extract text from an audio or video file with a great accuracy.

For example, here is a chapter from a LibriVox audio book (The Metal Giants, by Edmond Hamilton), stored on Archive.org: https://ia801400.us.archive.org/10/items/metalgiants_2209_librivox/metalgiants_03_hamilton_64kb.mp3.

Once we perform automatic speech recognition on this file, we get the following text:

Chapter three of The Medal Giants by Edmound Hamilton. This Librivox recording is in the public domain. Read by Ben Tucker. Chapter three: Lanier arrived in Stockton early the next morning. His face was drawn and haggard, as it had been since he first read a certain humorous newspaper despatch, and in his mind was an immense perplexity, a vague, chilling fear. Until late in the afternoon, he tramped warily through the town, asking in all quarters the same question: do you know of anyone named Det Mold who lives in or around Stockton? A tall, strong man? And from all he questioned, he got no trace until he happened into the office of a small trucking and hauling company. None of them knew anything of Deadmold, but they had done some work for a certain Foster who corresponded exactly to Lanier's description. This man lived several miles from the city in a northeastern direction and had hired them to haul some boxes from the railroad to his home, an old farmhouse. A mighty bad road it was too, and this Foster had been very particular about the moving of his stuff. Yes, they could direct him to the place. He went out such and such a concrete road and turned up a ruddy lane, very steep. By the time the sun hung poised above the western horizon, Lanier was already ascending that steep, twisted road. More than once he glanced back at the city below. A city bathed in the golden afternoon sunlight. Its streets were filled now with workers returning home from the mills, tired and blackened, calling out to the friends they met for the latest news on that \"Morgan criter\" as they termed it. A quiet serenity, a dreamy contented peace, pervaded Stockton, contrasting with the tense excitement of the preceding night. In a thousand homes the evening meal was being prepared, and the day's gossip related. In the west, the sun sank lower and lower, and all around, beyond the encircling hills, death marched toward the city with crashing giant strides. End of chapter three.

This is a great text extraction, not only because there is no spelling mistake, but also because punctuation was automatically added.

If the speaker speaks too fast, or if some unknown vocabulary is employed, it sometimes results in unexpected errors though. But good news is that AI is making a lot of progress, so that such errors happen less and less.

Why Use Speech To Text?

The quality of speech to text has recently dramatically improved and has led to many interesting applications. Here are some examples:

Customer Support

Thanks to automatic speech recognition, you can now automatically analyze customer calls and then extract precious information. For example you can automatically know which support discussions went well, and which ones didn't so you can act accordingly.

Vocal Messages Analysis

It's sometimes hard to address all these vocal messages in a timely manner. But you can automatically analyze each incoming message and extract the intent, categorize it, detect the urgency, etc. so you can easily adapt your response.

Medical Reports

It is very common for doctors to record their discussions with their patients, or record a summary of the discussion. They can now automatically convert these reports into text and then do several kinds of post processing like conversation summarization, entity extraction, etc.