Automatic Speech Recognition (Speech to Text)
For large files (above 200 seconds) you will need to use the asynchronous mode: see more in the documentation.
The API also returns word-level timestamps you can use for subtitling.
The API also accepts base-64 encoded files instead of URLs.
Automatic Speech Recognition
What Is Automatic Speech Recognition (Speech To Text)?
Automatic speech recognition (also known as speech to text) is about extracting text from an audio file. This is often a critical first step in an AI pipeline. Great progress have been made these last few years, and it is now possible to extract text from an audio or video file with a great accuracy.
For example, here is a chapter from a LibriVox audio book (The Metal Giants, by Edmond Hamilton), stored on Archive.org: https://ia801400.us.archive.org/10/items/metalgiants_2209_librivox/metalgiants_03_hamilton_64kb.mp3.
Once we perform automatic speech recognition on this file, we get the following text:
Chapter three of The Medal Giants by Edmound Hamilton. This Librivox recording is in the public domain. Read by Ben Tucker. Chapter three: Lanier arrived in Stockton early the next morning. His face was drawn and haggard, as it had been since he first read a certain humorous newspaper despatch, and in his mind was an immense perplexity, a vague, chilling fear. Until late in the afternoon, he tramped warily through the town, asking in all quarters the same question: do you know of anyone named Det Mold who lives in or around Stockton? A tall, strong man? And from all he questioned, he got no trace until he happened into the office of a small trucking and hauling company. None of them knew anything of Deadmold, but they had done some work for a certain Foster who corresponded exactly to Lanier's description. This man lived several miles from the city in a northeastern direction and had hired them to haul some boxes from the railroad to his home, an old farmhouse. A mighty bad road it was too, and this Foster had been very particular about the moving of his stuff. Yes, they could direct him to the place. He went out such and such a concrete road and turned up a ruddy lane, very steep. By the time the sun hung poised above the western horizon, Lanier was already ascending that steep, twisted road. More than once he glanced back at the city below. A city bathed in the golden afternoon sunlight. Its streets were filled now with workers returning home from the mills, tired and blackened, calling out to the friends they met for the latest news on that \"Morgan criter\" as they termed it. A quiet serenity, a dreamy contented peace, pervaded Stockton, contrasting with the tense excitement of the preceding night. In a thousand homes the evening meal was being prepared, and the day's gossip related. In the west, the sun sank lower and lower, and all around, beyond the encircling hills, death marched toward the city with crashing giant strides. End of chapter three.
This is a great text extraction, not only because there is no spelling mistake, but also because punctuation was automatically added.
If the speaker speaks too fast, or if some unknown vocabulary is employed, it sometimes results in unexpected errors though. But good news is that AI is making a lot of progress, so that such errors happen less and less.
Why Use Speech To Text?
The quality of speech to text has recently dramatically improved and has led to many interesting applications. Here are some examples:
Customer Support
Thanks to automatic speech recognition, you can now automatically analyze customer calls and then extract precious information. For example you can automatically know which support discussions went well, and which ones didn't so you can act accordingly.
Vocal Messages Analysis
It's sometimes hard to address all these vocal messages in a timely manner. But you can automatically analyze each incoming message and extract the intent, categorize it, detect the urgency, etc. so you can easily adapt your response.
Medical Reports
It is very common for doctors to record their discussions with their patients, or record a summary of the discussion. They can now automatically convert these reports into text and then do several kinds of post processing like conversation summarization, entity extraction, etc.
Use GPU
Control whether you want to use the model on a GPU. Machine learning models run much faster on GPUs.
URL
The url of your audio or video file. In synchronous mode, the file size should be 100MB maximum and duration should be 200 seconds maximum. In asynchronous mode, the file size should be 600MB maximum and duration should be 60,000 seconds maximum. Input language is automatically detected.
Note that services like Youtube, Google Drive, Dropbox, etc. do not give access to the underlying raw file. So if your file is stored there, you will first need to export it and serve it by yourself on your own server or on an online bucket like AWS S3, Google Cloud Storage Bucket, etc. If you are unsure about how to handle this, please contact us!
Input Language
Language of your file as ISO code. If no input language is passed, the model will try to guess the language automatically. Optional.