Automatic Speech Recognition (Speech To Text) API

What Is Automatic Speech Recognition (Speech To Text)?

Automatic speech recognition (also known as speech to text) is about extracting text from an audio file. This is often a critical first step in an AI pipeline. Great progress have been made these last few years, and it is now possible to extract text from an audio or video file with a great accuracy.

For example, here is a chapter from a LibriVox audio book (The Metal Giants, by Edmond Hamilton), stored on Archive.org: https://ia801400.us.archive.org/10/items/metalgiants_2209_librivox/metalgiants_03_hamilton_64kb.mp3.

Once we perform automatic speech recognition on this file, we get the following text:

Chapter three of The Medal Giants by Edmound Hamilton. This Librivox recording is in the public domain. Read by Ben Tucker. Chapter three: Lanier arrived in Stockton early the next morning. His face was drawn and haggard, as it had been since he first read a certain humorous newspaper despatch, and in his mind was an immense perplexity, a vague, chilling fear. Until late in the afternoon, he tramped warily through the town, asking in all quarters the same question: do you know of anyone named Det Mold who lives in or around Stockton? A tall, strong man? And from all he questioned, he got no trace until he happened into the office of a small trucking and hauling company. None of them knew anything of Deadmold, but they had done some work for a certain Foster who corresponded exactly to Lanier's description. This man lived several miles from the city in a northeastern direction and had hired them to haul some boxes from the railroad to his home, an old farmhouse. A mighty bad road it was too, and this Foster had been very particular about the moving of his stuff. Yes, they could direct him to the place. He went out such and such a concrete road and turned up a ruddy lane, very steep. By the time the sun hung poised above the western horizon, Lanier was already ascending that steep, twisted road. More than once he glanced back at the city below. A city bathed in the golden afternoon sunlight. Its streets were filled now with workers returning home from the mills, tired and blackened, calling out to the friends they met for the latest news on that \"Morgan criter\" as they termed it. A quiet serenity, a dreamy contented peace, pervaded Stockton, contrasting with the tense excitement of the preceding night. In a thousand homes the evening meal was being prepared, and the day's gossip related. In the west, the sun sank lower and lower, and all around, beyond the encircling hills, death marched toward the city with crashing giant strides. End of chapter three.

This is a great text extraction, not only because there is no spelling mistake, but also because punctuation was automatically added.

If the speaker speaks too fast, or if some unknown vocabulary is employed, it sometimes results in unexpected errors though. But good news is that AI is making a lot of progress, so that such errors happen less and less.

Why Use Speech To Text?

The quality of speech to text has recently dramatically improved and has led to many interesting applications. Here are some examples:

Customer Support

Thanks to automatic speech recognition, you can now automatically analyze customer calls and then extract precious information. For example you can automatically know which support discussions went well, and which ones didn't so you can act accordingly.

Vocal Messages Analysis

It's sometimes hard to address all these vocal messages in a timely manner. But you can automatically analyze each incoming message and extract the intent, categorize it, detect the urgency, etc. so you can easily adapt your response.

Medical Reports

It is very common for doctors to record their discussions with their patients, or record a summary of the discussion. They can now automatically convert these reports into text and then do several kinds of post processing like conversation summarization, entity extraction, etc.

Automatic speech recognition with Wav2Vec2 and XLS-R.

Wav2Vec2 and XLS-R are great AI technologies released by Meta/Facebook in order to dramatically improve automatic speech recognition in English.

These models can be easily fine-tuned with specific data, which is the case of the Wav2Vec2 XLS-R 1B English model, released by Jonatas Grosman. You can find it here. This model was fine-tuned on popular datasets like Common Voice, Librispeech, VoxPopuli... and gives great results.

Automatic speech Recognition API

Building an inference API for automatic speech recognition is interesting as soon a you want to use speech to text in production. But keep in mind that building such an API is not necessarily easy. First because you need to code the API (easy part) but also because you need to build a highly available, fast, and scalable infrastructure to serve your models behind the hood (hardest part). Machine learning models consume a lot of resources (memory, disk space, CPU, GPU...) which makes it hard to achieve high-availability and low latency at the same time.

Leveraging such an API is very interesting because it is completely decoupled from the rest of your stack (microservice architecture), so you can easily scale it independently and ensure high-availability of your models through redundancy. But an API is also the way to go in terms of language interoperability. Most machine learning frameworks are developed in Python, but it's likely that you want to access them from other languages like Javascript, Go, Ruby... In such situation, an API is a great solution.

NLP Cloud's Speech To Text API

NLP Cloud proposes a speech to text API that gives you the opportunity to perform automatic speech recognition out of the box, based on the Wav2Vec2 XLS-R 1B English model by Jonatas Grosman.
This model is very computation intensive so a GPU is needed to get a decent response time. You can either use the pre-trained model, train your own model, or upload your own custom model!

For more details, see our documentation about automatic speech recognition here.

Testing speech to text locally is one thing, but using it reliably in production is another thing. With NLP Cloud you can just do both!