Providing STT services to end users #2337
wasertech
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'd like to open this discussion about STT and the way we provide this technology to end users.
As you might have guessed by now, STT isn't made with the end user in mind. If you need a concrete example of this, you only need to look at the way we use our STT models in the wild.
The
stt
command can't access the microphone and can only transcribe from files. Plus you have to specify which acoustic and language models to use every time.i.e.
Why is this an issue?
Currently, you need to build a custom stack (based on examples from STT-examples) to use pre-trained models (don't even get me started on custom models). This hinders the ability of anyone to use and expand upon the technology as you need to be fluent in at least one (or more) programming language to achieve so.
How can we solve this issue?
Training any neural network is always going to be an involved process by nature. We won't change that fact, we can only alleviate the process.
For that purpose I have written the Training Wizard for STT which aims to be your personal wizard for training and fine-tuning your STT models. This piece of software takes care of everything you need to train your custom acoustic model and augment your scorers for the task at hand. (In its current form, the wizard aims to train models for the purposes of my Assistant using its Data Manager but can be easily expanded for any other use).
If you have shared your voice with the CommonVoice Project, the Training Wizard is able to download your personal data to fine-tune the acoustic models upon using Coqui's data importer.
Have a look at the Wizard in action (click on the picture).
Providing an unified interface to quickly start transcribing
With our custom models taken care of by the Training Wizard, we still need an unified way to infer from our models.
This is where Listen comes into play.
Listen provides STT models as a service. Meaning it serves our models in the background. Always ready to be questioned.
With models at the ready, you can quickly and easily ask for any audio to be transcribed from anywhere.
Demonstration
Let's install
listen
first:Next step is to run the service so it can download any missing models and start serving them on a socket.
First time you launch the service, it will error out with
NotAllowedToListenError
:This is a security mechanism to prevent (at least slow down) anybody to record you without your consent.
Set
is_allowed
totrue
inside the STT service configuration file~/.assistant/stt.toml
:This authorizes the system to record your voice.
You can now restart the service.
It will fetch the latest available pre-trained STT models for the language of your system.
Listen uses the environment variable
$LANG
to choose which models to download using stt-models-locals and the Model Manager for STT.Which means you can choose which models to fetch and serve by passing a custom value for
$LANG
directly before running the service.❯ LANG="en_US.UTF-8" python -m listen.STT.as_service
If available, it will also fetch a punctuation model to punctuate your sentences.
Once the models have been downloaded, they will be served and you should be ready to transcribe anything using listen client.
Using custom models with Listen
If you used the Training Wizard for STT,
listen
will automatically use your latest custom models but if you already trained models by hand and wish to use them withlisten
, you can simply put*.tflite
and*.scorer
into the following directory~/.assistant/models/${lang}/STT/
where$lang
is the language of the models (i.e.en
,fr
,de
, etc.), for listen to automatically pick up.Getting transcription results from another program
My Assistant uses Listen to understand when I speak. Which means Listen needs to be able to integrate with other programs.
Since we serve the models as a service, we only need to send a request with audio to the service to receive the transcription back.
Here is a python example taken from Assistant listen sub-module:
Tell us about your experience
How easy is it for you to apply this technology to your own stack?
Do you have any particular pain you would like to point out?
Does listen (and the Training Wizard) can help your workflow? Why? How?
This discussion is here for you.
Beta Was this translation helpful? Give feedback.
All reactions