Providing STT services to end users #2337

wasertech · 2023-01-18T19:08:57Z

wasertech
Jan 18, 2023
Collaborator

I'd like to open this discussion about STT and the way we provide this technology to end users.

As you might have guessed by now, STT isn't made with the end user in mind. If you need a concrete example of this, you only need to look at the way we use our STT models in the wild.

The stt command can't access the microphone and can only transcribe from files. Plus you have to specify which acoustic and language models to use every time.

i.e.

stt --model output_graph.tflite --scorer kenlm.scorer  --audio audio.wav

Why is this an issue?

Currently, you need to build a custom stack (based on examples from STT-examples) to use pre-trained models (don't even get me started on custom models). This hinders the ability of anyone to use and expand upon the technology as you need to be fluent in at least one (or more) programming language to achieve so.

How can we solve this issue?

Training any neural network is always going to be an involved process by nature. We won't change that fact, we can only alleviate the process.

For that purpose I have written the Training Wizard for STT which aims to be your personal wizard for training and fine-tuning your STT models. This piece of software takes care of everything you need to train your custom acoustic model and augment your scorers for the task at hand. (In its current form, the wizard aims to train models for the purposes of my Assistant using its Data Manager but can be easily expanded for any other use).

If you have shared your voice with the CommonVoice Project, the Training Wizard is able to download your personal data to fine-tune the acoustic models upon using Coqui's data importer.

Have a look at the Wizard in action (click on the picture).

Providing an unified interface to quickly start transcribing

With our custom models taken care of by the Training Wizard, we still need an unified way to infer from our models.

This is where Listen comes into play.

Listen provides STT models as a service. Meaning it serves our models in the background. Always ready to be questioned.

With models at the ready, you can quickly and easily ask for any audio to be transcribed from anywhere.

Demonstration

Let's install listen first:

❯ pip install stt-listen

Next step is to run the service so it can download any missing models and start serving them on a socket.

❯ python -m listen.STT.as_service

First time you launch the service, it will error out with NotAllowedToListenError:

❯ python -m listen.STT.as_service
...
listen.STT.exception.NotAllowedToListenError: System is not allowed to listen. Open `~/.assistant/stt.toml` and set `stt.is_allowed` to `True`.

This is a security mechanism to prevent (at least slow down) anybody to record you without your consent.

Set is_allowed to true inside the STT service configuration file ~/.assistant/stt.toml:

...
[stt]
is_allowed = true
...

This authorizes the system to record your voice.

You can now restart the service.

❯ python -m listen.STT.as_service

It will fetch the latest available pre-trained STT models for the language of your system.

Listen uses the environment variable $LANG to choose which models to download using stt-models-locals and the Model Manager for STT.

Which means you can choose which models to fetch and serve by passing a custom value for $LANG directly before running the service.

❯ LANG="en_US.UTF-8" python -m listen.STT.as_service

If available, it will also fetch a punctuation model to punctuate your sentences.

Once the models have been downloaded, they will be served and you should be ready to transcribe anything using listen client.

# transcribe a file
❯ listen -f audio.wav
Filename                       Duration(s)         
audio.wav                     3.580               

❯ cat audio.txt
Bonjour.


# Transcribe from a live microphone
❯ listen
You can speak now.
Bonjour.
^C
Stopped listening.

Using custom models with Listen

If you used the Training Wizard for STT, listen will automatically use your latest custom models but if you already trained models by hand and wish to use them with listen, you can simply put *.tflite and *.scorer into the following directory ~/.assistant/models/${lang}/STT/ where $lang is the language of the models (i.e. en, fr, de, etc.), for listen to automatically pick up.

Getting transcription results from another program

My Assistant uses Listen to understand when I speak. Which means Listen needs to be able to integrate with other programs.

Since we serve the models as a service, we only need to send a request with audio to the service to receive the transcription back.

Here is a python example taken from Assistant listen sub-module:

# https://gitlab.com/waser-technologies/technologies/assistant/-/blob/main/assistant/listen/main.py

import sys
import signal
from listen import mic
from listen.STT import utils

# first we make sure we are allowed to listen
STT_CONFIG = utils.get_config_or_default()
is_allowed_to_listen = utils.is_allowed_to_listen(STT_CONFIG)
if is_allowed_to_listen:
    pass
else:
    print("System has not user autorization to listen.")
    sys.exit(1)

# this if for [Ctrl]+[C] to quit
signal.signal(signal.SIGINT, signal.SIG_DFL)

...

def main():
    try:
        # Define the source of the audio
        source = mic.Microphone()
        while True:
            # Get transcription
            query = source.transcribe()
            if query:
                print(f"sending: {query}")
                # If there is a transcription, do something with it
                asyncio.run(nlp(query, USER))
    except Exception as e:
        raise e

Tell us about your experience

How easy is it for you to apply this technology to your own stack?
Do you have any particular pain you would like to point out?
Does listen (and the Training Wizard) can help your workflow? Why? How?
This discussion is here for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Providing STT services to end users #2337

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Providing STT services to end users #2337

wasertech Jan 18, 2023 Collaborator

Why is this an issue?

How can we solve this issue?

Providing an unified interface to quickly start transcribing

Demonstration

Using custom models with Listen

Getting transcription results from another program

Tell us about your experience

Replies: 0 comments

wasertech
Jan 18, 2023
Collaborator