Skip to content

rajashekar/WakeWordDetector

Repository files navigation

Wake Word Detector

Table of Contents

Background

Personal Assistant devices like Google Home, Alexa and Apple Homepod, will be constantly listening for specific set of wake words like “Ok, Google” or “Alexa” or “Hey Siri”, and once these sequence of words are detected it would prompt to user for next commands and respond to them appropriately.

Introduction

To create a open-source custom wake word detector, which will take audio as input and once the sequence of words are detected then prompt to the user.

Goal is to provide configurable custom detector so that anyone can use it on their own application to perform operations, once configured wake words are detected.

Related Work

  • Firefox Voice
    • Model was trained using Mozilla Common Voice dataset, used Pytorch (refer paper Howl) library to extract audio features and to train model on res8. Custom logic MeydaMelSpectrogram was used to train the model.
    • Used Meyda: an audio feature extraction library for the Web Audio API for audio feature extraction at client side. Mel-frequency cepstral coefficients (MFCCs) is extracted from audio stream.
    • Used Honkling (Purely written in Javascript) to do inference on model created using TensorFlow.js and copied above Pytorch model weights to the model created in tensorflow js.
  • This project
    • Model was trained using MCV dataset and generated data using Google Speech to Text. Used Pytorch library to extract audio features and to train model on 2 layer CNN. Used Log MelSpectrogram to train the model.
    • Server side inference - Used websockets to stream audio from browser to backend and did inference on that model.
    • Client side inference
      • Used magenta-js for audio feature extraction, Log MelSpectrograms are extracted from audio stream.
      • Converted Pytorch Model to Open Neural Network Exchange (ONNX) model, used microsoft/onnxjs to do inference on onnx model at client side.
      • Converted ONNX model to tensorflow model, used tensorflow js to do inference on tensorflow model
      • Converted tensorflow model to tflite model, used tflite js to do inference on tensorflow lite model.

Implementation

Preparing labelled dataset

Used Mozilla Common Voice dataset,

  • Go through each wake word and check transcripts for match
  • If found then it will be in positive dataset
  • If not found then it will be in negative dataset
  • Load appropriate mp3 files and trim the silence parts
  • save as .wav file and transcript as .lab file
  • Code reference fetch_dataset_mcv.py

Word Alignment

  • For positive dataset, used Montreal Forced Alignment to get timestamps of each word in audio.
  • Download the stable version
    wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz
    tar -xf montreal-forced-aligner_linux.tar.gz
    rm montreal-forced-aligner_linux.tar.gz
  • Download the Librispeech Lexicon dictionary
    wget https://www.openslr.org/resources/11/librispeech-lexicon.txt
  • Known issues in MFA
    # known mfa issue https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/issues/109
    cp montreal-forced-aligner/lib/libpython3.6m.so.1.0 montreal-forced-aligner/lib/libpython3.6m.so
    cd montreal-forced-aligner/lib/thirdparty/bin && rm libopenblas.so.0 && ln -s ../../libopenblasp-r0-8dca6697.3.0.dev.so libopenblas.so.0
  • Creating aligned data
    montreal-forced-aligner\bin\mfa_align -q positive\audio librispeech-lexicon.txt montreal-forced-aligner\pretrained_models\english.zip aligned_data

Generated textgrid file

Fix data imbalance

Check for any data imbalance, if the dataset does not have enough samples containing wake words, consider using text to speech services to generate more samples.

  • Used Google Text To Speech Api, set environment variable GOOGLE_APPLICATION_CREDENTIALS with your key.
  • Used various speed rates, pitches and voices to generate data for wake words.
  • Code generate_dataset_google_tts.py

Extract audio features

  • Below is how sound looks like when plotted on time (x-axis) and amplitude (y-axis)
    import librosa
    sounddata = librosa.core.load("hey.wav", sr=16000, mono=True)[0]
    
    # plotting the signal in time series
    plt.plot(sounddata)
    plt.title('Signal')
    plt.xlabel('Time (samples)')
    plt.ylabel('Amplitude')
  • When Short-time Fourier transform (STFT) computed, below is how spectrogram looks like
    from torchaudio.transforms import Spectrogram
    spectrogram  = Spectrogram(n_fft=512,hop_length=200)
    spectrogram.to(device)
    
    inp = torch.from_numpy(sounddata).float().to(device)
    hey_spectrogram = spectrogram(inp.float())
    plot_spectrogram(hey_spectrogram.cpu(), title="Spectrogram")
  • A mel spectrogram is a spectrogram where the frequencies are converted to the mel scale.
    from torchaudio.transforms import MelSpectrogram
    mel_spectrogram  = MelSpectrogram(n_mels=40,sample_rate=16000,
                                    n_fft=512,hop_length=200,
                                    norm="slaney")
    
    mel_spectrogram.to(device)
    inp = torch.from_numpy(sounddata).float().to(device)
    hey_mels_slaney = mel_spectrogram(inp.float())
    plot_spectrogram(hey_mels_slaney.cpu(), title="MelSpectrogram", ylabel='mel freq')
  • After adding offset and taking log on mels, below is how final mel spectrogram looks like
    log_offset = 1e-7
    log_hey_mel_specgram = torch.log(hey_mels_slaney + log_offset)
    plot_spectrogram(log_hey_mel_specgram.cpu(), title="MelSpectrogram (Log)", ylabel='mel freq')

Audio transformations

  • Used MelSpectrogram from Pytorch audio to generate mel spectrogram
  • Hyperparameters
    Sample rate = 16000 (16kHz)
    Max window length = 750 ms (12000)
    Number of mel bins = 40
    Hop length = 200
    Mel Spectrogram matrix size = 40 x 61
    
  • Used Zero Mean Unit Variance to scale the values
  • Code transformers.py and audio_collator.py

Define model architecture

  • Given above transformations, Mel spectrogram of size 40x61 will be fed to model
  • Below is the CNN model used
  • Code model.py
  • Below is the CNN model summary

Train model

  • Used batch size as 16, Tensor of size [16, 1, 40, 61] will be fed to Model
  • Used 20 epochs, below is how the train vs validation loss looks like without noise
  • As you can see, without noise, there is overfitting problem
  • Its resolved after adding noise, below is how the train vs validation loss looks like
  • Code - train.py

Test Model

  • Below is how model performed on test dataset, model acheived 87% accuracy
  • Below is the confusion matrix
  • Below is the ROC curve

Inference

Below are the methods used on live streaming audio on above model.

Using Pyaudio

  • Used Pyaudio, to get input from microphone
  • Capture 750ms window of audio buffer
  • After n batches, do transformations and infer on model
  • Code - infer.py

Using web sockets

  • Used Flask Socketio at server level to capture audio buffer from client.
  • At Client, used socket.io at client level to send audio buffer through socket connection.
  • Capture audio buffer using getUserMedia, convert to array buffer and stream to server.
  • Inference will happen at server, after n batches of 750ms window
  • If sequence detected, send detected prompt to client.
  • Server Code - application.py
  • Client Code - main.js
  • To run this locally
    cd server
    python -m venv .venv
    pip install -r requirements.txt
    FLASK_ENV=development FLASK_APP=application.py .venv/bin/flask run --port 8011
  • Use Dockerfile & Dockerrun.aws.json to containerize the app and deploy to AWS Elastic BeanStalk
  • Elastic Beanstalk initialize app
    eb init -p docker-19.03.13-ce wakebot-app --region us-west-2
  • Create Elastic Beanstalk instance
    eb create wakebot-app --instance_type t2.large --max-instances 1
  • Disadvantage of above method might be of privacy, since we are sending the audio buffer to server for inference

Using ONNX

  • Used Pytorch onnx to convert pytorch model to onnx model
  • Pytorch to onnx convert code - convert_to_onnx.py
  • Once converted, onnx model can be used at client side to do inference
  • Client side, used onnx.js to do inference at client level
  • Capture audio buffer at client using getUserMedia, convert to array buffer
  • Used fft.js to compute Fourier Transform
  • Used methods from Meganta.js audio utils to compute audio transformations like Mel spectrograms
  • Below is the comparision of client side vs server side audio transformations
  • Client side code - main.js
  • To run locally
    cd standalone
    python -m venv .venv
    pip install -r requirements.txt
    FLASK_ENV=development FLASK_APP=application.py .venv/bin/flask run --port 8011
  • To deploy to AWS Elastic Beanstalk, first initialize app
    eb init -p python-3.7 wakebot-std-app --region us-west-2
  • Create Elastic Beanstalk instance
    eb create wakebot-std-app --instance_type t2.large --max-instances 1
  • Refer standalone_onnx for client version without flask, you can deploy on any static server, you can also deploy to IPFS
  • Recent version will show, plots and audio buffer for each wake word which model infered for, click on wake word button to know what buffer was infered for that word.

Using tensorflowjs

  • Used onnx-tensorflow to convert onnx model to tensorflow model
  • onnx to tensorflow convert code - convert_onnx_to_tf.py
    onnx_model = onnx.load("onnx_model.onnx")  # load onnx model
    tf_rep = prepare(onnx_model)  # prepare tf representation
    
    # Input nodes to the model
    print("inputs:", tf_rep.inputs)
    
    # Output nodes from the model
    print("outputs:", tf_rep.outputs)
    
    # All nodes in the model
    print("tensor_dict:")
    print(tf_rep.tensor_dict)
    
    tf_rep.export_graph("hey_fourth_brain")  # export the model
    
  • Verify the model using below command
    python .venv/lib/python3.8/site-packages/tensorflow/python/tools/saved_model_cli.py show --dir hey_fourth_brain --all
    
    Output
    MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
    
    signature_def['__saved_model_init_op']:
    The given SavedModel SignatureDef contains the following input(s):
    The given SavedModel SignatureDef contains the following output(s):
        outputs['__saved_model_init_op'] tensor_info:
            dtype: DT_INVALID
            shape: unknown_rank
            name: NoOp
    Method name is: 
    
    signature_def['serving_default']:
    The given SavedModel SignatureDef contains the following input(s):
        inputs['input'] tensor_info:
            dtype: DT_FLOAT
            shape: (-1, 1, 40, 61)
            name: serving_default_input:0
    The given SavedModel SignatureDef contains the following output(s):
        outputs['output'] tensor_info:
            dtype: DT_FLOAT
            shape: (-1, 4)
            name: PartitionedCall:0
    Method name is: tensorflow/serving/predict
    
    Defined Functions:
    Function Name: '__call__'
            Named Argument #1
            input
    
    Function Name: 'gen_tensor_dict'
    
  • Refer onnx_to_tf for generated files
  • Test converted model using test_tf.py
  • Used tensorflowjs[wizard] to convert savedModel to web model
    (.venv) (base) ➜  onnx_to_tf git:(main) ✗ tensorflowjs_wizard 
    Welcome to TensorFlow.js Converter.
    ? Please provide the path of model file or the directory that contains model files. 
    If you are converting TFHub module please provide the URL.  hey_fourth_brain
    ? What is your input model format? (auto-detected format is marked with *)  Tensorflow Saved Model *
    ? What is tags for the saved model?  serve
    ? What is signature name of the model?  signature name: serving_default
    ? Do you want to compress the model? (this will decrease the model precision.)  No compression (Higher accuracy)
    ? Please enter shard size (in bytes) of the weight files?  4194304
    ? Do you want to skip op validation? 
    This will allow conversion of unsupported ops, 
    you can implement them as custom ops in tfjs-converter.  No
    ? Do you want to strip debug ops? 
    This will improve model execution performance.  Yes
    ? Do you want to enable Control Flow V2 ops? 
    This will improve branch and loop execution performance.  Yes
    ? Do you want to provide metadata? 
    Provide your own metadata in the form: 
    metadata_key:path/metadata.json 
    Separate multiple metadata by comma.  
    ? Which directory do you want to save the converted model in?  web_model
    converter command generated:
    tensorflowjs_converter --control_flow_v2=True --input_format=tf_saved_model --metadata= --saved_model_tags=serve --signature_name=serving_default --strip_debug_ops=True --weight_shard_size_bytes=4194304 hey_fourth_brain web_model
    
    ...
    File(s) generated by conversion:
    Filename                           Size(bytes)
    group1-shard1of1.bin                729244
    model.json                          28812
    Total size:                         758056
    
  • Once above step is done, copy the files to web application Example -
    ├── index.html
    └── static
        └── audio
            ├── audio_utils.js
            ├── fft.js
            ├── main.js
            ├── mic128.png
            ├── model
            │   ├── group1-shard1of1.bin
            │   └── model.json
            ├── prompt.mp3
            └── styles.css
    
  • Client side used tfjs to load model and do inference
  • Loading the tensorflow model
    let tfModel;
    async function loadModel() {
        tfModel = await tf.loadGraphModel('static/audio/model/model.json');
    }
    loadModel()
    
  • Do inference using above model
    let outputTensor = tf.tidy(() => {
    let inputTensor = tf.tensor(dataProcessed, [batch, 1, MEL_SPEC_BINS, dataProcessed.length/(batch * MEL_SPEC_BINS)], 'float32');
    let outputTensor = tfModel.predict(inputTensor);
        return outputTensor
    });
    let outputData = await outputTensor.data();
    

Using tflite

  • Once tensorflow model is created, it can be converted to tflite, using below code
    model = tf.saved_model.load("hey_fourth_brain")
    input_shape = [1, 1, 40, 61]
    func = tf.function(model).get_concrete_function(input=tf.TensorSpec(shape=input_shape, dtype=np.float32, name="input"))
    converter = tf.lite.TFLiteConverter.from_concrete_functions([func])
    tflite_model = converter.convert()
    open("hey_fourth_brain.tflite", "wb").write(tflite_model)
    
  • Note: tf.lite.TFLiteConverter.from_saved_model("hey_fourth_brain") did not work, as it was throwing conv.cc:349 input->dims->data[3] != filter->dims->data[3] (0 != 1) on inference, so used above method.
  • copy the tflite model to web application
  • Used tflite js to load model and do inference
  • Loading tflite model
    let tfliteModel;
    async function loadModel() {
        tfliteModel = await tflite.loadTFLiteModel('static/audio/hey_fourth_brain.tflite');
    }
    loadModel()
    

Demo

Slides

Please use this link for slides

Dataset

You can download the dataset that was used in the project from here

Conclusion

In this project, we have went through how to extract audio features from audio and train model and detect wake words by using end to end example with source code. Go through wake_word_detection.ipynb jupyter notebook for complete walk through of this project.

Enhancements

  • Explore different number of mels, in this project we used 40 as number of mels, we can use different number to see whether this will improve accuracy or not, this can be in range of 32 to 128.
  • Use RNN or LSTM or GRU or attention to see whether we can get better results
  • Check by computing MFCCs (which is computed after Mel spectrograms) and see if we see any improvements.
  • Use different audio augmentation methods like TimeStrech, TimeMasking, FrequencyMasking

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published