Conversational agents are sweet! See Hume's EVI, Call Alice, and Fixie's ai.town. These three know when they are being interrupted, somehow.
- A speaker diarization model
- A speech activity model
- DSP in the browser
I made a dataset of overlapped speech called grid-overlap from audio in the GRID audiovisual sentence corpus.
- I recorded six clips of myself saying something I might say to interrupt an assistant. See the filenames in interrupts
- I embedded the clips into one half of the total audio recordings from GRID. See add_overlaps.py
- 50% with interruption, 50% without interruption.
- 10% validation, 20% test, 70% train
audio_25k from the Grid AudioVisual dataset contains 1k recordings for each of 30 speakers.
See this train_overlap Colab notebook
- Code for inference is at the end of the above notebook
- A finetuned wav2vec running on a T4 took about 1.41 seconds to classify a <1 second recording.
- This is too slow
see vui
wav2vec in the browser
Q: test time it takes to
- featureExtract 1sec of audio
- classify that audio as overlap or not overlap
A:
1.41 s ± 318 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
^^ With a T4 in python, using the pipeline abstraction:
path_to_model="/content/drive/MyDrive/wav2vec2-base-detect-overlap"
classifier = pipeline("audio-classification", model=path_to_model)
Recon: Is there a wav2vec model that can be used in the browser? Yes. Might be worth learning the time it takes your browser to classify audio.
Fear: Is tokenizing input going to take a long time? Maybe not, cuz browsers have ASR and speech recognition
- Dataset is in 44khz
- Same speaker is used for every interrupt
- there are only 6 interrupt variants. These are likely easily learned