Realtime? (low-latency streaming inference) #43

willstott101 · 2021-10-11T10:17:59Z

Thanks for allosaurus, my experiments with it have been fruitful so far. Very impressive work!

I'm curious about whether the architecture of this package is suitable for operating on streaming audio at a reasonably low-latency?

I haven't dug much further than what I needed to load a file with pydub and get some output, and am happy to dig further. I thought it could be a good idea to start a conversation about this, perhaps the system and models are totally unsuitable for real-time, or perhaps it might just require a bit of engineering effort from me.

Thanks in advance

xinjli · 2021-10-11T16:10:13Z

Hi, thanks for asking!

Unfortunately, the current model is not able to do the real-time transcription. The real-time model would need some special architecture, which is not implemented in the current model.
If you want to use it for real-time purposes. Maybe the best way, for now, is to feed your audio stream into the model for a fixed amount of time (e.g: 2 second), and then concatenate the outputs.

willstott101 · 2021-10-12T09:16:59Z

Fair enough, I'm curious about the theoretical minimum latency of the model. I see there is a "window_size": 0.025, in pm_config.json and "window_size": 3, in am_config.json (uni2005). Is the minimum latency therefore basically 0.025 * 3 (seconds I assume), or am I wrong in assuming those window sizes are the overall limiter of data passed to any given execution of the neural network? Perhaps the network keeps state as windows are passed to it. Perhaps those windows aren't actually what I think they are. 🤷

xinjli · 2021-10-20T18:44:38Z

hi sorry for the late reply.

For this model, the minimum latency would be 0.025 + 0.01 + 0.01 because the window are overlapping by 0.01. And of course you also need to consider the time spent on feature extraction and inference

padster06 · 2022-07-26T16:37:45Z

hi, continuing on wills thread about real time audio streaming, we ran into a bit of a blocker. In the lifter (pm.feature.lifter) function it seems to change the output based on the length of the array inputted as "cepstra". The same array just with less elements gets returned with different values. Is there an obvious way to make this function invariant to input array length? Or do we need to keep state and have a rolling average type thing?

Thanks

IpsumDominum · 2023-11-29T17:29:31Z

Hello, just wanna clarify, is it because the current model uses a bidirectional LSTM this is not possible?

willstott101 changed the title ~~Realtime?~~ Realtime? (low-latency streaming inference) Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realtime? (low-latency streaming inference) #43

Realtime? (low-latency streaming inference) #43

willstott101 commented Oct 11, 2021

xinjli commented Oct 11, 2021

willstott101 commented Oct 12, 2021

xinjli commented Oct 20, 2021

padster06 commented Jul 26, 2022

IpsumDominum commented Nov 29, 2023

Realtime? (low-latency streaming inference) #43

Realtime? (low-latency streaming inference) #43

Comments

willstott101 commented Oct 11, 2021

xinjli commented Oct 11, 2021

willstott101 commented Oct 12, 2021

xinjli commented Oct 20, 2021

padster06 commented Jul 26, 2022

IpsumDominum commented Nov 29, 2023