Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realtime? (low-latency streaming inference) #43

Open
willstott101 opened this issue Oct 11, 2021 · 5 comments
Open

Realtime? (low-latency streaming inference) #43

willstott101 opened this issue Oct 11, 2021 · 5 comments

Comments

@willstott101
Copy link
Contributor

Thanks for allosaurus, my experiments with it have been fruitful so far. Very impressive work!

I'm curious about whether the architecture of this package is suitable for operating on streaming audio at a reasonably low-latency?

I haven't dug much further than what I needed to load a file with pydub and get some output, and am happy to dig further. I thought it could be a good idea to start a conversation about this, perhaps the system and models are totally unsuitable for real-time, or perhaps it might just require a bit of engineering effort from me.

Thanks in advance

@xinjli
Copy link
Owner

xinjli commented Oct 11, 2021

Hi, thanks for asking!

Unfortunately, the current model is not able to do the real-time transcription. The real-time model would need some special architecture, which is not implemented in the current model.
If you want to use it for real-time purposes. Maybe the best way, for now, is to feed your audio stream into the model for a fixed amount of time (e.g: 2 second), and then concatenate the outputs.

@willstott101
Copy link
Contributor Author

Fair enough, I'm curious about the theoretical minimum latency of the model. I see there is a "window_size": 0.025, in pm_config.json and "window_size": 3, in am_config.json (uni2005). Is the minimum latency therefore basically 0.025 * 3 (seconds I assume), or am I wrong in assuming those window sizes are the overall limiter of data passed to any given execution of the neural network? Perhaps the network keeps state as windows are passed to it. Perhaps those windows aren't actually what I think they are. 🤷

@xinjli
Copy link
Owner

xinjli commented Oct 20, 2021

hi sorry for the late reply.

For this model, the minimum latency would be 0.025 + 0.01 + 0.01 because the window are overlapping by 0.01. And of course you also need to consider the time spent on feature extraction and inference

@padster06
Copy link

hi, continuing on wills thread about real time audio streaming, we ran into a bit of a blocker. In the lifter (pm.feature.lifter) function it seems to change the output based on the length of the array inputted as "cepstra". The same array just with less elements gets returned with different values. Is there an obvious way to make this function invariant to input array length? Or do we need to keep state and have a rolling average type thing?

Thanks

@IpsumDominum
Copy link

Hello, just wanna clarify, is it because the current model uses a bidirectional LSTM this is not possible?

@willstott101 willstott101 changed the title Realtime? Realtime? (low-latency streaming inference) Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants