-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New feature] Adding visemes as part of the output #99
base: main
Are you sure you want to change the base?
Conversation
adding speech_to_visemes
Updating the viseme branch
Updating the speech-to-speech fork with visemes
FYI: I have updated the branch with the last changes in the upstream branch. Feel free to review the code whenever you have time |
779a7ee
to
1414ed4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, this looks good but there are some things that could be more solid. I would put that huge dictionary of mappings into a dictionary, and I would rethink a bit how we are unpacking things on the clients. There are a few parts commented out and things. I think Visemes is a great feature, but I would also untie it from the TTS_handlers. The model seems to only process the audio, so I don't see a reason why they couldn't work in the same way as all the other handlers and have something like piping the TTS to Visemes and then to the connector. That would make this implementation more in-line with the rest of the library, and it would make Visemes not affect any TTS. Plus, it would make extending TTSs simpler.
Thank you @andimarafioti for the review. Overall, this was mainly a first attempt to explore the feasibility and latency of the speech-to-viseme conversion, so I was more focused on the output structure and seeing if you were happy with it. Since the latency seems acceptable and the output looks good, I can now refine the code with a stronger focus on readability and modularity. More specifically:
I can definitely do this! It will clean up the code and make it easier to maintain.
Totally fair. Those were more for internal testing and to show how to access the info without overwhelming the terminal logs. I'll clean it up and find a better way to manage that information.
I initially tied speech-to-visemes to the TTS handlers because certain commercial APIs (like Amazon Polly) offer TTS with time-stamped visemes. I wanted to emulate that interface. But I agree with your suggestion of decoppling speech-to-visemes and TTS—it would definitely improve the modularity, simplify future extensions, and keep things consistent across the tool. I’ll work on it |
Cleaning code
Adding speech to visemes as a child of BaseHandler
hi @andimarafioti, I have addressed all your points above. I hope you like my changes:
|
Introducing SpeechToVisemes 🗣️
This PR addresses issue #37 and introduces the
SpeechToVisemes
module, a submodule ofTTS
🤖. This new functionality converts speech into visemes, which is crucial for apps that require visual representations of spoken words, like lip-syncing in animations or improving accessibility features 🎬.Example Usage 📹
demo_s2s.mov
How it works 🤔
The tool generates timestamped sequences of visemes (22 mouth shapes in total, following Microsoft's documentation 📚) by transcribing synthesized speech with the huggingface ASR pipeline using phoneme-recognition models. The default model is "bookbot/wav2vec2-ljspeech-gruut", which provides decent results with low latency and no external dependencies. Other alternatives include "ct-vikramanantha/phoneme-scorer-v2-wav2vec2" and "Bluecast/wav2vec2-Phoneme".
Following, phonemes are mapped into visemes, also taking into account language-specific sounds.
Notes on the server architecture 🛠️
I've implemented STV as a submodule of TTS to leverage the existing architecture (I didn't want to make any major edit here). However, I have ideas on how we may restructure the entire tool architecture to make it more generalizable. I propose a sensor-engine-actuator framework (instead of having just STT, LM, and TTS), where the sensor includes submodules like STT and speech emotion recognition (which can also run in parallel), the engine comprises LLM and application-specific rules, and the actuator includes TTS, visemes generation, and potentially more output instructions. Such a framework would enable emotion-aware agents, which are fundamental in a lot of scenarios (e.g., therapy)!! @andimarafioti Let's discuss this further if you're interested! 💬
Your feedback is welcome! 🤗