Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No reading tracking with Piper speech synthesis #361

Open
patrick-emmabuntus opened this issue Jan 26, 2024 · 4 comments
Open

No reading tracking with Piper speech synthesis #361

patrick-emmabuntus opened this issue Jan 26, 2024 · 4 comments

Comments

@patrick-emmabuntus
Copy link

Hello,

I used Calibre 6.13 with ebook-speaker on Debian 12.

The goal is to allow blind people to listen to the content of ebooks.

In order to have a better reading, I want to replace eSpeak-ng with Piper. Playback with Piper works well but this one compared to eSpeak-ng does not review the playback tracking data in Calibre like eSpeak-ng does see screenshot below.

In the Speak-ng synthesizer engine the "EspeakIndexing" option is set to 1 which activates word tracking.

This function is very important because it allows when reopening an ebook to return to where it left off because Calibre followed the voice reading.

Do you know if such a function is available in Piper?

And if so, how to activate it?

Thank you in advance for your advice.

Calibre_espeak_ng_scroll_speech

@SeymourNickelson
Copy link

This would be an awesome feature to have. It should be possible to add code to synthesize each word independently of each other and provide a callback just before the audio is played on each word boundary, but I would assume that the voice wouldn't sound as realistic because you're feeding the model one word at a time.

I wonder if there is another way to highlight on words as they are played without impacting the quality of the output.

@patrick-emmabuntus
Copy link
Author

Thank you @SeymourNickelson for your advice.

Indeed, if the words are read one by one, this will alter the reading of the voice synthesis.

On the other hand, you must continue to read the words normally using voice synthesis and you must send a reading position to caliber so there may be a small gap between the word read and the word displayed in Caliber. The goal is to allow the blind person to return to the position where they were in the book during the previous reading and not have to reread the entire chapter from the last chapter read.

@contentnation
Copy link

contentnation commented Feb 22, 2024

I had time to into the way Calibre works.
Sadly, I got bad news.
Short version: Calibre uses speech-dispatch for generating the audio. You can add custom tools for text-to-speech (like piper).
But for the highlighting feature you need to add direct support for piper in speech-dispatch to add "magic".
Plus some work on piper side for the other part of the magic.

For those, who want to go on developing, a few notes (or TODOs):
speech-dispatch needs similar marker functionality as in src/modules/espeak.c:
As soon as such a marker is received, wait for the audio data and tell upstream about the marker.
On the piper side, the markers need to used to split the input and if a marker is reached, send the generated audio timestamp and audio data until that point.
Current generic output always filters those markers before it is sent to piper (or any external tts).

@patrick-emmabuntus
Copy link
Author

Thank you very much @contentnation for your advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants