Evaluate TTS Engines #60

kristiankielhofner · 2023-05-23T10:25:51Z

SpeechT5 is included because it's in Transformers and it's an easy first pick for TTS.

There are several others (in no particular order):

Now that WIS has been released I'm very interested in feedback from the community to evaluate different engines, voices, etc so we can select the best default for future versions of WIS.

DePingus · 2023-05-24T02:35:06Z

Rhasspy released their new engine Piper recently. Their samples sound a lot like Mycroft's Mimic 3 engine.

kristiankielhofner · 2023-05-24T03:46:20Z

Mycroft's Mimic 3 engine is licensed as AGPL, which has some considerable legal implications. I'd also be hesitant to use anything Mycroft based as their future is uncertain.

You can see from both Mimic and Piper they're using a lot of the same standard components as the frameworks above - VITS, espeak-ng, etc.

nils-se · 2023-05-25T10:33:01Z

Strong advocate of Coqui. I tried a lot of voices and frameworks and liked it the best in German and English. Here is a little python code for trying it out. I wrote a helper function for saying long texts. The first sentence voice gets generated and played, then the next et cetera. If one would compute all the sentences at once, there would be a huge delay. It works nicely on CPU in sub-realtime, so maybe plays nice with your GPU focused STT and LLM. Great work by the way!

Please mind my non-coding background when trying out my code ;)

Also on second execution the audio files will be reused with a hash. So they won't get generated by say().

from TTS.api import TTS
import subprocess
import time
import os
import hashlib
import pysbd ## for sentence splitting with natural language processing

# Init TTS with the target model name. To list all models just use: TTS.list_models()
#list_voices = TTS.list_models()
#print(list_voices)
#exit()
 
audiofiles_path = "/tmp/"

## German:
#tts = TTS(model_name="tts_models/de/thorsten/tacotron2-DDC", progress_bar=False, gpu=False)

## English:
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC", progress_bar=False, gpu=False)


def say(text, blocking=1):

    sentences = pysbd.Segmenter(language="de", clean=False).segment(text)
    print(sentences)

    i_sentence = 0
    for sentence in sentences:
        hash = hashlib.md5()
        hash.update(sentence.encode())
        print("----------------------------------------")
        print("")
        print(sentence)
        print("")
        print("----------------------------------------")
        filepath = hash.hexdigest() + ".wav"
        filepath = os.path.join(audiofiles_path, filepath)
        if not os.path.exists(filepath):

            tts.tts_to_file(sentence, file_path=filepath)
        if i_sentence > 0:
            while play_process.poll() is None: ## test if play-process is still running and wait for it to finish
                time.sleep(0.1)
                print("still waiting")
        if i_sentence == len(sentences) - 1:
            print("last sentence")
            if blocking == 1:
                print("blocking")
                subprocess.call(["aplay", filepath])
            elif blocking == 0:
                print("non blocking")
                subprocess.Popen(["aplay", filepath])
        else:
            print("not last sentence - non blocking")
            print(time.time())
            play_process = subprocess.Popen(["aplay", filepath]) # non blocking, so the next voice generation can start while the voice is played back



        i_sentence = i_sentence + 1


## Benchmark for block-tts vs. sequential-tts:

## German:
# test_text = "Dies ist ein allererster neuer Test für mich. Paris ist die Hauptstadt aller Herzen. Ich bin ein Computer Assistent und ich bin da, um dir zu helfen."

## English:
test_text = "Hello World, this is a test. Just stay put, while I compute all voice sequences. While I am at it, let me tell you something about myself: I am a robotic voice assistant and I wish to serve. Good bye."

t1 = time.time()

tts.tts_to_file(test_text, file_path="/tmp/tts_one_shot.wav")
first_voice = time.time()
subprocess.call(["aplay", "/tmp/tts_one_shot.wav"])

t2 = time.time()
tts_say = t2-t1
tts_first_voice = first_voice - t1



# input("Press Enter to continue...")

t1 = time.time()
print("start")
print(time.time())

say(test_text, blocking = 0)

t2 = time.time()
diff_say = t2-t1


print("say non-blocking: " + str(diff_say))
print("time to first voice: " + str(tts_first_voice))
print("one-shot generation: " + str(tts_say))```

kristiankielhofner · 2023-05-25T11:36:19Z

That's a vote for coqui!

Yes, we will use caching but my preferred approach is to go about it a little differently.

For WIS itself TTS output to HTTP response should be bytesIO() at a minimum, and this was a little painful last I looked at coqui.

Our plan is to cache TTS output at the network/transport layer for architecture abstraction, scalability, and to do things like enable the use of a CDN for large scale deployments like our hosted community WIS instance. If you're going to use caching the cache hit request shouldn't touch WIS at all.

We have CDN with tiered caching and reserve caching setup already for community hosted WIS and that will provide response times that are ridiculously fast for users around the world with TTS.

nikito · 2023-06-02T14:05:46Z

Discussed in #78 but also casting my vote here for Coqui 😄

JarbasAl · 2023-06-06T12:47:53Z

Have you considered OpenVoiceOS plugin manager for this? it provides a lot of TTS options (and several other things such as STT, VAD....) behind a unified api

https://github.com/OpenVoiceOS?q=tts&type=all&language=&sort=

satvikpendem · 2023-08-23T01:34:08Z

Try Bark, it's MIT licensed.

nikito · 2023-08-23T02:22:38Z

Think bark was looked at but the performance unfortunately isn't where it needs to be for a good user experience for a voice assistant (TTS generation on an enterprise GPU is at best real time, and on lesser gpus it is less than real time).

kristiankielhofner · 2023-08-24T13:48:58Z

@nikito is correct in terms of performance.

Bark also has a strange tendency to insert random "ummm" sounds in the audio output as documented in this issue.

satvikpendem · 2023-08-24T17:16:50Z

Yeah that's true, because it's using a generative pre-transformer model rather than true text to speech. It hallucinates similarly to Stable Diffusion and ChatGPT.

kristiankielhofner · 2023-08-24T17:21:06Z

We're on a constant hunt for a TTS implementation that provides better quality and more flexibility than SpeechT5 with comparable performance. It's at the top of my list in terms of ongoing WIS improvements but I've yet to find such a thing...

Nortonko · 2023-12-03T17:36:38Z

Hi. is there something new about TTS? Thanks.

nikito · 2023-12-03T18:09:52Z

Yes, we have some new engines in process. They aren't in the main branch yet, but you can experiment with them in the feature/split_arch branch.

Nortonko · 2023-12-03T18:29:48Z

Thanks, i will try it

ther3zz · 2024-01-17T16:00:21Z

strong vote for coqui, specially with their xtts v2 model since fine tuning is super easy

satvikpendem · 2024-01-18T12:34:38Z

Just saw this on Hacker News, it's the best I've heard so far:

https://github.com/collabora/WhisperSpeech

Napetc · 2024-04-02T20:56:27Z

Hello, will it be possible to add other languages? I would like to add it as an additional file. If possible

ccsmart · 2024-06-04T20:14:53Z

I switched to split_arch branch using coqui and TTS seems much improved. Numbers are spoken out.
When building with xts "utils.sh build-xtts" multi language response is supported. The voice is downloaded as part of the install / build.
However how do you switch voices with xtts ?

ssteo · 2024-10-22T20:47:25Z

I've used coqui earlier but later found this to be more flexible with variant voice tweaks https://github.com/2noise/ChatTTS

Update: Unfortunately, it is AGPL licensed

kristiankielhofner mentioned this issue Jun 1, 2023

TTS Does not handle numbers in text #78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate TTS Engines #60

Evaluate TTS Engines #60

kristiankielhofner commented May 23, 2023 •

edited

Loading

DePingus commented May 24, 2023

kristiankielhofner commented May 24, 2023

nils-se commented May 25, 2023

kristiankielhofner commented May 25, 2023 •

edited

Loading

nikito commented Jun 2, 2023

JarbasAl commented Jun 6, 2023

satvikpendem commented Aug 23, 2023

nikito commented Aug 23, 2023

kristiankielhofner commented Aug 24, 2023

satvikpendem commented Aug 24, 2023

kristiankielhofner commented Aug 24, 2023

Nortonko commented Dec 3, 2023

nikito commented Dec 3, 2023

Nortonko commented Dec 3, 2023

ther3zz commented Jan 17, 2024

satvikpendem commented Jan 18, 2024

Napetc commented Apr 2, 2024

ccsmart commented Jun 4, 2024 •

edited

Loading

ssteo commented Oct 22, 2024 •

edited

Loading

Evaluate TTS Engines #60

Evaluate TTS Engines #60

Comments

kristiankielhofner commented May 23, 2023 • edited Loading

DePingus commented May 24, 2023

kristiankielhofner commented May 24, 2023

nils-se commented May 25, 2023

kristiankielhofner commented May 25, 2023 • edited Loading

nikito commented Jun 2, 2023

JarbasAl commented Jun 6, 2023

satvikpendem commented Aug 23, 2023

nikito commented Aug 23, 2023

kristiankielhofner commented Aug 24, 2023

satvikpendem commented Aug 24, 2023

kristiankielhofner commented Aug 24, 2023

Nortonko commented Dec 3, 2023

nikito commented Dec 3, 2023

Nortonko commented Dec 3, 2023

ther3zz commented Jan 17, 2024

satvikpendem commented Jan 18, 2024

Napetc commented Apr 2, 2024

ccsmart commented Jun 4, 2024 • edited Loading

ssteo commented Oct 22, 2024 • edited Loading

kristiankielhofner commented May 23, 2023 •

edited

Loading

kristiankielhofner commented May 25, 2023 •

edited

Loading

ccsmart commented Jun 4, 2024 •

edited

Loading

ssteo commented Oct 22, 2024 •

edited

Loading