-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationhelp wantedExtra attention is neededExtra attention is needed
Description
Thank you for this great job. When I try to use zero-shot TTS, I found speakers' similarity is low between spk_smp and generated aduio. My prompt audio、prompt_text and generated audio are in audios.zip. What may be the reason for causing this, and is there any advice for improvement, thanks.
audio_file = 'sample.wav'
prompt_text = 'I chance to leave him alone, but[uv_break] no[uv_break]. She just wanted to see him again[uv_break]. Anna[uv_break], you don't know how it feels to lose a sister[uv_break].'
spk_smp = chat.sample_audio_speaker(load_audio(audio_file, 24000))
params_infer_code = ChatTTS.Chat.InferCodeParams(
spk_smp=spk_smp,
txt_smp=prompt_text,
temperature=0.3,
top_P=0.7,
top_K=20
)
params_refine_text = ChatTTS.Chat.RefineTextParams(
prompt='[oral_5]'
)
text = "I do love books, but I think I like writing about them more than selling them."
wav = chat.infer(
text,
params_infer_code=params_infer_code,
split_text=False,
params_refine_text=params_refine_text
)
torchaudio.save("sample_generated.wav", torch.from_numpy(wav[0]).unsqueeze(0), 24000)
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationhelp wantedExtra attention is neededExtra attention is needed