Skip to content

low speaker similarity in zero-shot tts #910

@LoganLiu66

Description

@LoganLiu66

Thank you for this great job. When I try to use zero-shot TTS, I found speakers' similarity is low between spk_smp and generated aduio. My prompt audio、prompt_text and generated audio are in audios.zip. What may be the reason for causing this, and is there any advice for improvement, thanks.

    audio_file = 'sample.wav'
    prompt_text = 'I chance to leave him alone, but[uv_break] no[uv_break]. She just wanted to see him again[uv_break]. Anna[uv_break], you don't know how it feels to lose a sister[uv_break].'
    spk_smp = chat.sample_audio_speaker(load_audio(audio_file, 24000))

    params_infer_code = ChatTTS.Chat.InferCodeParams(
        spk_smp=spk_smp,
        txt_smp=prompt_text,
        temperature=0.3,
        top_P=0.7,
        top_K=20
    )
    params_refine_text = ChatTTS.Chat.RefineTextParams(
        prompt='[oral_5]'
    )

    text = "I do love books, but I think I like writing about them more than selling them."
    wav = chat.infer(
        text,
        params_infer_code=params_infer_code,
        split_text=False,
        params_refine_text=params_refine_text
    )
    torchaudio.save("sample_generated.wav", torch.from_numpy(wav[0]).unsqueeze(0), 24000)

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationhelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions