low speaker similarity in zero-shot tts

Thank you for this great job. When I try to use zero-shot TTS, I found speakers' similarity is low between spk_smp and generated aduio. My prompt audio、prompt_text and generated audio are in [audios.zip](https://github.com/user-attachments/files/19049691/default.zip). What may be the reason for causing this, and is there any advice for improvement, thanks.

```py
    audio_file = 'sample.wav'
    prompt_text = 'I chance to leave him alone, but[uv_break] no[uv_break]. She just wanted to see him again[uv_break]. Anna[uv_break], you don't know how it feels to lose a sister[uv_break].'
    spk_smp = chat.sample_audio_speaker(load_audio(audio_file, 24000))

    params_infer_code = ChatTTS.Chat.InferCodeParams(
        spk_smp=spk_smp,
        txt_smp=prompt_text,
        temperature=0.3,
        top_P=0.7,
        top_K=20
    )
    params_refine_text = ChatTTS.Chat.RefineTextParams(
        prompt='[oral_5]'
    )

    text = "I do love books, but I think I like writing about them more than selling them."
    wav = chat.infer(
        text,
        params_infer_code=params_infer_code,
        split_text=False,
        params_refine_text=params_refine_text
    )
    torchaudio.save("sample_generated.wav", torch.from_numpy(wav[0]).unsqueeze(0), 24000)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

low speaker similarity in zero-shot tts #910

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

low speaker similarity in zero-shot tts #910

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions