Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] tts_to_file gives TypeError: Invalid file: None #3067

Closed
perrylets opened this issue Oct 13, 2023 · 9 comments
Closed

[Bug] tts_to_file gives TypeError: Invalid file: None #3067

perrylets opened this issue Oct 13, 2023 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@perrylets
Copy link

Describe the bug

When using the xtts-1 model on windows (python 3.11.6), every time I run the tts_to_file function, it gives the error TypeError: Invalid file: None

To Reproduce

On windows with python 3.11.6, with torch, torchaudio (not sure if needed, but just to be sure) and TTS installed, run this snippet

import torch
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v1").to("cuda" if toch.cuda.is_available() else "cpu")
# any combination of parameters gives the same error.
tts.tts_to_file("Hello, world!", language="en") # Expected error: "TypeError: Invalid file: None"

Expected behavior

The audio output should be written to output.wav, or the specified file name.

Logs

No response

Environment

- 🐸TTS Version: 0.17.8
- PyTorch Version: 2.1.0+cpu
- Python Version: 3.11.6
- OS: Windows 11
- CUDA/cuDNN version: null
- GPU models and configuration: AMD Ryzen 7 5700G with Radeon Graphics
- How you installed PyTorch: pip on a virtual environment

Additional context

No response

@perrylets perrylets added the bug Something isn't working label Oct 13, 2023
@erogol
Copy link
Member

erogol commented Oct 16, 2023

@Aya-AlJafari can you check this ?

@taha9881
Copy link

speaker_wav="cloning/audio.wav"
file_path="output.wav"

Try adding this two argument, Make the respective directory for speaker_wav and add sample audio file in .wav format.

@perrylets
Copy link
Author

I already did that before making the issue.

@Aya-AlJafari
Copy link
Contributor

Hi @perrylets, can you please post the full log after executing this command:

tts.tts_to_file("Hello, world!",file_path="output.wav", speaker_wav="path/to/wavefile", language="en")

because the missing speaker_wav in tts.tts_to_file("Hello, world!", language="en") is the source of the None error, and the alternative above should fix it. I'm curious to see the log if it's still not working on your side.

@perrylets
Copy link
Author

Where are the logs? Is it just the console output?

@Aya-AlJafari
Copy link
Contributor

@perrylets yes the full output.

@Mikerhinos
Copy link

Mikerhinos commented Oct 21, 2023

I'm having the same error while using the tts.tts_with_vc_to_file() method even after adding the speaker_wav path, using xtts_v1 or xtts_v1.1
Full output :

 > Using model: xtts
 > Text splitted to sentences.
['Experience has shown that it is not because you think this process is critical for you, that it is for your project.']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[12], line 32
     14 
   (...)
     27 #tts = TTS(model_name="voice_conversion_models/multilingual/vctk/freevc24", progress_bar=False).to("cuda")
     28 #tts.voice_conversion_to_file(source_wav="C:\\Users\\miker\\Downloads\\output_synth_"+now_string+".wav", target_wav="C:\\Users\\miker\\Downloads\\output_audio_"+now_string+".wav", file_path="C:\\Users\\miker\\Downloads\\output_cloned_"+now_string+".wav")
     31 tts = TTS("tts_models/multilingual/multi-dataset/xtts_v1", gpu=True).to(device)
---> 32 tts.tts_with_vc_to_file(
     33     text=translated_text,
     34     speaker_wav="C:\\Users\\miker\\Downloads\\output_audio_"+now_string+".wav",
     35     file_path="C:\\Users\\miker\\Downloads\\output_cloned_"+now_string+".wav",
     36     language='en'
     37 )
     39 # Display audio widget to play the generated audio
     40 audio_widget = Audio(filename="C:\\Users\\miker\\Downloads\\output_cloned_"+now_string+".wav", autoplay=False)

File ~\anaconda3\envs\colab\lib\site-packages\TTS\api.py:488, in TTS.tts_with_vc_to_file(self, text, language, speaker_wav, file_path)
    469 def tts_with_vc_to_file(
    470     self, text: str, language: str = None, speaker_wav: str = None, file_path: str = "output.wav"
    471 ):
    472     """Convert text to speech with voice conversion and save to file.
    473 
    474     Check `tts_with_vc` for more details.
   (...)
    486             Output file path. Defaults to "output.wav".
    487     """
--> 488     wav = self.tts_with_vc(text=text, language=language, speaker_wav=speaker_wav)
    489     save_wav(wav=wav, path=file_path, sample_rate=self.voice_converter.vc_config.audio.output_sample_rate)

File ~\anaconda3\envs\colab\lib\site-packages\TTS\api.py:463, in TTS.tts_with_vc(self, text, language, speaker_wav)
    444 """Convert text to speech with voice conversion.
    445 
    446 It combines tts with voice conversion to fake voice cloning.
   (...)
    459         Defaults to None.
    460 """
    461 with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as fp:
    462     # Lazy code... save it to a temp file to resample it while reading it for VC
--> 463     self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name)
    464 if self.voice_converter is None:
    465     self.load_vc_model_by_name("voice_conversion_models/multilingual/vctk/freevc24")

File ~\anaconda3\envs\colab\lib\site-packages\TTS\api.py:403, in TTS.tts_to_file(self, text, speaker, language, speaker_wav, emotion, speed, pipe_out, file_path, **kwargs)
    393 if self.csapi is not None:
    394     return self.tts_coqui_studio(
    395         text=text,
    396         speaker_name=speaker,
   (...)
    401         pipe_out=pipe_out,
    402     )
--> 403 wav = self.tts(text=text, speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)
    404 self.synthesizer.save_wav(wav=wav, path=file_path, pipe_out=pipe_out)
    405 return file_path

File ~\anaconda3\envs\colab\lib\site-packages\TTS\api.py:341, in TTS.tts(self, text, speaker, language, speaker_wav, emotion, speed, **kwargs)
    337 if self.csapi is not None:
    338     return self.tts_coqui_studio(
    339         text=text, speaker_name=speaker, language=language, emotion=emotion, speed=speed
    340     )
--> 341 wav = self.synthesizer.tts(
    342     text=text,
    343     speaker_name=speaker,
    344     language_name=language,
    345     speaker_wav=speaker_wav,
    346     reference_wav=None,
    347     style_wav=None,
    348     style_text=None,
    349     reference_speaker_name=None,
    350     **kwargs,
    351 )
    352 return wav

File ~\anaconda3\envs\colab\lib\site-packages\TTS\utils\synthesizer.py:374, in Synthesizer.tts(self, text, speaker_name, language_name, speaker_wav, style_wav, style_text, reference_wav, reference_speaker_name, **kwargs)
    372 for sen in sens:
    373     if hasattr(self.tts_model, "synthesize"):
--> 374         outputs = self.tts_model.synthesize(
    375             text=sen,
    376             config=self.tts_config,
    377             speaker_id=speaker_name,
    378             voice_dirs=self.voice_dir,
    379             d_vector=speaker_embedding,
    380             speaker_wav=speaker_wav,
    381             language=language_name,
    382             **kwargs,
    383         )
    384     else:
    385         # synthesize voice
    386         outputs = synthesis(
    387             model=self.tts_model,
    388             text=sen,
   (...)
    396             language_id=language_id,
    397         )

File ~\anaconda3\envs\colab\lib\site-packages\TTS\tts\models\xtts.py:462, in Xtts.synthesize(self, text, config, speaker_wav, language, **kwargs)
    459 if isinstance(speaker_wav, list):
    460     speaker_wav = speaker_wav[0]
--> 462 return self.inference_with_config(text, config, ref_audio_path=speaker_wav, language=language, **kwargs)

File ~\anaconda3\envs\colab\lib\site-packages\TTS\tts\models\xtts.py:484, in Xtts.inference_with_config(self, text, config, ref_audio_path, language, **kwargs)
    472 settings = {
    473     "temperature": config.temperature,
    474     "length_penalty": config.length_penalty,
   (...)
    481     "decoder_sampler": config.decoder_sampler,
    482 }
    483 settings.update(kwargs)  # allow overriding of preset settings with kwargs
--> 484 return self.full_inference(text, ref_audio_path, language, **settings)

File ~\anaconda3\envs\colab\lib\site-packages\torch\utils\_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~\anaconda3\envs\colab\lib\site-packages\TTS\tts\models\xtts.py:570, in Xtts.full_inference(self, text, ref_audio_path, language, temperature, length_penalty, repetition_penalty, top_k, top_p, gpt_cond_len, do_sample, decoder_iterations, cond_free, cond_free_k, diffusion_temperature, decoder_sampler, decoder, **hf_generate_kwargs)
    486 @torch.inference_mode()
    487 def full_inference(
    488     self,
   (...)
    507     **hf_generate_kwargs,
    508 ):
    509     """
    510     This function produces an audio clip of the given text being spoken with the given reference voice.
    511 
   (...)
    564         Sample rate is 24kHz.
    565     """
    566     (
    567         gpt_cond_latent,
    568         diffusion_conditioning,
    569         speaker_embedding
--> 570     ) = self.get_conditioning_latents(audio_path=ref_audio_path, gpt_cond_len=gpt_cond_len)
    571     return self.inference(
    572         text,
    573         language,
   (...)
    589         **hf_generate_kwargs,
    590     )

File ~\anaconda3\envs\colab\lib\site-packages\TTS\tts\models\xtts.py:435, in Xtts.get_conditioning_latents(self, audio_path, gpt_cond_len)
    433 diffusion_cond_latents = None
    434 if self.args.use_hifigan:
--> 435     speaker_embedding = self.get_speaker_embedding(audio_path)
    436 else:
    437     diffusion_cond_latents = self.get_diffusion_cond_latents(audio_path)

File ~\anaconda3\envs\colab\lib\site-packages\torch\utils\_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~\anaconda3\envs\colab\lib\site-packages\TTS\tts\models\xtts.py:421, in Xtts.get_speaker_embedding(self, audio_path)
    416 @torch.inference_mode()
    417 def get_speaker_embedding(
    418     self,
    419     audio_path
    420 ):
--> 421     audio = load_audio(audio_path, self.hifigan_decoder.speaker_encoder_audio_config["sample_rate"])
    422     speaker_embedding = self.hifigan_decoder.speaker_encoder.forward(
    423         audio.to(self.device), l2_norm=True
    424     ).unsqueeze(-1).to(self.device)
    425     return speaker_embedding

File ~\anaconda3\envs\colab\lib\site-packages\TTS\tts\models\xtts.py:34, in load_audio(audiopath, sr)
     23 def load_audio(audiopath, sr=22050):
     24     """
     25     Load an audio file from disk and resample it to the specified sampling rate.
     26 
   (...)
     32         Tensor: Audio waveform tensor with shape (1, T), where T is the number of samples.
     33     """
---> 34     audio, sampling_rate = torchaudio.load(audiopath)
     36     if len(audio.shape) > 1:
     37         if audio.shape[0] < 5:

File ~\anaconda3\envs\colab\lib\site-packages\torchaudio\backend\soundfile_backend.py:221, in load(filepath, frame_offset, num_frames, normalize, channels_first, format)
    139 @_requires_soundfile
    140 def load(
    141     filepath: str,
   (...)
    146     format: Optional[str] = None,
    147 ) -> Tuple[torch.Tensor, int]:
    148     """Load audio data from file.
    149 
    150     Note:
   (...)
    219             `[channel, time]` else `[time, channel]`.
    220     """
--> 221     with soundfile.SoundFile(filepath, "r") as file_:
    222         if file_.format != "WAV" or normalize:
    223             dtype = "float32"

File ~\anaconda3\envs\colab\lib\site-packages\soundfile.py:658, in SoundFile.__init__(self, file, mode, samplerate, channels, subtype, endian, format, closefd)
    655 self._mode = mode
    656 self._info = _create_info_struct(file, mode, samplerate, channels,
    657                                  format, subtype, endian)
--> 658 self._file = self._open(file, mode_int, closefd)
    659 if set(mode).issuperset('r+') and self.seekable():
    660     # Move write position to 0 (like in Python file objects)
    661     self.seek(0)

File ~\anaconda3\envs\colab\lib\site-packages\soundfile.py:1212, in SoundFile._open(self, file, mode_int, closefd)
   1209     file_ptr = _snd.sf_open_virtual(self._init_virtual_io(file),
   1210                                     mode_int, self._info, _ffi.NULL)
   1211 else:
-> 1212     raise TypeError("Invalid file: {0!r}".format(self.name))
   1213 if file_ptr == _ffi.NULL:
   1214     # get the actual error code
   1215     err = _snd.sf_error(file_ptr)

TypeError: Invalid file: None```

@gorip1
Copy link

gorip1 commented Oct 22, 2023

Hi, same error for me using xtts_v1.1 & tts.tts_with_vc_to_file()

main.py :

from TTS.api import TTS
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v1.1", progress_bar=True).to("cpu")
tts.tts_with_vc_to_file(
    text="Hi guys how are you ?",
    speaker_wav="TTS/real_audio_sample/me_speaking.wav",
    file_path="output.wav",
    language="en"
)

Full output :

 > tts_models/multilingual/multi-dataset/xtts_v1.1 is already downloaded.
 > Using model: xtts
/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Traceback (most recent call last):
  File "/Users/XXX/PycharmProjects/coquiTTS/main.py", line 9, in <module>
    tts.tts_with_vc_to_file(
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/TTS/api.py", line 488, in tts_with_vc_to_file
    wav = self.tts_with_vc(text=text, language=language, speaker_wav=speaker_wav)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/TTS/api.py", line 463, in tts_with_vc
    self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/TTS/api.py", line 403, in tts_to_file
    wav = self.tts(text=text, speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/TTS/api.py", line 341, in tts
    wav = self.synthesizer.tts(
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/TTS/utils/synthesizer.py", line 374, in tts
    outputs = self.tts_model.synthesize(
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 462, in synthesize
    return self.inference_with_config(text, config, ref_audio_path=speaker_wav, language=language, **kwargs)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 484, in inference_with_config
    return self.full_inference(text, ref_audio_path, language, **settings)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 570, in full_inference
    ) = self.get_conditioning_latents(audio_path=ref_audio_path, gpt_cond_len=gpt_cond_len)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 435, in get_conditioning_latents
    speaker_embedding = self.get_speaker_embedding(audio_path)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 421, in get_speaker_embedding
    audio = load_audio(audio_path, self.hifigan_decoder.speaker_encoder_audio_config["sample_rate"])
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/TTS/tts/models/xtts.py", line 34, in load_audio
    audio, sampling_rate = torchaudio.load(audiopath)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/torchaudio/_backend/utils.py", line 203, in load
    return backend.load(uri, frame_offset, num_frames, normalize, channels_first, format, buffer_size)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/torchaudio/_backend/soundfile.py", line 26, in load
    return soundfile_backend.load(uri, frame_offset, num_frames, normalize, channels_first, format)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/torchaudio/_backend/soundfile_backend.py", line 221, in load
    with soundfile.SoundFile(filepath, "r") as file_:
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/soundfile.py", line 1212, in _open
    raise TypeError("Invalid file: {0!r}".format(self.name))
TypeError: Invalid file: None

Hope it'll help 🤞

[EDIT]

It (kind of) worked when putting directly the file_path in soundfile.SoundFile() in venv/lib/python3.9/site-packages/torchaudio/_backend/soundfile_backend.py
So the error should be somewhere in between !

Line 221 :

with soundfile.SoundFile(filepath, "r") as file_: ⤵️
with soundfile.SoundFile('MY_FILE_PATH.wav', "r") as file_:

Output :

 > tts_models/multilingual/multi-dataset/xtts_v1.1 is already downloaded.
 > Using model: xtts
/Users/XXX/PycharmProjects/coquiTTS/venv/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
 > Text splitted to sentences.
['Hi guys how are you ?']
 > Processing time: 43.66520690917969
 > Real-time factor: 1.7740076433982859
 > voice_conversion_models/multilingual/vctk/freevc24 is already downloaded.
 > Using model: freevc
 > Loading pretrained speaker encoder model ...
Loaded the voice encoder model on cpu in 0.01 seconds.

This was referenced Oct 26, 2023
@lucasjinreal
Copy link

Hi, how to load model from local path?

p = os.fspath(p)

TypeError: expected str, bytes or os.PathLike object, not NoneType

eginhard added a commit to idiap/coqui-ai-TTS that referenced this issue Nov 20, 2023
This reverts commit 041b4b6.

Fixes coqui-ai#3143. The original issue (coqui-ai#3067) was people trying to use
tts.tts_with_vc_to_file() with XTTS and was "fixed" in coqui-ai#3109. But XTTS has
integrated VC and you can just do tts.tts_to_file(..., speaker_wav="..."), there
is no point in passing it through FreeVC afterwards. So, reverting this commit
because it breaks tts.tts_with_vc_to_file() for any model that doesn't have
integrated VC, i.e. all models this method is meant for.
erogol pushed a commit that referenced this issue Nov 24, 2023
* Revert "fix for issue 3067"

This reverts commit 041b4b6.

Fixes #3143. The original issue (#3067) was people trying to use
tts.tts_with_vc_to_file() with XTTS and was "fixed" in #3109. But XTTS has
integrated VC and you can just do tts.tts_to_file(..., speaker_wav="..."), there
is no point in passing it through FreeVC afterwards. So, reverting this commit
because it breaks tts.tts_with_vc_to_file() for any model that doesn't have
integrated VC, i.e. all models this method is meant for.

* fix: support multi-speaker models in tts_with_vc/tts_with_vc_to_file

* fix: only compute spk embeddings for models that support it

Fixes #1440. Passing a `speaker_wav` argument to regular Vits models failed
because they don't support voice cloning. Now that argument is simply ignored.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants