Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify SpeechSynthesisDataset class, make it return text #1205

Merged
merged 8 commits into from
Nov 30, 2023

Conversation

yaozengwei
Copy link
Contributor

@yaozengwei yaozengwei commented Nov 6, 2023

This PR is required by the TTS recipe k2-fsa/icefall#1372 in icefall. In this TTS recipe, we convert the transcript text to phonemes in training.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but don't you prefer to have G2P inside SpeechSynthesisDataset to avoid spending the time running it inside the training loop?

}
"""

def __init__(
self,
cuts: CutSet,
return_cuts: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is breaking, can you move return_cuts to some position towards the end of parameter list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@yaozengwei
Copy link
Contributor Author

yaozengwei commented Nov 10, 2023

Looks good, but don't you prefer to have G2P inside SpeechSynthesisDataset to avoid spending the time running it inside the training loop?

Hi, @pzelasko, I prefer to do the text normalization and tokenization in separate training recipes, since they usually depend on different packages. @csukuangfj suggests doing this in data preparation stage, with converted phonemes saved to manifests.

@pzelasko
Copy link
Collaborator

Looks good, but don't you prefer to have G2P inside SpeechSynthesisDataset to avoid spending the time running it inside the training loop?

Hi, @pzelasko, I prefer to do the text normalization and tokenization in separate training recipes, since they usually depend on different packages. @csukuangfj suggests doing this in data preparation stage, with converted phonemes saved to manifests.

OK cool. It looks like there are some conflicts after merging the other PR with multi-speaker support, could you resolve them?

@yaozengwei
Copy link
Contributor Author

Looks good, but don't you prefer to have G2P inside SpeechSynthesisDataset to avoid spending the time running it inside the training loop?

Hi, @pzelasko, I prefer to do the text normalization and tokenization in separate training recipes, since they usually depend on different packages. @csukuangfj suggests doing this in data preparation stage, with converted phonemes saved to manifests.

OK cool. It looks like there are some conflicts after merging the other PR with multi-speaker support, could you resolve them?

Ok. Thanks. I have some local changes and will resolve the conflicts later.

In addition, in current implementation, it will load the whole cuts set and generate a char-based vocabulary according to the given texts. But I think TTS recipes usually use phoneme tokens instead. Also, if might cause token-id-mapping mismatch bettwen training and test, when separately given two cuts sets for this class. Shall we remove this and make this calss return raw text and optionally pre-converted (phoneme) tokens?

self.token_collater = TokenCollater(cuts, add_eos=add_eos, add_bos=add_bos)

@pzelasko
Copy link
Collaborator

In addition, in current implementation, it will load the whole cuts set and generate a char-based vocabulary according to the given texts. But I think TTS recipes usually use phoneme tokens instead. Also, if might cause token-id-mapping mismatch bettwen training and test, when separately given two cuts sets for this class. Shall we remove this and make this calss return raw text and optionally pre-converted (phoneme) tokens?

I see that this class accepts cuts in constructor, it's an outdated design we don't use anymore, so thanks for updating it. Instead, it would make sense to pass a tokenizer object/callable that (optionally) converts raw text into some tokens. What you're suggesting also works.

@yaozengwei
Copy link
Contributor Author

@pzelasko Thanks. It is ready to be merged.

@yaozengwei
Copy link
Contributor Author

yaozengwei commented Nov 29, 2023

@pzelasko @desh2608 Would you mind giving a review if you have time? I have fixed the test case.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@pzelasko pzelasko merged commit b869488 into lhotse-speech:master Nov 30, 2023
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants