Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add steps for document of getting dataset 'SF Bilingual Speech' #7378

Merged
merged 2 commits into from
Sep 19, 2023

Conversation

RobinDong
Copy link
Contributor

@RobinDong RobinDong commented Sep 6, 2023

What does this PR do ?

When following the document to get dataset SFSpeech Chinese/English Bilingual Speech, the first command

python scripts/dataset_processing/tts/sfbilingual/get_data.py \
    --data-root <your_local_dataset_root> \
    --val-size 0.1 \
    --test-size 0.2 \
    --seed-for-ds-split 100

directly raise error:

Traceback (most recent call last):
  File "/home/xxx/NeMo/scripts/dataset_processing/tts/sfbilingual/get_data.py", line 122, in <module>
    main()
  File "/home/xxx/NeMo/scripts/dataset_processing/tts/sfbilingual/get_data.py", line 116, in main
    __process_data(
  File "/home/xxx/NeMo/scripts/dataset_processing/tts/sfbilingual/get_data.py", line 91, in __process_data
    entries = __process_transcript(dataset_path)
  File "/home/xxx/NeMo/scripts/dataset_processing/tts/sfbilingual/get_data.py", line 65, in __process_transcript
    with open(file_path / "text_SF.txt", encoding="utf-8") as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'tryit/text_SF.txt'

The reason: scripts/dataset_processing/tts/sfbilingual/get_data.py actually doesn't download the dataset. The SFSpeech Chinese/English Bilingual Speech could only be downloaded through Nvidia NGC.

Changelog

  • Add steps in the document of dataset SFSpeech to download by ngc-cli tool at first

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and follow Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

@blisc @okuchaiev @titu1994 @XuesongYang

@XuesongYang
Copy link
Collaborator

Thanks for the fix. I made some changes upon yours.

RobinDong and others added 2 commits September 19, 2023 19:19
added a link from a tutorial demonstrating detailed data prep steps.

Signed-off-by: Xuesong Yang <[email protected]>
@blisc blisc merged commit b5d4573 into NVIDIA:main Sep 19, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants