Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enabling diverse datasets in val / test #6306

Merged
merged 2 commits into from
Apr 1, 2023
Merged

enabling diverse datasets in val / test #6306

merged 2 commits into from
Apr 1, 2023

Conversation

bmwshop
Copy link
Collaborator

@bmwshop bmwshop commented Mar 27, 2023

What does this PR do ?

Allows to specify multiple datasets with different configs in validation and test

ALL

Changelog

When processing val / test dataset configs, if a dict is detected instead of a list, overload the entire dataset config, not just the manifest path.

Usage

Specify multiple, different dataset configs as shown below.

  validation_ds:
    ds_item:
    - name: en
      manifest_filepath:
      - ${d}/en/val_test/fisher/audio_manifest_dev_clean_en.json
      - ${d}/en/val_test/europarl/audio_manifest_dev_clean_en.json
      sample_rate: ${model.sample_rate}
      batch_size: 16
      shuffle: false
      num_workers: 8
      pin_memory: true
      max_duration: 20.0
      min_duration: 0.1
      use_start_end_token: false
    - name: es
      manifest_filepath:
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/fisher/dev_fisher_manifest_es.json
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/mcv12/dev_mcv12_manifest_es.json
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/mls/dev_mls_manifest_es.json
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/voxpopuli/dev_voxpopuli_manifest_es.json
      sample_rate: ${model.sample_rate}
      batch_size: 32
      shuffle: false
      num_workers: 8
      pin_memory: true
      max_duration: 10.0
      min_duration: 1.0
      use_start_end_token: false

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • [ x] New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@bmwshop bmwshop requested a review from titu1994 March 27, 2023 23:04
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat way of handling this !
Two things -

  1. why sample dev and test set ? It will give non determinating results for comparison. How about disabling sampling as a default config example? This should be added to the PRs example.
  2. this capability needs to be fully documented in the ASR dataset / config docs.

Preferably, thete should be 2-3 model multilingual config with this commented into the dev/test section.

@bmwshop
Copy link
Collaborator Author

bmwshop commented Mar 28, 2023

Actually, concat should give deterministic results when shuffle is turned off and the seed is set! The reason to enable concat / sampling is to save time; in the overall multilingual example the validation dataset is over 400 hours long. However, I understand the point and will provide a simpler example.

@titu1994
Copy link
Collaborator

But is the subset of dataset replicable between two runs ? Ie seed is set to same per dataset and therefore the slice of samples is always exactly the same no matter if you call it the first run vs Nth / final chained run ?

@bmwshop
Copy link
Collaborator Author

bmwshop commented Apr 1, 2023

Yes, the result is the same across multiple validation runs if the config does not change - when concat_sampling is set to random but concat_sampling_seed is set and concat_shuffle is set to False.

However, I have removed the concat flags from this example for simplicity and because this PR does not touch the concat dataset code. It just deals with how validation and test dataset configs are processed.

But is the subset of dataset replicable between two runs ? Ie seed is set to same per dataset and therefore the slice of samples is always exactly the same no matter if you call it the first run vs Nth / final chained run ?

Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks fine then

@bmwshop bmwshop merged commit 93f9a93 into main Apr 1, 2023
@bmwshop bmwshop deleted the nest_valtest_ds branch April 1, 2023 17:56
hsiehjackson pushed a commit to hsiehjackson/NeMo that referenced this pull request Jun 2, 2023
@orena1
Copy link
Contributor

orena1 commented Feb 21, 2024

Thanks @bmwshop, is it feasible to only append the names and paths of the new datasets while maintaining uniform configurations for all datasets?

E.g:

  validation_ds:
    batch_size: 16
    shuffle: false
    num_workers: 8
    pin_memory: true
    max_duration: 20.0
    min_duration: 0.1
    use_start_end_token: false
    sample_rate: ${model.sample_rate}
    ds_item:
    - name: en
      manifest_filepath:
      - ${d}/en/val_test/fisher/audio_manifest_dev_clean_en.json
    - name: es
      manifest_filepath:
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/fisher/dev_fisher_manifest_es.json
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/mcv12/dev_mcv12_manifest_es.json
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/mls/dev_mls_manifest_es.json
      - ${d}/es/nemo_sp_asr_set_3pt0/dev/voxpopuli/dev_voxpopuli_manifest_es.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants