Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum sample-based training for Megatron NMT and Text Memmap based Seq2seq Pre-training #4396

Merged
merged 74 commits into from
Jul 30, 2022

Conversation

MaximumEntropy
Copy link
Contributor

@MaximumEntropy MaximumEntropy commented Jun 17, 2022

What does this PR do ?

  1. Trains Megatron-based NMT models based on maximum number of samples.
  2. Added support in text_memmap and csv_memmap in Megatron encoder-decoder models (T5, BART, UL2)

Collection: NLP

Usage

Add to command line

  model.data.data_impl=text_mmap \
  +model.data.data_impl_kwargs.newline_int=10 \
  +model.data.data_impl_kwargs.header_lines=0 \
  +model.data.data_impl_kwargs.workers=null \
  +model.data.data_impl_kwargs.sort_dataset_paths=False
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@lgtm-com
Copy link

lgtm-com bot commented Jun 17, 2022

This pull request introduces 5 alerts when merging 6c5a163 into 317739f - view on LGTM.com

new alerts:

  • 4 for Wrong number of arguments in a class instantiation
  • 1 for Wrong name for an argument in a class instantiation

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: MaximumEntropy <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Jun 17, 2022

This pull request introduces 3 alerts and fixes 1 when merging 5299acd into 317739f - view on LGTM.com

new alerts:

  • 3 for Wrong number of arguments in a class instantiation

fixed alerts:

  • 1 for Unused import

Signed-off-by: MaximumEntropy <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Jun 18, 2022

This pull request introduces 3 alerts and fixes 1 when merging 7a7ad85 into e542d7f - view on LGTM.com

new alerts:

  • 3 for Wrong number of arguments in a class instantiation

fixed alerts:

  • 1 for Unused import

Signed-off-by: MaximumEntropy <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Jun 20, 2022

This pull request introduces 3 alerts and fixes 1 when merging 734edd3 into e542d7f - view on LGTM.com

new alerts:

  • 3 for Wrong number of arguments in a class instantiation

fixed alerts:

  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Jun 21, 2022

This pull request introduces 3 alerts and fixes 1 when merging 9131474 into e542d7f - view on LGTM.com

new alerts:

  • 3 for Wrong number of arguments in a class instantiation

fixed alerts:

  • 1 for Unused import

Signed-off-by: MaximumEntropy <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Jun 22, 2022

This pull request introduces 3 alerts and fixes 1 when merging 3a101bf into 41f27a5 - view on LGTM.com

new alerts:

  • 3 for Wrong number of arguments in a class instantiation

fixed alerts:

  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Jun 23, 2022

This pull request introduces 3 alerts and fixes 1 when merging 245fc90 into 41f27a5 - view on LGTM.com

new alerts:

  • 3 for Wrong number of arguments in a class instantiation

fixed alerts:

  • 1 for Unused import

Signed-off-by: MaximumEntropy <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Jul 28, 2022

This pull request fixes 2 alerts when merging 18207e7 into 72d78d8 - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable
  • 1 for Unused import

ericharper
ericharper previously approved these changes Jul 29, 2022
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lgtm-com
Copy link

lgtm-com bot commented Jul 29, 2022

This pull request fixes 2 alerts when merging 930be3e into 59d635c - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable
  • 1 for Unused import

Signed-off-by: MaximumEntropy <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Jul 29, 2022

This pull request fixes 2 alerts when merging ef353c5 into 588c6ca - view on LGTM.com

fixed alerts:

  • 1 for Unused local variable
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Jul 29, 2022

This pull request introduces 1 alert and fixes 2 when merging 3b41977 into 2f85541 - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Unused local variable
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Jul 29, 2022

This pull request introduces 1 alert and fixes 2 when merging 621dbf7 into 2f85541 - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Unused local variable
  • 1 for Unused import

Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@lgtm-com
Copy link

lgtm-com bot commented Jul 29, 2022

This pull request introduces 1 alert and fixes 2 when merging 82e6560 into 4fef5dd - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Unused local variable
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Jul 30, 2022

This pull request introduces 1 alert and fixes 2 when merging b739756 into 1be2bda - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Unused local variable
  • 1 for Unused import

Signed-off-by: MaximumEntropy <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Jul 30, 2022

This pull request introduces 1 alert and fixes 2 when merging 7a8b244 into 1be2bda - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Unused local variable
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Jul 30, 2022

This pull request introduces 1 alert and fixes 2 when merging e4d5619 into 21cf961 - view on LGTM.com

new alerts:

  • 1 for Unused import

fixed alerts:

  • 1 for Unused local variable
  • 1 for Unused import

@MaximumEntropy MaximumEntropy merged commit 0b7df7a into main Jul 30, 2022
@MaximumEntropy MaximumEntropy deleted the megatron_nmt_sample_training branch July 30, 2022 04:46
paarthneekhara added a commit to paarthneekhara/NeMo that referenced this pull request Jul 31, 2022
* bug fix - sample rate was being ignored in vocoder dataset when not loading mel

Signed-off-by: Paarth Neekhara <[email protected]>

* handled n segments for a different sampling rate than original sampling rate

Signed-off-by: Paarth Neekhara <[email protected]>

* Added case for n_segments 0, warning for n_segments greater than file length

Signed-off-by: Paarth Neekhara <[email protected]>

* Fix metric setup for finetuning without a test set (NVIDIA#4585)

* Fix metric setup for finetuning without a test set

Signed-off-by: MaximumEntropy <[email protected]>

* Fix log key

Signed-off-by: MaximumEntropy <[email protected]>

* Remove pdb

Signed-off-by: MaximumEntropy <[email protected]>

* Minor

Signed-off-by: MaximumEntropy <[email protected]>

* Fix skip train ds building while finetuning

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>

* r1.10.0 MegaMolBART Compatibility (NVIDIA#4603)

* 1. Added vocab_size property to RegExTokenizer.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed passing hiddens directly.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added support in encoder outputs.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added comments.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added automatic mapping of kwargs to args in forward.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added encode function.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. PP and TP works (but not together)

Signed-off-by: Micha Livne <[email protected]>

* 1. Separated get_forward_output_only_func_encode and get_forward_output_only_func_decode.

Signed-off-by: Micha Livne <[email protected]>

* update branch

Signed-off-by: ericharper <[email protected]>

* Set headscale false (NVIDIA#4364)

Signed-off-by: MaximumEntropy <[email protected]>

* Add wandb as dependency (NVIDIA#4365)

Signed-off-by: smajumdar <[email protected]>

* Raise trainer error (NVIDIA#4356)

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Micha Livne <[email protected]>

* Set headscale false (NVIDIA#4364) (NVIDIA#4366)

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: smajumdar <[email protected]>

* Finetuning changes for BART (NVIDIA#4003)

* Temp

Signed-off-by: MaximumEntropy <[email protected]>

* Checkpoint converter to nemo for bart

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Micha Livne <[email protected]>

* Make position embedding expansion specific to a batch to avoid checkpoint size mismatches (NVIDIA#4357)

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Fix logging warning

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Micha Livne <[email protected]>

* 1. Added return logits to validation.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed unkown token during sampling.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed RegExTokenizer loading.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed ckpt file with samples int(0).

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed regex tokenizer.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed allowing enc_tokens to be None.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added ability to ignore tokens by id during decode.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed regex tokenizer .nemo loading issue.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed RegEx test.

Signed-off-by: Micha Livne <[email protected]>

* r1.10.0 untie embeddings weights (NVIDIA#4519)

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added independent decoder embeddings, and independent decoder token_head.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added support in yaml config.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed initialization.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added tests for untied embeddings and decoder token head.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated share_word_embeddings to share_token_embeddings.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.
Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed error in __del__ when TextMemMapDataset fails to build.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed comments.

Signed-off-by: Micha Livne <[email protected]>

* 1.Made method private.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed config names.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed alerts and style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed PP, TP, PP+TP still fails.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* Update megatron t5 interface to dialogue (NVIDIA#4626)

* G2P Aligner (NVIDIA#4604)

* Aligner inference notebook in progress. Preprocessing, forward, attn viz

Signed-off-by: Jocelyn Huang <[email protected]>

* Hard attn, duration extraction, distance matrix

Signed-off-by: Jocelyn Huang <[email protected]>

* Started: phoneme disambiguation using Aligner distance matrix

Signed-off-by: Jocelyn Huang <[email protected]>

* Decouple encode_from_g2p() from phoneme tokenizer encode() for disambiguation inference

Signed-off-by: Jocelyn Huang <[email protected]>

* Aligner G2P disambiguation using mean L2 embedding distance

Signed-off-by: Jocelyn Huang <[email protected]>

* Rename aligner inference notebook

Signed-off-by: Jocelyn Huang <[email protected]>

* Header text for Aligner notebook, formatting

Signed-off-by: Jocelyn Huang <[email protected]>

* Aligner notebook formatting, header, license updates

Signed-off-by: Jocelyn Huang <[email protected]>

* Aligner G2P disambiguation script draft

Signed-off-by: Jocelyn Huang <[email protected]>

* Aligner G2P disambiguation script finished

Signed-off-by: Jocelyn Huang <[email protected]>

* Remove normalization step to fix words with apostrophes (G2P)

Signed-off-by: Jocelyn Huang <[email protected]>

* Fix normalization args for G2P disambiguation

Signed-off-by: Jocelyn Huang <[email protected]>

* Allow str to be passed in for supp data, add 'text_normalized' as manifest option

Signed-off-by: Jocelyn Huang <[email protected]>

* Aligner G2P script fixes: normalization, tokenization, add brackets around tokens, etc.

Signed-off-by: Jocelyn Huang <[email protected]>

* Only disambiguate words in the given heteronyms list

Signed-off-by: Jocelyn Huang <[email protected]>

* Filtering option for disambiguation script

Signed-off-by: Jocelyn Huang <[email protected]>

* Add confidence thresholding, add PASTY to cmudict entries

Signed-off-by: Jocelyn Huang <[email protected]>

* TTS Aligner tutorial updates to generic path text

Signed-off-by: Jocelyn Huang <[email protected]>

* Add confidence to aligner_g2p.py run example

Signed-off-by: Jocelyn Huang <[email protected]>

* Move avg word distance function to Aligner encoder, add docstring, fix license

Signed-off-by: Jocelyn Huang <[email protected]>

* Aligner Inference notebook updates (link to sample, resources added)

Signed-off-by: Jocelyn Huang <[email protected]>

* Fix HF check for model card info (NVIDIA#4628)

Signed-off-by: smajumdar <[email protected]>

* Tiny VAD refactoring for postprocessing (NVIDIA#4625)

* binarization start index

Signed-off-by: fayejf <[email protected]>

* fix frame len

Signed-off-by: fayejf <[email protected]>

* style fix

Signed-off-by: fayejf <[email protected]>

* rame UNIT_FRAME_LEN

Signed-off-by: fayejf <[email protected]>

* update overlap script and fix lgtm

Signed-off-by: fayejf <[email protected]>

* style fi

Signed-off-by: fayejf <[email protected]>

* Fix ITN pt (NVIDIA#4623)

Signed-off-by: Guilherme Steinmann <[email protected]>

* [TN] bug fix "hundred" in Audio-based, added method so split text in sentences (NVIDIA#4610)

* fix duplex inference with grammars

Signed-off-by: ekmb <[email protected]>

* fix hundred TN audio bug, add split text

Signed-off-by: ekmb <[email protected]>

* fix header year

Signed-off-by: ekmb <[email protected]>

* style fix

Signed-off-by: ekmb <[email protected]>

* exclude I from roman-ordinal form

Signed-off-by: ekmb <[email protected]>

* fix graph_with_and

Signed-off-by: ekmb <[email protected]>

* fix tests

Signed-off-by: ekmb <[email protected]>

* fix split regex

Signed-off-by: ekmb <[email protected]>

* fix warning

Signed-off-by: ekmb <[email protected]>

* [Text Processing] G2P for OOV and heteronyms (NVIDIA#4624)

* add models

Signed-off-by: ekmb <[email protected]>

* fix header and t5 inference

Signed-off-by: ekmb <[email protected]>

* fix jenkins

Signed-off-by: ekmb <[email protected]>

* fix jenkins

Signed-off-by: ekmb <[email protected]>

* fix lgtm

Signed-off-by: ekmb <[email protected]>

* review fixes

Signed-off-by: ekmb <[email protected]>

* fix if/else and removed unused imports

Signed-off-by: ekmb <[email protected]>

* replace ModelPT with G2PModel

Signed-off-by: ekmb <[email protected]>

* black

Signed-off-by: ekmb <[email protected]>

* add missing headers

Signed-off-by: ekmb <[email protected]>

* jenkins

Signed-off-by: ekmb <[email protected]>

* jenkins

Signed-off-by: ekmb <[email protected]>

* fix TRANSFORMERS_OFFLINE flag

Signed-off-by: ekmb <[email protected]>

* jenkins

Signed-off-by: ekmb <[email protected]>

* jenkins

Signed-off-by: ekmb <[email protected]>

* jenkins

Signed-off-by: ekmb <[email protected]>

* Update README.rst

* Fp16 support for Conformer (NVIDIA#4571)

* adding auto-select best precision for mhsa

* cleanup

* moving mhsa32 check into mhsa

* switching to torch.cuda.is_bf16_supported()

* now using torch.is_autocast_enabled()

* added to non rel mhsa

* only forcing 32bit subsampling if using bf16

* removing unused imports

* moving contexts to utils

Signed-off-by: Dima Rekesh <[email protected]>

* formatting

Signed-off-by: Dima Rekesh <[email protected]>

* naming

Co-authored-by: Dima Rekesh <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* Maximum sample-based training for Megatron NMT and Text Memmap based Seq2seq Pre-training (NVIDIA#4396)

* Update blendable dataset, and refactor seq2seq data

Signed-off-by: MaximumEntropy <[email protected]>

* Blendable dataset with binarized mmap working

Signed-off-by: MaximumEntropy <[email protected]>

* Pass seed from cfg to dataset

Signed-off-by: MaximumEntropy <[email protected]>

* Fix multilingual setup

Signed-off-by: MaximumEntropy <[email protected]>

* Add on epoch start reconfiguration

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Update tokenizer creation for multilingual

Signed-off-by: MaximumEntropy <[email protected]>

* Tmp

Signed-off-by: MaximumEntropy <[email protected]>

* Update NMT script

Signed-off-by: MaximumEntropy <[email protected]>

* Remove unused import

Signed-off-by: MaximumEntropy <[email protected]>

* Update training script

Signed-off-by: MaximumEntropy <[email protected]>

* Log consumed samples

Signed-off-by: MaximumEntropy <[email protected]>

* Logging on val epoch end

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Remove redundant print

Signed-off-by: MaximumEntropy <[email protected]>

* Ckpt averaging for non model parallel megatron models

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Empty

Signed-off-by: MaximumEntropy <[email protected]>

* Update error message

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Remove check

Signed-off-by: MaximumEntropy <[email protected]>

* Restore fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Remove ipdb

Signed-off-by: MaximumEntropy <[email protected]>

* Fixes

Signed-off-by: MaximumEntropy <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Testing a simple solution

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed. Seems to work. Need to validate.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added support in CSV and text memmap toMEgatron encoder-decoder

Signed-off-by: Micha Livne <[email protected]>

* 1. Added support in CSV.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.
2. Fixed bugs.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed bugs.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated yaml.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed warnings.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed a bug.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added a test for text_memmap

Signed-off-by: Micha Livne <[email protected]>

* Fix retro

Signed-off-by: MaximumEntropy <[email protected]>

* add docstrings

Signed-off-by: MaximumEntropy <[email protected]>

* Minor

Signed-off-by: MaximumEntropy <[email protected]>

* Uncomment CI tests and fix existing gpt ci tests

Signed-off-by: MaximumEntropy <[email protected]>

* Tmp

Signed-off-by: MaximumEntropy <[email protected]>

* Remove max step hacking and move on_train_batch_end to base model

Signed-off-by: MaximumEntropy <[email protected]>

* Empty

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* NeMo Megatron Doc updates1 (NVIDIA#4633)

* Work on NeMo Megatron OSS documentation

Signed-off-by: Oleksii Kuchaiev <[email protected]>

* NeMo Megatron doc updates

Signed-off-by: Oleksii Kuchaiev <[email protected]>

Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Zhilin Wang <[email protected]>
Co-authored-by: Jocelyn <[email protected]>
Co-authored-by: fayejf <[email protected]>
Co-authored-by: Guilherme Steinmann <[email protected]>
Co-authored-by: Evelina <[email protected]>
Co-authored-by: Dima Rekesh <[email protected]>
Co-authored-by: Dima Rekesh <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Davood-M pushed a commit to Davood-M/NeMo that referenced this pull request Aug 9, 2022
…Seq2seq Pre-training (NVIDIA#4396)

* Update blendable dataset, and refactor seq2seq data

Signed-off-by: MaximumEntropy <[email protected]>

* Blendable dataset with binarized mmap working

Signed-off-by: MaximumEntropy <[email protected]>

* Pass seed from cfg to dataset

Signed-off-by: MaximumEntropy <[email protected]>

* Fix multilingual setup

Signed-off-by: MaximumEntropy <[email protected]>

* Add on epoch start reconfiguration

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Update tokenizer creation for multilingual

Signed-off-by: MaximumEntropy <[email protected]>

* Tmp

Signed-off-by: MaximumEntropy <[email protected]>

* Update NMT script

Signed-off-by: MaximumEntropy <[email protected]>

* Remove unused import

Signed-off-by: MaximumEntropy <[email protected]>

* Update training script

Signed-off-by: MaximumEntropy <[email protected]>

* Log consumed samples

Signed-off-by: MaximumEntropy <[email protected]>

* Logging on val epoch end

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Remove redundant print

Signed-off-by: MaximumEntropy <[email protected]>

* Ckpt averaging for non model parallel megatron models

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Empty

Signed-off-by: MaximumEntropy <[email protected]>

* Update error message

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Remove check

Signed-off-by: MaximumEntropy <[email protected]>

* Restore fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Remove ipdb

Signed-off-by: MaximumEntropy <[email protected]>

* Fixes

Signed-off-by: MaximumEntropy <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Testing a simple solution

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed. Seems to work. Need to validate.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added support in CSV and text memmap toMEgatron encoder-decoder

Signed-off-by: Micha Livne <[email protected]>

* 1. Added support in CSV.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.
2. Fixed bugs.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed bugs.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated yaml.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed warnings.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed a bug.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added a test for text_memmap

Signed-off-by: Micha Livne <[email protected]>

* Fix retro

Signed-off-by: MaximumEntropy <[email protected]>

* add docstrings

Signed-off-by: MaximumEntropy <[email protected]>

* Minor

Signed-off-by: MaximumEntropy <[email protected]>

* Uncomment CI tests and fix existing gpt ci tests

Signed-off-by: MaximumEntropy <[email protected]>

* Tmp

Signed-off-by: MaximumEntropy <[email protected]>

* Remove max step hacking and move on_train_batch_end to base model

Signed-off-by: MaximumEntropy <[email protected]>

* Empty

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: David Mosallanezhad <[email protected]>
piraka9011 pushed a commit to piraka9011/NeMo that referenced this pull request Aug 25, 2022
…Seq2seq Pre-training (NVIDIA#4396)

* Update blendable dataset, and refactor seq2seq data

Signed-off-by: MaximumEntropy <[email protected]>

* Blendable dataset with binarized mmap working

Signed-off-by: MaximumEntropy <[email protected]>

* Pass seed from cfg to dataset

Signed-off-by: MaximumEntropy <[email protected]>

* Fix multilingual setup

Signed-off-by: MaximumEntropy <[email protected]>

* Add on epoch start reconfiguration

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Update tokenizer creation for multilingual

Signed-off-by: MaximumEntropy <[email protected]>

* Tmp

Signed-off-by: MaximumEntropy <[email protected]>

* Update NMT script

Signed-off-by: MaximumEntropy <[email protected]>

* Remove unused import

Signed-off-by: MaximumEntropy <[email protected]>

* Update training script

Signed-off-by: MaximumEntropy <[email protected]>

* Log consumed samples

Signed-off-by: MaximumEntropy <[email protected]>

* Logging on val epoch end

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Remove redundant print

Signed-off-by: MaximumEntropy <[email protected]>

* Ckpt averaging for non model parallel megatron models

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Empty

Signed-off-by: MaximumEntropy <[email protected]>

* Update error message

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Remove check

Signed-off-by: MaximumEntropy <[email protected]>

* Restore fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Remove ipdb

Signed-off-by: MaximumEntropy <[email protected]>

* Fixes

Signed-off-by: MaximumEntropy <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Testing a simple solution

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed. Seems to work. Need to validate.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added support in CSV and text memmap toMEgatron encoder-decoder

Signed-off-by: Micha Livne <[email protected]>

* 1. Added support in CSV.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.
2. Fixed bugs.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed bugs.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated yaml.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed warnings.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed a bug.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added a test for text_memmap

Signed-off-by: Micha Livne <[email protected]>

* Fix retro

Signed-off-by: MaximumEntropy <[email protected]>

* add docstrings

Signed-off-by: MaximumEntropy <[email protected]>

* Minor

Signed-off-by: MaximumEntropy <[email protected]>

* Uncomment CI tests and fix existing gpt ci tests

Signed-off-by: MaximumEntropy <[email protected]>

* Tmp

Signed-off-by: MaximumEntropy <[email protected]>

* Remove max step hacking and move on_train_batch_end to base model

Signed-off-by: MaximumEntropy <[email protected]>

* Empty

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Anas Abou Allaban <[email protected]>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022
…Seq2seq Pre-training (NVIDIA#4396)

* Update blendable dataset, and refactor seq2seq data

Signed-off-by: MaximumEntropy <[email protected]>

* Blendable dataset with binarized mmap working

Signed-off-by: MaximumEntropy <[email protected]>

* Pass seed from cfg to dataset

Signed-off-by: MaximumEntropy <[email protected]>

* Fix multilingual setup

Signed-off-by: MaximumEntropy <[email protected]>

* Add on epoch start reconfiguration

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Update tokenizer creation for multilingual

Signed-off-by: MaximumEntropy <[email protected]>

* Tmp

Signed-off-by: MaximumEntropy <[email protected]>

* Update NMT script

Signed-off-by: MaximumEntropy <[email protected]>

* Remove unused import

Signed-off-by: MaximumEntropy <[email protected]>

* Update training script

Signed-off-by: MaximumEntropy <[email protected]>

* Log consumed samples

Signed-off-by: MaximumEntropy <[email protected]>

* Logging on val epoch end

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Remove redundant print

Signed-off-by: MaximumEntropy <[email protected]>

* Ckpt averaging for non model parallel megatron models

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Empty

Signed-off-by: MaximumEntropy <[email protected]>

* Update error message

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Remove check

Signed-off-by: MaximumEntropy <[email protected]>

* Restore fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Remove ipdb

Signed-off-by: MaximumEntropy <[email protected]>

* Fixes

Signed-off-by: MaximumEntropy <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Testing a simple solution

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed. Seems to work. Need to validate.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added support in CSV and text memmap toMEgatron encoder-decoder

Signed-off-by: Micha Livne <[email protected]>

* 1. Added support in CSV.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.
2. Fixed bugs.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed bugs.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated yaml.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed warnings.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed a bug.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added a test for text_memmap

Signed-off-by: Micha Livne <[email protected]>

* Fix retro

Signed-off-by: MaximumEntropy <[email protected]>

* add docstrings

Signed-off-by: MaximumEntropy <[email protected]>

* Minor

Signed-off-by: MaximumEntropy <[email protected]>

* Uncomment CI tests and fix existing gpt ci tests

Signed-off-by: MaximumEntropy <[email protected]>

* Tmp

Signed-off-by: MaximumEntropy <[email protected]>

* Remove max step hacking and move on_train_batch_end to base model

Signed-off-by: MaximumEntropy <[email protected]>

* Empty

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Hainan Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants