Mpijson #1

adammoody · 2021-08-19T15:43:03Z

No description provided.

- Add a test to stability compared to official repo.

* fixes for pt-1.10 * switch to torch_assert_equal wrapper

* feat: expose glu activations as argument * chore: rename activations -> glu_activations * refactor: use lookup dict instead of `getattr()` * refactor: mv lookup dict to `glu_activations.py` * chore: rm unnecessary default arg * test: add bf16 test; gelu in `test_training_all()` * Update megatron/testing_utils.py Co-authored-by: Stas Bekman <[email protected]> * refactor: use `require_torch_bf16` decorator * chore: comment out bf16 test uncomment in the future when torch supports gelu kernels for bf16 * consistent style * fix look up table * better grouping * fix: replace hard coded options with `GLU_ACTIVATIONS` Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]>

* add codecarbon * switch to offline * rework to also restart the tracker at each checkpoint save to ensure as little as possible data is lost * adjust API to match bigscience-workshop/codecarbon#1 * fix logging * new implementation based on mlco2/codecarbon#236 * add test * update requirements

* start workflow * fix

* start workflow * fix * fix * Update .github/workflows/main.yml Co-authored-by: Philipp Schmid <[email protected]> * Update .github/workflows/main.yml Co-authored-by: Philipp Schmid <[email protected]> Co-authored-by: Philipp Schmid <[email protected]>

…p#55) * add parallel merge using mpi * handle case where some ranks might have 0 items * add inclusive scan prefix sum * report more timing info * Update megatron/data/indexed_dataset.py Co-authored-by: Thomas Wang <[email protected]> * Update megatron/data/indexed_dataset.py Co-authored-by: Thomas Wang <[email protected]> * rename total size variable for clarity * move translation to bin/idx file names a level deeper * parallel merge for cached dataset * add alltrue function * move collectives to new distdata class, add torch.distributed * drop unused prefix_sum function * allow ranks to pass a list of files to be merged * check that input dataset files exist * fix: using wrong doc_idx list for mmap * move init dist and collectives to distdata class * add --merge option, move parallel/serial to their own functions * Update megatron/data/distdata.py Co-authored-by: Thomas Wang <[email protected]> * Update megatron/data/indexed_dataset.py Co-authored-by: Thomas Wang <[email protected]> * Update megatron/data/indexed_dataset.py Co-authored-by: Thomas Wang <[email protected]> * Update megatron/data/indexed_dataset.py Co-authored-by: Thomas Wang <[email protected]> * Update megatron/data/indexed_dataset.py Co-authored-by: Thomas Wang <[email protected]> * Update megatron/data/indexed_dataset.py Co-authored-by: Thomas Wang <[email protected]> * Update megatron/data/indexed_dataset.py Co-authored-by: Thomas Wang <[email protected]> * drop extraneous numpy tolist calls * rename self.MPI to mpi4py * handle case where no ranks have elements in their file * rename tokenize_start to time_start * drop unrelated comment in distdata.min * add comment why pointers_shift is not None and add assert * note why pointers uses sizes count and offset values * can just rely on rank 0 for the leading 0 element * add write_list function * determine element size * add checks for consistent element_size values * check that at least one rank has a file to merge * assert that torch backend is gloo or mpi * add collectives for assert and raise * rename to allassert and allraise_if * check dtype instead of element_size * add uint32 to element_sizes table * infer dtype from files being merged * add write_header function to indexed dataset classes * call write_header internally from IndexedDataset classes * return number of bytes written from write calls * move scatterv to distdata class * add functions to format status and error messages * defer merge_files_dist to future PR * open files using with, refresh comments * rely on default torch datatypes * fix some status messages from preprocess script * fix: exclusive scan computing pointers list * fix: exclusive scan to compute mmap pointers list * note about seek * rename preprocess_dataset_mpi.py to preprocess_data_dist.py * update usage comments at top of script * restore commented print_rank_0 statements * restore status message in mmap merge_file_ * drop mpi4py, sad :( * add test case for parallel merge * add preprocess_data_dist test for serial merge * improve error handling * refactor get_pointers code * bug fix in exscan * further refactor get_pointers * move exscan collective for pointers outside of try block * clarify some comments * include string 1k in name of test files * use temporary file for index * fix: implement scatterv from torch.distributed.scatter * switch to pad method in torch.nn.functional * return data received in scatterv as new tensor * raise exception if conflicting scratch and merge options * use allraise method from distdata in preprocess_data_dist Co-authored-by: Thomas Wang <[email protected]>

* ICT zeroshot evaluation code * made more generic, aligned with other tasks * Fixed based on review recoemmendation * fixed another issue * implementing DPR * implementation dpr * adding dpr code * removed commnets * removed commnets * removed commnets * DPR evaluation debugging * DPR ongoing * DPR finetune and evaluation * fixing model evaluation of retriver * added pre ad post process * added pre ad post process * evaluation works! * debugging DPR * fix copy-n-paste error remove erroneous arg. * Typo fix in readme * t5 fixes * before cleaning the comments * vit pipeline fixes * cleaning the code * additional cleaning * renaming the folders * Add temporary assert to finetuning until it can be fixed. * Fixed issues with ICT pretraining * updated the evaluation script for retriver * updated the evaluation script for retriver * updated the evaluation script for retriver * updated the evaluation script for retriver * added exit interval for finetuning * updating the scripts * updating no load rng * updating script * Update T5 scripts * resolved hang issue * fixed the tensor size miss-mass issue * fixed the evaluation hangs * Adding readme * Adding readme * Adding readme * Adding readme * Adding readme * Adding readme * Adding readme * Adding readme * Clean up README.md a bit * addressed comments * updated readme * updated readme * updated readme * updated readme * Basic handling of prefix lm by updating the mask * Add prefix option to gpt temporarily and prevent it to use custom kernel * Add argument for prefix lm, in order to configure masking strategy * Woops * loss_on_targets_only flag, assert that current prefix implementation only works with reset_attention_mask set to True and attempt to fix empty slice issue * Format * Reverse renaming * Allow prefix on partial document at the end * WIP: add prefix per row feature * Document the use of None * Woops * Handle empty document better * We might not be able to concat empty tensors * Handle empty tensor seperately * Debug * Test * Add loss masking as script argument * Turns out deepspeed integration of attention matrices prevented dynamic masks * Add more asserts * Prefix can only see the prefix, it cannot see target * Remove prefix-lm argument as we split the pretrain script * Iz PR review * Make masking row dependent when using prefix * Revert "Merge remote-tracking branch 'origin/master' into prefix_lm" This reverts commit d49d6e5, reversing changes made to 28a712d. * Tests (#1) * WIP: test * Still trying to figure out deepspeed * WIP * Test test * Test how to setup deepspeed in unit tests * Test something else * Empty strings might be problematic * Remove unecessary arguments * Woops * Remove global variables at the end of each test and init deepspeed * Woops * Maybe adding classmethod * Woops * Add debug print to check that tear down happends * Reset global variables before * Let's test this * Try something else * WIP * More fix * More fix * More stuff to fix * We really want to compare vectors and not coordinates * Reformat * check something out * fix test * Remove prefix-lm flag as it's integrated * Woops * Add test for without reset attention mask * Fix test for non reset attention mask * Fix test * Update code for prefix lm Co-authored-by: Mostofa Patwary <[email protected]> Co-authored-by: Mostofa Patwary <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Devrim <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Vijay Korthikanti <[email protected]> Co-authored-by: Jared Casper <[email protected]> Co-authored-by: Mohammad Shoeybi <[email protected]> Co-authored-by: Deepak Narayanan <[email protected]>

* Enable Megatron-LM workload on ROCm (#1) * Enable Megatron workload on ROCm * Added ds_pretrain_gpt_350M_dense_pipeclean.sh * removed a file * Removed an extra line * Fix to resolve the below rsqrtf() error on ROCm /root/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_hip_kernel.hip:298:10: error: no matching function for call to 'rsqrtf' return rsqrtf(v); ^~~~~~ /opt/rocm-5.2.0/llvm/lib/clang/14.0.0/include/__clang_hip_math.h:521:7: note: candidate function not viable: call to __device__ function from __host__ function float rsqrtf(float __x) { return __ocml_rsqrt_f32(__x); } ^ * Simplified code * Simplified the code * Removed extra spaces

adammoody and others added 6 commits August 18, 2021 09:52

fix: exclusive scan computing pointers list (bigscience-workshop#68)

191a96b

- Recompute bin/idx using microsoft/Megatron-DeepSpeed (Not changes)

5eeae0b

- Add a test to stability compared to official repo.

Add openwebtext1000.jsonl to .gitignore

9341269

abstraction to index and randomly access jsonl files

fb274bf

rebase on parallel merge, replace mpi4py with distdata class

d428c02

Merge branch 'pmerge' into mpijson

18881ae

adammoody mentioned this pull request Aug 19, 2021

extend preprocess_data_dist to handle jsonl files bigscience-workshop/Megatron-DeepSpeed#60

Open

5 tasks

adammoody and others added 16 commits August 19, 2021 15:46

look for extension .jsonl

bd6f41f

[testing] fixes for pt-1.10 (bigscience-workshop#71)

a96e2ab

* fixes for pt-1.10 * switch to torch_assert_equal wrapper

fix circular import (bigscience-workshop#72)

4255845

add progress messages

3488d0b

rebuild index if mtime is old

1305fe9

Check cardon directory is not None (bigscience-workshop#74)

e96df7d

[CI] start workflow (bigscience-workshop#75)

3fd48db

* start workflow * fix

store index values in network byte order

6bcac1f

add magic value and format version number to index file

813d068

Merge branch 'main' into mpijson

0510081

clean up merge

1fea302

clean up merge

d360313

adammoody closed this Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mpijson #1

Mpijson #1

Uh oh!

adammoody commented Aug 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Mpijson #1

Mpijson #1

Uh oh!

Conversation

adammoody commented Aug 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants