Fix curriculum learning support #134

conglongli · 2021-10-10T22:20:19Z

Fix the CL+PP case when pp >=4.

The error that this PR fixes can be reproduced by changing pp_size to 4 and num-layers to 4 in https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tests/test_training.py.

@stas00 please test this on your side.

Also fixes backward compatibility for new chkpt keys introduced by CL.

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

…ed-1 into conglongli/main

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

stas00 · 2021-10-18T02:41:27Z

OK, so testing this branch - the training hangs on startup:

[before the start of training step] datetime: 2021-10-18 04:28:57
[2021-10-18 04:28:57,799] [INFO] [checkpointing.py:547:forward] Activation Checkpointing Information
[2021-10-18 04:28:57,799] [INFO] [checkpointing.py:548:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2021-10-18 04:28:57,799] [INFO] [checkpointing.py:551:forward] ----contiguous Memory Checkpointing False with 64 total layers
[2021-10-18 04:28:57,799] [INFO] [checkpointing.py:554:forward] ----Synchronization False
[2021-10-18 04:28:57,799] [INFO] [checkpointing.py:555:forward] ----Profiling time in checkpointing False

and py-spy dump gives:

Process 715627: /gpfswork/rech/six/commun/conda/cutting-edge/bin/python -u /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-tr8b-104B/pretrain_gpt.py --local_rank=0 --tensor-model-parallel-size 4 --pipeline-model-parallel-size 32 --num-layers 64 --hidden-size 11600 --num-attention-heads 80 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 2048 --train-samples 600_000_000 --train-tokens 300_000_000_000 --vocab-file /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-tr8b-104B/data/gpt2-vocab.json --merge-file /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-tr8b-104B/data/gpt2-merges.txt --loss-scale 12 --fp16 --checkpoint-activations --no-masked-softmax-fusion --seed 43 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-8 --lr 6e-5 --min-lr 6e-6 --lr-warmup-samples 216_320 --lr-decay-tokens 260000000000 --lr-decay-style cosine --clip-grad 1.0 --weight-decay 1e-1 --exit-duration-in-mins 55 --log-interval 1 --save-interval 300 --eval-interval 1000 --eval-iters 5 --tensorboard-dir /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/tr8b-104B-logs/tensorboard --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints --load /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints --data-path /gpfswork/rech/six/commun/datasets-custom/oscar-en/meg-gpt2_text_document --data-impl mmap --split 949,50,1 --distributed-backend nccl --deepspeed --deepspeed_config ./ds_config.1587010.json --zero-stage 1 --deepspeed-activation-checkpointing
Python v3.8.11 (/gpfsssd/worksf/projects/rech/six/commun/conda/cutting-edge/bin/python3.8)

Thread 715627 (active): "MainThread"
    from_meta (deepspeed/runtime/utils.py:662)
    _exec_backward_pass (deepspeed/runtime/pipe/engine.py:707)
    _exec_schedule (deepspeed/runtime/pipe/engine.py:1313)
    train_batch (deepspeed/runtime/pipe/engine.py:329)
    train_step (megatron/training.py:405)
    train (megatron/training.py:735)
    pretrain (megatron/training.py:165)
    <module> (pretrain_gpt.py:237)
Thread 716013 (idle): "Thread-1"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:107)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:25)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716172 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716173 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716180 (idle): "Thread-2"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:107)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:25)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716181 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716182 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716189 (idle): "Thread-3"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:107)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:25)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716190 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716191 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

so it appears that perhaps there is a deadlock somewhere.

It hangs here:
https://github.com/microsoft/DeepSpeed/blob/1fc74cb9c81668b5ff0046446f8004d4cf8dc2d5/deepspeed/runtime/utils.py#L662

All gpus spin at 100%

stas00 · 2021-10-21T21:04:59Z

So besides the discussion on slack confirming that once deepspeedai/DeepSpeed#1473 is merged we can merge this one as well. I was able to us these 2 PRs to launch a Meg-DS training w/o problems. And additionally @conglongli run test_training.py on 8 gpus!

conglongli and others added 24 commits October 7, 2021 00:56

CL initial commit

99d2b37

CL+PP support

4c9c4a3

update

82a3198

Apply suggestions from code review

21e91b9

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

apply code review comments

6010a3d

make it easier to read large numbers

405c7a6

add a cl test

a90d30e

apply review comments

fb04d2b

Update examples/curriculum_learning/README.md

8e4a466

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Merge branch 'main' of https://github.com/conglongli/Megatron-DeepSpe…

3ed7075

…ed-1 into conglongli/main

update

d86a4f4

fix

e5a335d

new requirement

0c4073b

Update megatron/learning_rates.py

d25fa9e

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Update megatron/learning_rates.py

7cd53dc

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

fix samples and tokens - thank you Conglong

5a492b3

fix truncation

8ca1db7

switch to deepspeed@master

d7301a1

extend the doc

dbf8abd

Trigger CI

b7fd67e

merge upstream

c0ec861

fix CL+PP

20d498d

rollback accidental changes

f4c4eb3

relax assertion for corner case

df7a9d9

backward compatibility for new chkpt keys

ae24cd1

stas00 mentioned this pull request Oct 20, 2021

backward compatibility for new chkpt keys #147

Merged

Merge remote-tracking branch 'upstream/main' into main

3e52104

stas00 merged commit 8dc8af5 into bigscience-workshop:main Oct 22, 2021

adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Jun 21, 2023

fix a bug when run on bf16+pp (bigscience-workshop#134)

3ed9f4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix curriculum learning support #134

Fix curriculum learning support #134

Uh oh!

conglongli commented Oct 10, 2021 •

edited

Loading

Uh oh!

stas00 commented Oct 18, 2021 •

edited

Loading

Uh oh!

stas00 commented Oct 21, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix curriculum learning support #134

Fix curriculum learning support #134

Uh oh!

Conversation

conglongli commented Oct 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Oct 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Oct 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

conglongli commented Oct 10, 2021 •

edited

Loading

stas00 commented Oct 18, 2021 •

edited

Loading

stas00 commented Oct 21, 2021 •

edited

Loading