Skip to content

Conversation

@conglongli
Copy link
Contributor

@conglongli conglongli commented Oct 10, 2021

Fix the CL+PP case when pp >=4.

The error that this PR fixes can be reproduced by changing pp_size to 4 and num-layers to 4 in https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tests/test_training.py.

@stas00 please test this on your side.

Also fixes backward compatibility for new chkpt keys introduced by CL.

@stas00
Copy link
Contributor

stas00 commented Oct 18, 2021

OK, so testing this branch - the training hangs on startup:

[before the start of training step] datetime: 2021-10-18 04:28:57
[2021-10-18 04:28:57,799] [INFO] [checkpointing.py:547:forward] Activation Checkpointing Information
[2021-10-18 04:28:57,799] [INFO] [checkpointing.py:548:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2021-10-18 04:28:57,799] [INFO] [checkpointing.py:551:forward] ----contiguous Memory Checkpointing False with 64 total layers
[2021-10-18 04:28:57,799] [INFO] [checkpointing.py:554:forward] ----Synchronization False
[2021-10-18 04:28:57,799] [INFO] [checkpointing.py:555:forward] ----Profiling time in checkpointing False

and py-spy dump gives:

Process 715627: /gpfswork/rech/six/commun/conda/cutting-edge/bin/python -u /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-tr8b-104B/pretrain_gpt.py --local_rank=0 --tensor-model-parallel-size 4 --pipeline-model-parallel-size 32 --num-layers 64 --hidden-size 11600 --num-attention-heads 80 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 2048 --train-samples 600_000_000 --train-tokens 300_000_000_000 --vocab-file /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-tr8b-104B/data/gpt2-vocab.json --merge-file /gpfswork/rech/six/commun/code/tr8b-104B/Megatron-DeepSpeed-tr8b-104B/data/gpt2-merges.txt --loss-scale 12 --fp16 --checkpoint-activations --no-masked-softmax-fusion --seed 43 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-8 --lr 6e-5 --min-lr 6e-6 --lr-warmup-samples 216_320 --lr-decay-tokens 260000000000 --lr-decay-style cosine --clip-grad 1.0 --weight-decay 1e-1 --exit-duration-in-mins 55 --log-interval 1 --save-interval 300 --eval-interval 1000 --eval-iters 5 --tensorboard-dir /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/tr8b-104B-logs/tensorboard --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints --load /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints --data-path /gpfswork/rech/six/commun/datasets-custom/oscar-en/meg-gpt2_text_document --data-impl mmap --split 949,50,1 --distributed-backend nccl --deepspeed --deepspeed_config ./ds_config.1587010.json --zero-stage 1 --deepspeed-activation-checkpointing
Python v3.8.11 (/gpfsssd/worksf/projects/rech/six/commun/conda/cutting-edge/bin/python3.8)

Thread 715627 (active): "MainThread"
    from_meta (deepspeed/runtime/utils.py:662)
    _exec_backward_pass (deepspeed/runtime/pipe/engine.py:707)
    _exec_schedule (deepspeed/runtime/pipe/engine.py:1313)
    train_batch (deepspeed/runtime/pipe/engine.py:329)
    train_step (megatron/training.py:405)
    train (megatron/training.py:735)
    pretrain (megatron/training.py:165)
    <module> (pretrain_gpt.py:237)
Thread 716013 (idle): "Thread-1"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:107)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:25)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716172 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716173 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716180 (idle): "Thread-2"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:107)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:25)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716181 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716182 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716189 (idle): "Thread-3"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:107)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:25)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716190 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Thread 716191 (idle): "QueueFeederThread"
    wait (threading.py:302)
    _feed (multiprocessing/queues.py:227)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)

so it appears that perhaps there is a deadlock somewhere.

It hangs here:
https://github.com/microsoft/DeepSpeed/blob/1fc74cb9c81668b5ff0046446f8004d4cf8dc2d5/deepspeed/runtime/utils.py#L662

All gpus spin at 100%

@stas00
Copy link
Contributor

stas00 commented Oct 21, 2021

So besides the discussion on slack confirming that once deepspeedai/DeepSpeed#1473 is merged we can merge this one as well. I was able to us these 2 PRs to launch a Meg-DS training w/o problems. And additionally @conglongli run test_training.py on 8 gpus!

@stas00 stas00 merged commit 8dc8af5 into bigscience-workshop:main Oct 22, 2021
adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants