Loading HuBERT models with DeepSpeed ZeRO-3 causes program to hang #31797

anferico · 2024-07-04T15:44:32Z

System Info

transformers version: 4.42.3
Platform: Linux-5.14.0-362.24.1.el9_3.x86_64-x86_64-with-glibc2.34
Python version: 3.10.14
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: yes
Using GPU in script?: no
GPU type: NVIDIA A100-SXM4-40GB

Who can help?

@sanchit-gandhi @muellerzr

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

hubert_mre.py:

from transformers import AutoConfig, HubertModel, TrainingArguments, HfArgumentParser

def main():
    parser = HfArgumentParser(TrainingArguments)
    training_args = parser.parse_args_into_dataclasses()[0]

    config = AutoConfig.from_pretrained("facebook/hubert-large-ls960-ft")

    model = HubertModel.from_pretrained(
        "facebook/hubert-large-ls960-ft", config=config
    )

if __name__ == "__main__":
    main()

hubert_mre.sh:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export CUDA_LAUNCH_BLOCKING=1

OUTPUT_DIR=$HOME/hubert_mre

deepspeed \
    --num_gpus 2 \
    --master_port 60000 \
    ./hubert_mre.py \
    --output_dir $OUTPUT_DIR \
    --deepspeed zero3.json

zero3.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

Run hubert_mre.sh and watch the script hang indefinitely.

The curious thing is that this seems to happen only with HuBERT models. If, for example, you replace HubertModel.from_pretrained("facebook/hubert-large-ls960-ft") with Wav2Vec2BertModel.from_pretrained("facebook/w2v-bert-2.0"), the script runs just fine.

Also, this works fine if you pass --num_gpus 1.

Expected behavior

The script runs to completion without hanging indefinitely.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading HuBERT models with DeepSpeed ZeRO-3 causes program to hang #31797

Loading HuBERT models with DeepSpeed ZeRO-3 causes program to hang #31797

anferico commented Jul 4, 2024 •

edited

Loading

Loading HuBERT models with DeepSpeed ZeRO-3 causes program to hang #31797

Loading HuBERT models with DeepSpeed ZeRO-3 causes program to hang #31797

Comments

anferico commented Jul 4, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

anferico commented Jul 4, 2024 •

edited

Loading