Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading HuBERT models with DeepSpeed ZeRO-3 causes program to hang #31797

Open
2 of 4 tasks
anferico opened this issue Jul 4, 2024 · 0 comments
Open
2 of 4 tasks

Loading HuBERT models with DeepSpeed ZeRO-3 causes program to hang #31797

anferico opened this issue Jul 4, 2024 · 0 comments

Comments

@anferico
Copy link

anferico commented Jul 4, 2024

System Info

  • transformers version: 4.42.3
  • Platform: Linux-5.14.0-362.24.1.el9_3.x86_64-x86_64-with-glibc2.34
  • Python version: 3.10.14
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: 0.31.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: yes
  • Using GPU in script?: no
  • GPU type: NVIDIA A100-SXM4-40GB

Who can help?

@sanchit-gandhi @muellerzr

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

hubert_mre.py:

from transformers import AutoConfig, HubertModel, TrainingArguments, HfArgumentParser

def main():
    parser = HfArgumentParser(TrainingArguments)
    training_args = parser.parse_args_into_dataclasses()[0]

    config = AutoConfig.from_pretrained("facebook/hubert-large-ls960-ft")

    model = HubertModel.from_pretrained(
        "facebook/hubert-large-ls960-ft", config=config
    )

if __name__ == "__main__":
    main()

hubert_mre.sh:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export CUDA_LAUNCH_BLOCKING=1

OUTPUT_DIR=$HOME/hubert_mre

deepspeed \
    --num_gpus 2 \
    --master_port 60000 \
    ./hubert_mre.py \
    --output_dir $OUTPUT_DIR \
    --deepspeed zero3.json

zero3.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

Run hubert_mre.sh and watch the script hang indefinitely.

The curious thing is that this seems to happen only with HuBERT models. If, for example, you replace HubertModel.from_pretrained("facebook/hubert-large-ls960-ft") with Wav2Vec2BertModel.from_pretrained("facebook/w2v-bert-2.0"), the script runs just fine.

Also, this works fine if you pass --num_gpus 1.

Expected behavior

The script runs to completion without hanging indefinitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant