Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance throws error #761

Open
1 of 4 tasks
ajayvohra2005 opened this issue Jan 11, 2025 · 0 comments
Open
1 of 4 tasks
Labels
bug Something isn't working

Comments

@ajayvohra2005
Copy link

ajayvohra2005 commented Jan 11, 2025

System Info

AMI Name: huggingface-neuron-2024-12-13T12-47-53Z-692efe1a-8d5c-4033-bcbc-5d99f2d4ae6a
AMI-ID: ami-0bede50341b2516c4

optimum-cli env

Copy-and-paste the text below in your GitHub issue:

Platform:

  • Platform: Linux-5.15.0-1031-aws-x86_64-with-glibc2.35
  • Python version: 3.10.12

Python packages:

  • optimum-neuron version: 0.0.27
  • neuron-sdk version: 2.20.2
  • optimum version: 1.22.0
  • transformers version: 4.43.2
  • huggingface_hub version: 0.26.5
  • torch version: 2.1.2+cu121
  • aws-neuronx-runtime-discovery version: 2.9
  • libneuronxla version: 2.0.5347.0
  • neuronx-cc version: 2.15.143.0+e39249ad
  • neuronx-distributed version: 0.9.0
  • neuronx-hwm version: NA
  • torch-neuronx version: 2.1.2.2.3.2
  • torch-xla version: 2.1.5
  • transformers-neuronx version: 0.12.313

Neuron Driver:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/unknown,now 2.22.26.0-17a033bc8 amd64 [installed,upgradable to: 2.23.133.0-3e70920f2]
aws-neuronx-dkms/unknown,now 2.18.12.0 amd64 [installed,upgradable to: 2.19.64.0]
aws-neuronx-oci-hook/unknown,now 2.5.3.0 amd64 [installed,upgradable to: 2.6.36.0]
aws-neuronx-runtime-lib/unknown,now 2.22.14.0-6e27b8d5b amd64 [installed,upgradable to: 2.23.110.0-9b5179492]
aws-neuronx-tools/unknown,now 2.19.0.0 amd64 [installed,upgradable to: 2.20.204.0]

Who can help?

@michaelbenayoun

Running this tutorial Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance gives following error at the compile step:

++ export NEURON_FUSE_SOFTMAX=1
++ NEURON_FUSE_SOFTMAX=1
++ export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
++ NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
++ export MALLOC_ARENA_MAX=64
++ MALLOC_ARENA_MAX=64
++ export 'NEURON_CC_FLAGS=--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
++ NEURON_CC_FLAGS='--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
++ PROCESSES_PER_NODE=8
++ NUM_EPOCHS=1
++ TP_DEGREE=2
++ PP_DEGREE=1
++ BS=1
++ GRADIENT_ACCUMULATION_STEPS=8
++ LOGGING_STEPS=1
++ MODEL_NAME=meta-llama/Meta-Llama-3-8B
++ OUTPUT_DIR=output-
++ '[' '' = 1 ']'
++ MAX_STEPS=-1
++ XLA_USE_BF16=1
++ neuron_parallel_compile torchrun --nproc_per_node 8 docs/source/training_tutorials/sft_lora_finetune_llm.py --model_id meta-llama/Meta-Llama-3-8B --num_train_epochs 1 --do_train --learning_rate 5e-5 --warmup_ratio 0.03 --max_steps -1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --gradient_checkpointing true --bf16 --zero_1 false --tensor_parallel_size 2 --pipeline_parallel_size 1 --logging_steps 1 --save_total_limit 1 --output_dir output- --lr_scheduler_type constant --overwrite_output_dir
2025-01-11 21:51:35.000701:  32302  INFO ||NEURON_PARALLEL_COMPILE||: Running trial run (add option to terminate trial run early; also ignore trial run's generated outputs, i.e. loss, checkpoints)
2025-01-11 21:51:35.000701:  32302  INFO ||NEURON_PARALLEL_COMPILE||: Running cmd: ['torchrun', '--nproc_per_node', '8', 'docs/source/training_tutorials/sft_lora_finetune_llm.py', '--model_id', 'meta-llama/Meta-Llama-3-8B', '--num_train_epochs', '1', '--do_train', '--learning_rate', '5e-5', '--warmup_ratio', '0.03', '--max_steps', '-1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', 'true', '--bf16', '--zero_1', 'false', '--tensor_parallel_size', '2', '--pipeline_parallel_size', '1', '--logging_steps', '1', '--save_total_limit', '1', '--output_dir', 'output-', '--lr_scheduler_type', 'constant', '--overwrite_output_dir']
[2025-01-11 21:51:36,842] torch.distributed.run: [WARNING] 
[2025-01-11 21:51:36,842] torch.distributed.run: [WARNING] *****************************************
[2025-01-11 21:51:36,842] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2025-01-11 21:51:36,842] torch.distributed.run: [WARNING] *****************************************
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
Downloading readme: 100%|██████████| 8.20k/8.20k [00:00<00:00, 48.2MB/s]
Downloading data: 100%|██████████| 13.1M/13.1M [00:00<00:00, 30.4MB/s]
Generating train split: 100%|██████████| 15011/15011 [00:00<00:00, 161040.86 examples/s]
2025-Jan-11 21:51:55.0550 32498:33576 [3] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:51:55.0551 32502:33575 [7] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:51:55.0552 32498:33576 [3] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:51:55.0554 32502:33575 [7] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:51:55.0559 32496:33577 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:51:55.0561 32496:33577 [1] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
2025-Jan-11 21:51:59.0995 32499:33591 [4] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:51:59.0997 32499:33591 [4] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:52:00.0910 32497:33594 [2] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:52:00.0911 32497:33594 [2] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:52:01.0051 32501:33592 [6] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:52:01.0053 32501:33592 [6] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:52:01.0191 32495:33593 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:52:01.0193 32495:33593 [0] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
2025-Jan-11 21:52:08.0236 32500:33608 [5] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:52:08.0238 32500:33608 [5] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.65s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.65s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.65s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.63s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.65s/it]
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.68s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.71s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.69s/it] 
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
[2025-01-11 21:58:37,190] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 32495) of binary: /opt/aws_neuronx_venv_pytorch_2_1/bin/python3
Traceback (most recent call last):
  File "/opt/aws_neuronx_venv_pytorch_2_1/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
docs/source/training_tutorials/sft_lora_finetune_llm.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 32496)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 32497)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 32498)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 32499)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 32500)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 32501)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 32502)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 32495)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
2025-01-11 21:58:37.000640:  32302  ERROR ||NEURON_PARALLEL_COMPILE||: There was an error in the training script.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Execute this script from tutorial

#!/bin/bash

set -ex

export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/"

PROCESSES_PER_NODE=8

NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=1
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID

if [ "$NEURON_EXTRACT_GRAPHS_ONLY" = "1" ]; then
    MAX_STEPS=$((LOGGING_STEPS + 5))
else
    MAX_STEPS=-1
fi


XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE docs/source/training_tutorials/sft_lora_finetune_llm.py \
  --model_id $MODEL_NAME \
  --num_train_epochs $NUM_EPOCHS \
  --do_train \
  --learning_rate 5e-5 \
  --warmup_ratio 0.03 \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $BS \
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
  --gradient_checkpointing true \
  --bf16 \
  --zero_1 false \
  --tensor_parallel_size $TP_DEGREE \
  --pipeline_parallel_size $PP_DEGREE \
  --logging_steps $LOGGING_STEPS \
  --save_total_limit 1 \
  --output_dir $OUTPUT_DIR \
  --lr_scheduler_type "constant" \
  --overwrite_output_dir

Expected behavior

It should compile without error.

@ajayvohra2005 ajayvohra2005 added the bug Something isn't working label Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant