Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance throws error #761

ajayvohra2005 · 2025-01-11T22:16:48Z

System Info

AMI Name: huggingface-neuron-2024-12-13T12-47-53Z-692efe1a-8d5c-4033-bcbc-5d99f2d4ae6a
AMI-ID: ami-0bede50341b2516c4

optimum-cli env

Copy-and-paste the text below in your GitHub issue:

Platform:

Platform: Linux-5.15.0-1031-aws-x86_64-with-glibc2.35
Python version: 3.10.12

Python packages:

optimum-neuron version: 0.0.27
neuron-sdk version: 2.20.2
optimum version: 1.22.0
transformers version: 4.43.2
huggingface_hub version: 0.26.5
torch version: 2.1.2+cu121
aws-neuronx-runtime-discovery version: 2.9
libneuronxla version: 2.0.5347.0
neuronx-cc version: 2.15.143.0+e39249ad
neuronx-distributed version: 0.9.0
neuronx-hwm version: NA
torch-neuronx version: 2.1.2.2.3.2
torch-xla version: 2.1.5
transformers-neuronx version: 0.12.313

Neuron Driver:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/unknown,now 2.22.26.0-17a033bc8 amd64 [installed,upgradable to: 2.23.133.0-3e70920f2]
aws-neuronx-dkms/unknown,now 2.18.12.0 amd64 [installed,upgradable to: 2.19.64.0]
aws-neuronx-oci-hook/unknown,now 2.5.3.0 amd64 [installed,upgradable to: 2.6.36.0]
aws-neuronx-runtime-lib/unknown,now 2.22.14.0-6e27b8d5b amd64 [installed,upgradable to: 2.23.110.0-9b5179492]
aws-neuronx-tools/unknown,now 2.19.0.0 amd64 [installed,upgradable to: 2.20.204.0]

Who can help?

@michaelbenayoun

Running this tutorial Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance gives following error at the compile step:

++ export NEURON_FUSE_SOFTMAX=1
++ NEURON_FUSE_SOFTMAX=1
++ export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
++ NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
++ export MALLOC_ARENA_MAX=64
++ MALLOC_ARENA_MAX=64
++ export 'NEURON_CC_FLAGS=--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
++ NEURON_CC_FLAGS='--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
++ PROCESSES_PER_NODE=8
++ NUM_EPOCHS=1
++ TP_DEGREE=2
++ PP_DEGREE=1
++ BS=1
++ GRADIENT_ACCUMULATION_STEPS=8
++ LOGGING_STEPS=1
++ MODEL_NAME=meta-llama/Meta-Llama-3-8B
++ OUTPUT_DIR=output-
++ '[' '' = 1 ']'
++ MAX_STEPS=-1
++ XLA_USE_BF16=1
++ neuron_parallel_compile torchrun --nproc_per_node 8 docs/source/training_tutorials/sft_lora_finetune_llm.py --model_id meta-llama/Meta-Llama-3-8B --num_train_epochs 1 --do_train --learning_rate 5e-5 --warmup_ratio 0.03 --max_steps -1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --gradient_checkpointing true --bf16 --zero_1 false --tensor_parallel_size 2 --pipeline_parallel_size 1 --logging_steps 1 --save_total_limit 1 --output_dir output- --lr_scheduler_type constant --overwrite_output_dir
2025-01-11 21:51:35.000701:  32302  INFO ||NEURON_PARALLEL_COMPILE||: Running trial run (add option to terminate trial run early; also ignore trial run's generated outputs, i.e. loss, checkpoints)
2025-01-11 21:51:35.000701:  32302  INFO ||NEURON_PARALLEL_COMPILE||: Running cmd: ['torchrun', '--nproc_per_node', '8', 'docs/source/training_tutorials/sft_lora_finetune_llm.py', '--model_id', 'meta-llama/Meta-Llama-3-8B', '--num_train_epochs', '1', '--do_train', '--learning_rate', '5e-5', '--warmup_ratio', '0.03', '--max_steps', '-1', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', 'true', '--bf16', '--zero_1', 'false', '--tensor_parallel_size', '2', '--pipeline_parallel_size', '1', '--logging_steps', '1', '--save_total_limit', '1', '--output_dir', 'output-', '--lr_scheduler_type', 'constant', '--overwrite_output_dir']
[2025-01-11 21:51:36,842] torch.distributed.run: [WARNING] 
[2025-01-11 21:51:36,842] torch.distributed.run: [WARNING] *****************************************
[2025-01-11 21:51:36,842] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2025-01-11 21:51:36,842] torch.distributed.run: [WARNING] *****************************************
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
Downloading readme: 100%|██████████| 8.20k/8.20k [00:00<00:00, 48.2MB/s]
Downloading data: 100%|██████████| 13.1M/13.1M [00:00<00:00, 30.4MB/s]
Generating train split: 100%|██████████| 15011/15011 [00:00<00:00, 161040.86 examples/s]
2025-Jan-11 21:51:55.0550 32498:33576 [3] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:51:55.0551 32502:33575 [7] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:51:55.0552 32498:33576 [3] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:51:55.0554 32502:33575 [7] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:51:55.0559 32496:33577 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:51:55.0561 32496:33577 [1] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
2025-Jan-11 21:51:59.0995 32499:33591 [4] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:51:59.0997 32499:33591 [4] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:52:00.0910 32497:33594 [2] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:52:00.0911 32497:33594 [2] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:52:01.0051 32501:33592 [6] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:52:01.0053 32501:33592 [6] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2025-Jan-11 21:52:01.0191 32495:33593 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:52:01.0193 32495:33593 [0] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
2025-Jan-11 21:52:08.0236 32500:33608 [5] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2025-Jan-11 21:52:08.0238 32500:33608 [5] init.cc:149 CCOM WARN OFI plugin initNet() failed is EFA enabled?
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.65s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.65s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.65s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.63s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.65s/it]
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.68s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.71s/it] 
Downloading shards: 100%|██████████| 4/4 [06:22<00:00, 95.69s/it] 
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 87, in <module>
    main()
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 83, in main
    training_function(script_args, training_args)
  File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 46, in training_function
    sft_config = NeuronSFTConfig(
TypeError: NeuronSFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'
[2025-01-11 21:58:37,190] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 32495) of binary: /opt/aws_neuronx_venv_pytorch_2_1/bin/python3
Traceback (most recent call last):
  File "/opt/aws_neuronx_venv_pytorch_2_1/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
docs/source/training_tutorials/sft_lora_finetune_llm.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 32496)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 32497)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 32498)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 32499)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 32500)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 32501)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 32502)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-11_21:58:37
  host      : ip-172-31-113-192.us-west-2.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 32495)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
2025-01-11 21:58:37.000640:  32302  ERROR ||NEURON_PARALLEL_COMPILE||: There was an error in the training script.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Execute this script from tutorial

#!/bin/bash

set -ex

export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/"

PROCESSES_PER_NODE=8

NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=1
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID

if [ "$NEURON_EXTRACT_GRAPHS_ONLY" = "1" ]; then
    MAX_STEPS=$((LOGGING_STEPS + 5))
else
    MAX_STEPS=-1
fi


XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE docs/source/training_tutorials/sft_lora_finetune_llm.py \
  --model_id $MODEL_NAME \
  --num_train_epochs $NUM_EPOCHS \
  --do_train \
  --learning_rate 5e-5 \
  --warmup_ratio 0.03 \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $BS \
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
  --gradient_checkpointing true \
  --bf16 \
  --zero_1 false \
  --tensor_parallel_size $TP_DEGREE \
  --pipeline_parallel_size $PP_DEGREE \
  --logging_steps $LOGGING_STEPS \
  --save_total_limit 1 \
  --output_dir $OUTPUT_DIR \
  --lr_scheduler_type "constant" \
  --overwrite_output_dir

Expected behavior

It should compile without error.

The text was updated successfully, but these errors were encountered:

ajayvohra2005 added the bug Something isn't working label Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance throws error #761

Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance throws error #761

ajayvohra2005 commented Jan 11, 2025 •

edited

Loading

Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance throws error #761

Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance throws error #761

Comments

ajayvohra2005 commented Jan 11, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

ajayvohra2005 commented Jan 11, 2025 •

edited

Loading