-
Couldn't load subscription status.
- Fork 31k
Description
System Info
transformersversion: 4.31.0- Platform: Linux-5.19.0-42-generic-x86_64-with-glibc2.35
- Python version: 3.9.16
- Huggingface_hub version: 0.14.1
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: fp16
- use_cpu: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: [] - PyTorch version (GPU?): 2.0.0+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am training llama2-7b using huggingface trainer. I find that it will occur nan in forward when training with batchsize > 1 for llama2, however when batchsize=1 there is no error.
I dive into it and find that the nan occurs in layer.31.input_layer_norm, which is caused by inf in layers.30.mlp forward after the post_layer_norm, and this inf may comes from huge value in hidden_size. However, this doesn't explain why llama1 and llama2 with batchsize=1 can work, which also has huge outliners in hidden_size.
The code I use is like this:
The dataset format is: write a xxx. ###sentences: xxxx
The checkpoint of meta-llama/Llama-2-7b-hf and original meta checkpoint converted by transformers code are both tried.
trainer = transformers.Trainer(
model=model,
train_dataset=train_data,
eval_dataset=val_data,
args=transformers.TrainingArguments(
per_device_train_batch_size=args.micro_batch,
gradient_accumulation_steps=args.gradient_accumulation_steps,
warmup_ratio=args.warmup_ratio,
num_train_epochs=args.num_epoch,
learning_rate=3e-4,
fp16=True,
logging_steps=args.log_steps,
logging_first_step=True, # convenient
evaluation_strategy="no",
save_strategy=args.save_strategy,
eval_steps=None,
save_steps=args.save_steps,
output_dir=args.output_path,
load_best_model_at_end= False,
ddp_find_unused_parameters=False if ddp else None,
report_to="wandb" if args.wandb else [],
ignore_data_skip=args.ignore_data_skip,
),
data_collator=PROMPT.data_collator()
)
model.config.use_cache = False
if list(pathlib.Path(args.output_path).glob("checkpoint-*")):
trainer.train(resume_from_checkpoint=True)
else:
trainer.train()
trainer.save_state()
model.save_pretrained(args.output_path)
Expected behavior
The training should has no nan in forward, thus loss will be normal in backward