Skip to content

llama2 training has nan #25065

@LZY-the-boys

Description

@LZY-the-boys

System Info

  • transformers version: 4.31.0
  • Platform: Linux-5.19.0-42-generic-x86_64-with-glibc2.35
  • Python version: 3.9.16
  • Huggingface_hub version: 0.14.1
  • Safetensors version: 0.3.1
  • Accelerate version: 0.21.0
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: NO
    - mixed_precision: fp16
    - use_cpu: False
    - num_processes: 1
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: all
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.0.0+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am training llama2-7b using huggingface trainer. I find that it will occur nan in forward when training with batchsize > 1 for llama2, however when batchsize=1 there is no error.
I dive into it and find that the nan occurs in layer.31.input_layer_norm, which is caused by inf in layers.30.mlp forward after the post_layer_norm, and this inf may comes from huge value in hidden_size. However, this doesn't explain why llama1 and llama2 with batchsize=1 can work, which also has huge outliners in hidden_size.
The code I use is like this:
The dataset format is: write a xxx. ###sentences: xxxx
The checkpoint of meta-llama/Llama-2-7b-hf and original meta checkpoint converted by transformers code are both tried.

trainer = transformers.Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=args.micro_batch,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        warmup_ratio=args.warmup_ratio,
        num_train_epochs=args.num_epoch,
        learning_rate=3e-4,
        fp16=True,
        logging_steps=args.log_steps,
        logging_first_step=True, # convenient
        evaluation_strategy="no",
        save_strategy=args.save_strategy,
        eval_steps=None,
        save_steps=args.save_steps,
        output_dir=args.output_path,
        load_best_model_at_end= False,
        ddp_find_unused_parameters=False if ddp else None,
        report_to="wandb" if args.wandb else [],
        ignore_data_skip=args.ignore_data_skip,
    ),
    data_collator=PROMPT.data_collator()
)
model.config.use_cache = False
if list(pathlib.Path(args.output_path).glob("checkpoint-*")):
    trainer.train(resume_from_checkpoint=True)
else:
    trainer.train()
trainer.save_state()
model.save_pretrained(args.output_path)

Expected behavior

The training should has no nan in forward, thus loss will be normal in backward

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions