llama2 training has nan

### System Info

- `transformers` version: 4.31.0
- Platform: Linux-5.19.0-42-generic-x86_64-with-glibc2.35
- Python version: 3.9.16
- Huggingface_hub version: 0.14.1
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config:    - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: fp16
        - use_cpu: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>


### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

I am training llama2-7b using huggingface trainer. I find that it will occur nan in **forward** when training with batchsize > 1 for llama2, however when batchsize=1 there is no error.
I dive into it and find that the **nan** occurs in layer.31.input_layer_norm, which is caused by **inf** in layers.30.mlp forward after the post_layer_norm, and this **inf** may comes from huge value in hidden_size. However, this doesn't explain why llama1 and llama2 with batchsize=1 can work, which also has huge outliners in hidden_size. 
The code I use is like this: 
The dataset format is: `write a xxx. ###sentences: xxxx`
The checkpoint of `meta-llama/Llama-2-7b-hf` and original meta checkpoint converted by transformers code are both tried.
```
trainer = transformers.Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=args.micro_batch,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        warmup_ratio=args.warmup_ratio,
        num_train_epochs=args.num_epoch,
        learning_rate=3e-4,
        fp16=True,
        logging_steps=args.log_steps,
        logging_first_step=True, # convenient
        evaluation_strategy="no",
        save_strategy=args.save_strategy,
        eval_steps=None,
        save_steps=args.save_steps,
        output_dir=args.output_path,
        load_best_model_at_end= False,
        ddp_find_unused_parameters=False if ddp else None,
        report_to="wandb" if args.wandb else [],
        ignore_data_skip=args.ignore_data_skip,
    ),
    data_collator=PROMPT.data_collator()
)
model.config.use_cache = False
if list(pathlib.Path(args.output_path).glob("checkpoint-*")):
    trainer.train(resume_from_checkpoint=True)
else:
    trainer.train()
trainer.save_state()
model.save_pretrained(args.output_path)
```

### Expected behavior

The training should has no nan in forward, thus loss will be normal in backward

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

llama2 training has nan #25065

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

llama2 training has nan #25065

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions