You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The out-of-memory (OOM) error during training is usually due to insufficient space of the current device to complete the computation. You can try the following methods to deal with this issue:
Reduce the training batch size per_device_train_batch_size: 1
Reduce the maximum sequence length cutoff_len: 512
Set quantization_bit: 4 to quantize model parameters (only compatible with LoRA tuning)
Use DeepSpeed ZeRO-3 or FSDP to partition model weights on multiple devices
Use the paged optimizer optim: paged_adamw_8bit
模型训练时显存溢出,通常是由于当前某个设备的剩余显存不足以完成计算任务。可尝试下述方法解决:
降低批处理大小 per_device_train_batch_size: 1
降低最大序列长度 cutoff_len: 256
设置 quantization_bit: 4 量化模型参数(仅限于 LoRA 方法)
使用 DeepSpeed ZeRO-3 或 FSDP 将模型权重拆分到多个设备
使用分页低精度优化器 optim: paged_adamw_8bit
Unsatisfying fine-tuning results / 微调效果无法令人满意
Unsatisfying fine-tuning results are usually due to insufficient training samples, leading to underfitting. You can try the following methods to deal with this issue:
Increase the size of the training dataset
Increase the number of epochs num_train_epochs: 5.0 or steps max_steps: 1000
Use a larger learning rate learning_rate: 2.0e-4
微调效果较差,通常是由于训练样本过少,导致模型欠拟合。可尝试下述方法解决:
提高训练数据集的大小
提高训练轮数 num_train_epochs: 5.0 或步数 max_steps: 1000
增大学习率 learning_rate: 2.0e-4
Corrupted or repeated model responses / 胡乱或重复的模型回答
If this issue occurs before training, it is usually due to using an unaligned (base) model or a mismatched template. Please ensure an aligned (instruct/chat) model and correct template are used.
If this issue occurs after training, please check if the template used for training and inference is consistent. And do not forget to check if the overfitting appeared.
The error when using bf16 is due to some devices (e.g., NPUs) does not support bfloat16. Please replace bf16 with fp16.
If the error occurs on GPU devices, please use the following command to check if the CUDA version of PyTorch is installed correctly:
Please ensure that the working directory when launching the LLaMA Board is the same as the LLaMA-Factory directory.
请确保启动 LLaMA Board 时的工作目录与 LLaMA-Factory 主目录一致。
How to shard model weights on multiple devices / 如何模型权重拆分到多个设备上
During the training phase, please refer to the examples about how to use the DeepSpeed ZeRO-3 (recommended) or FSDP.
During the inference phase, the model will be automatically loaded onto all available devices.
Note
Please avoid creating issues regarding the following questions, as they might be closed without a response.
请避免创建与下述问题有关的 issues,这些 issues 可能不会被回复。
Tip
中文版入门教程:https://zhuanlan.zhihu.com/p/695287607
Out-of-memory / 显存溢出
The out-of-memory (OOM) error during training is usually due to insufficient space of the current device to complete the computation. You can try the following methods to deal with this issue:
per_device_train_batch_size: 1
cutoff_len: 512
quantization_bit: 4
to quantize model parameters (only compatible with LoRA tuning)optim: paged_adamw_8bit
模型训练时显存溢出,通常是由于当前某个设备的剩余显存不足以完成计算任务。可尝试下述方法解决:
per_device_train_batch_size: 1
cutoff_len: 256
quantization_bit: 4
量化模型参数(仅限于 LoRA 方法)optim: paged_adamw_8bit
Unsatisfying fine-tuning results / 微调效果无法令人满意
Unsatisfying fine-tuning results are usually due to insufficient training samples, leading to underfitting. You can try the following methods to deal with this issue:
num_train_epochs: 5.0
or stepsmax_steps: 1000
learning_rate: 2.0e-4
微调效果较差,通常是由于训练样本过少,导致模型欠拟合。可尝试下述方法解决:
num_train_epochs: 5.0
或步数max_steps: 1000
learning_rate: 2.0e-4
Corrupted or repeated model responses / 胡乱或重复的模型回答
If this issue occurs before training, it is usually due to using an unaligned (base) model or a mismatched
template
. Please ensure an aligned (instruct/chat) model and correcttemplate
are used.If this issue occurs after training, please check if the
template
used for training and inference is consistent. And do not forget to check if the overfitting appeared.若该问题发生在训练之前,通常是由于使用了未经对齐(base)的模型或不恰当的模板
template
,请保证使用对齐后(instruct/chat)的模型和正确的模板template
。若该问题发生在训练之后,请检查训练和推理使用的模板
template
是否一致,同时检查是否发生了过拟合。llamafactory-cli: command not found / 无法找到命令
Please install LLaMA Factory using pip according to the README. If errors persist after installation, try restarting the terminal.
请根据 README 使用 pip 安装 LLaMA Factory。若安装后仍出现报错,请尝试重启终端。
Mixed precision training problems / 混合精度训练报错
The error when using
bf16
is due to some devices (e.g., NPUs) does not support bfloat16. Please replacebf16
withfp16
.If the error occurs on GPU devices, please use the following command to check if the CUDA version of PyTorch is installed correctly:
python -c "import torch; print(torch.cuda.is_available())"
Mixed precision training is not supported on the CPU and Mac. Please remove the
bf16
orfp16
parameter and setlow_cpu_mem_usage: false
.使用
bf16
时出现错误,是由于部分计算设备(例如 NPU)不支持 bfloat16,请将bf16
更换为fp16
。若在 GPU 设备上出现半精度错误,请使用下述命令检查 CUDA 版本的 PyTorch 是否被正确安装:
python -c "import torch; print(torch.cuda.is_available())"
CPU 和 Mac 芯片尚未支持混合精度训练,请移除
bf16
或fp16
参数并设置low_cpu_mem_usage: false
。LLaMA Board cannot display datasets / LLaMA Board 无法显示数据集
Please ensure that the working directory when launching the LLaMA Board is the same as the LLaMA-Factory directory.
请确保启动 LLaMA Board 时的工作目录与 LLaMA-Factory 主目录一致。
How to shard model weights on multiple devices / 如何模型权重拆分到多个设备上
During the training phase, please refer to the examples about how to use the DeepSpeed ZeRO-3 (recommended) or FSDP.
During the inference phase, the model will be automatically loaded onto all available devices.
在训练阶段,请参考 examples 使用 DeepSpeed ZeRO-3(推荐)或 FSDP。
在推理阶段,模型会被自动加载到所有可用的设备上。
Distributed training stuck / 分布式训练卡住
Try setting the environment variable
export NCCL_P2P_LEVEL=NVL
.请尝试设置环境变量
export NCCL_P2P_LEVEL=NVL
。How to use ORPO or SimPO / 如何使用 ORPO 或 SimPO
Modify the
pref_loss
in example script toorpo
orsimpo
.将示例脚本 中的
pref_loss
改为orpo
或simpo
。Tip
If the problems still exist with the latest code, please create an issue.
若使用最新的代码仍然无法解决问题,请创建一个 issue。
The text was updated successfully, but these errors were encountered: