[NPU] feat: Support FSDP worker and vLLM Ascend#332
[NPU] feat: Support FSDP worker and vLLM Ascend#332vermouth1992 merged 23 commits intoverl-project:mainfrom
Conversation
7510441 to
8e1637e
Compare
|
does this pr work on multi nodes? |
I am currently conducting tests on a single node only, and will subsequently supplement with multi-node testing results. |
0afd136 to
d496b70
Compare
8b1b207 to
0b7e274
Compare
62af61c to
fd62e2e
Compare
45f208b to
d36c1c7
Compare
6314fcf to
d4309a8
Compare
|
transformers v4.51.4 starts to support ASCEND NPU to directly enable |
Thank you for your suggestion. I will make the necessary changes in the future. |
docs/ascend/ascend_vllm0.7.3.rst
Outdated
| vLLM | ||
| ------ | ||
|
|
||
| 为了保证能够在 verl 上正常使用 vLLM,需要安装 vLLM Ascend 插件(`vllm-ascend`)。关于在华为昇腾上支持的 vLLM 版本以及和 vLLM Ascend 的配套关系请参考`安装教程 <https://vllm-ascend.readthedocs.io/en/v0.7.1rc1/installation.html>`_。 |
There was a problem hiding this comment.
这个安装教程的URL指向的是v0.7.1rc1版本,建议调整为v0.7.3版本的文档
docs/ascend/ascend_vllm0.7.3.rst
Outdated
| 为了保证能够在 verl 上正常使用 vLLM,需要安装 vLLM Ascend 插件(`vllm-ascend`)。关于在华为昇腾上支持的 vLLM 版本以及和 vLLM Ascend 的配套关系请参考`安装教程 <https://vllm-ascend.readthedocs.io/en/v0.7.1rc1/installation.html>`_。 | ||
|
|
||
| ------ | ||
| Ray |
There was a problem hiding this comment.
Ray这部分下方貌似没有对应的内容,请确认,如果确实没有内容,建议删除章节标题
docs/ascend/ascend_vllm0.7.3.rst
Outdated
| 精度对比 | ||
| ------ | ||
|
|
||
| 根据经验,对于SFT等微调算法,我们期望在相同配置下,在华为昇腾设备上的 Loss 与英伟达 GPU 的 Loss 平均误差小于 2%,具体计算方式如下: |
There was a problem hiding this comment.
这里建议调整为“平均绝对误差小于等于2%”
docs/ascend/ascend_vllm0.7.3.rst
Outdated
|
|
||
| 其中,N 表示训练的步数。更多信息请参考[精度计算说明](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/LMaccuracy_0001.html)。 | ||
|
|
||
| 根据经验,对于GRPO等强化学习算法,我们期望在相同配置下,在华为昇腾设备上的 reward 与英伟达 GPU 的 reward 平均绝对误差小于 4%,具体计算参考 Loss 计算。 |
There was a problem hiding this comment.
这里建议调整为“平均绝对误差小于等于4%”
requirements-npu.txt
Outdated
| pybind11 | ||
| pylatexenc | ||
| ray | ||
| tensordict<0.6 |
| import torch | ||
| import torch.distributed | ||
| from filelock import FileLock | ||
| from verl.utils.device import is_cuda_available, is_npu_available |
There was a problem hiding this comment.
verl内部的模块,导入顺序应放在三方库之后
verl/utils/npu_patch.py
Outdated
| attn_output = self.proj(attn_output) | ||
| return attn_output | ||
|
|
||
|
|
There was a problem hiding this comment.
建议在此处添加做patching的原因,表明后续解决后会移除该patching
| from torch import nn | ||
| from torch.distributed.fsdp import FullyShardedDataParallel as FSDP | ||
|
|
||
| import verl.utils.torch_functional as verl_F |
There was a problem hiding this comment.
此处应该是多删除的,需要还原,和图模式相关
verl/workers/fsdp_workers.py
Outdated
| from verl.utils.import_utils import import_external_libs | ||
| from verl.utils.model import compute_position_id_with_mask | ||
| from verl.workers.sharding_manager.fsdp_ulysses import FSDPUlyssesShardingManager | ||
| from verl.utils.device import get_device_name, is_cuda_available, get_torch_device, is_npu_available |
There was a problem hiding this comment.
调整导入顺序为from verl.utils.device import get_device_name, get_torch_device, is_cuda_available, is_npu_available
| world_size = torch.distributed.get_world_size() | ||
| device = torch.cuda.current_device() # used when fsdp2 set cpu_offload_policy | ||
| loaded_params = model.load_weights(((name, param.to(device, non_blocking=True).full_tensor() if world_size != 1 and hasattr(param, "full_tensor") else param) for name, param in updated_params.items())) | ||
| loaded_params = model.load_weights(((name, param.to(device, non_blocking=True).full_tensor() if isinstance(param, DTensor) else param) for name, param in updated_params.items())) |
There was a problem hiding this comment.
此处改动请确认,非必要请对齐main分支
ef80e67 to
58b943b
Compare
verl/utils/npu_patch.py
Outdated
There was a problem hiding this comment.
既然暂时不支持VL模型,第一个PR建议先不添加这个patch,后续vllm-ascend明确支持后加入。
verl/__init__.py
Outdated
| patch_hub() | ||
|
|
||
| if is_npu_available: | ||
| from .utils import npu_patch |
| @@ -0,0 +1,30 @@ | |||
| # Tested with 1 & 8 NPUs | |||
There was a problem hiding this comment.
这个1 & 8 NPUs,是指在1台机器上使用8张卡的意思么?如果是这样,建议调整为Tested on 1 node with 8 NPUs,下方shell脚本同理
tests/npu/run_qwen05_gsm8k_grpo.sh
Outdated
| @@ -0,0 +1,44 @@ | |||
| # Tested with 1 & 8 NPUs | |||
There was a problem hiding this comment.
建议统一文件名命名风格,推荐调整为run_qwen2_5_05b_grpo.sh
verl/__init__.py
Outdated
| patch_hub() | ||
|
|
||
| if is_npu_available: | ||
|
|
d151b48 to
36b1589
Compare
c4cc95d to
c5ac75d
Compare
docs/ascend/ascend_vllm0.7.3.rst
Outdated
|
|
||
| 1. 使用 vLLM,需遵循 vllm-ascend 的安装教程 <https://vllm-ascend.readthedocs.io/en/v0.7.3/installation.html>。 | ||
| 2. 为了能够在 ASCEND NPU 上正常使能 flash_attention_2, transformers 版本需要大于等于 4.52.0。 | ||
| 3. 目前支持 LLM 模型的 GRPO 训练,VLM模型的 GRPO 训练因为 vllm-ascend 的问题将会在后续支持,涉及到的issue为: |
There was a problem hiding this comment.
下文提及支持SFT,这里描述仅支持GRPO,前后描述矛盾,请确认
For developers, you can follow the docs: docs/ascend/ascend.rst
This pr is committed for supporting Ascend NPU backend.
Co-authored-by: Chendong98 chendong136@huawei.com
Co-authored-by: zheliuyu 15750543867@163.com
Co-authored-by: celestialli celestialli@outlook.com
In this pr, we add the capability to determine the type of NPU device and we also add a new script for training on NPU.
These are change lists:
Here are our roadmap:
RoadMap
News
[2025.03.31] Add result of SFT and GRPO. Qwen2-7B-Instruct was tested on 2*8 devices, and many params related to batch_size need to be reduced. So this result is only for reference. We will announce the reward results of the default params as soon as sleep mode is supported.
[2025.03.03] Modify the adaptation method of Ray
[2025.02.25] The PPO algorithm is supported for training on NPU with the FSDP backend.
[2025.02.23] The SFT algorithm is supported for training on NPU with the FSDP backend.
[2025.02.21] The GRPO algorithm is supported for training on NPU with the FSDP backend.
Requirements
We use this PR testing on Ascend NPU and GPU to ensure the same codes can run on different devices. The device information is 8 Atlas 800T A2 and 8 A100. Other software information is shown in the following table.
About mean error

Due to differences in hardware structure, we cannot guarantee that the loss of Ascend NPU is exactly the same as that of the GPU. According to our experience, the loss differences less than 2% is acceptable. If the loss difference is greater than 2%, we will try to fix it. The calculation formula is as follows.
N represents the number of training steps. For more information, please refer to Calculation accuracy description