Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
2e88f88
support ASCEND NPU
sunyi0505 Feb 21, 2025
bdd0764
[fix] npu does not support torch.distributed.ReduceOp.AVG
zheliuyu Feb 22, 2025
63ac392
support ASCEND NPU
sunyi0505 Feb 22, 2025
c31e901
Merge branch 'main' into vllm-0.7-npu
sunyi0505 Mar 13, 2025
ee7d9fa
support vllm0.7.3 on ASCEND NPU
sunyi0505 Mar 18, 2025
7ea5091
support vllm0.7.3 on ASCEND NPU
sunyi0505 Mar 18, 2025
354fdd0
Merge remote-tracking branch 'origin/vllm-0.7.3-npu' into vllm-0.7.3-npu
sunyi0505 Mar 19, 2025
92b2de2
Support ASCEND NPU
sunyi0505 Mar 21, 2025
35576f2
support ASCEND NPU
sunyi0505 Apr 7, 2025
90028f1
support fa2 and use_remove_padding for ASCEND NPU
sunyi0505 Apr 19, 2025
196afcb
rebase code to verl main
sunyi0505 May 15, 2025
9498189
rebase code to verl main
sunyi0505 May 16, 2025
a98298a
change transformers version for ASCEND NPU
sunyi0505 May 16, 2025
d8417c3
change transformers version for ASCEND NPU
sunyi0505 May 16, 2025
528af81
support ASCEND NPU
sunyi0505 May 20, 2025
97d2fad
adjust import order
sunyi0505 May 21, 2025
63c00b9
Merge branch 'main' into vllm-0.7-npu
sunyi0505 May 22, 2025
2f22dca
add explanation of SFT support in the readme
sunyi0505 May 22, 2025
fbe16cf
modify ci docker timeout and license description
sunyi0505 May 22, 2025
ed22d71
modify CI file and remove device_name param
sunyi0505 May 22, 2025
1c071c6
modify default device_name to cuda
sunyi0505 May 23, 2025
e94006a
modify default device_name to cuda
sunyi0505 May 23, 2025
b58808d
modify default device_name to cuda
sunyi0505 May 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 47 additions & 4 deletions .github/workflows/e2e_ascend.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ jobs:
test:
name: verl Ascend test (self-host)
runs-on: [self-hosted, npu-0]
timeout-minutes: 5 # Increase this timeout value as needed
timeout-minutes: 30 # Increase this timeout value as needed
container:
image: quay.io/ascend/cann:8.0.0-910b-ubuntu22.04-py3.10
image: quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
volumes:
- /usr/local/dcmi:/usr/local/dcmi
- /usr/local/bin/npu-smi:/usr/local/bin/npu-smi
Expand All @@ -42,13 +42,56 @@ jobs:
--device /dev/hisi_hdc
--privileged
--network "host"
--shm-size 2g
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
HF_ENDPOINT: "https://hf-mirror.com"
HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
steps:
- name: Check npu and CANN info
run: |
cat /usr/local/Ascend/ascend-toolkit/latest/"$(uname -i)"-linux/ascend_toolkit_install.info
npu-smi info
- name: Checkout volcengine/verl repo
uses: actions/checkout@v4
- name: Run test
- name: Install torch
run: |
lscpu
pip install torch==2.5.1+cpu --index-url https://download.pytorch.org/whl/cpu
pip install torch-npu==2.5.1
pip install /usr/local/Ascend/ascend-toolkit/latest/lib64/te-0.4.0-py3-none-any.whl
- name: Install vllm
run: |
apt-get update && apt-get install -y git
git clone -b v0.7.3 --depth 1 https://github.com/vllm-project/vllm.git vllm-npu
cd vllm-npu
pip install -r requirements-build.txt
VLLM_TARGET_DEVICE=empty pip install -e . --extra-index https://download.pytorch.org/whl/cpu/
- name: Install vllm-ascend
run: |
pip list
pip show torch
git clone -b v0.7.3 --depth 1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
export COMPILE_CUSTOM_KERNELS=1
python setup.py install
- name: Install the current repository
run: |
pip3 install hf_transfer peft
pip3 install -r requirements-npu.txt
pip install -e .
- name: Prepare gsm8k dataset
run: |
ray stop --force
python3 examples/data_preprocess/gsm8k.py
- name: Running gsm8k e2e training tests with LoRA on ASCEND NPU
run: |
ray stop --force
bash tests/e2e/sft/run_sft.sh
rm -rf $HOME/ckpts
- name: Running gsm8k e2e training tests with GRPO on ASCEND NPU
run: |
ray stop --force
bash tests/npu/run_qwen2_5_05b_grpo.sh
rm -rf $HOME/ckpts
82 changes: 82 additions & 0 deletions docs/ascend/ascend_vllm073.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
verl x Ascend
========

我们在 verl 上增加对华为昇腾设备的支持。

硬件支持
=======

* Atlas 800T A2

* Atlas 200T A2 Box16

安装
=======

环境准备
------

+--------------+----------+
| 软件 | 版本 |
+-----------+-------------+
| Python | == 3.10 |
| torch | == 2.5.1 |
| torch_npu | == 2.5.1rc1 |
| CANN | == 8.1.RC1 |
+-----------+-------------+

1. 使用 vLLM,需遵循 vllm-ascend 的安装教程 <https://vllm-ascend.readthedocs.io/en/v0.7.3/installation.html>。
2. 为了能够在 ASCEND NPU 上正常使能 flash_attention_2, transformers 版本需要大于等于 4.52.0。
3. 目前支持 SFT 与 LLM 模型的 GRPO 训练,VLM模型的 GRPO 训练因为 vllm-ascend 的问题将会在后续支持,涉及到的issue为:

https://github.com/vllm-project/vllm-ascend/issues/809

https://github.com/vllm-project/vllm-ascend/issues/825

源码安装
------

.. code-block::
git clone https://github.com/volcengine/verl.git
cd verl
pip install -r requirements-npu.txt
pip install -e .

vLLM
------

为了保证能够在 verl 上正常使用 vLLM,需要安装 vLLM Ascend 插件(`vllm-ascend`)。关于在华为昇腾上支持的 vLLM 版本以及和 vLLM Ascend 的配套关系请参考`安装教程 <https://vllm-ascend.readthedocs.io/en/v0.7.3/installation.html>`_。

其他第三方库说明
------

+--------------+--------+
| 软件 | 说明 |
+--------------+--------+
| flash_attn | 不支持 |
+--------------+--------+
| liger-kernel | 不支持 |
+--------------+--------+

精度对比
------

根据经验,对于SFT等微调算法,我们期望在相同配置下,在华为昇腾设备上的 Loss 与英伟达 GPU 的 Loss 平均绝对误差小于等于 2%,具体计算方式如下:

.. image:: https://github.com/eric-haibin-lin/verl-community/tree/main/docs/loss_comparison.png
:alt: loss_comparison

其中,N 表示训练的步数。更多信息请参考[精度计算说明](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/LMaccuracy_0001.html)。

根据经验,对于GRPO等强化学习算法,我们期望在相同配置下,在华为昇腾设备上的 reward 与英伟达 GPU 的 reward 平均绝对误差小于等于 4%,具体计算参考 Loss 计算。

进展
------

+--------+--------+
| 算法 | 进展 |
+--------+--------+
| SFT | 已支持 |
+--------+--------+
| GRPO | 已支持 |
+--------+--------+
1 change: 1 addition & 0 deletions recipe/dapo/main_dapo.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
import ray

from .dapo_ray_trainer import RayDAPOTrainer
from verl.utils.device import is_cuda_available


def get_custom_reward_fn(config):
Expand Down
2 changes: 2 additions & 0 deletions recipe/sppo/main_sppo.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from verl.trainer.ppo.reward import load_reward_manager

from .sppo_ray_trainer import RaySPPOTrainer
from verl.utils.device import is_cuda_available


@hydra.main(config_path="config", config_name="sppo_trainer", version_base=None)
Expand Down Expand Up @@ -140,6 +141,7 @@ def run(self, config):
ray_worker_group_cls=ray_worker_group_cls,
reward_fn=reward_fn,
val_reward_fn=val_reward_fn,
device_name="cuda" if is_cuda_available else "npu",
)
trainer.init_workers()
trainer.fit()
Expand Down
2 changes: 2 additions & 0 deletions recipe/sppo/sppo_ray_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ def __init__(
val_dataset: Optional[Dataset] = None,
collate_fn=None,
train_sampler: Optional[Sampler] = None,
device_name="cuda",
):
self.tokenizer = tokenizer
self.processor = processor
Expand All @@ -105,6 +106,7 @@ def __init__(
self.use_rm = Role.RewardModel in role_worker_mapping
self.ray_worker_group_cls = ray_worker_group_cls
self.validation_generations_logger = ValidationGenerationsLogger()
self.device_name = device_name

# define in-reward KL control
# kl loss control currently not suppoorted
Expand Down
20 changes: 20 additions & 0 deletions requirements-npu.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# requirements.txt records the full set of dependencies for development
accelerate
codetiming
datasets
dill
hydra-core
numpy
pandas
peft
pyarrow>=15.0.0
pybind11
pylatexenc
ray
tensordict<=0.6.2
transformers>=4.52.0
wandb
mathruler
torchdata
einops
qwen_vl_utils
42 changes: 42 additions & 0 deletions tests/npu/run_qwen2_5_05b_grpo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
set -x

export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$HOME/data/gsm8k/train.parquet \
data.val_files=$HOME/data/gsm8k/test.parquet \
data.train_batch_size=128 \
data.max_prompt_length=512 \
data.max_response_length=128 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=False \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=20 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40 \
actor_rollout_ref.rollout.enable_chunked_prefill=False \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['console'] \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.experiment_name='qwen2_7b_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1 \
trainer.save_freq=-1 \
trainer.test_freq=5 \
trainer.total_epochs=1 $@
43 changes: 43 additions & 0 deletions tests/npu/run_qwen2_5_32b_grpo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$HOME/data/gsm8k/train.parquet \
data.val_files=$HOME/data/gsm8k/test.parquet \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6\
actor_rollout_ref.model.use_remove_padding=False \
actor_rollout_ref.actor.ppo_mini_batch_size=128 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.rollout.tensor_model_parallel_size=8 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console'] \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.experiment_name='qwen2_5_32b_function_rm' \
trainer.n_gpus_per_node=16 \
trainer.nnodes=2 \
trainer.save_freq=-1 \
trainer.test_freq=10 \
trainer.total_epochs=15 $@
44 changes: 44 additions & 0 deletions tests/npu/run_qwen2_5_7b_grpo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
set -x

# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$HOME/data/gsm8k/train.parquet \
data.val_files=$HOME/data/gsm8k/test.parquet \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
actor_rollout_ref.actor.optim.lr=5e-8 \
actor_rollout_ref.model.use_remove_padding=False \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.rollout.enable_chunked_prefill=False \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console'] \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.experiment_name='qwen2_5_7b_function_rm' \
trainer.n_gpus_per_node=16 \
trainer.nnodes=1 \
trainer.save_freq=-1 \
trainer.test_freq=5 \
trainer.total_epochs=5 $@
18 changes: 18 additions & 0 deletions verl/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,13 @@

import logging
import os
import pkg_resources

from pkg_resources import DistributionNotFound
from packaging.version import parse as parse_version
from .protocol import DataProto
from .utils.logging_utils import set_basic_config
from .utils.device import is_npu_available

version_folder = os.path.dirname(os.path.join(os.path.abspath(__file__)))

Expand All @@ -38,3 +42,17 @@
from modelscope.utils.hf_util import patch_hub

patch_hub()

if is_npu_available:
package_name = 'transformers'
required_version_spec = '4.51.0'
try:
installed_version = pkg_resources.get_distribution(package_name).version
installed = parse_version(installed_version)
required = parse_version(required_version_spec)

if not installed >= required:
raise ValueError(f"{package_name} version >= {required_version_spec} is required on ASCEND NPU, current version is {installed}.")
except DistributionNotFound:
raise ImportError(
f"package {package_name} is not installed, please run pip install {package_name}=={required_version_spec}")
2 changes: 1 addition & 1 deletion verl/models/transformers/qwen2_vl.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
)

try:
from flash_attn import flash_attn_func, flash_attn_varlen_func
from transformers.modeling_flash_attention_utils import flash_attn_func, flash_attn_varlen_func

_flash_supports_window_size = "window_size" in list(inspect.signature(flash_attn_func).parameters)
except ImportError:
Expand Down
Loading
Loading