Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions docs/ascend_tutorial/ascend_sglang_quick_start.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
verl x Ascend
===================================

Last updated: 09/25/2025.

我们在 verl 上增加对华为昇腾设备的支持。

硬件支持
-----------------------------------

Atlas 200T A2 Box16

Atlas 900 A2 PODc

Atlas 800T A3


安装
-----------------------------------

基础环境准备
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

+-----------+-------------+
| software | version |
+-----------+-------------+
| Python | == 3.11 |
+-----------+-------------+
| CANN | == 8.3.RC1 |
+-----------+-------------+
| HDK | == 25.3.RC1 |
+-----------+-------------+
| torch | == 2.6.0 |
+-----------+-------------+
| torch_npu | == 2.6.0 |
+-----------+-------------+

**目前verl框架中sglang npu后端仅支持上述HDK、CANN和PTA版本, 商发可用版本预计2025年10月发布**

为了能够在 verl 中正常使用 sglang,需使用以下命令安装sglang、torch_memory_saver和verl。

sglang
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash

# sglang
git clone https://github.com/sgl-project/sglang.git
cd sglang
mv python/pyproject.toml python/pyproject.toml.backup
mv python/pyproject_other.toml python/pyproject.toml
pip install -e "python[srt_npu]"

安装torch_memory_saver
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash

# torch_memory_saver
git clone https://github.com/sgl-project/sgl-kernel-npu.git
cd sgl-kernel-npu
bash build.sh -a memory-saver
pip install output/torch_memory_saver*.whl

安装verl
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

git clone https://github.com/volcengine/verl.git
cd verl
pip install --no-deps -e .
pip install -r requirements-npu.txt


其他三方库说明
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

+--------------+---------------+
| software | description |
+--------------+---------------+
| transformers | v4.56.1 |
+--------------+---------------+
| triton_ascend| v3.2.0 |
+--------------+---------------+

1. sglang依赖 transformers v4.56.1
2. sglang依赖triton_ascend v3.2.0
3. 暂不支持多模态模型,卸载相关安装包torchvision、timm

.. code-block:: bash

pip uninstall torchvision
pip uninstall timm
pip uninstall triton

pip install transformers==4.56.1
pip install -i https://test.pypi.org/simple/ triton-ascend==3.2.0.dev20250925


快速开始
-----------------------------------
正式使用前,建议您通过对Qwen3-8B GRPO的训练尝试以检验环境准备和安装的正确性。

1.下载数据集并将数据集预处理为parquet格式,以便包含计算RL奖励所需的必要字段

.. code-block:: bash

python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k

2.执行训练

.. code-block:: bash

bash verl/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_npu.sh
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ verl is fast with:
ascend_tutorial/ascend_quick_start.rst
ascend_tutorial/ascend_profiling_zh.rst
ascend_tutorial/ascend_profiling_en.rst
ascend_tutorial/ascend_sglang_quick_start.rst

.. toctree::
:maxdepth: 1
Expand Down
71 changes: 71 additions & 0 deletions examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_spmd_npu.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
set -x
export HCCL_CONNECT_TIMEOUT=1500
export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050

# WORKSPACE_HOME and DATA_HOME support custom path configuration.
WORKSPACE_HOME=$pwd
DATA_HOME=$pwd

sp_size=4
num_npu=4
tp_size=4
train_prompt_bsz=16
train_prompt_mini_bsz=16

max_prompt_length=512
max_response_length=1024

CKPTS_DIR=$WORKSPACE_HOME/logs/ckpt/qwen3_8b
model_path=$DATA_HOME/models/Qwen3-8B
train_data=$DATA_HOME/datasets/processed_gsm8k/train.parquet
valid_data=$DATA_HOME/datasets/processed_gsm8k/test.parquet

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$train_data \
data.val_files=$valid_data \
data.train_batch_size=$train_prompt_bsz \
data.max_prompt_length=$max_prompt_length \
data.max_response_length=$max_response_length \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=$model_path \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=$train_prompt_mini_bsz \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.use_torch_compile=False \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$tp_size \
actor_rollout_ref.rollout.name=sglang \
actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
actor_rollout_ref.rollout.n=5 \
+actor_rollout_ref.rollout.engine_kwargs.sglang.attention_backend="ascend" \
Comment thread
lbk-sys marked this conversation as resolved.
Comment thread
FightingZhen marked this conversation as resolved.
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.rollout.enable_chunked_prefill=False \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=2 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.nccl_timeout=1800 \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=console \
trainer.val_before_train=False \
trainer.project_name='verl_grpo_example_512_1024_gsm8k' \
trainer.experiment_name='qwen3_8b_function_rm' \
trainer.n_gpus_per_node=$num_npu \
trainer.nnodes=1 \
trainer.save_freq=1000 \
trainer.test_freq=10000 \
trainer.total_epochs=5 \
trainer.default_local_dir="${CKPTS_DIR}" \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
trainer.device=npu $@
71 changes: 71 additions & 0 deletions examples/grpo_trainer/run_qwen3_8b_grpo_sglang_32k_spmd_npu.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
set -x
export HCCL_CONNECT_TIMEOUT=1500
export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050

# WORKSPACE_HOME and DATA_HOME support custom path configuration.
WORKSPACE_HOME=$pwd
DATA_HOME=$pwd

sp_size=4
num_gpu=8
tp_size=4
train_prompt_bsz=16
train_prompt_mini_bsz=16

max_prompt_length=$((1024 * 2))
max_response_length=$((1024 * 32))

CKPTS_DIR=$WORKSPACE_HOME/logs/ckpt/qwen3_8b
model_path=$DATA_HOME/models/Qwen3-8B
train_data=$DATA_HOME/datasets/dapo/dapo-math-17k.parquet
valid_data=$DATA_HOME/datasets/dapo/aime-2024.parquet

python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$train_data \
data.val_files=$valid_data \
data.train_batch_size=$train_prompt_bsz \
data.max_prompt_length=$max_prompt_length \
data.max_response_length=$max_response_length \
data.filter_overlong_prompts=False \
data.truncation='error' \
actor_rollout_ref.model.path=$model_path \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=$train_prompt_mini_bsz \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.use_torch_compile=False \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$tp_size \
actor_rollout_ref.rollout.name=sglang \
actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
actor_rollout_ref.rollout.n=5 \
+actor_rollout_ref.rollout.engine_kwargs.sglang.attention_backend="ascend" \
Comment thread
lbk-sys marked this conversation as resolved.
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.rollout.enable_chunked_prefill=False \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.nccl_timeout=3600 \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=console \
trainer.val_before_train=False \
trainer.project_name='verl_grpo_example_2k_32k' \
trainer.experiment_name='qwen3_8b_function_rm' \
trainer.n_gpus_per_node=$num_gpu \
trainer.nnodes=1 \
trainer.save_freq=1000 \
trainer.test_freq=10000 \
trainer.total_epochs=5 \
trainer.default_local_dir="${CKPTS_DIR}" \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
trainer.device=npu $@
15 changes: 1 addition & 14 deletions verl/utils/attention_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,20 +27,7 @@ def _get_attention_functions() -> tuple[Callable, Callable, Callable, Callable]:
if is_cuda_available:
from flash_attn.bert_padding import index_first_axis, pad_input, rearrange, unpad_input
elif is_npu_available:
try:
from transformers.integrations.npu_flash_attention import (
index_first_axis,
pad_input,
rearrange,
unpad_input,
)
except ImportError:
# Since transformers v4.55.1, index_first_axis, pad_input, and unpad_input
# have been consolidated into `transformers.modeling_flash_attention_utils`.
from einops import rearrange
from transformers.modeling_flash_attention_utils import _index_first_axis as index_first_axis
from transformers.modeling_flash_attention_utils import _pad_input as pad_input
from transformers.modeling_flash_attention_utils import _unpad_input as unpad_input
from verl.utils.npu_utils import index_first_axis, pad_input, rearrange, unpad_input

_index_first_axis, _pad_input, _rearrange, _unpad_input = index_first_axis, pad_input, rearrange, unpad_input

Expand Down
Loading
Loading