diff --git a/docs/ascend_tutorial/ascend_sglang_quick_start.rst b/docs/ascend_tutorial/ascend_sglang_quick_start.rst index 8d6e187a05f..8b1661cbbe4 100644 --- a/docs/ascend_tutorial/ascend_sglang_quick_start.rst +++ b/docs/ascend_tutorial/ascend_sglang_quick_start.rst @@ -1,7 +1,7 @@ Ascend Quickstart with SGLang Backend =================================== -Last updated: 09/25/2025. +Last updated: 01/27/2026. 我们在 verl 上增加对华为昇腾设备的支持。 @@ -17,97 +17,137 @@ Atlas 800T A3 安装 ----------------------------------- +关键支持版本 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -基础环境准备 ++-----------+-----------------+ +| software | version | ++===========+=================+ +| Python | == 3.11 | ++-----------+-----------------+ +| HDK | >= 25.3.RC1 | ++-----------+-----------------+ +| CANN | >= 8.3.RC1 | ++-----------+-----------------+ +| torch | >= 2.7.1 | ++-----------+-----------------+ +| torch_npu | >= 2.7.1.post2 | ++-----------+-----------------+ +| sglang | v0.5.8 | ++-----------+-----------------+ + +从 Docker 镜像进行安装 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +我们提供了DockerFile进行构建,详见 `dockerfile_build_guidance `_ ,请根据设备自行选择对应构建文件 + +从自定义环境安装 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +**1. 安装HDK&CANN依赖并激活** + +异构计算架构CANN(Compute Architecture for Neural Networks)是昇腾针对AI场景推出的异构计算架构, 为了使训练和推理引擎能够利用更好、更快的硬件支持, 我们需要安装以下 `先决条件 `_ +-----------+-------------+ -| software | version | -+-----------+-------------+ -| Python | == 3.11 | -+-----------+-------------+ -| CANN | == 8.3.RC1 | -+-----------+-------------+ -| HDK | == 25.3.RC1 | -+-----------+-------------+ -| torch | == 2.6.0 | +| HDK | >= 25.3.RC1 | +-----------+-------------+ -| torch_npu | == 2.6.0 | +| CANN | >= 8.3.RC1 | +-----------+-------------+ +安装完成后请激活环境 -**目前verl框架中sglang npu后端仅支持上述HDK、CANN和PTA版本, 商发可用版本预计2025年10月发布** +.. code-block:: bash -为了能够在 verl 中正常使用 sglang,需使用以下命令安装sglang、torch_memory_saver和verl。 + source /usr/local/Ascend/ascend-toolkit/set_env.sh + source /usr/local/Ascend/nnal/atb/set_env.sh + +**2. 创建conda环境** -sglang -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash - # sglang - git clone https://github.com/sgl-project/sglang.git - cd sglang - mv python/pyproject.toml python/pyproject.toml.backup - mv python/pyproject_other.toml python/pyproject.toml - pip install -e "python[srt_npu]" - -安装torch_memory_saver -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + # create conda env + conda create -n verl-sglang python==3.11 + conda activate verl-sglang + +**3. 然后,执行我们在 verl 中提供的脚本** `install_sglang_mcore_npu.sh `_ + +如果在此步骤中遇到错误,请检查脚本并手动按照脚本中的步骤操作。 + .. code-block:: bash - - # torch_memory_saver - git clone https://github.com/sgl-project/sgl-kernel-npu.git - cd sgl-kernel-npu - bash build.sh -a memory-saver - pip install output/torch_memory_saver*.whl -安装verl -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + git clone https://github.com/volcengine/verl.git + # Make sure you have activated verl conda env + # NPU_DEVICE=A3 or A2 depends on your device + NPU_DEVICE=A3 bash verl/scripts/install_sglang_mcore_npu.sh + +**4. 安装verl** .. code-block:: bash - git clone https://github.com/volcengine/verl.git cd verl pip install --no-deps -e . pip install -r requirements-npu.txt -其他三方库说明 -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +快速开始 +----------------------------------- -+--------------+---------------+ -| software | description | -+--------------+---------------+ -| transformers | v4.56.1 | -+--------------+---------------+ -| triton_ascend| v3.2.0 | -+--------------+---------------+ +**1.当前NPU sglang脚本一览** -1. sglang依赖 transformers v4.56.1 -2. sglang依赖triton_ascend v3.2.0 -3. 暂不支持多模态模型,卸载相关安装包torchvision、timm +.. _Qwen3-30B: https://github.com/verl-project/verl/blob/main/examples/grpo_trainer/run_qwen3moe-30b_sglang_megatron_npu.sh +.. _Qwen2.5-32B: https://github.com/verl-project/verl/blob/main/examples/grpo_trainer/run_qwen2-32b_sglang_fsdp_npu.sh +.. _Qwen3-8B-1k: https://github.com/verl-project/verl/blob/main/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_spmd_npu.sh +.. _Qwen3-8B-32k: https://github.com/verl-project/verl/blob/main/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_32k_spmd_npu.sh -.. code-block:: bash - - pip uninstall torchvision - pip uninstall timm - pip uninstall triton - - pip install transformers==4.56.1 - pip install -i https://test.pypi.org/simple/ triton-ascend==3.2.0.dev20250925 + +-----------------+----------------+----------+-------------------+ + | 模型 | 推荐NPU型号 | 节点数量 | 训推后端 | + +=================+================+==========+===================+ + | `Qwen3-30B`_ | Atlas 800T A3 | 1 | SGLang + Megatron | + +-----------------+----------------+----------+-------------------+ + | `Qwen2.5-32B`_ | Atlas 900 A2 | 2 | SGLang + FSDP | + +-----------------+----------------+----------+-------------------+ + | `Qwen3-8B-1k`_ | Atlas A3/A2 | 1 | SGLang + FSDP | + +-----------------+----------------+----------+-------------------+ + | `Qwen3-8B-32k`_ | Atlas A3/A2 | 1 | SGLang + FSDP | + +-----------------+----------------+----------+-------------------+ +**2.最佳实践** -快速开始 ------------------------------------ -正式使用前,建议您通过对Qwen3-8B GRPO的训练尝试以检验环境准备和安装的正确性。 +我们提供基于verl+sglang `Qwen3-30B`_ 以及 `Qwen2.5-32B`_ 的 `最佳实践 `_ 作为参考 + +**3.环境变量与参数** -1.下载数据集并将数据集预处理为parquet格式,以便包含计算RL奖励所需的必要字段 +当前NPU上支持sglang后端必须添加以下环境变量 .. code-block:: bash - python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k + #支持NPU单卡多进程 https://www.hiascend.com/document/detail/zh/canncommercial/850/commlib/hcclug/hcclug_000091.html + export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050 + export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050 + #规避ray在device侧调用无法根据is_npu_available接口识别设备可用性 + export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 + #根据当前设备和需要卡数定义 + export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 + #使能推理EP时需要 + export SGLANG_DEEPEP_BF16_DISPATCH=1 + + -2.执行训练 +当前verl已解析推理常见参数, 详见 `async_sglang_server.py `_ 中 ServerArgs初始化传参,其他 `sglang参数 `_ 均可通过engine_kwargs 进行参数传递 + +vllm后端推理脚本转换为sglang, 需要添加修改以下参数 .. code-block:: bash - bash verl/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_npu.sh \ No newline at end of file + #必须 + actor_rollout_ref.rollout.name=sglang + +actor_rollout_ref.rollout.engine_kwargs.sglang.attention_backend="ascend" + #可选 + #使能推理EP,详细使用方法见 https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/deep_ep/README_CN.md + ++actor_rollout_ref.rollout.engine_kwargs.sglang.deepep_mode="auto" + ++actor_rollout_ref.rollout.engine_kwargs.sglang.moe_a2a_backend="deepep" + #Moe模型多DP时必须设置为True + +actor_rollout_ref.rollout.engine_kwargs.sglang.enable_dp_attention=False + #chunked_prefill默认关闭 + +actor_rollout_ref.rollout.engine_kwargs.sglang.chunked_prefill_size=-1 + + + diff --git a/docs/ascend_tutorial/examples/ascend_sglang_best_practices.rst b/docs/ascend_tutorial/examples/ascend_sglang_best_practices.rst new file mode 100644 index 00000000000..91c6efd25b7 --- /dev/null +++ b/docs/ascend_tutorial/examples/ascend_sglang_best_practices.rst @@ -0,0 +1,296 @@ +Ascend SGLang Best Practice +=================================== + +Last updated: 01/27/2026. + +.. _Qwen3-30B: https://github.com/verl-project/verl/blob/main/examples/grpo_trainer/run_qwen3moe-30b_sglang_megatron_npu.sh +.. _Qwen2.5-32B: https://github.com/verl-project/verl/blob/main/examples/grpo_trainer/run_qwen2-32b_sglang_fsdp_npu.sh +引言 +---------------------------------- + +SGLang 是当前主流的高性能开源推理引擎, 昇腾已经全面原生支持该推理引擎在verl中使用, +仅需简单的构建流程,开发者即可完成环境构建,本文将提供两个经典用例来帮助开发者了解以下内容: + +1. 环境构建 +2. 模型训练与评估 +3. 性能采集 + +两个用例模型脚本以及其需要的硬件条件各自如下: + ++----------------------+---------------------+----------+------------------------+ +| 模型 | NPU型号 | 节点数量 | 训推后端 | ++======================+=====================+==========+========================+ +| `Qwen3-30B`_ | Atlas 800T A3 | 1 | SGLang + Megatron | ++----------------------+---------------------+----------+------------------------+ +| `Qwen2.5-32B`_ | Atlas 900 A2 | 2 | SGLang + FSDP | ++----------------------+---------------------+----------+------------------------+ + +环境构建 +----------------------------------- +我们在quickstart中提供了两种构建环境的方法, 1.从镜像文件DockerFile进行构建 2.从自定义Conda环境进行构建 + +在本实践中, 我们额外指定verl 的commit id 以避免引入其他问题 + +.. code-block:: bash + + cd verl + git checkout 772c224 +模型训练与评估 +----------------------------------- +1.模型数据准备 +^^^^^^^^^^^ +`Qwen3-30B`_ +^^^^^^^^^^^ +**下载模型权重** + +--local-dir: 模型保存路径 + +.. code-block:: bash + + export HF_ENDPOINT=https://hf-mirror.com + huggingface-cli download --resume-download Qwen/Qwen3-30B-A3B --local-dir /path/to/local_dir + +**下载数据集** + +.. code-block:: bash + + git clone https://www.modelscope.cn/datasets/AI-ModelScope/DAPO-Math-17k.git + +**HuggingFace To Megatron权重转换(可选)** + +.. code-block:: bash + + python scripts/converter_hf_to_mcore.py \ + --hf_model_path Qwen/Qwen3-30B-A3B \ + --output_path Qwen/Qwen3-30B-A3B-mcore \ + --use_cpu_initialization # Only work for MoE models +*注:verl当前已支持mbridge进行灵活的hf和mcore之间的权重转换,可以修改以下相关参数直接加载hf权重* + +.. code-block:: bash + + actor_rollout_ref.actor.megatron.use_dist_checkpointing=False + actor_rollout_ref.actor.megatron.use_mbridge=True + +`Qwen2.5-32B`_ +^^^^^^^^^^^ +**下载模型权重** + +--local-dir: 模型保存路径 + +.. code-block:: bash + + export HF_ENDPOINT=https://hf-mirror.com + huggingface-cli download --resume-download Qwen/Qwen2.5-32B --local-dir /path/to/local_dir + +**下载及处理数据集** + +.. code-block:: bash + + wget https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset/resolve/main/deepscaler.json + python recipe/r1_ascend/json_to_parquet.py --output_dir ./data/deepscaler --json_path path/to/deepscaler.json --train_data_ratio 0.9 + +2.训练 +^^^^^^^^^^^ +根据开发者实际路径配置情况修改模型训练脚本中的以下参数 + +.. code-block:: bash + + # Model Weights Paths + MODEL_PATH=Qwen/Qwen3-30B-A3B + MCORE_MODEL_PATH=Qwen/Qwen3-30B-A3B-mcore + RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"} + CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"} + + # File System Paths + TRAIN_FILE=$RAY_DATA_HOME/dataset/dapo-math-17k.parquet + TEST_FILE=$RAY_DATA_HOME/dataset/aime-2024.parquet + + #保存频率,-1默认不保存,如需评测请修改此参数 + trainer.save_freq=-1 + +对于单机任务 `Qwen3-30B`_ , 可以直接bash执行verl仓上示例脚本 + +.. code-block:: bash + + bash examples/grpo_trainer/run_qwen3moe-30b_sglang_megatron_npu.sh +对于多节点任务 `Qwen2.5-32B`_ ,我们推荐使用以下脚本进行大规模多节点训练拉起 + +.. code-block:: bash + + pkill -9 python + ray stop --force + rm -rf /tmp/ray + export RAY_DEDUP_LOGS=0 + export HYDRA_FULL_ERROR=1 + # TASK_QUEUE_ENABLE,下发优化,图模式设置为1,非图模式设置为2 + export TASK_QUEUE_ENABLE=1 + export HCCL_ASYNC_ERROR_HANDLING=0 + export HCCL_EXEC_TIMEOUT=3600 + export HCCL_CONNECT_TIMEOUT=3600 + + export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050 + export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050 + export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 + export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 + # 修改为当前需要跑的用例路径 + DEFAULT_SH="./run_*.sh" + echo "Use $DEFAULT_SH" + + ulimit -n 32768 + mkdir logs + + NNODES=2 + NPUS_PER_NODE=8 + # 修改为对应主节点IP + MASTER_ADDR="IP FOR MASTER NODE" + # 修改为当前节点的通信网卡 + SOCKET_IFNAME="Your SOCKET IFNAME" + export HCCL_SOCKET_IFNAME="SOCKET IFNAME FOR CURRENT NODE" + export GLOO_SOCKET_IFNAME="SOCKET IFNAME FOR CURRENT NODE" + # 获取当前IP + CURRENT_IP=$(ifconfig $SOCKET_IFNAME | grep -Eo 'inet (addr:)?([0-9]{1,3}\.){3}[0-9]{1,3}' | awk '{print $NF}') + if [ "$MASTER_ADDR" = "$CURRENT_IP" ]; then + # 主节点启动 + ray start --head --port 6766 --dashboard-host=$MASTER_ADDR --node-ip-address=$CURRENT_IP --dashboard-port=8260 --resources='{"NPU": '$NPUS_PER_NODE'}' + + while true; do + ray_status_output=$(ray status) + npu_count=$(echo "$ray_status_output" | grep -oP '(?<=/)\d+\.\d+(?=\s*NPU)' | head -n 1) + npu_count_int=$(echo "$npu_count" | awk '{print int($1)}') + device_count=$((npu_count_int / $NPUS_PER_NODE)) + + # 判断device_count 是否与 NNODES 相等 + if [ "$device_count" -eq "$NNODES" ]; then + echo "Ray cluster is ready with $device_count devices (from $npu_count NPU resources), starting Python script." + ray status + bash $DEFAULT_SH + break + else + echo "Waiting for Ray to allocate $NNODES devices. Current device count: $device_count" + sleep 5 + fi + done + else + # 子节点尝试往主节点注册 ray 直到成功 + while true; do + # 尝试连接 ray 集群 + ray start --address="$MASTER_ADDR:6766" --resources='{"NPU": '$NPUS_PER_NODE'}' --node-ip-address=$CURRENT_IP + + # 检查连接是否成功 + ray status + if [ $? -eq 0 ]; then + echo "Successfully connected to the Ray cluster!" + break + else + echo "Failed to connect to the Ray cluster. Retrying in 5 seconds..." + sleep 5 + fi + done + fi + + sleep 600 + +DEFAULT_SH:修改为训练所用配置 sh 文件路径。在此案例中修改为 `Qwen2.5-32B`_ 路径。 + +NNODES 和 NPUS_PER_NODE:修改为使用节点数量和每个节点 NPU 数量。在此案例中分别为2和8。 + +MASTER_ADDR:修改为对应主节点 IP。即所有节点的 MASTER_ADDR 应该相同。 + +SOCKET_IFNAME, HCCL_SOCKET_IFNAME, GLOO_SOCKET_IFNAME: 修改为对应通信网卡,通信网卡可以通过以下命令获取: + +.. code-block:: bash + + ifconfig |grep "$(hostname -I |awk '{print $1}'|awk -F '.' '{print $0}')" -B 1|awk -F ':' '{print$1}' | head -1 | tail -1 + +3.模型评估 +^^^^^^^^^^^ + +不同模型步骤一致,仅以Qwen3-30b为例列举 + +我们通过 AISBenchmark 评估模型,该工具支持vllm/sglang多种推理后端的评估 + +**安装方法** + +.. code-block:: bash + + git clone https://gitee.com/aisbench/benchmark.git + cd benchmark + pip install -e . + +**下载评估数据集** + +.. code-block:: bash + + cd path/to/benchmark/ais_bench/datasets + wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip + unzip math.zip + rm math.zip + +**修改AISBench配置代码使能sglang推理评测** + +打开 benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py 文件,这是推理配置文件 + +.. code-block:: bash + + from ais_bench.benchmark.models import VLLMCustomAPIChatStream + from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content + from ais_bench.benchmark.clients import OpenAIChatStreamClient, OpenAIChatStreamSglangClient + + models = [ + dict( + attr="service", + type=VLLMCustomAPIChatStream, + abbr='sgl-api-stream-chat', + path="/path/to/Qwen3-30B", # 修改为 Qwen3-30B 模型路径 + model="qwen3-30b", + request_rate = 0, + max_seq_len=2048, + retry = 2, + host_ip = "localhost", # 推理服务的IP + host_port = 8005, # 推理服务的端口 + max_out_len = 8192, # 最大输出tokens长度 + batch_size=48, # 推理的最大并发数 + trust_remote_code=False, + custom_client=dict(type=OpenAIChatStreamSglangClient), #使用sglang客户端 + generation_kwargs = dict( + temperature = 0, + seed = 1234, + ), + pred_postprocessor=dict(type=extract_non_reasoning_content) + ) + ] + + +**启动sglang_server服务** + +.. code-block:: bash + + python -m sglang.launch_server --model-path "/path/to/Qwen3-30B" --tp-size 4 --dp-size 1 --port 8005 + +**启动sglang_client评测** + +.. code-block:: bash + + ais_bench --models vllm_api_stream_chat --datasets math500_gen_0_shot_cot_chat_prompt + +**评测结果** + +经过训练,模型在Math-500上的评分显著上升 + ++------+----------------------+---------+----------+------+----------------------+ +| iter | dataset | version | metric | mode | sgl-api-stream-chat | ++======+======================+=========+==========+======+======================+ +| 0 | math_prm800k_500 | c4b6f0 | accuracy | gen | 84.4 | ++------+----------------------+---------+----------+------+----------------------+ +| 150 | math_prm800k_500 | c4b6f0 | accuracy | gen | 91.7 | ++------+----------------------+---------+----------+------+----------------------+ + +性能采集 +----------------------------------- +关于NPU profiling的详细文档请参考 `ascend_profiling_zh `_ + +在 `Qwen3-30B`_ 的脚本中提供了基本的采集性能选项PROF_CONFIG,默认设置 global_profiler.steps=null 关闭采集, 开发者可根据实际需要进行参数修改 + +采集完成后,开发者可以使用 `MindStudio Insight `_ 进行数据解析 + +注: verl框架侧进行采集全量 Profiling 产生海量且重复的算子记录,可以根据文档修改代码仅采集关键阶段 \ No newline at end of file diff --git a/docs/ascend_tutorial/ascend_best_practice/dapo_multi_model_optimization_practice.md b/docs/ascend_tutorial/examples/dapo_multi_model_optimization_practice.md similarity index 100% rename from docs/ascend_tutorial/ascend_best_practice/dapo_multi_model_optimization_practice.md rename to docs/ascend_tutorial/examples/dapo_multi_model_optimization_practice.md diff --git a/docs/ascend_tutorial/ascend_optimization_pratice/gspo_optimization_practice.md b/docs/ascend_tutorial/examples/gspo_optimization_practice.md similarity index 100% rename from docs/ascend_tutorial/ascend_optimization_pratice/gspo_optimization_practice.md rename to docs/ascend_tutorial/examples/gspo_optimization_practice.md diff --git a/docs/index.rst b/docs/index.rst index 1b3cdedda7e..2e1bc7a04e2 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -150,8 +150,9 @@ verl is fast with: ascend_tutorial/ascend_profiling_en.rst ascend_tutorial/dockerfile_build_guidance.rst ascend_tutorial/ascend_sglang_quick_start.rst - ascend_tutorial/ascend_optimization_pratice/gspo_optimization_practice.md - ascend_tutorial/ascend_best_practice/dapo_multi_model_optimization_practice.md + ascend_tutorial/examples/gspo_optimization_practice.md + ascend_tutorial/examples/dapo_multi_model_optimization_practice.md + ascend_tutorial/examples/ascend_sglang_best_practices.rst .. toctree:: :maxdepth: 1 diff --git a/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_spmd_npu.sh b/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_spmd_npu.sh index b2d259b4330..878b106f9f1 100644 --- a/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_spmd_npu.sh +++ b/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_spmd_npu.sh @@ -2,7 +2,8 @@ set -x export HCCL_CONNECT_TIMEOUT=1500 export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050 export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050 - +export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 # WORKSPACE_HOME and DATA_HOME support custom path configuration. WORKSPACE_HOME=$pwd DATA_HOME=$pwd diff --git a/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_32k_spmd_npu.sh b/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_32k_spmd_npu.sh index 9076360bb6d..04b2f3a36e9 100644 --- a/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_32k_spmd_npu.sh +++ b/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_32k_spmd_npu.sh @@ -2,7 +2,8 @@ set -x export HCCL_CONNECT_TIMEOUT=1500 export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050 export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050 - +export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # WORKSPACE_HOME and DATA_HOME support custom path configuration. WORKSPACE_HOME=$pwd DATA_HOME=$pwd diff --git a/examples/grpo_trainer/run_qwen3moe-30b_sglang_megatron_npu.sh b/examples/grpo_trainer/run_qwen3moe-30b_sglang_megatron_npu.sh new file mode 100644 index 00000000000..71e566c7dcd --- /dev/null +++ b/examples/grpo_trainer/run_qwen3moe-30b_sglang_megatron_npu.sh @@ -0,0 +1,236 @@ +#!/bin/bash +set -xeuo pipefail +# Project Configuration +project_name='DAPO-Qwen3-30b-A3B-BASE-MATH' +exp_name='DAPO-Qwen3-30B-A3B-BASE-Megatron-SGLang' + +# Necessary env +export HCCL_CONNECT_TIMEOUT=1500 +export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050 +export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050 + +export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1 +export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 + +export DISABLE_L2_CACHE=1 +export TASK_QUEUE_ENABLE=1 + +# Node Info +NNODES=${NNODES:-1} +NPUS_PER_NODE=${NPUS_PER_NODE:-16} + +# Model Weights Paths +MODEL_PATH=Qwen/Qwen3-30B-A3B +MCORE_MODEL_PATH=Qwen/Qwen3-30B-A3B-mcore +RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"} +CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"} + +# File System Paths +TRAIN_FILE=$RAY_DATA_HOME/dataset/dapo-math-17k.parquet +TEST_FILE=$RAY_DATA_HOME/dataset/aime-2024.parquet +# Data Length Configuration +max_prompt_length=$((1024 * 2)) +max_response_length=$((1024 * 8)) + +# Training Batch Configuration +train_prompt_bsz=16 +train_prompt_mini_bsz=16 +n_resp_per_prompt=8 + +# Algorithm Configuration +adv_estimator=grpo +use_kl_in_reward=False +kl_coef=0.0 +use_kl_loss=True +kl_loss_coef=0.001 + +# Performance and Memory Management Configuration +all_offload=True +use_dynamic_bsz=False +actor_ppo_max_token_len=$(((max_prompt_length + max_response_length))) +infer_ppo_max_token_len=$(((max_prompt_length + max_response_length))) + +# Megatron Parallelism Configuration +train_tp=4 +train_ep=4 +train_etp=4 +train_pp=1 +train_cp=1 + +# SGLang Generation Configuration +gen_tp=4 +gen_dp=1 +gen_ep=1 +gpu_memory_utilization=0.5 +max_model_len=$((max_prompt_length + max_response_length)) +max_num_batched_tokens=$(((max_prompt_length + max_response_length) * 1)) + +# Data Configuration +DATA_CONFIG=( + # File Paths + data.train_files="${TRAIN_FILE}" + data.val_files="${TEST_FILE}" + # Data Structure + data.prompt_key=prompt + # Batch and Length Configuration + data.train_batch_size=${train_prompt_bsz} + data.max_prompt_length=${max_prompt_length} + data.max_response_length=${max_response_length} + # Preprocessing + data.filter_overlong_prompts=False + data.truncation='left' +) + +# Model Configuration +MODEL_CONFIG=( + # Model Path + actor_rollout_ref.model.path="${MODEL_PATH}" + # Model Processing + actor_rollout_ref.model.use_remove_padding=True +) + +# Reinforcement Learning Algorithm Configuration +ALGORITHM_CONFIG=( + # Advantage Estimation + algorithm.adv_estimator=${adv_estimator} + # KL Divergence Control + algorithm.use_kl_in_reward=${use_kl_in_reward} + algorithm.kl_ctrl.kl_coef=${kl_coef} +) + +ACTOR_CONFIG=( + # Core Runtime Settings + actor_rollout_ref.actor.use_torch_compile=False + actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} + # Loss Function Configuration + actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} + actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} + actor_rollout_ref.actor.entropy_coeff=0 + # PPO Training Parameters + actor_rollout_ref.actor.ppo_epochs=1 + actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 + actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} + actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} + # Optimizer Settings + actor_rollout_ref.actor.optim.lr=1e-6 + # Megatron Parallelism Strategy + actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp} + actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp} + actor_rollout_ref.actor.megatron.context_parallel_size=${train_cp} + actor_rollout_ref.actor.megatron.expert_model_parallel_size=${train_ep} + actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=${train_etp} + # Memory Optimization + actor_rollout_ref.actor.megatron.param_offload=${all_offload} + actor_rollout_ref.actor.megatron.optimizer_offload=${all_offload} + actor_rollout_ref.actor.megatron.grad_offload=${all_offload} + # Model Weights Management + actor_rollout_ref.actor.megatron.dist_checkpointing_path=${MCORE_MODEL_PATH} + actor_rollout_ref.actor.megatron.use_dist_checkpointing=True + actor_rollout_ref.actor.megatron.use_mbridge=False + # Transformer Architecture Optimizations + +actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True + +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform + +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full + +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 +) + +REF_CONFIG=( + # Core Runtime Settings + actor_rollout_ref.ref.use_torch_compile=False + # Log Probability Inference + actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 + actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} + actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} + # Megatron Parallelism Strategy + actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${train_tp} + actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${train_pp} + actor_rollout_ref.ref.megatron.context_parallel_size=${train_cp} + actor_rollout_ref.ref.megatron.expert_model_parallel_size=${train_ep} + actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=${train_etp} + # Memory Optimization + actor_rollout_ref.ref.megatron.param_offload=${all_offload} + # Model Weights Management + actor_rollout_ref.ref.megatron.dist_checkpointing_path=${MCORE_MODEL_PATH} + actor_rollout_ref.ref.megatron.use_dist_checkpointing=True + actor_rollout_ref.ref.megatron.use_mbridge=False +) + +ROLLOUT_CONFIG=( + # Rollout Engine + actor_rollout_ref.rollout.name=sglang + +actor_rollout_ref.rollout.engine_kwargs.sglang.attention_backend="ascend" + # Generation Parameters + actor_rollout_ref.rollout.n=${n_resp_per_prompt} + actor_rollout_ref.rollout.top_p=1.0 + actor_rollout_ref.rollout.top_k=-1 + actor_rollout_ref.rollout.temperature=1.0 + # Log Probability Inference + actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 + actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} + actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} + # Memory Management + actor_rollout_ref.rollout.gpu_memory_utilization=${gpu_memory_utilization} + # Parallelism Strategy + actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} + actor_rollout_ref.rollout.data_parallel_size=${gen_dp} + actor_rollout_ref.rollout.expert_parallel_size=${gen_ep} + +actor_rollout_ref.rollout.engine_kwargs.sglang.enable_dp_attention=False + # Performance Optimization + +actor_rollout_ref.rollout.engine_kwargs.sglang.chunked_prefill_size=-1 + actor_rollout_ref.rollout.enforce_eager=False + # Validation Generation + actor_rollout_ref.rollout.val_kwargs.n=1 + actor_rollout_ref.rollout.val_kwargs.do_sample=True + actor_rollout_ref.rollout.val_kwargs.top_p=1.0 + actor_rollout_ref.rollout.val_kwargs.top_k=-1 + actor_rollout_ref.rollout.val_kwargs.temperature=1.0 +) + +TRAINER_CONFIG=( + # Logger Configuration + trainer.logger='["console"]' + # Project Settings + trainer.project_name="${project_name}" + trainer.experiment_name="${exp_name}" + # Hardware Configuration + trainer.nnodes="${NNODES}" + trainer.n_gpus_per_node="${NPUS_PER_NODE}" + trainer.device='npu' + # Training Schedule + trainer.total_epochs=15 + trainer.val_before_train=False + trainer.test_freq=-1 + trainer.save_freq=-1 + # Checkpoint Directory + trainer.default_local_dir="${CKPTS_DIR}" +) + +# profiling configuration +PROF_CONFIG=( + global_profiler.tool=npu + global_profiler.steps=null + global_profiler.save_path=/profpath + actor_rollout_ref.actor.profiler.enable=True + actor_rollout_ref.actor.profiler.ranks="[0]" + actor_rollout_ref.actor.profiler.all_ranks=False + actor_rollout_ref.actor.profiler.tool_config.npu.discrete=True + actor_rollout_ref.actor.profiler.tool_config.npu.contents=['npu','cpu'] + actor_rollout_ref.actor.profiler.tool_config.npu.level=level0 + actor_rollout_ref.actor.profiler.tool_config.npu.analysis=True + actor_rollout_ref.rollout.profiler.enable=True + actor_rollout_ref.rollout.profiler.ranks="[0]" + actor_rollout_ref.rollout.profiler.all_ranks=False +) + +python3 -m verl.trainer.main_ppo \ + --config-path=config \ + --config-name='ppo_megatron_trainer.yaml' \ + "${DATA_CONFIG[@]}" \ + "${MODEL_CONFIG[@]}" \ + "${ACTOR_CONFIG[@]}" \ + "${REF_CONFIG[@]}" \ + "${ROLLOUT_CONFIG[@]}" \ + "${ALGORITHM_CONFIG[@]}" \ + "${TRAINER_CONFIG[@]}" \ + "${PROF_CONFIG[@]}" \ + "$@" diff --git a/scripts/install_sglang_mcore_npu.sh b/scripts/install_sglang_mcore_npu.sh new file mode 100644 index 00000000000..2975db3d1ed --- /dev/null +++ b/scripts/install_sglang_mcore_npu.sh @@ -0,0 +1,57 @@ +#!/bin/bash +set -e +NPU_DEVICE=${NPU_DEVICE:=A3} + +export MAX_JOBS=32 + +echo "1. install SGLang from source" +git clone -b v0.5.8 https://github.com/sgl-project/sglang.git +cd sglang +mv python/pyproject_other.toml python/pyproject.toml +pip install -e python[srt_npu] +cd .. + +echo "2. install torch & torch_npu & triton_ascend & other basic packages" +pip install torch==2.7.1 torch_npu==2.7.1.post2 torchvision==0.22.1 +pip install pybind11 click==8.2.1 mbridge "numpy<2.0.0" cachetools + + +echo "3. install sgl-kernel-npu form source, detailed readme in https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/deep_ep/README.md" +git clone https://github.com/sgl-project/sgl-kernel-npu.git +cd sgl-kernel-npu +git checkout 46b73de +sed -i '101s/^/# /' build.sh +if [ "$NPU_DEVICE" = "A3" ]; then + bash build.sh +fi +if [ "$NPU_DEVICE" = "A2" ]; then + bash build.sh -a deepep2 +fi +pip install output/torch_memory_saver*.whl +pip install output/sgl_kernel_npu*.whl +pip install output/deep_ep*.whl +cd "$(pip show deep-ep | grep -E '^Location:' | awk '{print $2}')" && ln -s deep_ep/deep_ep_cpp*.so && cd - +python -c "import deep_ep; print(deep_ep.__path__)" +cd .. +# install sgl-kernel-npu from release whl +# if [ "$NPU_DEVICE" = "A3" ]; then +# wget https://github.com/sgl-project/sgl-kernel-npu/releases/download/2026.01.21/sgl-kernel-npu_2026.01.21_8.5.0_a3.zip +# fi +# if [ "$NPU_DEVICE" = "A2" ]; then +# wget https://github.com/sgl-project/sgl-kernel-npu/releases/download/2026.01.21/sgl-kernel-npu_2026.01.21_8.5.0_910b.zip +# fi +# unzip sgl-kernel-npu*.zip +# pip install output/torch_memory_saver*.whl +# pip install output/sgl_kernel_npu*.whl +# pip install output/deep_ep*.whl + +if [ $USE_MEGATRON -eq 1 ]; then + echo "4. install Megatron and MindSpeed" + git clone -b 2.3.0_core_r0.12.1 https://gitcode.com/Ascend/MindSpeed.git + pip install -e MindSpeed + pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1 +fi + +echo "5. May need to uninstall timm & triton" +pip uninstall -y timm triton +echo "Successfully installed all packages"