diff --git a/docs/ascend_tutorial/ascend_sglang_quick_start.rst b/docs/ascend_tutorial/ascend_sglang_quick_start.rst
index 8d6e187a05f..8b1661cbbe4 100644
--- a/docs/ascend_tutorial/ascend_sglang_quick_start.rst
+++ b/docs/ascend_tutorial/ascend_sglang_quick_start.rst
@@ -1,7 +1,7 @@
 Ascend Quickstart with SGLang Backend
 ===================================
 
-Last updated: 09/25/2025.
+Last updated: 01/27/2026.
 
 我们在 verl 上增加对华为昇腾设备的支持。
 
@@ -17,97 +17,137 @@ Atlas 800T A3
 
 安装
 -----------------------------------
+关键支持版本
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-基础环境准备
++-----------+-----------------+
+| software  | version         |
++===========+=================+
+| Python    | == 3.11         |
++-----------+-----------------+
+| HDK       | >= 25.3.RC1     |
++-----------+-----------------+
+| CANN      | >= 8.3.RC1      |
++-----------+-----------------+
+| torch     | >= 2.7.1        |
++-----------+-----------------+
+| torch_npu | >= 2.7.1.post2  |
++-----------+-----------------+
+| sglang    | v0.5.8          |
++-----------+-----------------+
+
+从 Docker 镜像进行安装
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+我们提供了DockerFile进行构建,详见 `dockerfile_build_guidance <https://github.com/verl-project/verl/blob/main/docs/ascend_tutorial/dockerfile_build_guidance.rst>`_ ，请根据设备自行选择对应构建文件
+
+从自定义环境安装
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**1. 安装HDK&CANN依赖并激活**
+
+异构计算架构CANN(Compute Architecture for Neural Networks)是昇腾针对AI场景推出的异构计算架构, 为了使训练和推理引擎能够利用更好、更快的硬件支持, 我们需要安装以下 `先决条件 <https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/softwareinst/instg/instg_quick.html?Mode=PmIns&InstallType=netconda&OS=openEuler&Software=cannToolKit>`_
 
 +-----------+-------------+
-| software  | version     |
-+-----------+-------------+
-| Python    | == 3.11     |
-+-----------+-------------+
-| CANN      | == 8.3.RC1  |
-+-----------+-------------+
-| HDK       | == 25.3.RC1 |
-+-----------+-------------+
-| torch     | == 2.6.0    |
+| HDK       | >= 25.3.RC1 |
 +-----------+-------------+
-| torch_npu | == 2.6.0    |
+| CANN      | >= 8.3.RC1  |
 +-----------+-------------+
+安装完成后请激活环境
 
-**目前verl框架中sglang npu后端仅支持上述HDK、CANN和PTA版本, 商发可用版本预计2025年10月发布**
+.. code-block:: bash
 
-为了能够在 verl 中正常使用 sglang，需使用以下命令安装sglang、torch_memory_saver和verl。
+    source /usr/local/Ascend/ascend-toolkit/set_env.sh
+    source /usr/local/Ascend/nnal/atb/set_env.sh
+
+**2. 创建conda环境**
 
-sglang
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. code-block:: bash
     
-    # sglang
-    git clone https://github.com/sgl-project/sglang.git
-    cd sglang
-    mv python/pyproject.toml python/pyproject.toml.backup
-    mv python/pyproject_other.toml python/pyproject.toml
-    pip install -e "python[srt_npu]"
-
-安装torch_memory_saver
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    # create conda env
+    conda create -n verl-sglang python==3.11
+    conda activate verl-sglang
+
+**3. 然后，执行我们在 verl 中提供的脚本** `install_sglang_mcore_npu.sh <https://github.com/verl-project/verl/blob/main/scripts/install_sglang_mcore_npu.sh>`_
+
+如果在此步骤中遇到错误，请检查脚本并手动按照脚本中的步骤操作。
+
 .. code-block:: bash
-    
-    # torch_memory_saver
-    git clone https://github.com/sgl-project/sgl-kernel-npu.git
-    cd sgl-kernel-npu
-    bash build.sh  -a memory-saver
-    pip install output/torch_memory_saver*.whl
 
-安装verl
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    git clone https://github.com/volcengine/verl.git  
+    # Make sure you have activated verl conda env
+    # NPU_DEVICE=A3 or A2 depends on your device
+    NPU_DEVICE=A3 bash verl/scripts/install_sglang_mcore_npu.sh
+
+**4. 安装verl**
 
 .. code-block:: bash
 
-    git clone https://github.com/volcengine/verl.git
     cd verl
     pip install --no-deps -e .
     pip install -r requirements-npu.txt 
 
 
-其他三方库说明
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+快速开始
+-----------------------------------
 
-+--------------+---------------+
-| software     | description   |
-+--------------+---------------+
-| transformers | v4.56.1       |
-+--------------+---------------+
-| triton_ascend| v3.2.0        |
-+--------------+---------------+
+**1.当前NPU sglang脚本一览**
 
-1. sglang依赖 transformers v4.56.1
-2. sglang依赖triton_ascend v3.2.0
-3. 暂不支持多模态模型，卸载相关安装包torchvision、timm
+.. _Qwen3-30B: https://github.com/verl-project/verl/blob/main/examples/grpo_trainer/run_qwen3moe-30b_sglang_megatron_npu.sh
+.. _Qwen2.5-32B: https://github.com/verl-project/verl/blob/main/examples/grpo_trainer/run_qwen2-32b_sglang_fsdp_npu.sh
+.. _Qwen3-8B-1k: https://github.com/verl-project/verl/blob/main/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_spmd_npu.sh
+.. _Qwen3-8B-32k: https://github.com/verl-project/verl/blob/main/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_32k_spmd_npu.sh
 
-.. code-block:: bash
-    
-    pip uninstall torchvision
-    pip uninstall timm
-    pip uninstall triton
-    
-    pip install transformers==4.56.1
-    pip install -i https://test.pypi.org/simple/ triton-ascend==3.2.0.dev20250925
+   +-----------------+----------------+----------+-------------------+
+   | 模型            | 推荐NPU型号    | 节点数量 | 训推后端          |
+   +=================+================+==========+===================+
+   | `Qwen3-30B`_    | Atlas 800T A3  | 1        | SGLang + Megatron |
+   +-----------------+----------------+----------+-------------------+
+   | `Qwen2.5-32B`_  | Atlas 900 A2   | 2        | SGLang + FSDP     |
+   +-----------------+----------------+----------+-------------------+
+   | `Qwen3-8B-1k`_  | Atlas A3/A2    | 1        | SGLang + FSDP     |
+   +-----------------+----------------+----------+-------------------+
+   | `Qwen3-8B-32k`_ | Atlas A3/A2    | 1        | SGLang + FSDP     |
+   +-----------------+----------------+----------+-------------------+
 
+**2.最佳实践**
 
-快速开始
------------------------------------
-正式使用前，建议您通过对Qwen3-8B GRPO的训练尝试以检验环境准备和安装的正确性。
+我们提供基于verl+sglang `Qwen3-30B`_ 以及 `Qwen2.5-32B`_ 的 `最佳实践 <https://github.com/verl-project/verl/blob/main/docs/ascend_tutorial/examples/ascend_sglang_best_practices.rst>`_ 作为参考
+
+**3.环境变量与参数**
 
-1.下载数据集并将数据集预处理为parquet格式，以便包含计算RL奖励所需的必要字段
+当前NPU上支持sglang后端必须添加以下环境变量
 
 .. code-block:: bash
 
-    python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
+    #支持NPU单卡多进程 https://www.hiascend.com/document/detail/zh/canncommercial/850/commlib/hcclug/hcclug_000091.html
+    export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
+    export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050
+    #规避ray在device侧调用无法根据is_npu_available接口识别设备可用性
+    export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
+    #根据当前设备和需要卡数定义
+    export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
+    #使能推理EP时需要
+    export SGLANG_DEEPEP_BF16_DISPATCH=1
+
+
 
-2.执行训练
+当前verl已解析推理常见参数, 详见 `async_sglang_server.py <https://github.com/verl-project/verl/blob/main/verl/workers/rollout/sglang_rollout/async_sglang_server.py>`_  中 ServerArgs初始化传参,其他 `sglang参数 <https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/server_arguments.md>`_ 均可通过engine_kwargs 进行参数传递
+
+vllm后端推理脚本转换为sglang, 需要添加修改以下参数
 
 .. code-block:: bash
 
-    bash verl/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_npu.sh
\ No newline at end of file
+    #必须
+    actor_rollout_ref.rollout.name=sglang
+    +actor_rollout_ref.rollout.engine_kwargs.sglang.attention_backend="ascend"
+    #可选
+    #使能推理EP，详细使用方法见 https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/deep_ep/README_CN.md
+    ++actor_rollout_ref.rollout.engine_kwargs.sglang.deepep_mode="auto" 
+    ++actor_rollout_ref.rollout.engine_kwargs.sglang.moe_a2a_backend="deepep"
+    #Moe模型多DP时必须设置为True
+    +actor_rollout_ref.rollout.engine_kwargs.sglang.enable_dp_attention=False
+    #chunked_prefill默认关闭
+    +actor_rollout_ref.rollout.engine_kwargs.sglang.chunked_prefill_size=-1
+
+
+
diff --git a/docs/ascend_tutorial/examples/ascend_sglang_best_practices.rst b/docs/ascend_tutorial/examples/ascend_sglang_best_practices.rst
new file mode 100644
index 00000000000..91c6efd25b7
--- /dev/null
+++ b/docs/ascend_tutorial/examples/ascend_sglang_best_practices.rst
@@ -0,0 +1,296 @@
+Ascend SGLang Best Practice
+===================================
+
+Last updated: 01/27/2026.
+
+.. _Qwen3-30B: https://github.com/verl-project/verl/blob/main/examples/grpo_trainer/run_qwen3moe-30b_sglang_megatron_npu.sh
+.. _Qwen2.5-32B: https://github.com/verl-project/verl/blob/main/examples/grpo_trainer/run_qwen2-32b_sglang_fsdp_npu.sh
+引言
+----------------------------------
+
+SGLang 是当前主流的高性能开源推理引擎, 昇腾已经全面原生支持该推理引擎在verl中使用,
+仅需简单的构建流程，开发者即可完成环境构建，本文将提供两个经典用例来帮助开发者了解以下内容：
+
+1. 环境构建
+2. 模型训练与评估
+3. 性能采集
+
+两个用例模型脚本以及其需要的硬件条件各自如下：
+
++----------------------+---------------------+----------+------------------------+
+| 模型                 | NPU型号             | 节点数量 | 训推后端               |
++======================+=====================+==========+========================+
+| `Qwen3-30B`_         | Atlas 800T A3       | 1        | SGLang + Megatron      |
++----------------------+---------------------+----------+------------------------+
+| `Qwen2.5-32B`_       | Atlas 900 A2        | 2        | SGLang + FSDP          |
++----------------------+---------------------+----------+------------------------+
+
+环境构建
+-----------------------------------
+我们在quickstart中提供了两种构建环境的方法, 1.从镜像文件DockerFile进行构建 2.从自定义Conda环境进行构建
+
+在本实践中, 我们额外指定verl 的commit id 以避免引入其他问题
+
+.. code-block:: bash
+
+    cd verl
+    git checkout 772c224
+模型训练与评估
+-----------------------------------
+1.模型数据准备
+^^^^^^^^^^^
+`Qwen3-30B`_
+^^^^^^^^^^^
+**下载模型权重**
+
+--local-dir: 模型保存路径
+
+.. code-block:: bash
+
+  export HF_ENDPOINT=https://hf-mirror.com
+  huggingface-cli download --resume-download Qwen/Qwen3-30B-A3B --local-dir /path/to/local_dir
+
+**下载数据集**
+
+.. code-block:: bash
+
+  git clone https://www.modelscope.cn/datasets/AI-ModelScope/DAPO-Math-17k.git
+
+**HuggingFace To Megatron权重转换(可选)**
+
+.. code-block:: bash
+
+  python scripts/converter_hf_to_mcore.py \
+      --hf_model_path Qwen/Qwen3-30B-A3B \
+      --output_path Qwen/Qwen3-30B-A3B-mcore \
+      --use_cpu_initialization    # Only work for MoE models
+*注:verl当前已支持mbridge进行灵活的hf和mcore之间的权重转换,可以修改以下相关参数直接加载hf权重*
+
+.. code-block:: bash
+
+    actor_rollout_ref.actor.megatron.use_dist_checkpointing=False
+    actor_rollout_ref.actor.megatron.use_mbridge=True
+
+`Qwen2.5-32B`_
+^^^^^^^^^^^
+**下载模型权重**
+
+--local-dir: 模型保存路径
+
+.. code-block:: bash
+
+  export HF_ENDPOINT=https://hf-mirror.com
+  huggingface-cli download --resume-download Qwen/Qwen2.5-32B --local-dir /path/to/local_dir
+
+**下载及处理数据集**
+
+.. code-block:: bash
+
+    wget https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset/resolve/main/deepscaler.json
+    python recipe/r1_ascend/json_to_parquet.py --output_dir ./data/deepscaler --json_path path/to/deepscaler.json --train_data_ratio 0.9
+
+2.训练
+^^^^^^^^^^^
+根据开发者实际路径配置情况修改模型训练脚本中的以下参数
+
+.. code-block:: bash 
+
+    # Model Weights Paths
+    MODEL_PATH=Qwen/Qwen3-30B-A3B
+    MCORE_MODEL_PATH=Qwen/Qwen3-30B-A3B-mcore
+    RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
+    CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
+
+    # File System Paths
+    TRAIN_FILE=$RAY_DATA_HOME/dataset/dapo-math-17k.parquet
+    TEST_FILE=$RAY_DATA_HOME/dataset/aime-2024.parquet
+
+    #保存频率，-1默认不保存，如需评测请修改此参数
+    trainer.save_freq=-1
+
+对于单机任务 `Qwen3-30B`_ , 可以直接bash执行verl仓上示例脚本
+
+.. code-block:: bash 
+
+  bash examples/grpo_trainer/run_qwen3moe-30b_sglang_megatron_npu.sh
+对于多节点任务 `Qwen2.5-32B`_ ，我们推荐使用以下脚本进行大规模多节点训练拉起
+
+.. code-block:: bash
+
+  pkill -9 python
+  ray stop --force
+  rm -rf /tmp/ray
+  export RAY_DEDUP_LOGS=0
+  export HYDRA_FULL_ERROR=1
+  # TASK_QUEUE_ENABLE，下发优化，图模式设置为1，非图模式设置为2
+  export TASK_QUEUE_ENABLE=1
+  export HCCL_ASYNC_ERROR_HANDLING=0
+  export HCCL_EXEC_TIMEOUT=3600
+  export HCCL_CONNECT_TIMEOUT=3600
+  
+  export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
+  export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050
+  export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
+  export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8
+  # 修改为当前需要跑的用例路径
+  DEFAULT_SH="./run_*.sh"
+  echo "Use $DEFAULT_SH"
+  
+  ulimit -n 32768
+  mkdir logs
+  
+  NNODES=2
+  NPUS_PER_NODE=8
+  # 修改为对应主节点IP
+  MASTER_ADDR="IP FOR MASTER NODE"
+  # 修改为当前节点的通信网卡
+  SOCKET_IFNAME="Your SOCKET IFNAME"
+  export HCCL_SOCKET_IFNAME="SOCKET IFNAME FOR CURRENT NODE"
+  export GLOO_SOCKET_IFNAME="SOCKET IFNAME FOR CURRENT NODE"
+  # 获取当前IP
+  CURRENT_IP=$(ifconfig $SOCKET_IFNAME | grep -Eo 'inet (addr:)?([0-9]{1,3}\.){3}[0-9]{1,3}' | awk '{print $NF}')
+  if [ "$MASTER_ADDR" = "$CURRENT_IP" ]; then
+    # 主节点启动
+    ray start --head --port 6766 --dashboard-host=$MASTER_ADDR --node-ip-address=$CURRENT_IP --dashboard-port=8260 --resources='{"NPU": '$NPUS_PER_NODE'}'
+  
+    while true; do
+        ray_status_output=$(ray status)
+        npu_count=$(echo "$ray_status_output" | grep -oP '(?<=/)\d+\.\d+(?=\s*NPU)' | head -n 1)
+        npu_count_int=$(echo "$npu_count" | awk '{print int($1)}')
+        device_count=$((npu_count_int / $NPUS_PER_NODE))
+  
+        # 判断device_count 是否与 NNODES 相等
+        if [ "$device_count" -eq "$NNODES" ]; then
+            echo "Ray cluster is ready with $device_count devices (from $npu_count NPU resources), starting Python script."
+            ray status
+            bash $DEFAULT_SH
+            break
+        else
+            echo "Waiting for Ray to allocate $NNODES devices. Current device count: $device_count"
+            sleep 5
+        fi
+    done
+  else
+    # 子节点尝试往主节点注册 ray 直到成功
+    while true; do
+        # 尝试连接 ray 集群
+        ray start --address="$MASTER_ADDR:6766" --resources='{"NPU": '$NPUS_PER_NODE'}' --node-ip-address=$CURRENT_IP
+  
+        # 检查连接是否成功
+        ray status
+        if [ $? -eq 0 ]; then
+            echo "Successfully connected to the Ray cluster!"
+            break
+        else
+            echo "Failed to connect to the Ray cluster. Retrying in 5 seconds..."
+            sleep 5
+        fi
+    done
+  fi
+  
+  sleep 600
+
+DEFAULT_SH:修改为训练所用配置 sh 文件路径。在此案例中修改为 `Qwen2.5-32B`_ 路径。
+          
+NNODES 和 NPUS_PER_NODE:修改为使用节点数量和每个节点 NPU 数量。在此案例中分别为2和8。
+          
+MASTER_ADDR:修改为对应主节点 IP。即所有节点的 MASTER_ADDR 应该相同。
+          
+SOCKET_IFNAME, HCCL_SOCKET_IFNAME, GLOO_SOCKET_IFNAME: 修改为对应通信网卡，通信网卡可以通过以下命令获取：
+          
+.. code-block:: bash
+          
+  ifconfig |grep "$(hostname -I |awk '{print $1}'|awk -F '.' '{print $0}')" -B 1|awk -F ':' '{print$1}' | head -1 | tail -1
+
+3.模型评估
+^^^^^^^^^^^
+
+不同模型步骤一致,仅以Qwen3-30b为例列举
+
+我们通过 AISBenchmark 评估模型,该工具支持vllm/sglang多种推理后端的评估
+
+**安装方法**
+
+.. code-block:: bash
+
+  git clone https://gitee.com/aisbench/benchmark.git
+  cd benchmark
+  pip install -e .
+
+**下载评估数据集**
+
+.. code-block:: bash
+
+  cd path/to/benchmark/ais_bench/datasets
+  wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
+  unzip math.zip
+  rm math.zip
+
+**修改AISBench配置代码使能sglang推理评测**
+
+打开 benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py 文件，这是推理配置文件
+
+.. code-block:: bash
+
+    from ais_bench.benchmark.models import VLLMCustomAPIChatStream
+    from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
+    from ais_bench.benchmark.clients import OpenAIChatStreamClient, OpenAIChatStreamSglangClient
+
+    models = [
+        dict(
+            attr="service",
+            type=VLLMCustomAPIChatStream,
+            abbr='sgl-api-stream-chat',
+            path="/path/to/Qwen3-30B", # 修改为 Qwen3-30B 模型路径
+            model="qwen3-30b",
+            request_rate = 0,
+            max_seq_len=2048,
+            retry = 2,
+            host_ip = "localhost", # 推理服务的IP
+            host_port = 8005, # 推理服务的端口
+            max_out_len = 8192,  # 最大输出tokens长度
+            batch_size=48, # 推理的最大并发数
+            trust_remote_code=False,
+            custom_client=dict(type=OpenAIChatStreamSglangClient), #使用sglang客户端
+            generation_kwargs = dict(
+                temperature = 0,
+                seed = 1234,
+            ),
+            pred_postprocessor=dict(type=extract_non_reasoning_content)
+        )
+    ]
+
+
+**启动sglang_server服务**
+
+.. code-block:: bash
+
+    python -m sglang.launch_server --model-path "/path/to/Qwen3-30B"  --tp-size 4 --dp-size 1 --port 8005 
+
+**启动sglang_client评测**
+
+.. code-block:: bash
+
+    ais_bench --models vllm_api_stream_chat --datasets math500_gen_0_shot_cot_chat_prompt
+
+**评测结果**
+
+经过训练,模型在Math-500上的评分显著上升
+
++------+----------------------+---------+----------+------+----------------------+
+| iter | dataset              | version | metric   | mode | sgl-api-stream-chat  |
++======+======================+=========+==========+======+======================+
+|   0  | math_prm800k_500     | c4b6f0  | accuracy | gen  | 	84.4             |
++------+----------------------+---------+----------+------+----------------------+
+|  150 | math_prm800k_500     | c4b6f0  | accuracy | gen  |     91.7             |
++------+----------------------+---------+----------+------+----------------------+
+
+性能采集
+-----------------------------------
+关于NPU profiling的详细文档请参考 `ascend_profiling_zh <https://github.com/volcengine/verl/blob/main/docs/ascend_tutorial/ascend_profiling_zh.rst>`_
+
+在 `Qwen3-30B`_ 的脚本中提供了基本的采集性能选项PROF_CONFIG，默认设置 global_profiler.steps=null 关闭采集， 开发者可根据实际需要进行参数修改
+
+采集完成后，开发者可以使用 `MindStudio Insight <https://www.hiascend.com/document/detail/zh/mindstudio/830/GUI_baseddevelopmenttool/msascendinsightug/Insight_userguide_0002.html>`_ 进行数据解析
+
+注: verl框架侧进行采集全量 Profiling 产生海量且重复的算子记录，可以根据文档修改代码仅采集关键阶段
\ No newline at end of file
diff --git a/docs/ascend_tutorial/ascend_best_practice/dapo_multi_model_optimization_practice.md b/docs/ascend_tutorial/examples/dapo_multi_model_optimization_practice.md
similarity index 100%
rename from docs/ascend_tutorial/ascend_best_practice/dapo_multi_model_optimization_practice.md
rename to docs/ascend_tutorial/examples/dapo_multi_model_optimization_practice.md
diff --git a/docs/ascend_tutorial/ascend_optimization_pratice/gspo_optimization_practice.md b/docs/ascend_tutorial/examples/gspo_optimization_practice.md
similarity index 100%
rename from docs/ascend_tutorial/ascend_optimization_pratice/gspo_optimization_practice.md
rename to docs/ascend_tutorial/examples/gspo_optimization_practice.md
diff --git a/docs/index.rst b/docs/index.rst
index 1b3cdedda7e..2e1bc7a04e2 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -150,8 +150,9 @@ verl is fast with:
    ascend_tutorial/ascend_profiling_en.rst
    ascend_tutorial/dockerfile_build_guidance.rst
    ascend_tutorial/ascend_sglang_quick_start.rst
-   ascend_tutorial/ascend_optimization_pratice/gspo_optimization_practice.md
-   ascend_tutorial/ascend_best_practice/dapo_multi_model_optimization_practice.md
+   ascend_tutorial/examples/gspo_optimization_practice.md
+   ascend_tutorial/examples/dapo_multi_model_optimization_practice.md
+   ascend_tutorial/examples/ascend_sglang_best_practices.rst
 
 .. toctree::
    :maxdepth: 1
diff --git a/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_spmd_npu.sh b/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_spmd_npu.sh
index b2d259b4330..878b106f9f1 100644
--- a/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_spmd_npu.sh
+++ b/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_1k_spmd_npu.sh
@@ -2,7 +2,8 @@ set -x
 export HCCL_CONNECT_TIMEOUT=1500
 export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
 export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050
-
+export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
 # WORKSPACE_HOME and DATA_HOME support custom path configuration.
 WORKSPACE_HOME=$pwd
 DATA_HOME=$pwd
diff --git a/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_32k_spmd_npu.sh b/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_32k_spmd_npu.sh
index 9076360bb6d..04b2f3a36e9 100644
--- a/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_32k_spmd_npu.sh
+++ b/examples/grpo_trainer/run_qwen3_8b_grpo_sglang_32k_spmd_npu.sh
@@ -2,7 +2,8 @@ set -x
 export HCCL_CONNECT_TIMEOUT=1500
 export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
 export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050
-
+export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
 # WORKSPACE_HOME and DATA_HOME support custom path configuration.
 WORKSPACE_HOME=$pwd
 DATA_HOME=$pwd
diff --git a/examples/grpo_trainer/run_qwen3moe-30b_sglang_megatron_npu.sh b/examples/grpo_trainer/run_qwen3moe-30b_sglang_megatron_npu.sh
new file mode 100644
index 00000000000..71e566c7dcd
--- /dev/null
+++ b/examples/grpo_trainer/run_qwen3moe-30b_sglang_megatron_npu.sh
@@ -0,0 +1,236 @@
+#!/bin/bash
+set -xeuo pipefail
+# Project Configuration
+project_name='DAPO-Qwen3-30b-A3B-BASE-MATH'
+exp_name='DAPO-Qwen3-30B-A3B-BASE-Megatron-SGLang'
+
+# Necessary env
+export HCCL_CONNECT_TIMEOUT=1500
+export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
+export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050
+
+export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
+
+export DISABLE_L2_CACHE=1
+export TASK_QUEUE_ENABLE=1
+
+# Node Info
+NNODES=${NNODES:-1}
+NPUS_PER_NODE=${NPUS_PER_NODE:-16}
+
+# Model Weights Paths
+MODEL_PATH=Qwen/Qwen3-30B-A3B
+MCORE_MODEL_PATH=Qwen/Qwen3-30B-A3B-mcore
+RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
+CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
+
+# File System Paths
+TRAIN_FILE=$RAY_DATA_HOME/dataset/dapo-math-17k.parquet
+TEST_FILE=$RAY_DATA_HOME/dataset/aime-2024.parquet
+# Data Length Configuration
+max_prompt_length=$((1024 * 2))
+max_response_length=$((1024 * 8))
+
+# Training Batch Configuration
+train_prompt_bsz=16
+train_prompt_mini_bsz=16
+n_resp_per_prompt=8
+
+# Algorithm Configuration
+adv_estimator=grpo
+use_kl_in_reward=False
+kl_coef=0.0
+use_kl_loss=True
+kl_loss_coef=0.001
+
+# Performance and Memory Management Configuration
+all_offload=True
+use_dynamic_bsz=False
+actor_ppo_max_token_len=$(((max_prompt_length + max_response_length)))
+infer_ppo_max_token_len=$(((max_prompt_length + max_response_length)))
+
+# Megatron Parallelism Configuration
+train_tp=4
+train_ep=4
+train_etp=4
+train_pp=1
+train_cp=1
+
+# SGLang Generation Configuration
+gen_tp=4
+gen_dp=1
+gen_ep=1
+gpu_memory_utilization=0.5
+max_model_len=$((max_prompt_length + max_response_length))
+max_num_batched_tokens=$(((max_prompt_length + max_response_length) * 1))
+
+# Data Configuration
+DATA_CONFIG=(
+    # File Paths
+    data.train_files="${TRAIN_FILE}"
+    data.val_files="${TEST_FILE}"
+    # Data Structure
+    data.prompt_key=prompt
+    # Batch and Length Configuration
+    data.train_batch_size=${train_prompt_bsz}
+    data.max_prompt_length=${max_prompt_length}
+    data.max_response_length=${max_response_length}
+    # Preprocessing
+    data.filter_overlong_prompts=False
+    data.truncation='left'
+)
+
+# Model Configuration
+MODEL_CONFIG=(
+    # Model Path
+    actor_rollout_ref.model.path="${MODEL_PATH}"
+    # Model Processing
+    actor_rollout_ref.model.use_remove_padding=True
+)
+
+# Reinforcement Learning Algorithm Configuration
+ALGORITHM_CONFIG=(
+    # Advantage Estimation
+    algorithm.adv_estimator=${adv_estimator}
+    # KL Divergence Control
+    algorithm.use_kl_in_reward=${use_kl_in_reward}
+    algorithm.kl_ctrl.kl_coef=${kl_coef}
+)
+
+ACTOR_CONFIG=(
+    # Core Runtime Settings
+    actor_rollout_ref.actor.use_torch_compile=False
+    actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz}
+    # Loss Function Configuration
+    actor_rollout_ref.actor.use_kl_loss=${use_kl_loss}
+    actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef}
+    actor_rollout_ref.actor.entropy_coeff=0
+    # PPO Training Parameters
+    actor_rollout_ref.actor.ppo_epochs=1
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1
+    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len}
+    actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz}
+    # Optimizer Settings
+    actor_rollout_ref.actor.optim.lr=1e-6
+    # Megatron Parallelism Strategy
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp}
+    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp}
+    actor_rollout_ref.actor.megatron.context_parallel_size=${train_cp}
+    actor_rollout_ref.actor.megatron.expert_model_parallel_size=${train_ep}
+    actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=${train_etp}
+    # Memory Optimization
+    actor_rollout_ref.actor.megatron.param_offload=${all_offload}
+    actor_rollout_ref.actor.megatron.optimizer_offload=${all_offload}
+    actor_rollout_ref.actor.megatron.grad_offload=${all_offload}
+    # Model Weights Management
+    actor_rollout_ref.actor.megatron.dist_checkpointing_path=${MCORE_MODEL_PATH}
+    actor_rollout_ref.actor.megatron.use_dist_checkpointing=True
+    actor_rollout_ref.actor.megatron.use_mbridge=False
+    # Transformer Architecture Optimizations
+    +actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True
+    +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform
+    +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full
+    +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1
+)
+
+REF_CONFIG=(
+    # Core Runtime Settings
+    actor_rollout_ref.ref.use_torch_compile=False
+    # Log Probability Inference
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1
+    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
+    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
+    # Megatron Parallelism Strategy
+    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${train_tp}
+    actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${train_pp}
+    actor_rollout_ref.ref.megatron.context_parallel_size=${train_cp}
+    actor_rollout_ref.ref.megatron.expert_model_parallel_size=${train_ep}
+    actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=${train_etp}
+    # Memory Optimization
+    actor_rollout_ref.ref.megatron.param_offload=${all_offload}
+    # Model Weights Management
+    actor_rollout_ref.ref.megatron.dist_checkpointing_path=${MCORE_MODEL_PATH}
+    actor_rollout_ref.ref.megatron.use_dist_checkpointing=True
+    actor_rollout_ref.ref.megatron.use_mbridge=False
+)
+
+ROLLOUT_CONFIG=(
+    # Rollout Engine
+    actor_rollout_ref.rollout.name=sglang
+    +actor_rollout_ref.rollout.engine_kwargs.sglang.attention_backend="ascend"
+    # Generation Parameters
+    actor_rollout_ref.rollout.n=${n_resp_per_prompt}
+    actor_rollout_ref.rollout.top_p=1.0
+    actor_rollout_ref.rollout.top_k=-1
+    actor_rollout_ref.rollout.temperature=1.0
+    # Log Probability Inference
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1
+    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
+    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len}
+    # Memory Management
+    actor_rollout_ref.rollout.gpu_memory_utilization=${gpu_memory_utilization}
+    # Parallelism Strategy
+    actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp}
+    actor_rollout_ref.rollout.data_parallel_size=${gen_dp}
+    actor_rollout_ref.rollout.expert_parallel_size=${gen_ep}
+    +actor_rollout_ref.rollout.engine_kwargs.sglang.enable_dp_attention=False
+    # Performance Optimization
+    +actor_rollout_ref.rollout.engine_kwargs.sglang.chunked_prefill_size=-1
+    actor_rollout_ref.rollout.enforce_eager=False
+    # Validation Generation
+    actor_rollout_ref.rollout.val_kwargs.n=1
+    actor_rollout_ref.rollout.val_kwargs.do_sample=True
+    actor_rollout_ref.rollout.val_kwargs.top_p=1.0
+    actor_rollout_ref.rollout.val_kwargs.top_k=-1
+    actor_rollout_ref.rollout.val_kwargs.temperature=1.0
+)
+
+TRAINER_CONFIG=(
+    # Logger Configuration
+    trainer.logger='["console"]'
+    # Project Settings
+    trainer.project_name="${project_name}"
+    trainer.experiment_name="${exp_name}"
+    # Hardware Configuration
+    trainer.nnodes="${NNODES}"
+    trainer.n_gpus_per_node="${NPUS_PER_NODE}"
+    trainer.device='npu'
+    # Training Schedule
+    trainer.total_epochs=15
+    trainer.val_before_train=False
+    trainer.test_freq=-1
+    trainer.save_freq=-1
+    # Checkpoint Directory
+    trainer.default_local_dir="${CKPTS_DIR}"
+)
+
+# profiling configuration
+PROF_CONFIG=(
+    global_profiler.tool=npu 
+    global_profiler.steps=null 
+    global_profiler.save_path=/profpath 
+    actor_rollout_ref.actor.profiler.enable=True 
+    actor_rollout_ref.actor.profiler.ranks="[0]" 
+    actor_rollout_ref.actor.profiler.all_ranks=False 
+    actor_rollout_ref.actor.profiler.tool_config.npu.discrete=True 
+    actor_rollout_ref.actor.profiler.tool_config.npu.contents=['npu','cpu'] 
+    actor_rollout_ref.actor.profiler.tool_config.npu.level=level0 
+    actor_rollout_ref.actor.profiler.tool_config.npu.analysis=True 
+    actor_rollout_ref.rollout.profiler.enable=True 
+    actor_rollout_ref.rollout.profiler.ranks="[0]" 
+    actor_rollout_ref.rollout.profiler.all_ranks=False 
+)
+
+python3 -m verl.trainer.main_ppo \
+    --config-path=config \
+    --config-name='ppo_megatron_trainer.yaml' \
+    "${DATA_CONFIG[@]}" \
+    "${MODEL_CONFIG[@]}" \
+    "${ACTOR_CONFIG[@]}" \
+    "${REF_CONFIG[@]}" \
+    "${ROLLOUT_CONFIG[@]}" \
+    "${ALGORITHM_CONFIG[@]}" \
+    "${TRAINER_CONFIG[@]}" \
+    "${PROF_CONFIG[@]}" \
+    "$@"
diff --git a/scripts/install_sglang_mcore_npu.sh b/scripts/install_sglang_mcore_npu.sh
new file mode 100644
index 00000000000..2975db3d1ed
--- /dev/null
+++ b/scripts/install_sglang_mcore_npu.sh
@@ -0,0 +1,57 @@
+#!/bin/bash
+set -e
+NPU_DEVICE=${NPU_DEVICE:=A3}
+
+export MAX_JOBS=32
+
+echo "1. install SGLang from source"
+git clone -b v0.5.8 https://github.com/sgl-project/sglang.git
+cd sglang
+mv python/pyproject_other.toml python/pyproject.toml
+pip install -e python[srt_npu]
+cd ..
+
+echo "2. install torch & torch_npu & triton_ascend & other basic packages"
+pip install torch==2.7.1 torch_npu==2.7.1.post2 torchvision==0.22.1
+pip install pybind11 click==8.2.1 mbridge "numpy<2.0.0" cachetools
+
+
+echo "3. install sgl-kernel-npu form source, detailed readme in https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/deep_ep/README.md"
+git clone https://github.com/sgl-project/sgl-kernel-npu.git
+cd sgl-kernel-npu
+git checkout 46b73de
+sed -i '101s/^/# /' build.sh
+if [ "$NPU_DEVICE" = "A3" ]; then
+    bash build.sh
+fi
+if [ "$NPU_DEVICE" = "A2" ]; then
+    bash build.sh -a deepep2
+fi
+pip install output/torch_memory_saver*.whl
+pip install output/sgl_kernel_npu*.whl
+pip install output/deep_ep*.whl
+cd "$(pip show deep-ep | grep -E '^Location:' | awk '{print $2}')" && ln -s deep_ep/deep_ep_cpp*.so && cd -
+python -c "import deep_ep; print(deep_ep.__path__)"
+cd ..
+# install sgl-kernel-npu from release whl
+# if [ "$NPU_DEVICE" = "A3" ]; then
+#     wget https://github.com/sgl-project/sgl-kernel-npu/releases/download/2026.01.21/sgl-kernel-npu_2026.01.21_8.5.0_a3.zip
+# fi
+# if [ "$NPU_DEVICE" = "A2" ]; then
+#     wget https://github.com/sgl-project/sgl-kernel-npu/releases/download/2026.01.21/sgl-kernel-npu_2026.01.21_8.5.0_910b.zip
+# fi
+# unzip sgl-kernel-npu*.zip
+# pip install output/torch_memory_saver*.whl
+# pip install output/sgl_kernel_npu*.whl
+# pip install output/deep_ep*.whl
+
+if [ $USE_MEGATRON -eq 1 ]; then
+    echo "4. install Megatron and MindSpeed"
+    git clone -b 2.3.0_core_r0.12.1 https://gitcode.com/Ascend/MindSpeed.git 
+    pip install -e MindSpeed 
+    pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.12.1 
+fi
+
+echo "5. May need to uninstall timm & triton"
+pip uninstall -y timm triton
+echo "Successfully installed all packages"