-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[doc]add GLM5.md #6709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc]add GLM5.md #6709
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,271 @@ | ||||||
| # GLM-5 | ||||||
|
|
||||||
| ## Introduction | ||||||
|
|
||||||
| [GLM-5](https://huggingface.co/zai-org/GLM-5)use a Mixture-of-Experts (MoE) architecture and targeting at complex systems engineering and long-horizon agentic tasks. | ||||||
|
|
||||||
| This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation. | ||||||
|
|
||||||
| ## Supported Features | ||||||
|
|
||||||
| Refer to [supported features](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_models.html)to get the model's supported feature matrix. | ||||||
|
|
||||||
| Refer to [feature guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_features.html) to get the feature's configuration. | ||||||
|
|
||||||
| ## Environment Preparation | ||||||
|
|
||||||
| ### Model Weight | ||||||
|
|
||||||
| - `GLM-5`(BF16 version): [Download model weight](https://www.modelscope.cn/models/ZhipuAI/GLM-5). | ||||||
| - `GLM-5-w4a8`(Quantized version without mtp): [Download model weight](https://modelers.cn/models/Eco-Tech/GLM-5-w4a8). | ||||||
| - You can use [msmodelslim](https://gitcode.com/Ascend/msmodelslim) to quantify the model naively. | ||||||
|
|
||||||
| It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` | ||||||
|
|
||||||
| ### Installation | ||||||
|
|
||||||
| vLLM and vLLM-ascend only support GLM-5 on our main branches. you can use our official docker images and upgrade vllm and vllm-ascend for inference. | ||||||
|
|
||||||
| ```{code-block} bash | ||||||
| # Update --device according to your device (Atlas A3:/dev/davinci[0-15]). | ||||||
| # Update the vllm-ascend image according to your environment. | ||||||
| # Note you should download the weight to /root/.cache in advance. | ||||||
| # Update the vllm-ascend image, alm5-a3 can be replaced by: glm5;glm5-openeuler;glm5-a3-openeuler | ||||||
| export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:glm5-a3 | ||||||
| export NAME=vllm-ascend | ||||||
|
|
||||||
| # Run the container using the defined variables | ||||||
| # Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance | ||||||
| docker run --rm \ | ||||||
| --name $NAME \ | ||||||
| --net=host \ | ||||||
| --shm-size=1g \ | ||||||
| --device /dev/davinci0 \ | ||||||
| --device /dev/davinci1 \ | ||||||
| --device /dev/davinci2 \ | ||||||
| --device /dev/davinci3 \ | ||||||
| --device /dev/davinci4 \ | ||||||
| --device /dev/davinci5 \ | ||||||
| --device /dev/davinci6 \ | ||||||
| --device /dev/davinci7 \ | ||||||
| --device /dev/davinci_manager \ | ||||||
| --device /dev/devmm_svm \ | ||||||
| --device /dev/hisi_hdc \ | ||||||
| -v /usr/local/dcmi:/usr/local/dcmi \ | ||||||
| -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ | ||||||
| -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ | ||||||
| -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ | ||||||
| -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ | ||||||
| -v /etc/ascend_install.info:/etc/ascend_install.info \ | ||||||
| -v /root/.cache:/root/.cache \ | ||||||
| -it $IMAGE bash | ||||||
| ``` | ||||||
|
|
||||||
| In addition, if you don't want to use the docker image as above, you can also build all from source: | ||||||
|
|
||||||
| - Install `vllm-ascend` from source, refer to [installation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html). | ||||||
|
|
||||||
| To inference `GLM-5`, you should upgrade vllm、vllm-ascend、transformers to main branches: | ||||||
|
|
||||||
| ```shell | ||||||
| # upgrade vllm | ||||||
| git clone https://github.com/vllm-project/vllm.git | ||||||
| cd vllm | ||||||
| git checkout 978a37c82387ce4a40aaadddcdbaf4a06fc4d590 | ||||||
| VLLM_TARGET_DEVICE=empty pip install -v . | ||||||
|
|
||||||
| # upgrade vllm-ascend | ||||||
| git clone https://github.com/vllm-project/vllm-ascend.git | ||||||
| cd vllm-ascend | ||||||
| git checkout ff3a50d011dcbea08f87ebed69ff1bf156dbb01e | ||||||
| git submodule update --init --recursive | ||||||
| pip install -v . | ||||||
|
|
||||||
| # reinstall transformers | ||||||
| pip install git+https://github.com/huggingface/transformers.git | ||||||
| ``` | ||||||
|
|
||||||
| If you want to deploy multi-node environment, you need to set up environment on each node. | ||||||
|
|
||||||
| ## Deployment | ||||||
|
|
||||||
| ### Single-node Deployment | ||||||
|
|
||||||
| **A2 series** | ||||||
|
|
||||||
| Not test yet. | ||||||
|
|
||||||
| **A3 series** | ||||||
|
|
||||||
| - Quantized model `glm-5-w4a8` can be deployed on 1 Atlas 800 A3 (64G × 16) . | ||||||
|
|
||||||
| Run the following script to execute online inference. | ||||||
|
|
||||||
| ```shell | ||||||
| export HCCL_OP_EXPANSION_MODE="AIV" | ||||||
| export OMP_PROC_BIND=false | ||||||
| export OMP_NUM_THREADS=10 | ||||||
| export VLLM_USE_V1=1 | ||||||
| export HCCL_BUFFSIZE=200 | ||||||
| export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True | ||||||
| export VLLM_ASCEND_BALANCE_SCHEDULING=1 | ||||||
|
|
||||||
| vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-w4a8 \ | ||||||
| --host 0.0.0.0 \ | ||||||
| --port 8077 \ | ||||||
| --data-parallel-size 1 \ | ||||||
| --tensor-parallel-size 16 \ | ||||||
| --enable-expert-parallel \ | ||||||
| --seed 1024 \ | ||||||
| --served-model-name glm-5 \ | ||||||
| --max-num-seqs 8 \ | ||||||
| --max-model-len 66600 \ | ||||||
| --max-num-batched-tokens 4096 \ | ||||||
| --trust-remote-code \ | ||||||
| --gpu-memory-utilization 0.95 \ | ||||||
| --quantization ascend \ | ||||||
| --enable-chunked-prefill \ | ||||||
| --enable-prefix-caching \ | ||||||
| --async-scheduling \ | ||||||
| --additional-config '{"multistream_overlap_shared_expert":true}' \ | ||||||
| --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ | ||||||
| --speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' | ||||||
| ``` | ||||||
|
|
||||||
| **Notice:** | ||||||
| The parameters are explained as follows: | ||||||
|
|
||||||
| - For single-node deployment, we recommend using `dp1tp16` and turn off expert parallel in low-latency scenarios. | ||||||
| - `--async-scheduling` Asynchronous scheduling is a technique used to optimize inference efficiency. It allows non-blocking task scheduling to improve concurrency and throughput, especially when processing large-scale models. | ||||||
|
|
||||||
| ### Multi-node Deployment | ||||||
|
|
||||||
| **A2 series** | ||||||
|
|
||||||
| Not test yet. | ||||||
|
|
||||||
| **A3 series** | ||||||
|
|
||||||
| - `glm-5-bf16`: require at least 2 Atlas 800 A3 (64G × 16). | ||||||
|
|
||||||
| Run the following scripts on two nodes respectively. | ||||||
|
|
||||||
| **node 0** | ||||||
|
|
||||||
| ```shell | ||||||
| # this obtained through ifconfig | ||||||
| # nic_name is the network interface name corresponding to local_ip of the current node | ||||||
| nic_name="xxx" | ||||||
| local_ip="xxx" | ||||||
|
|
||||||
| # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) | ||||||
| node0_ip="xxxx" | ||||||
|
|
||||||
| export HCCL_OP_EXPANSION_MODE="AIV" | ||||||
|
|
||||||
| export HCCL_IF_IP=$local_ip | ||||||
| export GLOO_SOCKET_IFNAME=$nic_name | ||||||
| export TP_SOCKET_IFNAME=$nic_name | ||||||
| export HCCL_SOCKET_IFNAME=$nic_name | ||||||
| export OMP_PROC_BIND=false | ||||||
| export OMP_NUM_THREADS=10 | ||||||
| export VLLM_USE_V1=1 | ||||||
| export HCCL_BUFFSIZE=200 | ||||||
| export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True | ||||||
|
|
||||||
| vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-bf16 \ | ||||||
| --host 0.0.0.0 \ | ||||||
| --port 8077 \ | ||||||
| --data-parallel-size 2 \ | ||||||
| --data-parallel-size-local 1 \ | ||||||
| --data-parallel-address $node0_ip \ | ||||||
| --data-parallel-rpc-port 12890 \ | ||||||
| --tensor-parallel-size 16 \ | ||||||
| --quantization ascend \ | ||||||
| --seed 1024 \ | ||||||
| --served-model-name glm-5 \ | ||||||
| --enable-expert-parallel \ | ||||||
| --max-num-seqs 16 \ | ||||||
| --max-model-len 8192 \ | ||||||
| --max-num-batched-tokens 4096 \ | ||||||
| --trust-remote-code \ | ||||||
| --no-enable-prefix-caching \ | ||||||
| --gpu-memory-utilization 0.95 \ | ||||||
| --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ | ||||||
| --speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' | ||||||
| ``` | ||||||
|
|
||||||
| **node 1** | ||||||
|
|
||||||
| ```shell | ||||||
| # this obtained through ifconfig | ||||||
| # nic_name is the network interface name corresponding to local_ip of the current node | ||||||
| nic_name="xxx" | ||||||
| local_ip="xxx" | ||||||
|
|
||||||
| # The value of node0_ip must be consistent with the value of local_ip set in node0 (master node) | ||||||
| node0_ip="xxxx" | ||||||
|
|
||||||
| export HCCL_OP_EXPANSION_MODE="AIV" | ||||||
|
|
||||||
| export HCCL_IF_IP=$local_ip | ||||||
| export GLOO_SOCKET_IFNAME=$nic_name | ||||||
| export TP_SOCKET_IFNAME=$nic_name | ||||||
| export HCCL_SOCKET_IFNAME=$nic_name | ||||||
| export OMP_PROC_BIND=false | ||||||
| export OMP_NUM_THREADS=10 | ||||||
| export VLLM_USE_V1=1 | ||||||
| export HCCL_BUFFSIZE=200 | ||||||
| export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True | ||||||
|
|
||||||
| vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/GLM5-bf16 \ | ||||||
| --host 0.0.0.0 \ | ||||||
| --port 8077 \ | ||||||
| --headless \ | ||||||
| --data-parallel-size 2 \ | ||||||
| --data-parallel-size-local 1 \ | ||||||
| --data-parallel-start-rank 1 \ | ||||||
| --data-parallel-address $node0_ip \ | ||||||
| --data-parallel-rpc-port 12890 \ | ||||||
| --tensor-parallel-size 16 \ | ||||||
| --quantization ascend \ | ||||||
| --seed 1024 \ | ||||||
| --served-model-name glm-5 \ | ||||||
| --enable-expert-parallel \ | ||||||
| --max-num-seqs 16 \ | ||||||
| --max-model-len 8192 \ | ||||||
| --max-num-batched-tokens 4096 \ | ||||||
| --trust-remote-code \ | ||||||
| --no-enable-prefix-caching \ | ||||||
| --gpu-memory-utilization 0.95 \ | ||||||
| --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ | ||||||
| --speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}' | ||||||
| ``` | ||||||
|
|
||||||
| ### Prefill-Decode Disaggregation | ||||||
|
|
||||||
| Not test yet. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section contains placeholder text indicating it is incomplete. Please either provide the content for this section or remove it before merging. Shipping documentation with 'Not test yet' is confusing for users and makes the documentation seem unfinished.
Suggested change
|
||||||
|
|
||||||
| ## Accuracy Evaluation | ||||||
|
|
||||||
| Here are two accuracy evaluation methods. | ||||||
|
|
||||||
| ### Using AISBench | ||||||
|
|
||||||
| 1. Refer to [Using AISBench](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html) for details. | ||||||
|
|
||||||
| 2. After execution, you can get the result. | ||||||
|
|
||||||
| ### Using Language Model Evaluation Harness | ||||||
|
|
||||||
| Not test yet. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section contains placeholder text indicating it is incomplete. Please either provide the content for this section or remove it before merging. Shipping documentation with 'Not test yet' is confusing for users and makes the documentation seem unfinished.
Suggested change
|
||||||
|
|
||||||
| ## Performance | ||||||
|
|
||||||
| ### Using AISBench | ||||||
|
|
||||||
| Refer to [Using AISBench for performance evaluation](https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_ais_bench.html#execute-performance-evaluation) for details. | ||||||
|
|
||||||
| ### Using vLLM Benchmark | ||||||
|
|
||||||
| Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,6 +26,7 @@ DeepSeek-V3.1.md | |
| DeepSeek-V3.2.md | ||
| DeepSeek-R1.md | ||
| GLM4.x.md | ||
| GLM5.md | ||
| Kimi-K2-Thinking.md | ||
| PaddleOCR-VL.md | ||
| ::: | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure the hash is correct