sgl-project · hnyls2002 · May 12, 2025 · May 12, 2025 · May 12, 2025
diff --git a/3rdparty/amd/profiling/PROFILING.md b/3rdparty/amd/profiling/PROFILING.md
@@ -356,7 +356,7 @@ client.sh
 # Start profiling via API
 curl http://localhost:30000/start_profile -H "Content-Type: application/json"
 
-# Benchmark serving using sglang with random dataset and tokenizer
+# Benchmark serving using SGLang with a random dataset and tokenizer
 # Define the log file with a timestamp
 TIMESTAMP=$(date +%Y%m%d_%H%M%S)
 LOGFILE="sglang_client_log_$TIMESTAMP.json"

diff --git a/3rdparty/amd/tuning/TUNING.md b/3rdparty/amd/tuning/TUNING.md
@@ -93,21 +93,21 @@ TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDU
 #Inference with large improvement on AMD GPU
 TORCHINDUCTOR_FREEZING=1 your_script.sh
 ```
-## 4. Fused MOE kernel
-To maximize moe kernel efficiency, need to use below scripts to find out the best launch configuration
+## 4. Fused MoE kernel
+To maximize MoE kernel efficiency, need to use below scripts to find out the best launch configuration
 
 ### Key parameters:
-- **--model**: what moe model type to do tuning, it will automatically decide the size of d_model, model_intermediate_size, num_layers
+- **--model**: what MoE model type to do tuning, it will automatically decide the size of d_model, model_intermediate_size, num_layers
 - **--tp-size**: simulate the whole model run configuration to set the dimension size using tp correctly
-- **--batch**: M dimension size of moe kernel, for prefill moe kernel the value is batch*input_len, for decode moe kernel the value is batch
+- **--batch**: M dimension size of MoE kernel, for prefill MoE kernel the value is batch*input_len, for decode MoE kernel the value is batch
 - **--dtype**: computation type
 
 ```bash
 #Tuning
-#for example, we have one case like this "python3 -m sglang.bench_latency --model dummy_grok1/ --load-format dummy --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --batch-size 32 --input 1024 --output 8 --attention-backend triton --sampling-backend pytorch --quantization fp8" to run, it defined batch-size 32 input length 1024 and output length 8, from "--batch" in moe view point, the prefill batch is 32*1024 = 32768, the decode batch is 32*1(only one output token generated in each run).
-#so we can tune decode moe use below command
+#for example, we have one case like this "python3 -m sglang.bench_latency --model dummy_grok1/ --load-format dummy --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --batch-size 32 --input 1024 --output 8 --attention-backend triton --sampling-backend pytorch --quantization fp8" to run, it defined batch-size 32 input length 1024 and output length 8, from "--batch" in MoE view point, the prefill batch is 32*1024 = 32768, the decode batch is 32*1(only one output token generated in each run).
+#so we can tune decode MoE use below command
 python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32"
-# and use this command to tune prefill moe
+# and use this command to tune prefill MoE
 python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32768"
 ```
 

diff --git a/README.md b/README.md
@@ -44,7 +44,7 @@ SGLang is a fast serving framework for large language models and vision language
 It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
 The core features include:
 
-- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-lora batching.
+- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (PagedAttention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.
 - **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
 - **Extensive Model Support**: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
 - **Active Community**: SGLang is open-source and backed by an active community with industry adoption.

diff --git a/benchmark/benchmark_vllm_060/README.md b/benchmark/benchmark_vllm_060/README.md
@@ -1,6 +1,6 @@
 ## How to reproduce the benchmark results for SGLang v0.3.0 compared to vLLM v0.6.0
 
-In short, with multi step enabled, in online scenarios that we benchmarked, the Median TTFT of vLLM is **3 times** that of SGLang, and the Median ITL is **10 times** that of SGLang. Lower Median TTFT and ITL are better. vLLM's multi-step optimization did not improve throughput while ensuring lower Median TTFT and ITL. Also, under maximum throughput benchmark, if vLLM does not set gpu util to 0.95 separately and uses the default configuration instead, its maximum throughput is **lower** than that of SGLang.
+In short, with multi step enabled, in online scenarios that we benchmarked, the Median TTFT of vLLM is **3 times** that of SGLang, and the Median ITL is **10 times** that of SGLang. Lower Median TTFT and ITL are better. vLLM's multi-step optimization did not improve throughput while ensuring lower Median TTFT and ITL. Also, under maximum throughput benchmark, if vLLM does not set GPU utilization to 0.95 separately and uses the default configuration instead, its maximum throughput is **lower** than that of SGLang.
 
 ## Online benchmark results
 
@@ -41,12 +41,12 @@ In short, with multi step enabled, in online scenarios that we benchmarked, the
 ## Installation
 
 ```bash
-# install sglang v0.3.0
+# install SGLang v0.3.0
 pip install --upgrade pip
 pip install "sglang[all]"==0.3.0
 pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
 
-# install vllm v0.6.0
+# install vLLM v0.6.0
 pip install vllm==0.6.0
 ```
 

diff --git a/benchmark/deepseek_v3/README.md b/benchmark/deepseek_v3/README.md
@@ -45,10 +45,10 @@ Add [performance optimization options](#performance-optimization-options) as nee
 
 ### Performance Optimization Options
 
-[MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations can be enabled as needed.
+[MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations that can be enabled as needed.
 
 - [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models): For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.
-- [Torch.compile Optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#torchcompile-latency-optimizations): Add `--enable-torch-compile` argument to enable it. This will take some time while server starts. The maximum batch size for torch.compile optimization can be controlled with `--torch-compile-max-bs`. It's recommended to set it between `1` and `8`. (e.g., `--torch-compile-max-bs 8`)
+- [Torch.compile Optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#torchcompile-latency-optimizations): Add `--enable-torch-compile` argument to enable it. This will take some time while the server starts. The maximum batch size for torch.compile optimization can be controlled with `--torch-compile-max-bs`. It's recommended to set it between `1` and `8`. (e.g., `--torch-compile-max-bs 8`)
 
 ### Example: Sending requests with OpenAI API
 
@@ -90,7 +90,7 @@ If you have two H100 nodes, the usage is similar to the aforementioned H20.
 
 > **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).
 
-### Example: Serving with two H200\*8 nodes and docker
+### Example: Serving with two H200\*8 nodes and Docker
 
 There are two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configure the endpoint to expose it to another Docker container using `--host 0.0.0.0` and `--port 40000`, and set up communications with `--dist-init-addr 192.168.114.10:20000`.
 A single H200 with 8 devices can run DeepSeek V3, the dual H200 setup is just to demonstrate multi-node usage.
@@ -147,7 +147,7 @@ docker run --gpus all \
 
 To serve DeepSeek-V3 with A100 GPUs, we need to convert the [FP8 model checkpoints](https://huggingface.co/deepseek-ai/DeepSeek-V3) to BF16 with [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) mentioned [here](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) first.
 
-Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assume the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can have following commands to launch the server.
+Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assuming the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can run the following commands to launch the server.
 
 ```bash
 # node 1
@@ -178,7 +178,7 @@ python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1
 
 ### Example: Serving with 8 A100/A800 with AWQ Quantization
 
-Add `--quantization moe_wna16` flag to enable moe wna16 kernel for better performance.
+Add the `--quantization moe_wna16` flag to enable the MoE wna16 kernel for better performance.
 One example is as follows:
 
 ```bash
@@ -188,12 +188,12 @@ python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --
 
 ### Example: Serving with 16 A100/A800 with int8 Quantization
 
-There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:
+There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to HuggingFace. One example is as follows:
 
 - [meituan/DeepSeek-R1-Block-INT8](https://huggingface.co/meituan/DeepSeek-R1-Block-INT8)
 - [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
 
-Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-INT8` and port=5000, we can have following commands to launch the server:
+Assuming that the master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-INT8` and port=5000, we can run the following commands to launch the server:
 ```bash
 #master
 python3 -m sglang.launch_server \
@@ -225,7 +225,7 @@ Running with per-channel quantization model:
 
 - [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
 
-Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can have following commands to launch the server:
+Assuming that the master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can run the following commands to launch the server:
 
 ```bash
 #master

diff --git a/benchmark/gsm8k/README.md b/benchmark/gsm8k/README.md
@@ -1,6 +1,6 @@
-## Run benchmark
+## Run Benchmark
 
-### Benchmark sglang
+### Benchmark SGLang
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
 ```
@@ -10,7 +10,7 @@ python3 bench_sglang.py --num-questions 200
 ```
 
 
-### Benchmark vllm
+### Benchmark vLLM
 ```
 python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
 ```
@@ -20,7 +20,7 @@ python3 bench_other.py --num-questions 200 --backend vllm
 ```
 
 
-### Benchmark lightllm
+### Benchmark LightLLM
 ```
 # A10G
 python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
@@ -31,13 +31,13 @@ python3 bench_other.py --num-questions 200 --backend lightllm
 ```
 
 
-### Benchmark guidance
+### Benchmark Guidance
 ```
 python3 bench_other.py --num-questions 200 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
 ```
 
 
-### Benchmark lmql
+### Benchmark LMQL
 ```
 CUDA_VISIBLE_DEVICES=0,1 lmql serve-model meta-llama/Llama-2-7b-chat-hf --cuda --port 23000
 ```

diff --git a/benchmark/hellaswag/README.md b/benchmark/hellaswag/README.md
@@ -1,6 +1,6 @@
 ## Run benchmark
 
-### Benchmark sglang
+### Benchmark SGLang
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
 ```
@@ -10,7 +10,7 @@ python3 bench_sglang.py --num-questions 200
 ```
 
 
-### Benchmark vllm
+### Benchmark vLLM
 ```
 python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
 ```
@@ -20,7 +20,7 @@ python3 bench_other.py --num-questions 200 --backend vllm
 ```
 
 
-### Benchmark lightllm
+### Benchmark LightLLM
 ```
 # A10G
 python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
@@ -31,13 +31,13 @@ python3 bench_other.py --num-questions 200 --backend lightllm
 ```
 
 
-### Benchmark guidance
+### Benchmark Guidance
 ```
 CUDA_VISIBLE_DEVICES=0,1 python3 bench_other.py --num-questions 200 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
 ```
 
 
-### Benchmark lmql
+### Benchmark LMQL
 ```
 lmql serve-model meta-llama/Llama-2-7b-chat-hf --cuda --port 23000
 ```

diff --git a/benchmark/kernels/fused_moe_triton/README.md b/benchmark/kernels/fused_moe_triton/README.md
@@ -4,7 +4,7 @@ This directory contains benchmarking tools for MoE (Mixture of Experts) kernels.
 
 ### Tuning Tool
 
-- `tuning_fused_moe_triton.py`: A tool for tuning the `fused_moe_triton` kernel. Adapted from [vllm's benchmark_moe.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py), with added support for various model architectures.
+- `tuning_fused_moe_triton.py`: A tool for tuning the `fused_moe_triton` kernel. Adapted from [vLLM's benchmark_moe.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py), with added support for various model architectures.
 
 Example usage:
 ```bash
@@ -48,7 +48,7 @@ After tuning, a configuration file (e.g., `E=64,N=640,device_name=NVIDIA_GeForce
 
 ### Performance Comparison Tool
 
-- `benchmark_vllm_vs_sglang_fused_moe_triton.py`: A tool for comparing the performance of fused MoE kernels between vllm and sglang implementations. Supports various model architectures and data types.
+- `benchmark_vllm_vs_sglang_fused_moe_triton.py`: A tool for comparing the performance of fused MoE kernels between vLLM and SGLang implementations. Supports various model architectures and data types.
 
 Example usage:
 ```bash

diff --git a/benchmark/mmlu/README.md b/benchmark/mmlu/README.md
@@ -1,11 +1,11 @@
-## Download data
+## Download Data
 ```
 bash download_data.sh
 ```
 
-## Run benchmark
+## Run Benchmark
 
-### Benchmark sglang
+### Benchmark SGLang
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
 ```
@@ -19,7 +19,7 @@ python3 bench_sglang.py --nsub 10
 python3 bench_sglang.py --backend gpt-3.5-turbo --parallel 8
 ```
 
-### Benchmark vllm
+### Benchmark vLLM
 ```
 python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
 ```
@@ -29,7 +29,7 @@ python3 bench_other.py --nsub 10 --backend vllm
 ```
 
 
-### Benchmark lightllm
+### Benchmark LightLLM
 ```
 # A10G
 python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
@@ -43,13 +43,13 @@ python3 bench_other.py --nsub 10 --backend lightllm
 ```
 
 
-### Benchmark guidance
+### Benchmark Guidance
 ```
 python3 bench_other.py --nsub 10 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
 ```
 
 
-### Benchmark lmql
+### Benchmark LMQL
 ```
 CUDA_VISIBLE_DEVICES=0,1 lmql serve-model meta-llama/Llama-2-7b-chat-hf --cuda --port 23000
 ```

diff --git a/benchmark/mtbench/README.md b/benchmark/mtbench/README.md
@@ -4,9 +4,9 @@
 wget -O question.jsonl https://raw.githubusercontent.com/lm-sys/FastChat/main/fastchat/llm_judge/data/mt_bench/question.jsonl
 ```
 
-## Run benchmark
+## Run Benchmark
 
-### Benchmark sglang
+### Benchmark SGLang
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
 ```
@@ -15,7 +15,7 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 python3 bench_sglang.py --num-questions 80
 ```
 
-### Benchmark sglang EAGLE
+### Benchmark SGLang EAGLE
 ```
 python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algo EAGLE \
     --speculative-draft lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \
@@ -27,7 +27,7 @@ python3 bench_sglang_eagle.py --num-questions 80 --parallel 1
 ```
 
 
-### Benchmark vllm
+### Benchmark vLLM
 ```
 python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
 ```
@@ -37,7 +37,7 @@ python3 bench_other.py --num-questions 80 --backend vllm
 ```
 
 
-### Benchmark lightllm
+### Benchmark LightLLM
 ```
 # A10G
 python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000

diff --git a/benchmark/multi_chain_reasoning/README.md b/benchmark/multi_chain_reasoning/README.md
@@ -1,11 +1,11 @@
-## Download data
+## Download Data
 ```
 wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
 ```
 
-## Run benchmark
+## Run Benchmark
 
-### Benchmark sglang
+### Benchmark SGLang
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000  --schedule-conservativeness 1.3
 ```
@@ -16,7 +16,7 @@ python3 bench_sglang.py --num-questions 32 --parallel 1
 ```
 
 
-### Benchmark vllm
+### Benchmark vLLM
 ```
 python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
 ```
@@ -26,7 +26,7 @@ python3 bench_other.py --num-questions 64 --backend vllm
 ```
 
 
-### Benchmark lightllm
+### Benchmark LightLLM
 ```
 # A10G
 python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
@@ -37,12 +37,12 @@ python3 bench_other.py --num-questions 64 --backend lightllm
 ```
 
 
-### Benchmark guidance
+### Benchmark Guidance
 ```
 python3 bench_other.py --num-questions 8 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
 ```
 
-### Benchmark lmql
+### Benchmark LMQL
 
 ```
 python3 bench_other.py --num-questions 64 --backend lmql --parallel 1