feat: add sglang + openai gpt-oss serve and benchmark on modal by weedge · Pull Request #183 · ai-bot-pro/achatbot

weedge · 2025-08-10T10:51:24Z

Tip

use MXFP4 need quantization=mxfp4 quantization_param_path=QUANTIZATION_PARAM_PATH if not default
see add transformers + openai_gpt_oss on modal to run #179
Support mxfp4 for GPT-OSS sgl-project/sglang#8843
use autoscaling pool of containers (single node one/multi GPUs with mxfp4) to deploy online
KISS: https://github.com/openai/gpt-oss

Open Compute Project (OCP) MXFP4:

feat:

add sglang + openai gpt-oss serve and benchmark on modal

# 0. download model weight(safetensors), tokenizer and config
modal run src/download_models.py --repo-ids "lmsys/gpt-oss-20b-bf16"
modal run src/download_models.py --repo-ids "lmsys/gpt-oss-120b-bf16"
modal run src/download_models.py --repo-ids "openai/gpt-oss-20b"
modal run src/download_models.py --repo-ids "openai/gpt-oss-120b" --ignore-patterns "*.pt|*.bin|*original*|*metal*"


# 1. run server and test with urllib request raw http api (dev)
# fp8/bf16
LLM_MODEL=lmsys/gpt-oss-20b-bf16 SERVE_IMAGE_GPU=A100-80GB TP=1 modal run src/llm/sglang/openai_gpt_oss.py::url_request
# mxfp4 
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=A100-80GB TP=1 modal run src/llm/sglang/openai_gpt_oss.py::url_request


# 2. run server and test with openai client sdk (dev/test)
# fp8/bf16
LLM_MODEL=lmsys/gpt-oss-20b-bf16 SERVE_IMAGE_GPU=A100-80GB TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task local_api_completions
LLM_MODEL=lmsys/gpt-oss-20b-bf16 SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task local_api_completions
LLM_MODEL=lmsys/gpt-oss-20b-bf16 SERVE_IMAGE_GPU=L40s:2 TP=2 SERVE_ARGS="--cuda-graph-max-bs 4" modal run src/llm/sglang/openai_gpt_oss.py::main --task local_api_completions 
LLM_MODEL=lmsys/gpt-oss-120b-bf16 SERVE_IMAGE_GPU=H100:4 TP=4 SERVE_ARGS="--cuda-graph-max-bs 16" modal run src/llm/sglang/openai_gpt_oss.py::main --task local_api_completions 
# mxfp4 
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=A100-80GB TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task local_api_tool_completions
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task local_api_tool_completions
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=L40s:2 TP=2 SERVE_ARGS="--cuda-graph-max-bs 4" modal run src/llm/sglang/openai_gpt_oss.py::main --task local_api_completions 


# 3. benchmark
LLM_MODEL=lmsys/gpt-oss-20b-bf16 SERVE_IMAGE_GPU=L40s:2 TP=2 SERVE_ARGS="--cuda-graph-max-bs 4" modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=L40s:2 TP=2 SERVE_ARGS="--cuda-graph-max-bs 4" modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=A100-80GB TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=A100-80GB TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 20 --max-concurrency 4
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 20 --max-concurrency 4
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 40 --max-concurrency 8
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 80 --max-concurrency 16
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 160 --max-concurrency 32
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 320 --max-concurrency 64
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H200 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H200 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 20 --max-concurrency 4
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H200 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 80 --max-concurrency 16
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H200 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 160 --max-concurrency 32
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=B200 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=B200 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 20 --max-concurrency 4
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=B200 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 80 --max-concurrency 16
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=B200 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 160 --max-concurrency 32
LLM_MODEL=lmsys/gpt-oss-120b-bf16 SERVE_IMAGE_GPU=H100:4 SERVE_ARGS="--cuda-graph-max-bs 16 --mem-fraction-static 0.5" TP=4 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 320 --max-concurrency 64

# 4. run server (gray)
# fp8/bf16
LLM_MODEL=lmsys/gpt-oss-20b-bf16 SERVE_IMAGE_GPU=H100 TP=1 modal serve src/llm/sglang/openai_gpt_oss.py
LLM_MODEL=lmsys/gpt-oss-20b-bf16 SERVE_IMAGE_GPU=L40s:2 TP=2 SERVE_ARGS="--cuda-graph-max-bs 4" modal serve src/llm/sglang/openai_gpt_oss.py
# mxfp4
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal serve src/llm/sglang/openai_gpt_oss.py
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=L40s:2 TP=2 SERVE_ARGS="--cuda-graph-max-bs 4" modal serve src/llm/sglang/openai_gpt_oss.py

# 5. deploy (online)
# fp8/bf16
LLM_MODEL=lmsys/gpt-oss-20b-bf16 SERVE_IMAGE_GPU=H100 TP=1 SERVE_MAX_CONTAINERS=10 modal deploy src/llm/sglang/openai_gpt_oss.py
LLM_MODEL=lmsys/gpt-oss-20b-bf16 SERVE_IMAGE_GPU=L40s:2 TP=2 SERVE_MAX_CONTAINERS=10 SERVE_ARGS="--cuda-graph-max-bs 4" modal serve src/llm/sglang/openai_gpt_oss.py
# mxfp4
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 SERVE_MAX_CONTAINERS=10 modal deploy src/llm/sglang/openai_gpt_oss.py
LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=L40s:2 TP=2 SERVE_MAX_CONTAINERS=10 SERVE_ARGS="--cuda-graph-max-bs 4" modal deploy src/llm/sglang/openai_gpt_oss.py

===================

# 6. generate (DIY)
LLM_MODEL=openai/gpt-oss-20b RUN_IMAGE_GPU=A100-80GB TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task generate
LLM_MODEL=openai/gpt-oss-20b RUN_IMAGE_GPU=A100-80GB TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task generate_stream
LLM_MODEL=openai/gpt-oss-20b RUN_IMAGE_GPU=A100-80GB TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task batch_generate_stream
LLM_MODEL=openai/gpt-oss-20b RUN_IMAGE_GPU=A100-80GB TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task harmony_generate 
LLM_MODEL=openai/gpt-oss-20b RUN_IMAGE_GPU=A100-80GB TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task chat_stream

# local input --- queue --> remote to loop chat

## use browser tool(find,open,search), need env EXA_API_KEY from https://exa.ai
RUN_IMAGE_GPU=A100-80GB modal run src/llm/sglang/openai_gpt_oss.py::main --task chat_tool_stream --build-in-tool browser
RUN_IMAGE_GPU=A100-80GB modal run src/llm/sglang/openai_gpt_oss.py::main --task chat_tool_stream \
    --max-tokens 2048 --temperature=1.0 --top-p=1.0 \
    --build-in-tool browser --show-browser-results --model-identity "你是一名聊天助手，请用中文回复。"
RUN_IMAGE_GPU=A100-80GB modal run src/llm/sglang/openai_gpt_oss.py::main --task chat_tool_stream --build-in-tool browser --is-apply-patch --show-browser-results
RUN_IMAGE_GPU=A100-80GB modal run src/llm/sglang/openai_gpt_oss.py::main --task chat_tool_stream --build-in-tool browser --raw --is-apply-patch --show-browser-results

## need python tool to run script on python docker, u can change python tools to run local env python or use serverless function 
RUN_IMAGE_GPU=A100-80GB modal run src/llm/sglang/openai_gpt_oss.py::main --task chat_tool_stream --build-in-tool python
RUN_IMAGE_GPU=A100-80GB modal run src/llm/sglang/openai_gpt_oss.py::main --task chat_tool_stream --build-in-tool python --is-apply-patch 
RUN_IMAGE_GPU=A100-80GB modal run src/llm/sglang/openai_gpt_oss.py::main --task chat_tool_stream --build-in-tool python --raw --is-apply-patch

reference

Signed-off-by: weedge <weege007@gmail.com>

gemini-code-assist

Summary of Changes

Hello @weedge, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've implemented a new feature to enable serving and benchmarking of OpenAI GPT-OSS models using SGLang on Modal. This involves setting up a dedicated Modal application for SGLang, complete with environment configuration, model serving capabilities, and integrated benchmarking tools. Additionally, I've made minor adjustments to existing evaluation scripts to improve clarity and ensure proper directory setup, and updated the VLLM serving configuration to use a different port and adjusted concurrency settings.

Highlights

New SGLang Integration: I've added a new feature that integrates SGLang for serving and benchmarking OpenAI GPT-OSS models on the Modal platform, providing a new pathway for deploying and evaluating these models.
New File for SGLang Logic: I've introduced a new Python file, deploy/modal/src/llm/sglang/openai_gpt_oss.py, which encapsulates all the necessary logic for setting up the SGLang environment, serving models, and running benchmarks.
Modal App Renaming for Clarity: I've updated the Modal application names in evalscope.py and lighteval.py to be more descriptive, changing them from openai_gpt_oss_trl to openai_gpt_oss_evalscope and openai_gpt_oss_lighteval respectively, improving clarity and organization.
Automated Directory Creation: I've ensured that the evaluation output directories are automatically created in evalscope.py and lighteval.py by adding os.makedirs(eval_out_dir, exist_ok=True), preventing potential errors during evaluation runs.
VLLM Serving Port and Concurrency Adjustment: I've adjusted the default serving port for VLLM in openai_gpt_oss.py from 8801 to 8000 and modified the target_inputs for modal.concurrent from 10 to 4, which might impact concurrency and resource allocation for VLLM deployments.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for serving and benchmarking OpenAI's GPT-OSS models using sglang on Modal. It introduces a new script for this purpose and makes minor adjustments to existing evaluation scripts. My review focuses on the new sglang script, where I've identified a critical issue with a Python package dependency that would break the build, as well as several high and medium severity issues related to correctness, code quality, and maintainability. I've provided suggestions to fix these issues.

gemini-code-assist · 2025-08-10T10:53:01Z

deploy/modal/src/llm/sglang/openai_gpt_oss.py

+        "pip3 install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129"
+    )
+    .run_commands(
+        "pip3 install https://github.com/sgl-project/whl/releases/download/v0.3.3/sgl_kernel-0.3.3-cp39-abi3-manylinux2014_x86_64.whl --force-reinstall"


The Docker image is configured to use Python 3.10 (add_python="3.10" on line 23), but this command attempts to install a wheel for Python 3.9 (cp39). This will cause the image build to fail. You should use the wheel compatible with Python 3.10, which is available.

Suggested change

"pip3 install https://github.com/sgl-project/whl/releases/download/v0.3.3/sgl_kernel-0.3.3-cp39-abi3-manylinux2014_x86_64.whl --force-reinstall"

"pip3 install https://github.com/sgl-project/whl/releases/download/v0.3.3/sgl_kernel-0.3.3-cp310-abi3-manylinux2014_x86_64.whl --force-reinstall"

wheels from https://github.com/sgl-project/whl/releases/tag/v0.3.3 , no x86_64, but use cp39 is ok for python3.10

deploy/modal/src/llm/sglang/openai_gpt_oss.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

weedge · 2025-08-10T11:08:05Z

python3 -m sglang.launch_server --help

usage: launch_server.py [-h] --model-path MODEL_PATH
                        [--tokenizer-path TOKENIZER_PATH]
                        [--tokenizer-mode {auto,slow}] [--skip-tokenizer-init]
                        [--load-format {auto,pt,safetensors,npcache,dummy,sharded_state,gguf,bitsandbytes,layered,remote}]
                        [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG]
                        [--trust-remote-code]
                        [--context-length CONTEXT_LENGTH] [--is-embedding]
                        [--enable-multimodal] [--revision REVISION]
                        [--model-impl MODEL_IMPL] [--host HOST] [--port PORT]
                        [--skip-server-warmup] [--warmups WARMUPS]
                        [--nccl-port NCCL_PORT]
                        [--dtype {auto,half,float16,bfloat16,float,float32}]
                        [--quantization {awq,fp8,gptq,marlin,gptq_marlin,awq_marlin,bitsandbytes,gguf,modelopt,modelopt_fp4,petit_nvfp4,w8a8_int8,w8a8_fp8,moe_wna16,qoq,w4afp8,mxfp4}]
                        [--quantization-param-path QUANTIZATION_PARAM_PATH]
                        [--kv-cache-dtype {auto,fp8_e5m2,fp8_e4m3}]
                        [--mem-fraction-static MEM_FRACTION_STATIC]
                        [--max-running-requests MAX_RUNNING_REQUESTS]
                        [--max-queued-requests MAX_QUEUED_REQUESTS]
                        [--max-total-tokens MAX_TOTAL_TOKENS]
                        [--chunked-prefill-size CHUNKED_PREFILL_SIZE]
                        [--max-prefill-tokens MAX_PREFILL_TOKENS]
                        [--schedule-policy {lpm,random,fcfs,dfs-weight,lof}]
                        [--schedule-conservativeness SCHEDULE_CONSERVATIVENESS]
                        [--cpu-offload-gb CPU_OFFLOAD_GB]
                        [--page-size PAGE_SIZE]
                        [--hybrid-kvcache-ratio [HYBRID_KVCACHE_RATIO]]
                        [--swa-full-tokens-ratio SWA_FULL_TOKENS_RATIO]
                        [--disable-hybrid-swa-memory] [--device DEVICE]
                        [--tensor-parallel-size TENSOR_PARALLEL_SIZE]
                        [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
                        [--max-micro-batch-size MAX_MICRO_BATCH_SIZE]
                        [--stream-interval STREAM_INTERVAL] [--stream-output]
                        [--random-seed RANDOM_SEED]
                        [--constrained-json-whitespace-pattern CONSTRAINED_JSON_WHITESPACE_PATTERN]
                        [--watchdog-timeout WATCHDOG_TIMEOUT]
                        [--dist-timeout DIST_TIMEOUT]
                        [--download-dir DOWNLOAD_DIR]
                        [--base-gpu-id BASE_GPU_ID]
                        [--gpu-id-step GPU_ID_STEP] [--sleep-on-idle]
                        [--log-level LOG_LEVEL]
                        [--log-level-http LOG_LEVEL_HTTP] [--log-requests]
                        [--log-requests-level {0,1,2,3}]
                        [--crash-dump-folder CRASH_DUMP_FOLDER]
                        [--show-time-cost] [--enable-metrics]
                        [--enable-metrics-for-all-schedulers]
                        [--bucket-time-to-first-token BUCKET_TIME_TO_FIRST_TOKEN [BUCKET_TIME_TO_FIRST_TOKEN ...]]
                        [--bucket-inter-token-latency BUCKET_INTER_TOKEN_LATENCY [BUCKET_INTER_TOKEN_LATENCY ...]]
                        [--bucket-e2e-request-latency BUCKET_E2E_REQUEST_LATENCY [BUCKET_E2E_REQUEST_LATENCY ...]]
                        [--collect-tokens-histogram]
                        [--decode-log-interval DECODE_LOG_INTERVAL]
                        [--enable-request-time-stats-logging]
                        [--kv-events-config KV_EVENTS_CONFIG]
                        [--api-key API_KEY]
                        [--served-model-name SERVED_MODEL_NAME]
                        [--chat-template CHAT_TEMPLATE]
                        [--completion-template COMPLETION_TEMPLATE]
                        [--file-storage-path FILE_STORAGE_PATH]
                        [--enable-cache-report]
                        [--reasoning-parser {deepseek-r1,qwen3,qwen3-thinking,glm45,kimi,step3}]
                        [--tool-call-parser {qwen25,mistral,llama3,deepseekv3,pythonic,kimi_k2,qwen3_coder,glm45,step3}]
                        [--tool-server TOOL_SERVER]
                        [--data-parallel-size DATA_PARALLEL_SIZE]
                        [--load-balance-method {round_robin,shortest_queue,minimum_tokens}]
                        [--dist-init-addr DIST_INIT_ADDR] [--nnodes NNODES]
                        [--node-rank NODE_RANK]
                        [--json-model-override-args JSON_MODEL_OVERRIDE_ARGS]
                        [--preferred-sampling-params PREFERRED_SAMPLING_PARAMS]
                        [--enable-lora] [--max-lora-rank MAX_LORA_RANK]
                        [--lora-target-modules [{q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj,all} ...]]
                        [--lora-paths [LORA_PATHS ...]]
                        [--max-loras-per-batch MAX_LORAS_PER_BATCH]
                        [--max-loaded-loras MAX_LOADED_LORAS]
                        [--lora-backend LORA_BACKEND]
                        [--attention-backend {aiter,cutlass_mla,fa3,flashinfer,flashmla,intel_amx,torch_native,ascend,triton,trtllm_mla,trtllm_mha,dual_chunk_flash_attn}]
                        [--prefill-attention-backend {aiter,cutlass_mla,fa3,flashinfer,flashmla,intel_amx,torch_native,ascend,triton,trtllm_mla,trtllm_mha,dual_chunk_flash_attn}]
                        [--decode-attention-backend {aiter,cutlass_mla,fa3,flashinfer,flashmla,intel_amx,torch_native,ascend,triton,trtllm_mla,trtllm_mha,dual_chunk_flash_attn}]
                        [--sampling-backend {flashinfer,pytorch}]
                        [--grammar-backend {xgrammar,outlines,llguidance,none}]
                        [--mm-attention-backend {sdpa,fa3,triton_attn}]
                        [--speculative-algorithm {EAGLE,EAGLE3,NEXTN}]
                        [--speculative-draft-model-path SPECULATIVE_DRAFT_MODEL_PATH]
                        [--speculative-num-steps SPECULATIVE_NUM_STEPS]
                        [--speculative-eagle-topk SPECULATIVE_EAGLE_TOPK]
                        [--speculative-num-draft-tokens SPECULATIVE_NUM_DRAFT_TOKENS]
                        [--speculative-accept-threshold-single SPECULATIVE_ACCEPT_THRESHOLD_SINGLE]
                        [--speculative-accept-threshold-acc SPECULATIVE_ACCEPT_THRESHOLD_ACC]
                        [--speculative-token-map SPECULATIVE_TOKEN_MAP]
                        [--expert-parallel-size EXPERT_PARALLEL_SIZE]
                        [--moe-a2a-backend {deepep}]
                        [--enable-flashinfer-cutlass-moe]
                        [--enable-flashinfer-trtllm-moe]
                        [--enable-flashinfer-allreduce-fusion]
                        [--deepep-mode {normal,low_latency,auto}]
                        [--ep-num-redundant-experts EP_NUM_REDUNDANT_EXPERTS]
                        [--ep-dispatch-algorithm EP_DISPATCH_ALGORITHM]
                        [--init-expert-location INIT_EXPERT_LOCATION]
                        [--enable-eplb] [--eplb-algorithm EPLB_ALGORITHM]
                        [--eplb-rebalance-num-iterations EPLB_REBALANCE_NUM_ITERATIONS]
                        [--eplb-rebalance-layers-per-chunk EPLB_REBALANCE_LAYERS_PER_CHUNK]
                        [--expert-distribution-recorder-mode EXPERT_DISTRIBUTION_RECORDER_MODE]
                        [--expert-distribution-recorder-buffer-size EXPERT_DISTRIBUTION_RECORDER_BUFFER_SIZE]
                        [--enable-expert-distribution-metrics]
                        [--deepep-config DEEPEP_CONFIG]
                        [--moe-dense-tp-size MOE_DENSE_TP_SIZE]
                        [--enable-hierarchical-cache]
                        [--hicache-ratio HICACHE_RATIO]
                        [--hicache-size HICACHE_SIZE]
                        [--hicache-write-policy {write_back,write_through,write_through_selective}]
                        [--hicache-io-backend {direct,kernel}]
                        [--hicache-mem-layout {layer_first,page_first}]
                        [--hicache-storage-backend {file,mooncake,hf3fs,nixl}]
                        [--hicache-storage-prefetch-policy {best_effort,wait_complete,timeout}]
                        [--enable-double-sparsity]
                        [--ds-channel-config-path DS_CHANNEL_CONFIG_PATH]
                        [--ds-heavy-channel-num DS_HEAVY_CHANNEL_NUM]
                        [--ds-heavy-token-num DS_HEAVY_TOKEN_NUM]
                        [--ds-heavy-channel-type DS_HEAVY_CHANNEL_TYPE]
                        [--ds-sparse-decode-threshold DS_SPARSE_DECODE_THRESHOLD]
                        [--disable-radix-cache]
                        [--cuda-graph-max-bs CUDA_GRAPH_MAX_BS]
                        [--cuda-graph-bs CUDA_GRAPH_BS [CUDA_GRAPH_BS ...]]
                        [--disable-cuda-graph] [--disable-cuda-graph-padding]
                        [--enable-profile-cuda-graph] [--enable-cudagraph-gc]
                        [--enable-nccl-nvls] [--enable-symm-mem]
                        [--enable-tokenizer-batch-encode]
                        [--disable-outlines-disk-cache]
                        [--disable-custom-all-reduce] [--enable-mscclpp]
                        [--disable-overlap-schedule] [--enable-mixed-chunk]
                        [--enable-dp-attention] [--enable-dp-lm-head]
                        [--enable-two-batch-overlap]
                        [--tbo-token-distribution-threshold TBO_TOKEN_DISTRIBUTION_THRESHOLD]
                        [--enable-torch-compile]
                        [--torch-compile-max-bs TORCH_COMPILE_MAX_BS]
                        [--torchao-config TORCHAO_CONFIG]
                        [--enable-nan-detection] [--enable-p2p-check]
                        [--triton-attention-reduce-in-fp32]
                        [--triton-attention-num-kv-splits TRITON_ATTENTION_NUM_KV_SPLITS]
                        [--num-continuous-decode-steps NUM_CONTINUOUS_DECODE_STEPS]
                        [--delete-ckpt-after-loading] [--enable-memory-saver]
                        [--allow-auto-truncate]
                        [--enable-custom-logit-processor]
                        [--flashinfer-mla-disable-ragged]
                        [--disable-shared-experts-fusion]
                        [--disable-chunked-prefix-cache]
                        [--disable-fast-image-processor]
                        [--enable-return-hidden-states]
                        [--enable-triton-kernel-moe]
                        [--enable-flashinfer-mxfp4-moe]
                        [--scheduler-recv-interval SCHEDULER_RECV_INTERVAL]
                        [--debug-tensor-dump-output-folder DEBUG_TENSOR_DUMP_OUTPUT_FOLDER]
                        [--debug-tensor-dump-input-file DEBUG_TENSOR_DUMP_INPUT_FILE]
                        [--debug-tensor-dump-inject DEBUG_TENSOR_DUMP_INJECT]
                        [--debug-tensor-dump-prefill-only]
                        [--disaggregation-mode {null,prefill,decode}]
                        [--disaggregation-transfer-backend {mooncake,nixl,ascend}]
                        [--disaggregation-bootstrap-port DISAGGREGATION_BOOTSTRAP_PORT]
                        [--disaggregation-decode-tp DISAGGREGATION_DECODE_TP]
                        [--disaggregation-decode-dp DISAGGREGATION_DECODE_DP]
                        [--disaggregation-prefill-pp DISAGGREGATION_PREFILL_PP]
                        [--disaggregation-ib-device DISAGGREGATION_IB_DEVICE]
                        [--num-reserved-decode-tokens NUM_RESERVED_DECODE_TOKENS]
                        [--pdlb-url PDLB_URL]
                        [--custom-weight-loader [CUSTOM_WEIGHT_LOADER ...]]
                        [--enable-pdmux] [--sm-group-num SM_GROUP_NUM]
                        [--weight-loader-disable-mmap] [--enable-ep-moe]
                        [--enable-deepep-moe]

options:
  -h, --help            show this help message and exit
  --model-path MODEL_PATH, --model MODEL_PATH
                        The path of the model weights. This can be a local
                        folder or a Hugging Face repo ID.
  --tokenizer-path TOKENIZER_PATH
                        The path of the tokenizer.
  --tokenizer-mode {auto,slow}
                        Tokenizer mode. 'auto' will use the fast tokenizer if
                        available, and 'slow' will always use the slow
                        tokenizer.
  --skip-tokenizer-init
                        If set, skip init tokenizer and pass input_ids in
                        generate request.
  --load-format {auto,pt,safetensors,npcache,dummy,sharded_state,gguf,bitsandbytes,layered,remote}
                        The format of the model weights to load. "auto" will
                        try to load the weights in the safetensors format and
                        fall back to the pytorch bin format if safetensors
                        format is not available. "pt" will load the weights in
                        the pytorch bin format. "safetensors" will load the
                        weights in the safetensors format. "npcache" will load
                        the weights in pytorch format and store a numpy cache
                        to speed up the loading. "dummy" will initialize the
                        weights with random values, which is mainly for
                        profiling."gguf" will load the weights in the gguf
                        format. "bitsandbytes" will load the weights using
                        bitsandbytes quantization."layered" loads weights
                        layer by layer so that one can quantize a layer before
                        loading another to make the peak memory envelope
                        smaller.
  --model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG
                        Extra config for model loader. This will be passed to
                        the model loader corresponding to the chosen
                        load_format.
  --trust-remote-code   Whether or not to allow for custom models defined on
                        the Hub in their own modeling files.
  --context-length CONTEXT_LENGTH
                        The model's maximum context length. Defaults to None
                        (will use the value from the model's config.json
                        instead).
  --is-embedding        Whether to use a CausalLM as an embedding model.
  --enable-multimodal   Enable the multimodal functionality for the served
                        model. If the model being served is not multimodal,
                        nothing will happen
  --revision REVISION   The specific model version to use. It can be a branch
                        name, a tag name, or a commit id. If unspecified, will
                        use the default version.
  --model-impl MODEL_IMPL
                        Which implementation of the model to use. * "auto"
                        will try to use the SGLang implementation if it exists
                        and fall back to the Transformers implementation if no
                        SGLang implementation is available. * "sglang" will
                        use the SGLang model implementation. * "transformers"
                        will use the Transformers model implementation.
  --host HOST           The host of the HTTP server.
  --port PORT           The port of the HTTP server.
  --skip-server-warmup  If set, skip warmup.
  --warmups WARMUPS     Specify custom warmup functions (csv) to run before
                        server starts eg. --warmups=warmup_name1,warmup_name2
                        will run the functions `warmup_name1` and
                        `warmup_name2` specified in warmup.py before the
                        server starts listening for requests
  --nccl-port NCCL_PORT
                        The port for NCCL distributed environment setup.
                        Defaults to a random port.
  --dtype {auto,half,float16,bfloat16,float,float32}
                        Data type for model weights and activations. * "auto"
                        will use FP16 precision for FP32 and FP16 models, and
                        BF16 precision for BF16 models. * "half" for FP16.
                        Recommended for AWQ quantization. * "float16" is the
                        same as "half". * "bfloat16" for a balance between
                        precision and range. * "float" is shorthand for FP32
                        precision. * "float32" for FP32 precision.
  --quantization {awq,fp8,gptq,marlin,gptq_marlin,awq_marlin,bitsandbytes,gguf,modelopt,modelopt_fp4,petit_nvfp4,w8a8_int8,w8a8_fp8,moe_wna16,qoq,w4afp8,mxfp4}
                        The quantization method.
  --quantization-param-path QUANTIZATION_PARAM_PATH
                        Path to the JSON file containing the KV cache scaling
                        factors. This should generally be supplied, when KV
                        cache dtype is FP8. Otherwise, KV cache scaling
                        factors default to 1.0, which may cause accuracy
                        issues.
  --kv-cache-dtype {auto,fp8_e5m2,fp8_e4m3}
                        Data type for kv cache storage. "auto" will use model
                        data type. "fp8_e5m2" and "fp8_e4m3" is supported for
                        CUDA 11.8+.
  --mem-fraction-static MEM_FRACTION_STATIC
                        The fraction of the memory used for static allocation
                        (model weights and KV cache memory pool). Use a
                        smaller value if you see out-of-memory errors.
  --max-running-requests MAX_RUNNING_REQUESTS
                        The maximum number of running requests.
  --max-queued-requests MAX_QUEUED_REQUESTS
                        The maximum number of queued requests. This option is
                        ignored when using disaggregation-mode.
  --max-total-tokens MAX_TOTAL_TOKENS
                        The maximum number of tokens in the memory pool. If
                        not specified, it will be automatically calculated
                        based on the memory usage fraction. This option is
                        typically used for development and debugging purposes.
  --chunked-prefill-size CHUNKED_PREFILL_SIZE
                        The maximum number of tokens in a chunk for the
                        chunked prefill. Setting this to -1 means disabling
                        chunked prefill.
  --max-prefill-tokens MAX_PREFILL_TOKENS
                        The maximum number of tokens in a prefill batch. The
                        real bound will be the maximum of this value and the
                        model's maximum context length.
  --schedule-policy {lpm,random,fcfs,dfs-weight,lof}
                        The scheduling policy of the requests.
  --schedule-conservativeness SCHEDULE_CONSERVATIVENESS
                        How conservative the schedule policy is. A larger
                        value means more conservative scheduling. Use a larger
                        value if you see requests being retracted frequently.
  --cpu-offload-gb CPU_OFFLOAD_GB
                        How many GBs of RAM to reserve for CPU offloading.
  --page-size PAGE_SIZE
                        The number of tokens in a page.
  --hybrid-kvcache-ratio [HYBRID_KVCACHE_RATIO]
                        Mix ratio in [0,1] between uniform and hybrid kv
                        buffers (0.0 = pure uniform: swa_size / full_size =
                        1)(1.0 = pure hybrid: swa_size / full_size =
                        local_attention_size / context_length)
  --swa-full-tokens-ratio SWA_FULL_TOKENS_RATIO
                        The ratio of SWA layer KV tokens / full layer KV
                        tokens, regardless of the number of swa:full layers.
                        It should be between 0 and 1. E.g. 0.5 means if each
                        swa layer has 50 tokens, then each full layer has 100
                        tokens.
  --disable-hybrid-swa-memory
                        Disable the hybrid SWA memory.
  --device DEVICE       The device to use ('cuda', 'xpu', 'hpu', 'npu',
                        'cpu'). Defaults to auto-detection if not specified.
  --tensor-parallel-size TENSOR_PARALLEL_SIZE, --tp-size TENSOR_PARALLEL_SIZE
                        The tensor parallelism size.
  --pipeline-parallel-size PIPELINE_PARALLEL_SIZE, --pp-size PIPELINE_PARALLEL_SIZE
                        The pipeline parallelism size.
  --max-micro-batch-size MAX_MICRO_BATCH_SIZE
                        The maximum micro batch size in pipeline parallelism.
  --stream-interval STREAM_INTERVAL
                        The interval (or buffer size) for streaming in terms
                        of the token length. A smaller value makes streaming
                        smoother, while a larger value makes the throughput
                        higher
  --stream-output       Whether to output as a sequence of disjoint segments.
  --random-seed RANDOM_SEED
                        The random seed.
  --constrained-json-whitespace-pattern CONSTRAINED_JSON_WHITESPACE_PATTERN
                        (outlines backend only) Regex pattern for syntactic
                        whitespaces allowed in JSON constrained output. For
                        example, to allow the model generate consecutive
                        whitespaces, set the pattern to [ ]*
  --watchdog-timeout WATCHDOG_TIMEOUT
                        Set watchdog timeout in seconds. If a forward batch
                        takes longer than this, the server will crash to
                        prevent hanging.
  --dist-timeout DIST_TIMEOUT
                        Set timeout for torch.distributed initialization.
  --download-dir DOWNLOAD_DIR
                        Model download directory for huggingface.
  --base-gpu-id BASE_GPU_ID
                        The base GPU ID to start allocating GPUs from. Useful
                        when running multiple instances on the same machine.
  --gpu-id-step GPU_ID_STEP
                        The delta between consecutive GPU IDs that are used.
                        For example, setting it to 2 will use GPU 0,2,4,...
  --sleep-on-idle       Reduce CPU usage when sglang is idle.
  --log-level LOG_LEVEL
                        The logging level of all loggers.
  --log-level-http LOG_LEVEL_HTTP
                        The logging level of HTTP server. If not set, reuse
                        --log-level by default.
  --log-requests        Log metadata, inputs, outputs of all requests. The
                        verbosity is decided by --log-requests-level
  --log-requests-level {0,1,2,3}
                        0: Log metadata (no sampling parameters). 1: Log
                        metadata and sampling parameters. 2: Log metadata,
                        sampling parameters and partial input/output. 3: Log
                        every input/output.
  --crash-dump-folder CRASH_DUMP_FOLDER
                        Folder path to dump requests from the last 5 min
                        before a crash (if any). If not specified, crash
                        dumping is disabled.
  --show-time-cost      Show time cost of custom marks.
  --enable-metrics      Enable log prometheus metrics.
  --enable-metrics-for-all-schedulers
                        Enable --enable-metrics-for-all-schedulers when you
                        want schedulers on all TP ranks (not just TP 0) to
                        record request metrics separately. This is especially
                        useful when dp_attention is enabled, as otherwise all
                        metrics appear to come from TP 0.
  --bucket-time-to-first-token BUCKET_TIME_TO_FIRST_TOKEN [BUCKET_TIME_TO_FIRST_TOKEN ...]
                        The buckets of time to first token, specified as a
                        list of floats.
  --bucket-inter-token-latency BUCKET_INTER_TOKEN_LATENCY [BUCKET_INTER_TOKEN_LATENCY ...]
                        The buckets of inter-token latency, specified as a
                        list of floats.
  --bucket-e2e-request-latency BUCKET_E2E_REQUEST_LATENCY [BUCKET_E2E_REQUEST_LATENCY ...]
                        The buckets of end-to-end request latency, specified
                        as a list of floats.
  --collect-tokens-histogram
                        Collect prompt/generation tokens histogram.
  --decode-log-interval DECODE_LOG_INTERVAL
                        The log interval of decode batch.
  --enable-request-time-stats-logging
                        Enable per request time stats logging
  --kv-events-config KV_EVENTS_CONFIG
                        Config in json format for NVIDIA dynamo KV event
                        publishing. Publishing will be enabled if this flag is
                        used.
  --api-key API_KEY     Set API key of the server. It is also used in the
                        OpenAI API compatible server.
  --served-model-name SERVED_MODEL_NAME
                        Override the model name returned by the v1/models
                        endpoint in OpenAI API server.
  --chat-template CHAT_TEMPLATE
                        The buliltin chat template name or the path of the
                        chat template file. This is only used for OpenAI-
                        compatible API server.
  --completion-template COMPLETION_TEMPLATE
                        The buliltin completion template name or the path of
                        the completion template file. This is only used for
                        OpenAI-compatible API server. only for code completion
                        currently.
  --file-storage-path FILE_STORAGE_PATH
                        The path of the file storage in backend.
  --enable-cache-report
                        Return number of cached tokens in
                        usage.prompt_tokens_details for each openai request.
  --reasoning-parser {deepseek-r1,qwen3,qwen3-thinking,glm45,kimi,step3}
                        Specify the parser for reasoning models, supported
                        parsers are: ['deepseek-r1', 'qwen3',
                        'qwen3-thinking', 'glm45', 'kimi', 'step3'].
  --tool-call-parser {qwen25,mistral,llama3,deepseekv3,pythonic,kimi_k2,qwen3_coder,glm45,step3}
                        Specify the parser for handling tool-call
                        interactions. Options include: 'qwen25', 'mistral',
                        'llama3', 'deepseekv3', 'pythonic', 'kimi_k2',
                        'qwen3_coder', 'glm45', and 'step3'.
  --tool-server TOOL_SERVER
                        Either 'demo' or a comma-separated list of tool server
                        urls to use for the model. If not specified, no tool
                        server will be used.
  --data-parallel-size DATA_PARALLEL_SIZE, --dp-size DATA_PARALLEL_SIZE
                        The data parallelism size.
  --load-balance-method {round_robin,shortest_queue,minimum_tokens}
                        The load balancing strategy for data parallelism.
  --dist-init-addr DIST_INIT_ADDR, --nccl-init-addr DIST_INIT_ADDR
                        The host address for initializing distributed backend
                        (e.g., `192.168.0.2:25000`).
  --nnodes NNODES       The number of nodes.
  --node-rank NODE_RANK
                        The node rank.
  --json-model-override-args JSON_MODEL_OVERRIDE_ARGS
                        A dictionary in JSON string format used to override
                        default model configurations.
  --preferred-sampling-params PREFERRED_SAMPLING_PARAMS
                        json-formatted sampling settings that will be returned
                        in /get_model_info
  --enable-lora         Enable LoRA support for the model. This argument is
                        automatically set to True if `--lora-paths` is
                        provided for backward compatibility.
  --max-lora-rank MAX_LORA_RANK
                        The maximum rank of LoRA adapters. If not specified,
                        it will be automatically inferred from the adapters
                        provided in --lora-paths.
  --lora-target-modules [{q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj,all} ...]
                        The union set of all target modules where LoRA should
                        be applied. If not specified, it will be automatically
                        inferred from the adapters provided in --lora-paths.
                        If 'all' is specified, all supported modules will be
                        targeted.
  --lora-paths [LORA_PATHS ...]
                        The list of LoRA adapters. You can provide a list of
                        either path in str or renamed path in the format
                        {name}={path}.
  --max-loras-per-batch MAX_LORAS_PER_BATCH
                        Maximum number of adapters for a running batch,
                        include base-only request.
  --max-loaded-loras MAX_LOADED_LORAS
                        If specified, it limits the maximum number of LoRA
                        adapters loaded in CPU memory at a time. The value
                        must be greater than or equal to `--max-loras-per-
                        batch`.
  --lora-backend LORA_BACKEND
                        Choose the kernel backend for multi-LoRA serving.
  --attention-backend {aiter,cutlass_mla,fa3,flashinfer,flashmla,intel_amx,torch_native,ascend,triton,trtllm_mla,trtllm_mha,dual_chunk_flash_attn}
                        Choose the kernels for attention layers.
  --prefill-attention-backend {aiter,cutlass_mla,fa3,flashinfer,flashmla,intel_amx,torch_native,ascend,triton,trtllm_mla,trtllm_mha,dual_chunk_flash_attn}
                        Choose the kernels for prefill attention layers (have
                        priority over --attention-backend).
  --decode-attention-backend {aiter,cutlass_mla,fa3,flashinfer,flashmla,intel_amx,torch_native,ascend,triton,trtllm_mla,trtllm_mha,dual_chunk_flash_attn}
                        Choose the kernels for decode attention layers (have
                        priority over --attention-backend).
  --sampling-backend {flashinfer,pytorch}
                        Choose the kernels for sampling layers.
  --grammar-backend {xgrammar,outlines,llguidance,none}
                        Choose the backend for grammar-guided decoding.
  --mm-attention-backend {sdpa,fa3,triton_attn}
                        Set multimodal attention backend.
  --speculative-algorithm {EAGLE,EAGLE3,NEXTN}
                        Speculative algorithm.
  --speculative-draft-model-path SPECULATIVE_DRAFT_MODEL_PATH
                        The path of the draft model weights. This can be a
                        local folder or a Hugging Face repo ID.
  --speculative-num-steps SPECULATIVE_NUM_STEPS
                        The number of steps sampled from draft model in
                        Speculative Decoding.
  --speculative-eagle-topk SPECULATIVE_EAGLE_TOPK
                        The number of tokens sampled from the draft model in
                        eagle2 each step.
  --speculative-num-draft-tokens SPECULATIVE_NUM_DRAFT_TOKENS
                        The number of tokens sampled from the draft model in
                        Speculative Decoding.
  --speculative-accept-threshold-single SPECULATIVE_ACCEPT_THRESHOLD_SINGLE
                        Accept a draft token if its probability in the target
                        model is greater than this threshold.
  --speculative-accept-threshold-acc SPECULATIVE_ACCEPT_THRESHOLD_ACC
                        The accept probability of a draft token is raised from
                        its target probability p to min(1, p / threshold_acc).
  --speculative-token-map SPECULATIVE_TOKEN_MAP
                        The path of the draft model's small vocab table.
  --expert-parallel-size EXPERT_PARALLEL_SIZE, --ep-size EXPERT_PARALLEL_SIZE, --ep EXPERT_PARALLEL_SIZE
                        The expert parallelism size.
  --moe-a2a-backend {deepep}
                        Choose the backend for MoE A2A.
  --enable-flashinfer-cutlass-moe
                        Enable FlashInfer CUTLASS MoE backend for modelopt_fp4
                        quant on Blackwell. Supports MoE-EP
  --enable-flashinfer-trtllm-moe
                        Enable FlashInfer TRTLLM MoE backend on Blackwell.
                        Supports BlockScale FP8 MoE-EP
  --enable-flashinfer-allreduce-fusion
                        Enable FlashInfer allreduce fusion for Add_RMSNorm.
  --deepep-mode {normal,low_latency,auto}
                        Select the mode when enable DeepEP MoE, could be
                        `normal`, `low_latency` or `auto`. Default is `auto`,
                        which means `low_latency` for decode batch and
                        `normal` for prefill batch.
  --ep-num-redundant-experts EP_NUM_REDUNDANT_EXPERTS
                        Allocate this number of redundant experts in expert
                        parallel.
  --ep-dispatch-algorithm EP_DISPATCH_ALGORITHM
                        The algorithm to choose ranks for redundant experts in
                        expert parallel.
  --init-expert-location INIT_EXPERT_LOCATION
                        Initial location of EP experts.
  --enable-eplb         Enable EPLB algorithm
  --eplb-algorithm EPLB_ALGORITHM
                        Chosen EPLB algorithm
  --eplb-rebalance-num-iterations EPLB_REBALANCE_NUM_ITERATIONS
                        Number of iterations to automatically trigger a EPLB
                        re-balance.
  --eplb-rebalance-layers-per-chunk EPLB_REBALANCE_LAYERS_PER_CHUNK
                        Number of layers to rebalance per forward pass.
  --expert-distribution-recorder-mode EXPERT_DISTRIBUTION_RECORDER_MODE
                        Mode of expert distribution recorder.
  --expert-distribution-recorder-buffer-size EXPERT_DISTRIBUTION_RECORDER_BUFFER_SIZE
                        Circular buffer size of expert distribution recorder.
                        Set to -1 to denote infinite buffer.
  --enable-expert-distribution-metrics
                        Enable logging metrics for expert balancedness
  --deepep-config DEEPEP_CONFIG
                        Tuned DeepEP config suitable for your own cluster. It
                        can be either a string with JSON content or a file
                        path.
  --moe-dense-tp-size MOE_DENSE_TP_SIZE
                        TP size for MoE dense MLP layers. This flag is useful
                        when, with large TP size, there are errors caused by
                        weights in MLP layers having dimension smaller than
                        the min dimension GEMM supports.
  --enable-hierarchical-cache
                        Enable hierarchical cache
  --hicache-ratio HICACHE_RATIO
                        The ratio of the size of host KV cache memory pool to
                        the size of device pool.
  --hicache-size HICACHE_SIZE
                        The size of host KV cache memory pool in gigabytes,
                        which will override the hicache_ratio if set.
  --hicache-write-policy {write_back,write_through,write_through_selective}
                        The write policy of hierarchical cache.
  --hicache-io-backend {direct,kernel}
                        The IO backend for KV cache transfer between CPU and
                        GPU
  --hicache-mem-layout {layer_first,page_first}
                        The layout of host memory pool for hierarchical cache.
  --hicache-storage-backend {file,mooncake,hf3fs,nixl}
                        The storage backend for hierarchical KV cache.
  --hicache-storage-prefetch-policy {best_effort,wait_complete,timeout}
                        Control when prefetching from the storage backend
                        should stop.
  --enable-double-sparsity
                        Enable double sparsity attention
  --ds-channel-config-path DS_CHANNEL_CONFIG_PATH
                        The path of the double sparsity channel config
  --ds-heavy-channel-num DS_HEAVY_CHANNEL_NUM
                        The number of heavy channels in double sparsity
                        attention
  --ds-heavy-token-num DS_HEAVY_TOKEN_NUM
                        The number of heavy tokens in double sparsity
                        attention
  --ds-heavy-channel-type DS_HEAVY_CHANNEL_TYPE
                        The type of heavy channels in double sparsity
                        attention
  --ds-sparse-decode-threshold DS_SPARSE_DECODE_THRESHOLD
                        The type of heavy channels in double sparsity
                        attention
  --disable-radix-cache
                        Disable RadixAttention for prefix caching.
  --cuda-graph-max-bs CUDA_GRAPH_MAX_BS
                        Set the maximum batch size for cuda graph. It will
                        extend the cuda graph capture batch size to this
                        value.
  --cuda-graph-bs CUDA_GRAPH_BS [CUDA_GRAPH_BS ...]
                        Set the list of batch sizes for cuda graph.
  --disable-cuda-graph  Disable cuda graph.
  --disable-cuda-graph-padding
                        Disable cuda graph when padding is needed. Still uses
                        cuda graph when padding is not needed.
  --enable-profile-cuda-graph
                        Enable profiling of cuda graph capture.
  --enable-cudagraph-gc
                        Enable garbage collection during CUDA graph capture.
                        If disabled (default), GC is frozen during capture to
                        speed up the process.
  --enable-nccl-nvls    Enable NCCL NVLS for prefill heavy requests when
                        available.
  --enable-symm-mem     Enable NCCL symmetric memory for fast collectives.
  --enable-tokenizer-batch-encode
                        Enable batch tokenization for improved performance
                        when processing multiple text inputs. Do not use with
                        image inputs, pre-tokenized input_ids, or
                        input_embeds.
  --disable-outlines-disk-cache
                        Disable disk cache of outlines to avoid possible
                        crashes related to file system or high concurrency.
  --disable-custom-all-reduce
                        Disable the custom all-reduce kernel and fall back to
                        NCCL.
  --enable-mscclpp      Enable using mscclpp for small messages for all-reduce
                        kernel and fall back to NCCL.
  --disable-overlap-schedule
                        Disable the overlap scheduler, which overlaps the CPU
                        scheduler with GPU model worker.
  --enable-mixed-chunk  Enabling mixing prefill and decode in a batch when
                        using chunked prefill.
  --enable-dp-attention
                        Enabling data parallelism for attention and tensor
                        parallelism for FFN. The dp size should be equal to
                        the tp size. Currently DeepSeek-V2 and Qwen 2/3 MoE
                        models are supported.
  --enable-dp-lm-head   Enable vocabulary parallel across the attention TP
                        group to avoid all-gather across DP groups, optimizing
                        performance under DP attention.
  --enable-two-batch-overlap
                        Enabling two micro batches to overlap.
  --tbo-token-distribution-threshold TBO_TOKEN_DISTRIBUTION_THRESHOLD
                        The threshold of token distribution between two
                        batches in micro-batch-overlap, determines whether to
                        two-batch-overlap or two-chunk-overlap. Set to 0
                        denote disable two-chunk-overlap.
  --enable-torch-compile
                        Optimize the model with torch.compile. Experimental
                        feature.
  --torch-compile-max-bs TORCH_COMPILE_MAX_BS
                        Set the maximum batch size when using torch compile.
  --torchao-config TORCHAO_CONFIG
                        Optimize the model with torchao. Experimental feature.
                        Current choices are: int8dq, int8wo,
                        int4wo-<group_size>, fp8wo, fp8dq-per_tensor, fp8dq-
                        per_row
  --enable-nan-detection
                        Enable the NaN detection for debugging purposes.
  --enable-p2p-check    Enable P2P check for GPU access, otherwise the p2p
                        access is allowed by default.
  --triton-attention-reduce-in-fp32
                        Cast the intermediate attention results to fp32 to
                        avoid possible crashes related to fp16.This only
                        affects Triton attention kernels.
  --triton-attention-num-kv-splits TRITON_ATTENTION_NUM_KV_SPLITS
                        The number of KV splits in flash decoding Triton
                        kernel. Larger value is better in longer context
                        scenarios. The default value is 8.
  --num-continuous-decode-steps NUM_CONTINUOUS_DECODE_STEPS
                        Run multiple continuous decoding steps to reduce
                        scheduling overhead. This can potentially increase
                        throughput but may also increase time-to-first-token
                        latency. The default value is 1, meaning only run one
                        decoding step at a time.
  --delete-ckpt-after-loading
                        Delete the model checkpoint after loading the model.
  --enable-memory-saver
                        Allow saving memory using release_memory_occupation
                        and resume_memory_occupation
  --allow-auto-truncate
                        Allow automatically truncating requests that exceed
                        the maximum input length instead of returning an
                        error.
  --enable-custom-logit-processor
                        Enable users to pass custom logit processors to the
                        server (disabled by default for security)
  --flashinfer-mla-disable-ragged
                        Not using ragged prefill wrapper when running
                        flashinfer mla
  --disable-shared-experts-fusion
                        Disable shared experts fusion optimization for
                        deepseek v3/r1.
  --disable-chunked-prefix-cache
                        Disable chunked prefix cache feature for deepseek,
                        which should save overhead for short sequences.
  --disable-fast-image-processor
                        Adopt base image processor instead of fast image
                        processor.
  --enable-return-hidden-states
                        Enable returning hidden states with responses.
  --enable-triton-kernel-moe
                        Use triton moe grouped gemm kernel.
  --enable-flashinfer-mxfp4-moe
                        Enable FlashInfer MXFP4 MoE backend for modelopt_fp4
                        quant on Blackwell.
  --scheduler-recv-interval SCHEDULER_RECV_INTERVAL
                        The interval to poll requests in scheduler. Can be set
                        to >1 to reduce the overhead of this.
  --debug-tensor-dump-output-folder DEBUG_TENSOR_DUMP_OUTPUT_FOLDER
                        The output folder for dumping tensors.
  --debug-tensor-dump-input-file DEBUG_TENSOR_DUMP_INPUT_FILE
                        The input filename for dumping tensors
  --debug-tensor-dump-inject DEBUG_TENSOR_DUMP_INJECT
                        Inject the outputs from jax as the input of every
                        layer.
  --debug-tensor-dump-prefill-only
                        Only dump the tensors for prefill requests (i.e. batch
                        size > 1).
  --disaggregation-mode {null,prefill,decode}
                        Only used for PD disaggregation. "prefill" for
                        prefill-only server, and "decode" for decode-only
                        server. If not specified, it is not PD disaggregated
  --disaggregation-transfer-backend {mooncake,nixl,ascend}
                        The backend for disaggregation transfer. Default is
                        mooncake.
  --disaggregation-bootstrap-port DISAGGREGATION_BOOTSTRAP_PORT
                        Bootstrap server port on the prefill server. Default
                        is 8998.
  --disaggregation-decode-tp DISAGGREGATION_DECODE_TP
                        Decode tp size. If not set, it matches the tp size of
                        the current engine. This is only set on the prefill
                        server.
  --disaggregation-decode-dp DISAGGREGATION_DECODE_DP
                        Decode dp size. If not set, it matches the dp size of
                        the current engine. This is only set on the prefill
                        server.
  --disaggregation-prefill-pp DISAGGREGATION_PREFILL_PP
                        Prefill pp size. If not set, it is default to 1. This
                        is only set on the decode server.
  --disaggregation-ib-device DISAGGREGATION_IB_DEVICE
                        The InfiniBand devices for disaggregation transfer,
                        accepts single device (e.g., --disaggregation-ib-
                        device mlx5_0) or multiple comma-separated devices
                        (e.g., --disaggregation-ib-device mlx5_0,mlx5_1).
                        Default is None, which triggers automatic device
                        detection when mooncake backend is enabled.
  --num-reserved-decode-tokens NUM_RESERVED_DECODE_TOKENS
                        Number of decode tokens that will have memory reserved
                        when adding new request to the running batch.
  --pdlb-url PDLB_URL   The URL of the PD disaggregation load balancer. If
                        set, the prefill/decode server will register with the
                        load balancer.
  --custom-weight-loader [CUSTOM_WEIGHT_LOADER ...]
                        The custom dataloader which used to update the model.
                        Should be set with a valid import path, such as
                        my_package.weight_load_func
  --enable-pdmux        Enable PD-Multiplexing, PD running on greenctx stream.
  --sm-group-num SM_GROUP_NUM
                        Number of sm partition groups.
  --weight-loader-disable-mmap
                        Disable mmap while loading weight using safetensors.
  --enable-ep-moe       (Deprecated) Enabling expert parallelism for moe. The
                        ep size is equal to the tp size.
  --enable-deepep-moe   (Deprecated) Enabling DeepEP MoE implementation for EP
                        MoE.

    python3 -m sglang.launch_server --model /root/.achatbot/models/openai/gpt-oss-20b --host 0.0.0.0 --port 30000 \
        --tp 1

WARNING:sglang.srt.configs.model_config:mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING:sglang.srt.server_args:Detected GPT-OSS model, enabling triton_kernels MOE kernel.
[2025-08-10 11:05:45] server_args=ServerArgs(model_path='/root/.achatbot/models/openai/gpt-oss-20b', tokenizer_path='/root/.achatbot/models/openai/gpt-oss-20b', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.874, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=184604551, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, served_model_name='/root/.achatbot/models/openai/gpt-oss-20b', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='triton', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend=None, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, enable_triton_kernel_moe=True, enable_flashinfer_mxfp4_moe=False, scheduler_recv_interval=1, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False)
[2025-08-10 11:05:45] Downcasting torch.float32 to torch.bfloat16.
[2025-08-10 11:05:45] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-08-10 11:05:46] Using default HuggingFace chat template with detected content format: string
[2025-08-10 11:05:53] Downcasting torch.float32 to torch.bfloat16.
[2025-08-10 11:05:53] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.

weedge · 2025-08-10T11:08:10Z

python3 -m sglang.bench_serving --help

usage: bench_serving.py [-h]
                        [--backend {sglang,sglang-native,sglang-oai,sglang-oai-chat,vllm,vllm-chat,lmdeploy,lmdeploy-chat,trt,gserver,truss}]
                        [--base-url BASE_URL] [--host HOST] [--port PORT]
                        [--dataset-name {sharegpt,random,random-ids,generated-shared-prefix,mmmu}]
                        [--dataset-path DATASET_PATH] [--model MODEL]
                        [--tokenizer TOKENIZER] [--num-prompts NUM_PROMPTS]
                        [--sharegpt-output-len SHAREGPT_OUTPUT_LEN]
                        [--sharegpt-context-len SHAREGPT_CONTEXT_LEN]
                        [--random-input-len RANDOM_INPUT_LEN]
                        [--random-output-len RANDOM_OUTPUT_LEN]
                        [--random-range-ratio RANDOM_RANGE_RATIO]
                        [--request-rate REQUEST_RATE]
                        [--max-concurrency MAX_CONCURRENCY]
                        [--output-file OUTPUT_FILE] [--output-details]
                        [--disable-tqdm] [--disable-stream] [--return-logprob]
                        [--seed SEED] [--disable-ignore-eos]
                        [--extra-request-body {"key1": "value1", "key2": "value2"}]
                        [--apply-chat-template] [--profile]
                        [--lora-name [LORA_NAME ...]]
                        [--prompt-suffix PROMPT_SUFFIX] [--pd-separated]
                        [--flush-cache] [--warmup-requests WARMUP_REQUESTS]
                        [--tokenize-prompt] [--gsp-num-groups GSP_NUM_GROUPS]
                        [--gsp-prompts-per-group GSP_PROMPTS_PER_GROUP]
                        [--gsp-system-prompt-len GSP_SYSTEM_PROMPT_LEN]
                        [--gsp-question-len GSP_QUESTION_LEN]
                        [--gsp-output-len GSP_OUTPUT_LEN]

Benchmark the online serving throughput.

options:
  -h, --help            show this help message and exit
  --backend {sglang,sglang-native,sglang-oai,sglang-oai-chat,vllm,vllm-chat,lmdeploy,lmdeploy-chat,trt,gserver,truss}
                        Must specify a backend, depending on the LLM Inference
                        Engine.
  --base-url BASE_URL   Server or API base url if not using http host and
                        port.
  --host HOST           Default host is 0.0.0.0.
  --port PORT           If not set, the default port is configured according
                        to its default value for different LLM Inference
                        Engines.
  --dataset-name {sharegpt,random,random-ids,generated-shared-prefix,mmmu}
                        Name of the dataset to benchmark on.
  --dataset-path DATASET_PATH
                        Path to the dataset.
  --model MODEL         Name or path of the model. If not set, the default
                        model will request /v1/models for conf.
  --tokenizer TOKENIZER
                        Name or path of the tokenizer. If not set, using the
                        model conf.
  --num-prompts NUM_PROMPTS
                        Number of prompts to process. Default is 1000.
  --sharegpt-output-len SHAREGPT_OUTPUT_LEN
                        Output length for each request. Overrides the output
                        length from the ShareGPT dataset.
  --sharegpt-context-len SHAREGPT_CONTEXT_LEN
                        The context length of the model for the ShareGPT
                        dataset. Requests longer than the context length will
                        be dropped.
  --random-input-len RANDOM_INPUT_LEN
                        Number of input tokens per request, used only for
                        random dataset.
  --random-output-len RANDOM_OUTPUT_LEN
                        Number of output tokens per request, used only for
                        random dataset.
  --random-range-ratio RANDOM_RANGE_RATIO
                        Range of sampled ratio of input/output length, used
                        only for random dataset.
  --request-rate REQUEST_RATE
                        Number of requests per second. If this is inf, then
                        all the requests are sent at time 0. Otherwise, we use
                        Poisson process to synthesize the request arrival
                        times. Default is inf.
  --max-concurrency MAX_CONCURRENCY
                        Maximum number of concurrent requests. This can be
                        used to help simulate an environment where a higher
                        level component is enforcing a maximum number of
                        concurrent requests. While the --request-rate argument
                        controls the rate at which requests are initiated,
                        this argument will control how many are actually
                        allowed to execute at a time. This means that when
                        used in combination, the actual request rate may be
                        lower than specified with --request-rate, if the
                        server is not processing requests fast enough to keep
                        up.
  --output-file OUTPUT_FILE
                        Output JSONL file name.
  --output-details      Output details of benchmarking.
  --disable-tqdm        Specify to disable tqdm progress bar.
  --disable-stream      Disable streaming mode.
  --return-logprob      Return logprob.
  --seed SEED           The random seed.
  --disable-ignore-eos  Disable ignoring EOS.
  --extra-request-body {"key1": "value1", "key2": "value2"}
                        Append given JSON object to the request payload. You
                        can use this to specifyadditional generate params like
                        sampling params.
  --apply-chat-template
                        Apply chat template
  --profile             Use Torch Profiler. The endpoint must be launched with
                        SGLANG_TORCH_PROFILER_DIR to enable profiler.
  --lora-name [LORA_NAME ...]
                        The names of LoRA adapters. You can provide a list of
                        names in the format {name} {name} {name}...
  --prompt-suffix PROMPT_SUFFIX
                        Suffix applied to the end of all user prompts,
                        followed by assistant prompt suffix.
  --pd-separated        Benchmark PD disaggregation server
  --flush-cache         Flush the cache before running the benchmark
  --warmup-requests WARMUP_REQUESTS
                        Number of warmup requests to run before the benchmark
  --tokenize-prompt     Use integer ids instead of string for inputs. Useful
                        to control prompt lengths accurately

generated-shared-prefix dataset arguments:
  --gsp-num-groups GSP_NUM_GROUPS
                        Number of system prompt groups for generated-shared-
                        prefix dataset
  --gsp-prompts-per-group GSP_PROMPTS_PER_GROUP
                        Number of prompts per system prompt group for
                        generated-shared-prefix dataset
  --gsp-system-prompt-len GSP_SYSTEM_PROMPT_LEN
                        Target length in tokens for system prompts in
                        generated-shared-prefix dataset
  --gsp-question-len GSP_QUESTION_LEN
                        Target length in tokens for questions in generated-
                        shared-prefix dataset
  --gsp-output-len GSP_OUTPUT_LEN
                        Target length in tokens for outputs in generated-
                        shared-prefix dataset

weedge · 2025-08-10T11:09:20Z

LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 20 --max-concurrency 4

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 4         
Successful requests:                     20        
Benchmark duration (s):                  42.49     
Total input tokens:                      10240     
Total generated tokens:                  20480     
Total generated tokens (retokenized):    19627     
Request throughput (req/s):              0.47      
Input token throughput (tok/s):          240.98    
Output token throughput (tok/s):         481.95    
Total token throughput (tok/s):          722.93    
Concurrency:                             3.97      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8432.23   
Median E2E Latency (ms):                 8560.30   
---------------Time to First Token----------------
Mean TTFT (ms):                          2731.00   
Median TTFT (ms):                        2465.51   
P99 TTFT (ms):                           6884.07   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           5.60      
Median ITL (ms):                         0.02      
P95 ITL (ms):                            32.63     
P99 ITL (ms):                            39.79     
Max ITL (ms):                            2999.22   
==================================================

weedge · 2025-08-10T11:22:39Z

LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 40 --max-concurrency 8

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 8         
Successful requests:                     40        
Benchmark duration (s):                  51.22     
Total input tokens:                      20480     
Total generated tokens:                  40960     
Total generated tokens (retokenized):    38888     
Request throughput (req/s):              0.78      
Input token throughput (tok/s):          399.84    
Output token throughput (tok/s):         799.67    
Total token throughput (tok/s):          1199.51   
Concurrency:                             7.97      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   10208.51  
Median E2E Latency (ms):                 11467.30  
---------------Time to First Token----------------
Mean TTFT (ms):                          3808.40   
Median TTFT (ms):                        4905.19   
P99 TTFT (ms):                           7524.85   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           6.27      
Median ITL (ms):                         0.02      
P95 ITL (ms):                            43.42     
P99 ITL (ms):                            85.97     
Max ITL (ms):                            5036.40   
==================================================

weedge · 2025-08-10T12:40:34Z

LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 80 --max-concurrency 16

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     80        
Benchmark duration (s):                  51.98     
Total input tokens:                      40960     
Total generated tokens:                  81920     
Total generated tokens (retokenized):    78033     
Request throughput (req/s):              1.54      
Input token throughput (tok/s):          787.96    
Output token throughput (tok/s):         1575.92   
Total token throughput (tok/s):          2363.88   
Concurrency:                             15.94     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   10356.59  
Median E2E Latency (ms):                 9502.49   
---------------Time to First Token----------------
Mean TTFT (ms):                          3505.50   
Median TTFT (ms):                        2855.67   
P99 TTFT (ms):                           9155.96   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           6.72      
Median ITL (ms):                         0.02      
P95 ITL (ms):                            34.40     
P99 ITL (ms):                            45.11     
Max ITL (ms):                            5738.54   
==================================================

weedge · 2025-08-10T12:47:57Z

LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 160 --max-concurrency 32

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     160       
Benchmark duration (s):                  59.25     
Total input tokens:                      81920     
Total generated tokens:                  163840    
Total generated tokens (retokenized):    156876    
Request throughput (req/s):              2.70      
Input token throughput (tok/s):          1382.70   
Output token throughput (tok/s):         2765.40   
Total token throughput (tok/s):          4148.10   
Concurrency:                             31.78     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   11769.33  
Median E2E Latency (ms):                 9837.10   
---------------Time to First Token----------------
Mean TTFT (ms):                          3721.89   
Median TTFT (ms):                        2106.69   
P99 TTFT (ms):                           7713.94   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           7.88      
Median ITL (ms):                         0.01      
P95 ITL (ms):                            50.20     
P99 ITL (ms):                            61.79     
Max ITL (ms):                            6138.13   
==================================================

weedge · 2025-08-10T12:56:22Z

LLM_MODEL=openai/gpt-oss-20b SERVE_IMAGE_GPU=H100 TP=1 modal run src/llm/sglang/openai_gpt_oss.py::main --task benchmark --num-prompts 320 --max-concurrency 64

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     320       
Benchmark duration (s):                  109.66    
Total input tokens:                      163840    
Total generated tokens:                  327680    
Total generated tokens (retokenized):    313361    
Request throughput (req/s):              2.92      
Input token throughput (tok/s):          1494.05   
Output token throughput (tok/s):         2988.10   
Total token throughput (tok/s):          4482.15   
Concurrency:                             43.10     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   14768.59  
Median E2E Latency (ms):                 13006.53  
---------------Time to First Token----------------
Mean TTFT (ms):                          3893.74   
Median TTFT (ms):                        2715.66   
P99 TTFT (ms):                           11028.79  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           10.65     
Median ITL (ms):                         0.01      
P95 ITL (ms):                            79.58     
P99 ITL (ms):                            96.64     
Max ITL (ms):                            55488.42  
==================================================

PS: 并发请求大了之后，会出现请求排队（Concurrency 明显降低），TTFT的延迟升高，在单个H100 GPU sglang 加载 gpt-oss-20 模型进推理(v1//v1/completions) 所能处理的并发请求能力已进入饱和期。

Signed-off-by: weedge <weege007@gmail.com>

ghostplant · 2025-08-11T03:39:36Z

@weedge Which version supporting running "openai/gpt-oss-120b" using 1 A100 or 1 H100?

weedge · 2025-08-11T17:01:41Z

@weedge Which version supporting running "openai/gpt-oss-120b" using 1 A100 or 1 H100?

@ghostplant see this PR #179, maybe u can get answer.

feat: add sglang + openai gpt-oss serve and benchmark on modal

3d65c29

Signed-off-by: weedge <weege007@gmail.com>

gemini-code-assist bot reviewed Aug 10, 2025

View reviewed changes

weedge added modal MoE sglang GPT openai GPT model reasoning reasoning model bench labels Aug 10, 2025

weedge and others added 2 commits August 10, 2025 18:59

Update deploy/modal/src/llm/sglang/openai_gpt_oss.py

cc17959

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update deploy/modal/src/llm/sglang/openai_gpt_oss.py

cf585fc

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix

0a5f7c4

Signed-off-by: weedge <weege007@gmail.com>

weedge merged commit 1c93536 into main Aug 10, 2025

	"pip3 install https://github.com/sgl-project/whl/releases/download/v0.3.3/sgl_kernel-0.3.3-cp39-abi3-manylinux2014_x86_64.whl --force-reinstall"
	"pip3 install https://github.com/sgl-project/whl/releases/download/v0.3.3/sgl_kernel-0.3.3-cp310-abi3-manylinux2014_x86_64.whl --force-reinstall"

Conversation

weedge commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

reference

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

weedge Aug 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weedge commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weedge commented Aug 10, 2025

Uh oh!

weedge commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weedge commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weedge commented Aug 10, 2025

Uh oh!

weedge commented Aug 10, 2025

Uh oh!

weedge commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghostplant commented Aug 11, 2025

Uh oh!

weedge commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weedge commented Aug 10, 2025 •

edited

Loading

weedge commented Aug 10, 2025 •

edited

Loading

weedge commented Aug 10, 2025 •

edited

Loading

weedge commented Aug 10, 2025 •

edited

Loading

weedge commented Aug 10, 2025 •

edited

Loading