Conversation
Signed-off-by: weedge <weege007@gmail.com>
There was a problem hiding this comment.
Summary of Changes
Hello @weedge, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've implemented a new feature to enable serving and benchmarking of OpenAI GPT-OSS models using SGLang on Modal. This involves setting up a dedicated Modal application for SGLang, complete with environment configuration, model serving capabilities, and integrated benchmarking tools. Additionally, I've made minor adjustments to existing evaluation scripts to improve clarity and ensure proper directory setup, and updated the VLLM serving configuration to use a different port and adjusted concurrency settings.
Highlights
- New SGLang Integration: I've added a new feature that integrates SGLang for serving and benchmarking OpenAI GPT-OSS models on the Modal platform, providing a new pathway for deploying and evaluating these models.
- New File for SGLang Logic: I've introduced a new Python file,
deploy/modal/src/llm/sglang/openai_gpt_oss.py, which encapsulates all the necessary logic for setting up the SGLang environment, serving models, and running benchmarks. - Modal App Renaming for Clarity: I've updated the Modal application names in
evalscope.pyandlighteval.pyto be more descriptive, changing them fromopenai_gpt_oss_trltoopenai_gpt_oss_evalscopeandopenai_gpt_oss_lightevalrespectively, improving clarity and organization. - Automated Directory Creation: I've ensured that the evaluation output directories are automatically created in
evalscope.pyandlighteval.pyby addingos.makedirs(eval_out_dir, exist_ok=True), preventing potential errors during evaluation runs. - VLLM Serving Port and Concurrency Adjustment: I've adjusted the default serving port for VLLM in
openai_gpt_oss.pyfrom 8801 to 8000 and modified thetarget_inputsformodal.concurrentfrom 10 to 4, which might impact concurrency and resource allocation for VLLM deployments.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request adds support for serving and benchmarking OpenAI's GPT-OSS models using sglang on Modal. It introduces a new script for this purpose and makes minor adjustments to existing evaluation scripts. My review focuses on the new sglang script, where I've identified a critical issue with a Python package dependency that would break the build, as well as several high and medium severity issues related to correctness, code quality, and maintainability. I've provided suggestions to fix these issues.
| "pip3 install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129" | ||
| ) | ||
| .run_commands( | ||
| "pip3 install https://github.com/sgl-project/whl/releases/download/v0.3.3/sgl_kernel-0.3.3-cp39-abi3-manylinux2014_x86_64.whl --force-reinstall" |
There was a problem hiding this comment.
The Docker image is configured to use Python 3.10 (add_python="3.10" on line 23), but this command attempts to install a wheel for Python 3.9 (cp39). This will cause the image build to fail. You should use the wheel compatible with Python 3.10, which is available.
| "pip3 install https://github.com/sgl-project/whl/releases/download/v0.3.3/sgl_kernel-0.3.3-cp39-abi3-manylinux2014_x86_64.whl --force-reinstall" | |
| "pip3 install https://github.com/sgl-project/whl/releases/download/v0.3.3/sgl_kernel-0.3.3-cp310-abi3-manylinux2014_x86_64.whl --force-reinstall" |
There was a problem hiding this comment.
wheels from https://github.com/sgl-project/whl/releases/tag/v0.3.3 , no x86_64, but use cp39 is ok for python3.10
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
python3 -m sglang.launch_server --help usage: launch_server.py [-h] --model-path MODEL_PATH
[--tokenizer-path TOKENIZER_PATH]
[--tokenizer-mode {auto,slow}] [--skip-tokenizer-init]
[--load-format {auto,pt,safetensors,npcache,dummy,sharded_state,gguf,bitsandbytes,layered,remote}]
[--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG]
[--trust-remote-code]
[--context-length CONTEXT_LENGTH] [--is-embedding]
[--enable-multimodal] [--revision REVISION]
[--model-impl MODEL_IMPL] [--host HOST] [--port PORT]
[--skip-server-warmup] [--warmups WARMUPS]
[--nccl-port NCCL_PORT]
[--dtype {auto,half,float16,bfloat16,float,float32}]
[--quantization {awq,fp8,gptq,marlin,gptq_marlin,awq_marlin,bitsandbytes,gguf,modelopt,modelopt_fp4,petit_nvfp4,w8a8_int8,w8a8_fp8,moe_wna16,qoq,w4afp8,mxfp4}]
[--quantization-param-path QUANTIZATION_PARAM_PATH]
[--kv-cache-dtype {auto,fp8_e5m2,fp8_e4m3}]
[--mem-fraction-static MEM_FRACTION_STATIC]
[--max-running-requests MAX_RUNNING_REQUESTS]
[--max-queued-requests MAX_QUEUED_REQUESTS]
[--max-total-tokens MAX_TOTAL_TOKENS]
[--chunked-prefill-size CHUNKED_PREFILL_SIZE]
[--max-prefill-tokens MAX_PREFILL_TOKENS]
[--schedule-policy {lpm,random,fcfs,dfs-weight,lof}]
[--schedule-conservativeness SCHEDULE_CONSERVATIVENESS]
[--cpu-offload-gb CPU_OFFLOAD_GB]
[--page-size PAGE_SIZE]
[--hybrid-kvcache-ratio [HYBRID_KVCACHE_RATIO]]
[--swa-full-tokens-ratio SWA_FULL_TOKENS_RATIO]
[--disable-hybrid-swa-memory] [--device DEVICE]
[--tensor-parallel-size TENSOR_PARALLEL_SIZE]
[--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
[--max-micro-batch-size MAX_MICRO_BATCH_SIZE]
[--stream-interval STREAM_INTERVAL] [--stream-output]
[--random-seed RANDOM_SEED]
[--constrained-json-whitespace-pattern CONSTRAINED_JSON_WHITESPACE_PATTERN]
[--watchdog-timeout WATCHDOG_TIMEOUT]
[--dist-timeout DIST_TIMEOUT]
[--download-dir DOWNLOAD_DIR]
[--base-gpu-id BASE_GPU_ID]
[--gpu-id-step GPU_ID_STEP] [--sleep-on-idle]
[--log-level LOG_LEVEL]
[--log-level-http LOG_LEVEL_HTTP] [--log-requests]
[--log-requests-level {0,1,2,3}]
[--crash-dump-folder CRASH_DUMP_FOLDER]
[--show-time-cost] [--enable-metrics]
[--enable-metrics-for-all-schedulers]
[--bucket-time-to-first-token BUCKET_TIME_TO_FIRST_TOKEN [BUCKET_TIME_TO_FIRST_TOKEN ...]]
[--bucket-inter-token-latency BUCKET_INTER_TOKEN_LATENCY [BUCKET_INTER_TOKEN_LATENCY ...]]
[--bucket-e2e-request-latency BUCKET_E2E_REQUEST_LATENCY [BUCKET_E2E_REQUEST_LATENCY ...]]
[--collect-tokens-histogram]
[--decode-log-interval DECODE_LOG_INTERVAL]
[--enable-request-time-stats-logging]
[--kv-events-config KV_EVENTS_CONFIG]
[--api-key API_KEY]
[--served-model-name SERVED_MODEL_NAME]
[--chat-template CHAT_TEMPLATE]
[--completion-template COMPLETION_TEMPLATE]
[--file-storage-path FILE_STORAGE_PATH]
[--enable-cache-report]
[--reasoning-parser {deepseek-r1,qwen3,qwen3-thinking,glm45,kimi,step3}]
[--tool-call-parser {qwen25,mistral,llama3,deepseekv3,pythonic,kimi_k2,qwen3_coder,glm45,step3}]
[--tool-server TOOL_SERVER]
[--data-parallel-size DATA_PARALLEL_SIZE]
[--load-balance-method {round_robin,shortest_queue,minimum_tokens}]
[--dist-init-addr DIST_INIT_ADDR] [--nnodes NNODES]
[--node-rank NODE_RANK]
[--json-model-override-args JSON_MODEL_OVERRIDE_ARGS]
[--preferred-sampling-params PREFERRED_SAMPLING_PARAMS]
[--enable-lora] [--max-lora-rank MAX_LORA_RANK]
[--lora-target-modules [{q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj,all} ...]]
[--lora-paths [LORA_PATHS ...]]
[--max-loras-per-batch MAX_LORAS_PER_BATCH]
[--max-loaded-loras MAX_LOADED_LORAS]
[--lora-backend LORA_BACKEND]
[--attention-backend {aiter,cutlass_mla,fa3,flashinfer,flashmla,intel_amx,torch_native,ascend,triton,trtllm_mla,trtllm_mha,dual_chunk_flash_attn}]
[--prefill-attention-backend {aiter,cutlass_mla,fa3,flashinfer,flashmla,intel_amx,torch_native,ascend,triton,trtllm_mla,trtllm_mha,dual_chunk_flash_attn}]
[--decode-attention-backend {aiter,cutlass_mla,fa3,flashinfer,flashmla,intel_amx,torch_native,ascend,triton,trtllm_mla,trtllm_mha,dual_chunk_flash_attn}]
[--sampling-backend {flashinfer,pytorch}]
[--grammar-backend {xgrammar,outlines,llguidance,none}]
[--mm-attention-backend {sdpa,fa3,triton_attn}]
[--speculative-algorithm {EAGLE,EAGLE3,NEXTN}]
[--speculative-draft-model-path SPECULATIVE_DRAFT_MODEL_PATH]
[--speculative-num-steps SPECULATIVE_NUM_STEPS]
[--speculative-eagle-topk SPECULATIVE_EAGLE_TOPK]
[--speculative-num-draft-tokens SPECULATIVE_NUM_DRAFT_TOKENS]
[--speculative-accept-threshold-single SPECULATIVE_ACCEPT_THRESHOLD_SINGLE]
[--speculative-accept-threshold-acc SPECULATIVE_ACCEPT_THRESHOLD_ACC]
[--speculative-token-map SPECULATIVE_TOKEN_MAP]
[--expert-parallel-size EXPERT_PARALLEL_SIZE]
[--moe-a2a-backend {deepep}]
[--enable-flashinfer-cutlass-moe]
[--enable-flashinfer-trtllm-moe]
[--enable-flashinfer-allreduce-fusion]
[--deepep-mode {normal,low_latency,auto}]
[--ep-num-redundant-experts EP_NUM_REDUNDANT_EXPERTS]
[--ep-dispatch-algorithm EP_DISPATCH_ALGORITHM]
[--init-expert-location INIT_EXPERT_LOCATION]
[--enable-eplb] [--eplb-algorithm EPLB_ALGORITHM]
[--eplb-rebalance-num-iterations EPLB_REBALANCE_NUM_ITERATIONS]
[--eplb-rebalance-layers-per-chunk EPLB_REBALANCE_LAYERS_PER_CHUNK]
[--expert-distribution-recorder-mode EXPERT_DISTRIBUTION_RECORDER_MODE]
[--expert-distribution-recorder-buffer-size EXPERT_DISTRIBUTION_RECORDER_BUFFER_SIZE]
[--enable-expert-distribution-metrics]
[--deepep-config DEEPEP_CONFIG]
[--moe-dense-tp-size MOE_DENSE_TP_SIZE]
[--enable-hierarchical-cache]
[--hicache-ratio HICACHE_RATIO]
[--hicache-size HICACHE_SIZE]
[--hicache-write-policy {write_back,write_through,write_through_selective}]
[--hicache-io-backend {direct,kernel}]
[--hicache-mem-layout {layer_first,page_first}]
[--hicache-storage-backend {file,mooncake,hf3fs,nixl}]
[--hicache-storage-prefetch-policy {best_effort,wait_complete,timeout}]
[--enable-double-sparsity]
[--ds-channel-config-path DS_CHANNEL_CONFIG_PATH]
[--ds-heavy-channel-num DS_HEAVY_CHANNEL_NUM]
[--ds-heavy-token-num DS_HEAVY_TOKEN_NUM]
[--ds-heavy-channel-type DS_HEAVY_CHANNEL_TYPE]
[--ds-sparse-decode-threshold DS_SPARSE_DECODE_THRESHOLD]
[--disable-radix-cache]
[--cuda-graph-max-bs CUDA_GRAPH_MAX_BS]
[--cuda-graph-bs CUDA_GRAPH_BS [CUDA_GRAPH_BS ...]]
[--disable-cuda-graph] [--disable-cuda-graph-padding]
[--enable-profile-cuda-graph] [--enable-cudagraph-gc]
[--enable-nccl-nvls] [--enable-symm-mem]
[--enable-tokenizer-batch-encode]
[--disable-outlines-disk-cache]
[--disable-custom-all-reduce] [--enable-mscclpp]
[--disable-overlap-schedule] [--enable-mixed-chunk]
[--enable-dp-attention] [--enable-dp-lm-head]
[--enable-two-batch-overlap]
[--tbo-token-distribution-threshold TBO_TOKEN_DISTRIBUTION_THRESHOLD]
[--enable-torch-compile]
[--torch-compile-max-bs TORCH_COMPILE_MAX_BS]
[--torchao-config TORCHAO_CONFIG]
[--enable-nan-detection] [--enable-p2p-check]
[--triton-attention-reduce-in-fp32]
[--triton-attention-num-kv-splits TRITON_ATTENTION_NUM_KV_SPLITS]
[--num-continuous-decode-steps NUM_CONTINUOUS_DECODE_STEPS]
[--delete-ckpt-after-loading] [--enable-memory-saver]
[--allow-auto-truncate]
[--enable-custom-logit-processor]
[--flashinfer-mla-disable-ragged]
[--disable-shared-experts-fusion]
[--disable-chunked-prefix-cache]
[--disable-fast-image-processor]
[--enable-return-hidden-states]
[--enable-triton-kernel-moe]
[--enable-flashinfer-mxfp4-moe]
[--scheduler-recv-interval SCHEDULER_RECV_INTERVAL]
[--debug-tensor-dump-output-folder DEBUG_TENSOR_DUMP_OUTPUT_FOLDER]
[--debug-tensor-dump-input-file DEBUG_TENSOR_DUMP_INPUT_FILE]
[--debug-tensor-dump-inject DEBUG_TENSOR_DUMP_INJECT]
[--debug-tensor-dump-prefill-only]
[--disaggregation-mode {null,prefill,decode}]
[--disaggregation-transfer-backend {mooncake,nixl,ascend}]
[--disaggregation-bootstrap-port DISAGGREGATION_BOOTSTRAP_PORT]
[--disaggregation-decode-tp DISAGGREGATION_DECODE_TP]
[--disaggregation-decode-dp DISAGGREGATION_DECODE_DP]
[--disaggregation-prefill-pp DISAGGREGATION_PREFILL_PP]
[--disaggregation-ib-device DISAGGREGATION_IB_DEVICE]
[--num-reserved-decode-tokens NUM_RESERVED_DECODE_TOKENS]
[--pdlb-url PDLB_URL]
[--custom-weight-loader [CUSTOM_WEIGHT_LOADER ...]]
[--enable-pdmux] [--sm-group-num SM_GROUP_NUM]
[--weight-loader-disable-mmap] [--enable-ep-moe]
[--enable-deepep-moe]
options:
-h, --help show this help message and exit
--model-path MODEL_PATH, --model MODEL_PATH
The path of the model weights. This can be a local
folder or a Hugging Face repo ID.
--tokenizer-path TOKENIZER_PATH
The path of the tokenizer.
--tokenizer-mode {auto,slow}
Tokenizer mode. 'auto' will use the fast tokenizer if
available, and 'slow' will always use the slow
tokenizer.
--skip-tokenizer-init
If set, skip init tokenizer and pass input_ids in
generate request.
--load-format {auto,pt,safetensors,npcache,dummy,sharded_state,gguf,bitsandbytes,layered,remote}
The format of the model weights to load. "auto" will
try to load the weights in the safetensors format and
fall back to the pytorch bin format if safetensors
format is not available. "pt" will load the weights in
the pytorch bin format. "safetensors" will load the
weights in the safetensors format. "npcache" will load
the weights in pytorch format and store a numpy cache
to speed up the loading. "dummy" will initialize the
weights with random values, which is mainly for
profiling."gguf" will load the weights in the gguf
format. "bitsandbytes" will load the weights using
bitsandbytes quantization."layered" loads weights
layer by layer so that one can quantize a layer before
loading another to make the peak memory envelope
smaller.
--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG
Extra config for model loader. This will be passed to
the model loader corresponding to the chosen
load_format.
--trust-remote-code Whether or not to allow for custom models defined on
the Hub in their own modeling files.
--context-length CONTEXT_LENGTH
The model's maximum context length. Defaults to None
(will use the value from the model's config.json
instead).
--is-embedding Whether to use a CausalLM as an embedding model.
--enable-multimodal Enable the multimodal functionality for the served
model. If the model being served is not multimodal,
nothing will happen
--revision REVISION The specific model version to use. It can be a branch
name, a tag name, or a commit id. If unspecified, will
use the default version.
--model-impl MODEL_IMPL
Which implementation of the model to use. * "auto"
will try to use the SGLang implementation if it exists
and fall back to the Transformers implementation if no
SGLang implementation is available. * "sglang" will
use the SGLang model implementation. * "transformers"
will use the Transformers model implementation.
--host HOST The host of the HTTP server.
--port PORT The port of the HTTP server.
--skip-server-warmup If set, skip warmup.
--warmups WARMUPS Specify custom warmup functions (csv) to run before
server starts eg. --warmups=warmup_name1,warmup_name2
will run the functions `warmup_name1` and
`warmup_name2` specified in warmup.py before the
server starts listening for requests
--nccl-port NCCL_PORT
The port for NCCL distributed environment setup.
Defaults to a random port.
--dtype {auto,half,float16,bfloat16,float,float32}
Data type for model weights and activations. * "auto"
will use FP16 precision for FP32 and FP16 models, and
BF16 precision for BF16 models. * "half" for FP16.
Recommended for AWQ quantization. * "float16" is the
same as "half". * "bfloat16" for a balance between
precision and range. * "float" is shorthand for FP32
precision. * "float32" for FP32 precision.
--quantization {awq,fp8,gptq,marlin,gptq_marlin,awq_marlin,bitsandbytes,gguf,modelopt,modelopt_fp4,petit_nvfp4,w8a8_int8,w8a8_fp8,moe_wna16,qoq,w4afp8,mxfp4}
The quantization method.
--quantization-param-path QUANTIZATION_PARAM_PATH
Path to the JSON file containing the KV cache scaling
factors. This should generally be supplied, when KV
cache dtype is FP8. Otherwise, KV cache scaling
factors default to 1.0, which may cause accuracy
issues.
--kv-cache-dtype {auto,fp8_e5m2,fp8_e4m3}
Data type for kv cache storage. "auto" will use model
data type. "fp8_e5m2" and "fp8_e4m3" is supported for
CUDA 11.8+.
--mem-fraction-static MEM_FRACTION_STATIC
The fraction of the memory used for static allocation
(model weights and KV cache memory pool). Use a
smaller value if you see out-of-memory errors.
--max-running-requests MAX_RUNNING_REQUESTS
The maximum number of running requests.
--max-queued-requests MAX_QUEUED_REQUESTS
The maximum number of queued requests. This option is
ignored when using disaggregation-mode.
--max-total-tokens MAX_TOTAL_TOKENS
The maximum number of tokens in the memory pool. If
not specified, it will be automatically calculated
based on the memory usage fraction. This option is
typically used for development and debugging purposes.
--chunked-prefill-size CHUNKED_PREFILL_SIZE
The maximum number of tokens in a chunk for the
chunked prefill. Setting this to -1 means disabling
chunked prefill.
--max-prefill-tokens MAX_PREFILL_TOKENS
The maximum number of tokens in a prefill batch. The
real bound will be the maximum of this value and the
model's maximum context length.
--schedule-policy {lpm,random,fcfs,dfs-weight,lof}
The scheduling policy of the requests.
--schedule-conservativeness SCHEDULE_CONSERVATIVENESS
How conservative the schedule policy is. A larger
value means more conservative scheduling. Use a larger
value if you see requests being retracted frequently.
--cpu-offload-gb CPU_OFFLOAD_GB
How many GBs of RAM to reserve for CPU offloading.
--page-size PAGE_SIZE
The number of tokens in a page.
--hybrid-kvcache-ratio [HYBRID_KVCACHE_RATIO]
Mix ratio in [0,1] between uniform and hybrid kv
buffers (0.0 = pure uniform: swa_size / full_size =
1)(1.0 = pure hybrid: swa_size / full_size =
local_attention_size / context_length)
--swa-full-tokens-ratio SWA_FULL_TOKENS_RATIO
The ratio of SWA layer KV tokens / full layer KV
tokens, regardless of the number of swa:full layers.
It should be between 0 and 1. E.g. 0.5 means if each
swa layer has 50 tokens, then each full layer has 100
tokens.
--disable-hybrid-swa-memory
Disable the hybrid SWA memory.
--device DEVICE The device to use ('cuda', 'xpu', 'hpu', 'npu',
'cpu'). Defaults to auto-detection if not specified.
--tensor-parallel-size TENSOR_PARALLEL_SIZE, --tp-size TENSOR_PARALLEL_SIZE
The tensor parallelism size.
--pipeline-parallel-size PIPELINE_PARALLEL_SIZE, --pp-size PIPELINE_PARALLEL_SIZE
The pipeline parallelism size.
--max-micro-batch-size MAX_MICRO_BATCH_SIZE
The maximum micro batch size in pipeline parallelism.
--stream-interval STREAM_INTERVAL
The interval (or buffer size) for streaming in terms
of the token length. A smaller value makes streaming
smoother, while a larger value makes the throughput
higher
--stream-output Whether to output as a sequence of disjoint segments.
--random-seed RANDOM_SEED
The random seed.
--constrained-json-whitespace-pattern CONSTRAINED_JSON_WHITESPACE_PATTERN
(outlines backend only) Regex pattern for syntactic
whitespaces allowed in JSON constrained output. For
example, to allow the model generate consecutive
whitespaces, set the pattern to [ ]*
--watchdog-timeout WATCHDOG_TIMEOUT
Set watchdog timeout in seconds. If a forward batch
takes longer than this, the server will crash to
prevent hanging.
--dist-timeout DIST_TIMEOUT
Set timeout for torch.distributed initialization.
--download-dir DOWNLOAD_DIR
Model download directory for huggingface.
--base-gpu-id BASE_GPU_ID
The base GPU ID to start allocating GPUs from. Useful
when running multiple instances on the same machine.
--gpu-id-step GPU_ID_STEP
The delta between consecutive GPU IDs that are used.
For example, setting it to 2 will use GPU 0,2,4,...
--sleep-on-idle Reduce CPU usage when sglang is idle.
--log-level LOG_LEVEL
The logging level of all loggers.
--log-level-http LOG_LEVEL_HTTP
The logging level of HTTP server. If not set, reuse
--log-level by default.
--log-requests Log metadata, inputs, outputs of all requests. The
verbosity is decided by --log-requests-level
--log-requests-level {0,1,2,3}
0: Log metadata (no sampling parameters). 1: Log
metadata and sampling parameters. 2: Log metadata,
sampling parameters and partial input/output. 3: Log
every input/output.
--crash-dump-folder CRASH_DUMP_FOLDER
Folder path to dump requests from the last 5 min
before a crash (if any). If not specified, crash
dumping is disabled.
--show-time-cost Show time cost of custom marks.
--enable-metrics Enable log prometheus metrics.
--enable-metrics-for-all-schedulers
Enable --enable-metrics-for-all-schedulers when you
want schedulers on all TP ranks (not just TP 0) to
record request metrics separately. This is especially
useful when dp_attention is enabled, as otherwise all
metrics appear to come from TP 0.
--bucket-time-to-first-token BUCKET_TIME_TO_FIRST_TOKEN [BUCKET_TIME_TO_FIRST_TOKEN ...]
The buckets of time to first token, specified as a
list of floats.
--bucket-inter-token-latency BUCKET_INTER_TOKEN_LATENCY [BUCKET_INTER_TOKEN_LATENCY ...]
The buckets of inter-token latency, specified as a
list of floats.
--bucket-e2e-request-latency BUCKET_E2E_REQUEST_LATENCY [BUCKET_E2E_REQUEST_LATENCY ...]
The buckets of end-to-end request latency, specified
as a list of floats.
--collect-tokens-histogram
Collect prompt/generation tokens histogram.
--decode-log-interval DECODE_LOG_INTERVAL
The log interval of decode batch.
--enable-request-time-stats-logging
Enable per request time stats logging
--kv-events-config KV_EVENTS_CONFIG
Config in json format for NVIDIA dynamo KV event
publishing. Publishing will be enabled if this flag is
used.
--api-key API_KEY Set API key of the server. It is also used in the
OpenAI API compatible server.
--served-model-name SERVED_MODEL_NAME
Override the model name returned by the v1/models
endpoint in OpenAI API server.
--chat-template CHAT_TEMPLATE
The buliltin chat template name or the path of the
chat template file. This is only used for OpenAI-
compatible API server.
--completion-template COMPLETION_TEMPLATE
The buliltin completion template name or the path of
the completion template file. This is only used for
OpenAI-compatible API server. only for code completion
currently.
--file-storage-path FILE_STORAGE_PATH
The path of the file storage in backend.
--enable-cache-report
Return number of cached tokens in
usage.prompt_tokens_details for each openai request.
--reasoning-parser {deepseek-r1,qwen3,qwen3-thinking,glm45,kimi,step3}
Specify the parser for reasoning models, supported
parsers are: ['deepseek-r1', 'qwen3',
'qwen3-thinking', 'glm45', 'kimi', 'step3'].
--tool-call-parser {qwen25,mistral,llama3,deepseekv3,pythonic,kimi_k2,qwen3_coder,glm45,step3}
Specify the parser for handling tool-call
interactions. Options include: 'qwen25', 'mistral',
'llama3', 'deepseekv3', 'pythonic', 'kimi_k2',
'qwen3_coder', 'glm45', and 'step3'.
--tool-server TOOL_SERVER
Either 'demo' or a comma-separated list of tool server
urls to use for the model. If not specified, no tool
server will be used.
--data-parallel-size DATA_PARALLEL_SIZE, --dp-size DATA_PARALLEL_SIZE
The data parallelism size.
--load-balance-method {round_robin,shortest_queue,minimum_tokens}
The load balancing strategy for data parallelism.
--dist-init-addr DIST_INIT_ADDR, --nccl-init-addr DIST_INIT_ADDR
The host address for initializing distributed backend
(e.g., `192.168.0.2:25000`).
--nnodes NNODES The number of nodes.
--node-rank NODE_RANK
The node rank.
--json-model-override-args JSON_MODEL_OVERRIDE_ARGS
A dictionary in JSON string format used to override
default model configurations.
--preferred-sampling-params PREFERRED_SAMPLING_PARAMS
json-formatted sampling settings that will be returned
in /get_model_info
--enable-lora Enable LoRA support for the model. This argument is
automatically set to True if `--lora-paths` is
provided for backward compatibility.
--max-lora-rank MAX_LORA_RANK
The maximum rank of LoRA adapters. If not specified,
it will be automatically inferred from the adapters
provided in --lora-paths.
--lora-target-modules [{q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj,all} ...]
The union set of all target modules where LoRA should
be applied. If not specified, it will be automatically
inferred from the adapters provided in --lora-paths.
If 'all' is specified, all supported modules will be
targeted.
--lora-paths [LORA_PATHS ...]
The list of LoRA adapters. You can provide a list of
either path in str or renamed path in the format
{name}={path}.
--max-loras-per-batch MAX_LORAS_PER_BATCH
Maximum number of adapters for a running batch,
include base-only request.
--max-loaded-loras MAX_LOADED_LORAS
If specified, it limits the maximum number of LoRA
adapters loaded in CPU memory at a time. The value
must be greater than or equal to `--max-loras-per-
batch`.
--lora-backend LORA_BACKEND
Choose the kernel backend for multi-LoRA serving.
--attention-backend {aiter,cutlass_mla,fa3,flashinfer,flashmla,intel_amx,torch_native,ascend,triton,trtllm_mla,trtllm_mha,dual_chunk_flash_attn}
Choose the kernels for attention layers.
--prefill-attention-backend {aiter,cutlass_mla,fa3,flashinfer,flashmla,intel_amx,torch_native,ascend,triton,trtllm_mla,trtllm_mha,dual_chunk_flash_attn}
Choose the kernels for prefill attention layers (have
priority over --attention-backend).
--decode-attention-backend {aiter,cutlass_mla,fa3,flashinfer,flashmla,intel_amx,torch_native,ascend,triton,trtllm_mla,trtllm_mha,dual_chunk_flash_attn}
Choose the kernels for decode attention layers (have
priority over --attention-backend).
--sampling-backend {flashinfer,pytorch}
Choose the kernels for sampling layers.
--grammar-backend {xgrammar,outlines,llguidance,none}
Choose the backend for grammar-guided decoding.
--mm-attention-backend {sdpa,fa3,triton_attn}
Set multimodal attention backend.
--speculative-algorithm {EAGLE,EAGLE3,NEXTN}
Speculative algorithm.
--speculative-draft-model-path SPECULATIVE_DRAFT_MODEL_PATH
The path of the draft model weights. This can be a
local folder or a Hugging Face repo ID.
--speculative-num-steps SPECULATIVE_NUM_STEPS
The number of steps sampled from draft model in
Speculative Decoding.
--speculative-eagle-topk SPECULATIVE_EAGLE_TOPK
The number of tokens sampled from the draft model in
eagle2 each step.
--speculative-num-draft-tokens SPECULATIVE_NUM_DRAFT_TOKENS
The number of tokens sampled from the draft model in
Speculative Decoding.
--speculative-accept-threshold-single SPECULATIVE_ACCEPT_THRESHOLD_SINGLE
Accept a draft token if its probability in the target
model is greater than this threshold.
--speculative-accept-threshold-acc SPECULATIVE_ACCEPT_THRESHOLD_ACC
The accept probability of a draft token is raised from
its target probability p to min(1, p / threshold_acc).
--speculative-token-map SPECULATIVE_TOKEN_MAP
The path of the draft model's small vocab table.
--expert-parallel-size EXPERT_PARALLEL_SIZE, --ep-size EXPERT_PARALLEL_SIZE, --ep EXPERT_PARALLEL_SIZE
The expert parallelism size.
--moe-a2a-backend {deepep}
Choose the backend for MoE A2A.
--enable-flashinfer-cutlass-moe
Enable FlashInfer CUTLASS MoE backend for modelopt_fp4
quant on Blackwell. Supports MoE-EP
--enable-flashinfer-trtllm-moe
Enable FlashInfer TRTLLM MoE backend on Blackwell.
Supports BlockScale FP8 MoE-EP
--enable-flashinfer-allreduce-fusion
Enable FlashInfer allreduce fusion for Add_RMSNorm.
--deepep-mode {normal,low_latency,auto}
Select the mode when enable DeepEP MoE, could be
`normal`, `low_latency` or `auto`. Default is `auto`,
which means `low_latency` for decode batch and
`normal` for prefill batch.
--ep-num-redundant-experts EP_NUM_REDUNDANT_EXPERTS
Allocate this number of redundant experts in expert
parallel.
--ep-dispatch-algorithm EP_DISPATCH_ALGORITHM
The algorithm to choose ranks for redundant experts in
expert parallel.
--init-expert-location INIT_EXPERT_LOCATION
Initial location of EP experts.
--enable-eplb Enable EPLB algorithm
--eplb-algorithm EPLB_ALGORITHM
Chosen EPLB algorithm
--eplb-rebalance-num-iterations EPLB_REBALANCE_NUM_ITERATIONS
Number of iterations to automatically trigger a EPLB
re-balance.
--eplb-rebalance-layers-per-chunk EPLB_REBALANCE_LAYERS_PER_CHUNK
Number of layers to rebalance per forward pass.
--expert-distribution-recorder-mode EXPERT_DISTRIBUTION_RECORDER_MODE
Mode of expert distribution recorder.
--expert-distribution-recorder-buffer-size EXPERT_DISTRIBUTION_RECORDER_BUFFER_SIZE
Circular buffer size of expert distribution recorder.
Set to -1 to denote infinite buffer.
--enable-expert-distribution-metrics
Enable logging metrics for expert balancedness
--deepep-config DEEPEP_CONFIG
Tuned DeepEP config suitable for your own cluster. It
can be either a string with JSON content or a file
path.
--moe-dense-tp-size MOE_DENSE_TP_SIZE
TP size for MoE dense MLP layers. This flag is useful
when, with large TP size, there are errors caused by
weights in MLP layers having dimension smaller than
the min dimension GEMM supports.
--enable-hierarchical-cache
Enable hierarchical cache
--hicache-ratio HICACHE_RATIO
The ratio of the size of host KV cache memory pool to
the size of device pool.
--hicache-size HICACHE_SIZE
The size of host KV cache memory pool in gigabytes,
which will override the hicache_ratio if set.
--hicache-write-policy {write_back,write_through,write_through_selective}
The write policy of hierarchical cache.
--hicache-io-backend {direct,kernel}
The IO backend for KV cache transfer between CPU and
GPU
--hicache-mem-layout {layer_first,page_first}
The layout of host memory pool for hierarchical cache.
--hicache-storage-backend {file,mooncake,hf3fs,nixl}
The storage backend for hierarchical KV cache.
--hicache-storage-prefetch-policy {best_effort,wait_complete,timeout}
Control when prefetching from the storage backend
should stop.
--enable-double-sparsity
Enable double sparsity attention
--ds-channel-config-path DS_CHANNEL_CONFIG_PATH
The path of the double sparsity channel config
--ds-heavy-channel-num DS_HEAVY_CHANNEL_NUM
The number of heavy channels in double sparsity
attention
--ds-heavy-token-num DS_HEAVY_TOKEN_NUM
The number of heavy tokens in double sparsity
attention
--ds-heavy-channel-type DS_HEAVY_CHANNEL_TYPE
The type of heavy channels in double sparsity
attention
--ds-sparse-decode-threshold DS_SPARSE_DECODE_THRESHOLD
The type of heavy channels in double sparsity
attention
--disable-radix-cache
Disable RadixAttention for prefix caching.
--cuda-graph-max-bs CUDA_GRAPH_MAX_BS
Set the maximum batch size for cuda graph. It will
extend the cuda graph capture batch size to this
value.
--cuda-graph-bs CUDA_GRAPH_BS [CUDA_GRAPH_BS ...]
Set the list of batch sizes for cuda graph.
--disable-cuda-graph Disable cuda graph.
--disable-cuda-graph-padding
Disable cuda graph when padding is needed. Still uses
cuda graph when padding is not needed.
--enable-profile-cuda-graph
Enable profiling of cuda graph capture.
--enable-cudagraph-gc
Enable garbage collection during CUDA graph capture.
If disabled (default), GC is frozen during capture to
speed up the process.
--enable-nccl-nvls Enable NCCL NVLS for prefill heavy requests when
available.
--enable-symm-mem Enable NCCL symmetric memory for fast collectives.
--enable-tokenizer-batch-encode
Enable batch tokenization for improved performance
when processing multiple text inputs. Do not use with
image inputs, pre-tokenized input_ids, or
input_embeds.
--disable-outlines-disk-cache
Disable disk cache of outlines to avoid possible
crashes related to file system or high concurrency.
--disable-custom-all-reduce
Disable the custom all-reduce kernel and fall back to
NCCL.
--enable-mscclpp Enable using mscclpp for small messages for all-reduce
kernel and fall back to NCCL.
--disable-overlap-schedule
Disable the overlap scheduler, which overlaps the CPU
scheduler with GPU model worker.
--enable-mixed-chunk Enabling mixing prefill and decode in a batch when
using chunked prefill.
--enable-dp-attention
Enabling data parallelism for attention and tensor
parallelism for FFN. The dp size should be equal to
the tp size. Currently DeepSeek-V2 and Qwen 2/3 MoE
models are supported.
--enable-dp-lm-head Enable vocabulary parallel across the attention TP
group to avoid all-gather across DP groups, optimizing
performance under DP attention.
--enable-two-batch-overlap
Enabling two micro batches to overlap.
--tbo-token-distribution-threshold TBO_TOKEN_DISTRIBUTION_THRESHOLD
The threshold of token distribution between two
batches in micro-batch-overlap, determines whether to
two-batch-overlap or two-chunk-overlap. Set to 0
denote disable two-chunk-overlap.
--enable-torch-compile
Optimize the model with torch.compile. Experimental
feature.
--torch-compile-max-bs TORCH_COMPILE_MAX_BS
Set the maximum batch size when using torch compile.
--torchao-config TORCHAO_CONFIG
Optimize the model with torchao. Experimental feature.
Current choices are: int8dq, int8wo,
int4wo-<group_size>, fp8wo, fp8dq-per_tensor, fp8dq-
per_row
--enable-nan-detection
Enable the NaN detection for debugging purposes.
--enable-p2p-check Enable P2P check for GPU access, otherwise the p2p
access is allowed by default.
--triton-attention-reduce-in-fp32
Cast the intermediate attention results to fp32 to
avoid possible crashes related to fp16.This only
affects Triton attention kernels.
--triton-attention-num-kv-splits TRITON_ATTENTION_NUM_KV_SPLITS
The number of KV splits in flash decoding Triton
kernel. Larger value is better in longer context
scenarios. The default value is 8.
--num-continuous-decode-steps NUM_CONTINUOUS_DECODE_STEPS
Run multiple continuous decoding steps to reduce
scheduling overhead. This can potentially increase
throughput but may also increase time-to-first-token
latency. The default value is 1, meaning only run one
decoding step at a time.
--delete-ckpt-after-loading
Delete the model checkpoint after loading the model.
--enable-memory-saver
Allow saving memory using release_memory_occupation
and resume_memory_occupation
--allow-auto-truncate
Allow automatically truncating requests that exceed
the maximum input length instead of returning an
error.
--enable-custom-logit-processor
Enable users to pass custom logit processors to the
server (disabled by default for security)
--flashinfer-mla-disable-ragged
Not using ragged prefill wrapper when running
flashinfer mla
--disable-shared-experts-fusion
Disable shared experts fusion optimization for
deepseek v3/r1.
--disable-chunked-prefix-cache
Disable chunked prefix cache feature for deepseek,
which should save overhead for short sequences.
--disable-fast-image-processor
Adopt base image processor instead of fast image
processor.
--enable-return-hidden-states
Enable returning hidden states with responses.
--enable-triton-kernel-moe
Use triton moe grouped gemm kernel.
--enable-flashinfer-mxfp4-moe
Enable FlashInfer MXFP4 MoE backend for modelopt_fp4
quant on Blackwell.
--scheduler-recv-interval SCHEDULER_RECV_INTERVAL
The interval to poll requests in scheduler. Can be set
to >1 to reduce the overhead of this.
--debug-tensor-dump-output-folder DEBUG_TENSOR_DUMP_OUTPUT_FOLDER
The output folder for dumping tensors.
--debug-tensor-dump-input-file DEBUG_TENSOR_DUMP_INPUT_FILE
The input filename for dumping tensors
--debug-tensor-dump-inject DEBUG_TENSOR_DUMP_INJECT
Inject the outputs from jax as the input of every
layer.
--debug-tensor-dump-prefill-only
Only dump the tensors for prefill requests (i.e. batch
size > 1).
--disaggregation-mode {null,prefill,decode}
Only used for PD disaggregation. "prefill" for
prefill-only server, and "decode" for decode-only
server. If not specified, it is not PD disaggregated
--disaggregation-transfer-backend {mooncake,nixl,ascend}
The backend for disaggregation transfer. Default is
mooncake.
--disaggregation-bootstrap-port DISAGGREGATION_BOOTSTRAP_PORT
Bootstrap server port on the prefill server. Default
is 8998.
--disaggregation-decode-tp DISAGGREGATION_DECODE_TP
Decode tp size. If not set, it matches the tp size of
the current engine. This is only set on the prefill
server.
--disaggregation-decode-dp DISAGGREGATION_DECODE_DP
Decode dp size. If not set, it matches the dp size of
the current engine. This is only set on the prefill
server.
--disaggregation-prefill-pp DISAGGREGATION_PREFILL_PP
Prefill pp size. If not set, it is default to 1. This
is only set on the decode server.
--disaggregation-ib-device DISAGGREGATION_IB_DEVICE
The InfiniBand devices for disaggregation transfer,
accepts single device (e.g., --disaggregation-ib-
device mlx5_0) or multiple comma-separated devices
(e.g., --disaggregation-ib-device mlx5_0,mlx5_1).
Default is None, which triggers automatic device
detection when mooncake backend is enabled.
--num-reserved-decode-tokens NUM_RESERVED_DECODE_TOKENS
Number of decode tokens that will have memory reserved
when adding new request to the running batch.
--pdlb-url PDLB_URL The URL of the PD disaggregation load balancer. If
set, the prefill/decode server will register with the
load balancer.
--custom-weight-loader [CUSTOM_WEIGHT_LOADER ...]
The custom dataloader which used to update the model.
Should be set with a valid import path, such as
my_package.weight_load_func
--enable-pdmux Enable PD-Multiplexing, PD running on greenctx stream.
--sm-group-num SM_GROUP_NUM
Number of sm partition groups.
--weight-loader-disable-mmap
Disable mmap while loading weight using safetensors.
--enable-ep-moe (Deprecated) Enabling expert parallelism for moe. The
ep size is equal to the tp size.
--enable-deepep-moe (Deprecated) Enabling DeepEP MoE implementation for EP
MoE.
python3 -m sglang.launch_server --model /root/.achatbot/models/openai/gpt-oss-20b --host 0.0.0.0 --port 30000 \
--tp 1 WARNING:sglang.srt.configs.model_config:mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. |
|
python3 -m sglang.bench_serving --help usage: bench_serving.py [-h]
[--backend {sglang,sglang-native,sglang-oai,sglang-oai-chat,vllm,vllm-chat,lmdeploy,lmdeploy-chat,trt,gserver,truss}]
[--base-url BASE_URL] [--host HOST] [--port PORT]
[--dataset-name {sharegpt,random,random-ids,generated-shared-prefix,mmmu}]
[--dataset-path DATASET_PATH] [--model MODEL]
[--tokenizer TOKENIZER] [--num-prompts NUM_PROMPTS]
[--sharegpt-output-len SHAREGPT_OUTPUT_LEN]
[--sharegpt-context-len SHAREGPT_CONTEXT_LEN]
[--random-input-len RANDOM_INPUT_LEN]
[--random-output-len RANDOM_OUTPUT_LEN]
[--random-range-ratio RANDOM_RANGE_RATIO]
[--request-rate REQUEST_RATE]
[--max-concurrency MAX_CONCURRENCY]
[--output-file OUTPUT_FILE] [--output-details]
[--disable-tqdm] [--disable-stream] [--return-logprob]
[--seed SEED] [--disable-ignore-eos]
[--extra-request-body {"key1": "value1", "key2": "value2"}]
[--apply-chat-template] [--profile]
[--lora-name [LORA_NAME ...]]
[--prompt-suffix PROMPT_SUFFIX] [--pd-separated]
[--flush-cache] [--warmup-requests WARMUP_REQUESTS]
[--tokenize-prompt] [--gsp-num-groups GSP_NUM_GROUPS]
[--gsp-prompts-per-group GSP_PROMPTS_PER_GROUP]
[--gsp-system-prompt-len GSP_SYSTEM_PROMPT_LEN]
[--gsp-question-len GSP_QUESTION_LEN]
[--gsp-output-len GSP_OUTPUT_LEN]
Benchmark the online serving throughput.
options:
-h, --help show this help message and exit
--backend {sglang,sglang-native,sglang-oai,sglang-oai-chat,vllm,vllm-chat,lmdeploy,lmdeploy-chat,trt,gserver,truss}
Must specify a backend, depending on the LLM Inference
Engine.
--base-url BASE_URL Server or API base url if not using http host and
port.
--host HOST Default host is 0.0.0.0.
--port PORT If not set, the default port is configured according
to its default value for different LLM Inference
Engines.
--dataset-name {sharegpt,random,random-ids,generated-shared-prefix,mmmu}
Name of the dataset to benchmark on.
--dataset-path DATASET_PATH
Path to the dataset.
--model MODEL Name or path of the model. If not set, the default
model will request /v1/models for conf.
--tokenizer TOKENIZER
Name or path of the tokenizer. If not set, using the
model conf.
--num-prompts NUM_PROMPTS
Number of prompts to process. Default is 1000.
--sharegpt-output-len SHAREGPT_OUTPUT_LEN
Output length for each request. Overrides the output
length from the ShareGPT dataset.
--sharegpt-context-len SHAREGPT_CONTEXT_LEN
The context length of the model for the ShareGPT
dataset. Requests longer than the context length will
be dropped.
--random-input-len RANDOM_INPUT_LEN
Number of input tokens per request, used only for
random dataset.
--random-output-len RANDOM_OUTPUT_LEN
Number of output tokens per request, used only for
random dataset.
--random-range-ratio RANDOM_RANGE_RATIO
Range of sampled ratio of input/output length, used
only for random dataset.
--request-rate REQUEST_RATE
Number of requests per second. If this is inf, then
all the requests are sent at time 0. Otherwise, we use
Poisson process to synthesize the request arrival
times. Default is inf.
--max-concurrency MAX_CONCURRENCY
Maximum number of concurrent requests. This can be
used to help simulate an environment where a higher
level component is enforcing a maximum number of
concurrent requests. While the --request-rate argument
controls the rate at which requests are initiated,
this argument will control how many are actually
allowed to execute at a time. This means that when
used in combination, the actual request rate may be
lower than specified with --request-rate, if the
server is not processing requests fast enough to keep
up.
--output-file OUTPUT_FILE
Output JSONL file name.
--output-details Output details of benchmarking.
--disable-tqdm Specify to disable tqdm progress bar.
--disable-stream Disable streaming mode.
--return-logprob Return logprob.
--seed SEED The random seed.
--disable-ignore-eos Disable ignoring EOS.
--extra-request-body {"key1": "value1", "key2": "value2"}
Append given JSON object to the request payload. You
can use this to specifyadditional generate params like
sampling params.
--apply-chat-template
Apply chat template
--profile Use Torch Profiler. The endpoint must be launched with
SGLANG_TORCH_PROFILER_DIR to enable profiler.
--lora-name [LORA_NAME ...]
The names of LoRA adapters. You can provide a list of
names in the format {name} {name} {name}...
--prompt-suffix PROMPT_SUFFIX
Suffix applied to the end of all user prompts,
followed by assistant prompt suffix.
--pd-separated Benchmark PD disaggregation server
--flush-cache Flush the cache before running the benchmark
--warmup-requests WARMUP_REQUESTS
Number of warmup requests to run before the benchmark
--tokenize-prompt Use integer ids instead of string for inputs. Useful
to control prompt lengths accurately
generated-shared-prefix dataset arguments:
--gsp-num-groups GSP_NUM_GROUPS
Number of system prompt groups for generated-shared-
prefix dataset
--gsp-prompts-per-group GSP_PROMPTS_PER_GROUP
Number of prompts per system prompt group for
generated-shared-prefix dataset
--gsp-system-prompt-len GSP_SYSTEM_PROMPT_LEN
Target length in tokens for system prompts in
generated-shared-prefix dataset
--gsp-question-len GSP_QUESTION_LEN
Target length in tokens for questions in generated-
shared-prefix dataset
--gsp-output-len GSP_OUTPUT_LEN
Target length in tokens for outputs in generated-
shared-prefix dataset |
|
@weedge Which version supporting running "openai/gpt-oss-120b" using 1 A100 or 1 H100? |
@ghostplant see this PR #179, maybe u can get answer. |






Tip
MXFP4needquantization=mxfp4 quantization_param_path=QUANTIZATION_PARAM_PATHif not defaultOpen Compute Project (OCP) MXFP4:
feat:
reference