Skip to content
This repository was archived by the owner on Apr 20, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 17 additions & 10 deletions recipes/gb200-fp4/1k1k/low-latency.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,17 @@
name: "gb200-fp4-1p2d"
name: "gb200-fp4-1k1k-low-latency"

dynamo:
version: 0.8.1

frontend:
type: dynamo
enable_multiple_frontends: true
num_additional_frontends: 3
nginx_container: nginx

model:
path: "dsfp4"
container: "0.5.5.post2"
container: "lmsysorg/sglang:v0.5.8-cu130-runtime"
precision: "fp4"

resources:
Expand All @@ -25,14 +34,11 @@ backend:
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
#SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
#SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
SGLANG_ENABLE_JIT_DEEPGEMM: "false"
SGLANG_ENABLE_FLASHINFER_GEMM: "true"

decode_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
Expand All @@ -44,14 +50,11 @@ backend:
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
# SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
# SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
NCCL_CUMEM_ENABLE: "1"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
SGLANG_ENABLE_JIT_DEEPGEMM: "false"
SGLANG_ENABLE_FLASHINFER_GEMM: "true"

sglang_config:
prefill:
Expand All @@ -76,6 +79,8 @@ backend:
moe-dense-tp-size: 1
load-balance-method: "round_robin"
disaggregation-bootstrap-port: 30001
disaggregation-transfer-backend: nixl
fp4-gemm-backend: "flashinfer_trtllm"
Comment on lines +82 to +83
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

sglang v0.5.8 flashinfer_trtllm backend configuration

💡 Result:

In SGLang v0.5.8, flashinfer_trtllm is configured as the MoE runner backend via --moe-runner-backend (it’s not the attention backend flag). The valid options list includes flashinfer_trtllm. [1]

CLI example (as shown in the SGLang docs)

python3 -m sglang.launch_server \
  --model-path nvidia/DeepSeek-R1-FP4 \
  --tp 8 \
  --attention-backend trtllm_mla \
  --moe-runner-backend flashinfer_trtllm \
  --quantization modelopt_fp4 \
  --prefill-attention-backend fa4

This is an example of “hybrid attention” (different prefill vs decode backends) while using flashinfer_trtllm for the MoE runner. [2]

YAML config equivalent

model-path: nvidia/DeepSeek-R1-FP4
tp: 8
attention-backend: trtllm_mla
moe-runner-backend: flashinfer_trtllm
quantization: modelopt_fp4
prefill-attention-backend: fa4

Launch with:

python3 -m sglang.launch_server --config config.yaml

(Config-file support is documented; CLI args override YAML.) [1]

Sources: Server arguments (MoE + config support) [1]; Attention backend doc showing flashinfer_trtllm usage in an example command [2].


🌐 Web query:

sglang v0.5.8 disaggregation-transfer-backend nixl options

💡 Result:

In SGLang v0.5.8, the NIXL “options” for PD disaggregation are mainly (a) selecting NIXL as the transfer backend, plus (b) the general PD-disaggregation knobs (there aren’t extra SGLang CLI flags that are NIXL-specific in the docs).

1) Enable NIXL transfer backend (PD disaggregation)

Use:

  • --disaggregation-transfer-backend nixl (supported values: mooncake, nixl, ascend, fake) [1]
    and run separate servers:
  • --disaggregation-mode prefill
  • --disaggregation-mode decode [1][2]

Docs example (single node) uses exactly these flags with NIXL. [2]

2) PD-disaggregation knobs you can still use with NIXL

These are the PD-disaggregation CLI arguments listed by SGLang (apply regardless of backend selection): [1]

  • --disaggregation-bootstrap-port (prefill bootstrap server port)
  • --disaggregation-decode-tp (set on prefill server; decode TP size)
  • --disaggregation-decode-dp (set on prefill server; decode DP size)
  • --disaggregation-prefill-pp (set on decode server; prefill PP size)
  • --disaggregation-decode-enable-offload-kvcache
  • --disaggregation-decode-enable-fake-auto
  • --num-reserved-decode-tokens
  • --disaggregation-decode-polling-interval
  • --disaggregation-ib-device exists, but its description is tied to auto-detection for the mooncake backend (not described as a NIXL control). [1]

3) Install/bring NIXL (dependency)

SGLang’s PD-disaggregation doc says to install NIXL via pip (or build from source, e.g., if you already have UCX installed). [2]

Sources:
[1] SGLang “Server Arguments” (PD disaggregation section). (docs.sglang.io)
[2] SGLang “PD Disaggregation” (NIXL requirements + usage examples). (docs.sglang.io)

Citations:


🌐 Web query:

sglang 0.5.8 release notes backend options

💡 Result:

SGLang v0.5.8 was released Jan 23, 2026 (PyPI upload date). [1]

v0.5.8 release notes (backend-related items)

  • SGLang-Diffusion: can run sglang-diffusion with the diffusers backend (plus multi-LoRA inference, “SLA attention backends”, CLI warmup switch, ComfyUI plugin, and broad perf improvements). [2]
  • Flash Attention 4: adds Flash Attention 4 decoding kernels. [2]

“Backend options” you can configure (runtime/kernel backends)

From SGLang server CLI args (see python -m sglang.launch_server --help), the main backend-selector knobs are: [3]

  • --model-impl: auto / sglang / transformers (model implementation choice). [3]
  • --attention-backend: attention kernel backend. [3]
  • prefill_attention_backend and decode_attention_backend (experimental; override attention backend separately for prefill vs decode). [3]
  • --sampling-backend: sampling kernel backend. [3]
  • --grammar-backend: backend for grammar-guided decoding. [3]
  • --mm-attention-backend: multimodal attention backend. [3]

Sources: PyPI sglang 0.5.8 release entry [1]; GitHub release notes for v0.5.8 [2]; SGLang “Server Arguments” docs (kernel backend flags) [3].


Use moe-runner-backend: flashinfer_trtllm instead of fp4-gemm-backend: flashinfer_trtllm.

In SGLang v0.5.8, flashinfer_trtllm is a valid backend value but must be configured via the moe-runner-backend option, not fp4-gemm-backend. The config key fp4-gemm-backend does not exist in the official SGLang configuration schema. The disaggregation-transfer-backend: nixl is valid and correct.

🤖 Prompt for AI Agents
In `@recipes/gb200-fp4/1k1k/low-latency.yaml` around lines 81 - 82, Replace the
invalid config key fp4-gemm-backend with the correct moe-runner-backend: keep
disaggregation-transfer-backend: nixl as-is, remove or rename the
fp4-gemm-backend entry and add moe-runner-backend: flashinfer_trtllm so the
backend is configured using the supported key (update the entry that currently
reads fp4-gemm-backend to moe-runner-backend and retain the flashinfer_trtllm
value).

data-parallel-size: 1
tensor-parallel-size: 4
expert-parallel-size: 1
Expand All @@ -100,12 +105,14 @@ backend:
scheduler-recv-interval: 10
enable-symm-mem: true
moe-dense-tp-size: 1
disaggregation-transfer-backend: nixl
fp4-gemm-backend: "flashinfer_trtllm"
tensor-parallel-size: 4
expert-parallel-size: 1

benchmark:
type: "sa-bench"
isl: 1024
osl: 1024
concurrencies: "4x8x32x64x112x128x256"
req_rate: "inf"
concurrencies: "4x8x32"
req_rate: "inf"
25 changes: 16 additions & 9 deletions recipes/gb200-fp4/1k1k/max-tpt.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
# 4P1D, with 12 Decode Nodes. Uses single batch overlap
name: "gb200-fp4-1k1k-max-tpt"

name: "gb200-fp4-max-tpt"
dynamo:
version: 0.8.1

frontend:
type: dynamo
enable_multiple_frontends: true
num_additional_frontends: 9
nginx_container: nginx

model:
path: "dsfp4"
container: "0.5.5.post2"
container: "lmsysorg/sglang:v0.5.8-cu130-runtime"
precision: "fp4"

resources:
Expand All @@ -27,7 +34,6 @@ backend:
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
MC_TE_METRIC: "true"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
Expand All @@ -46,7 +52,6 @@ backend:
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
MC_TE_METRIC: "true"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
Expand All @@ -56,8 +61,6 @@ backend:
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
SGLANG_MOE_NVFP4_DISPATCH: "1"
SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1" # Used in older sglang versions
SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass"

sglang_config:
prefill:
Expand Down Expand Up @@ -103,6 +106,8 @@ backend:
# Performance optimizations
disable-cuda-graph: true
enable-dp-attention: true
fp4-gemm-backend: "flashinfer_cutlass"
disaggregation-transfer-backend: nixl

# Parallelism
tp-size: 4
Expand Down Expand Up @@ -162,6 +167,8 @@ backend:
enable-dp-lm-head: true
prefill-round-robin-balance: true
enable-dp-attention: true
disaggregation-transfer-backend: nixl
fp4-gemm-backend: "flashinfer_cutlass"

# Parallelism
tp-size: 48
Expand All @@ -172,5 +179,5 @@ benchmark:
type: "sa-bench"
isl: 1024
osl: 1024
concurrencies: "1x128x512x2048x4096x8192x12000x15000"
req_rate: "inf"
concurrencies: "512x2048x4096"
req_rate: "inf"
26 changes: 16 additions & 10 deletions recipes/gb200-fp4/1k1k/mid-curve.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
# 4P1D, with 8 Decode Nodes. Does not use single batch overlap but allows us to currently drive higher
# per gpu throughput
name: "gb200-fp4-1k1k-mid-curve"

name: "gb200-fp4-max-tpt-2"
dynamo:
version: 0.8.1

frontend:
type: dynamo
enable_multiple_frontends: true
num_additional_frontends: 9
nginx_container: nginx

model:
path: "dsfp4"
container: "0.5.5.post2"
container: "lmsysorg/sglang:v0.5.8-cu130-runtime"
precision: "fp4"

resources:
Expand All @@ -28,7 +34,6 @@ backend:
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
MC_TE_METRIC: "true"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
Expand All @@ -47,7 +52,6 @@ backend:
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
MC_TE_METRIC: "true"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
Expand All @@ -57,8 +61,6 @@ backend:
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
SGLANG_MOE_NVFP4_DISPATCH: "1"
SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1" # Used in older sglang versions
SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass"

sglang_config:
prefill:
Expand Down Expand Up @@ -104,6 +106,8 @@ backend:
# Performance optimizations
disable-cuda-graph: true
enable-dp-attention: true
fp4-gemm-backend: "flashinfer_cutlass"
disaggregation-transfer-backend: nixl

# Parallelism
tp-size: 4
Expand Down Expand Up @@ -162,6 +166,8 @@ backend:
enable-dp-lm-head: true
prefill-round-robin-balance: true
enable-dp-attention: true
disaggregation-transfer-backend: nixl
fp4-gemm-backend: "flashinfer_cutlass"

# Parallelism
tp-size: 32
Expand All @@ -172,5 +178,5 @@ benchmark:
type: "sa-bench"
isl: 1024
osl: 1024
concurrencies: "1x128x512x2048x4096x8192x12000x15000"
req_rate: "inf"
concurrencies: "512x2048x4096x8192x15000"
req_rate: "inf"
25 changes: 14 additions & 11 deletions recipes/gb200-fp4/1k8k/low-latency.yaml
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
name: "gb200-fp4-1p2d"
name: "gb200-fp4-1k8k-low-latency"

dynamo:
version: 0.7.0
version: 0.8.1
Comment on lines +1 to +4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if dynamo 0.8.1 exists on PyPI
curl -s "https://pypi.org/pypi/dynamo/0.8.1/json" | head -c 200

Repository: ishandhanani/srt-slurm

Length of output: 90


🏁 Script executed:

# Search for other dynamo version references in the repository
rg "dynamo:" --type yaml -A 1 | head -50

Repository: ishandhanani/srt-slurm

Length of output: 3082


🏁 Script executed:

cat recipes/gb200-fp4/1k8k/low-latency.yaml | head -20

Repository: ishandhanani/srt-slurm

Length of output: 405


🏁 Script executed:

# Search for all dynamo version specifications with "version:" pattern
rg "dynamo:" -A 2 -B 0 | rg "version:|hash:|install:" | sort -u

Repository: ishandhanani/srt-slurm

Length of output: 28548


Dynamo version 0.8.1 does not exist on PyPI and must be corrected.

While the experiment name follows the standardized convention, the dynamo configuration is invalid. The version 0.8.1 is not a valid release on PyPI (PyPI returned "Not Found"). Although this version is consistently used across multiple GB200 recipes in this PR, the underlying version must be corrected to an actual release (e.g., 0.8.0 as referenced in docs/config-reference.md) before these configurations can be used.

🤖 Prompt for AI Agents
In `@recipes/gb200-fp4/1k8k/low-latency.yaml` around lines 1 - 4, The dynamo
version specified under the "dynamo" block (dynamo.version: 0.8.1) is invalid on
PyPI; update the version to a real release (e.g., change dynamo.version to
0.8.0) in this recipe (name: "gb200-fp4-1k8k-low-latency") and apply the same
correction to other GB200 recipes that reuse the erroneous 0.8.1 value so the
configuration matches an actual PyPI release.


frontend:
type: dynamo
frontend:
type: dynamo
enable_multiple_frontends: true
num_additional_frontends: 4
num_additional_frontends: 3
nginx_container: nginx

model:
path: "dsr1"
container: "lmsysorg/sglang:v0.5.5.post2"
path: "dsfp4"
container: "lmsysorg/sglang:v0.5.8-cu130-runtime"
precision: "fp4"

resources:
Expand All @@ -37,7 +38,6 @@ backend:
NCCL_CUMEM_ENABLE: "1"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
SGLANG_ENABLE_JIT_DEEPGEMM: "false"
SGLANG_ENABLE_FLASHINFER_GEMM: "true"

decode_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
Expand All @@ -54,12 +54,11 @@ backend:
NCCL_CUMEM_ENABLE: "1"
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
SGLANG_ENABLE_JIT_DEEPGEMM: "false"
SGLANG_ENABLE_FLASHINFER_GEMM: "true"

sglang_config:
prefill:
disaggregation-mode: "prefill"
served-model-name: "deepseek-ai/DeepSeek-R1"
disaggregation-mode: "prefill"
trust-remote-code: true
disable-radix-cache: true
kv-cache-dtype: "fp8_e4m3"
Expand All @@ -81,10 +80,12 @@ backend:
data-parallel-size: 1
tensor-parallel-size: 4
expert-parallel-size: 1
fp4-gemm-backend: "flashinfer_trtllm"
disaggregation-transfer-backend: nixl

decode:
disaggregation-mode: "decode"
served-model-name: "deepseek-ai/DeepSeek-R1"
disaggregation-mode: "decode"
prefill-round-robin-balance: true
trust-remote-code: true
disable-radix-cache: true
Expand All @@ -103,6 +104,8 @@ backend:
moe-dense-tp-size: 1
tensor-parallel-size: 4
expert-parallel-size: 1
fp4-gemm-backend: "flashinfer_trtllm"
disaggregation-transfer-backend: nixl

benchmark:
type: "sa-bench"
Expand Down
18 changes: 10 additions & 8 deletions recipes/gb200-fp4/1k8k/max-tpt.yaml
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
name: "gb200-fp4-max-tpt"
name: "gb200-fp4-1k8k-max-tpt"

dynamo:
version: 0.7.0
version: 0.8.1

frontend:
type: dynamo
enable_multiple_frontends: true
num_additional_frontends: 9
nginx_container: nginx

model:
path: "dsr1"
container: "lmsysorg/sglang:v0.5.5.post2"
path: "dsfp4"
container: "lmsysorg/sglang:v0.5.8-cu130-runtime"
precision: "fp4"

resources:
Expand All @@ -32,7 +33,6 @@ backend:
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
MC_TE_METRIC: "true"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
Expand All @@ -51,7 +51,6 @@ backend:
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
MC_TE_METRIC: "true"
MC_FORCE_MNNVL: "1"
NCCL_MNNVL_ENABLE: "1"
Expand All @@ -61,14 +60,14 @@ backend:
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
SGLANG_MOE_NVFP4_DISPATCH: "1"
SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1" # Used in older sglang versions
SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass"
Comment on lines 61 to 63
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential redundancy: SGLANG_FLASHINFER_FP4_GEMM_BACKEND env var vs fp4-gemm-backend config.

The decode_environment still contains SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass" (line 63), while the new fp4-gemm-backend: "flashinfer_cutlass" is added in sglang_config for both prefill (line 110) and decode (line 237).

This creates:

  1. Potential redundancy between env var and config option
  2. Asymmetry: prefill_environment lacks this env var, but decode has it

If fp4-gemm-backend in sglang_config is the new preferred approach, consider removing the legacy env var from decode_environment to avoid confusion about which setting takes precedence.

🤖 Prompt for AI Agents
In `@recipes/gb200-fp4/1k8k/max-tpt.yaml` around lines 61 - 63, Remove the legacy
env var SGLANG_FLASHINFER_FP4_GEMM_BACKEND from decode_environment to avoid
conflicting/ambiguous configuration with the newer sglang_config key
fp4-gemm-backend; update the decode_environment block (where
SGLANG_FLASHINFER_FP4_GEMM_BACKEND is set) so that only fp4-gemm-backend in
sglang_config controls the GEMM backend, and ensure prefill_environment and
decode_environment are symmetric (either both omit the legacy env var or both
include it only if you intend to keep backward compatibility).


sglang_config:
prefill:
# Model configuration
served-model-name: "deepseek-ai/DeepSeek-R1"
trust-remote-code: true
disaggregation-transfer-backend: nixl

# KV cache and attention
kv-cache-dtype: "fp8_e4m3"
Expand Down Expand Up @@ -108,6 +107,7 @@ backend:
# Performance optimizations
disable-cuda-graph: true
enable-dp-attention: true
fp4-gemm-backend: "flashinfer_cutlass"

# Parallelism
tp-size: 4
Expand All @@ -118,6 +118,7 @@ backend:
# Model configuration
served-model-name: "deepseek-ai/DeepSeek-R1"
trust-remote-code: true
disaggregation-transfer-backend: nixl

# KV cache and attention
kv-cache-dtype: "fp8_e4m3"
Expand Down Expand Up @@ -233,6 +234,7 @@ backend:
enable-dp-lm-head: true
prefill-round-robin-balance: true
enable-dp-attention: true
fp4-gemm-backend: "flashinfer_cutlass"

# Parallelism
tp-size: 48
Expand All @@ -243,5 +245,5 @@ benchmark:
type: "sa-bench"
isl: 1024
osl: 8192
concurrencies: "256x512x1024x2048x8192"
concurrencies: "256x512x1024x2048"
req_rate: "inf"
Loading
Loading