Add openai/gpt-oss-20b recipe and Laguna-XS.2 DFlash spec decoding#447
Add openai/gpt-oss-20b recipe and Laguna-XS.2 DFlash spec decoding#447
Conversation
- Add standalone openai/gpt-oss-20b recipe (21B/3.6B-A MoE, MXFP4, 16GB VRAM, single_node_tp with tp=1), with hardware tuning ported from the 120b sibling for the shared gpt-oss kernel paths. - Remove the now-redundant "20b" variant from gpt-oss-120b.yaml so the 20b page is the single source of truth. - Add spec_decoding feature to poolside/Laguna-XS.2 using the Laguna-XS.2-speculator.dflash draft model (DFlash, 7 tokens, greedy); document the VLLM_USE_DEEP_GEMM=0 requirement and PR #41880 dependency in the guide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: yasong.wang <yasong.wang@inferact.ai>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1bafcf0e1f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| spec_decoding: | ||
| description: "DFlash speculative decoding with the Laguna-XS.2 draft model (7 tokens, greedy)" | ||
| args: | ||
| - "--speculative-config" | ||
| - '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}' |
There was a problem hiding this comment.
Include required DeepGEMM env guard for spec decoding
This spec_decoding feature only adds --speculative-config, but the same recipe explicitly documents that DFlash requires VLLM_USE_DEEP_GEMM=0 to work. In this codebase, feature toggles are rendered as CLI args only, so users who enable this toggle via the command builder will get a command that is missing the required environment setting and can fail at runtime unless they manually edit it. Please wire the required env guard into the generated configuration path, not only the guide text.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request refactors the OpenAI model configurations by moving the GPT-OSS 20B variant into its own dedicated file and updates the Poolside Laguna-XS.2 configuration to support DFlash speculative decoding. Feedback indicates that the newly added strategies for Laguna-XS.2 are missing their definition files, which will prevent them from appearing in the UI. Additionally, there is an inconsistency in the --max-model-len value provided in the Laguna-XS.2 guide that requires clarification or correction.
| - single_node_tep | ||
| - single_node_dep |
There was a problem hiding this comment.
The strategies single_node_tep and single_node_dep have been added to compatible_strategies, but the corresponding strategy definition files (e.g., strategies/single_node_tep.yaml) appear to be missing from this pull request. Without these files, these strategies will be filtered out by the CommandBuilder logic and will not be available for selection in the UI.
| ```bash | ||
| VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS.2 \ | ||
| --trust-remote-code \ | ||
| --max-model-len 16384 \ |
There was a problem hiding this comment.
There is an inconsistency in the --max-model-len value between the base launch command (131072 on line 103) and the speculative decoding example (16384 on line 159). If this reduction is a technical requirement for DFlash or due to memory constraints when VLLM_USE_DEEP_GEMM=0 is set, it should be explicitly noted in the guide to avoid confusing users who expect the full 128K context.
Summary
openai/gpt-oss-20b— 21B / 3.6B-active MoE with native MXFP4, fits in 16GB VRAM. Single-node TP=1 default, with Hopper / Blackwell / AMD tuning ported from thegpt-oss-120bsibling (shared kernel paths). Removes the now-redundant"20b"variant fromgpt-oss-120b.yamlso each HF id maps to one recipe page.poolside/Laguna-XS.2spec decoding — addsspec_decodingfeature wiring--speculative-configtopoolside/Laguna-XS.2-speculator.dflash(DFlash method,num_speculative_tokens=7). Guide documents theVLLM_USE_DEEP_GEMM=0requirement and the vLLM PR #41880 dependency.Test plan
node scripts/build-recipes-api.mjs→✓ JSON API: 91 models, 8 strategies(was 90 before).public/openai/gpt-oss-20b.jsonrecommended command matches the YAML on H200 default.public/poolside/Laguna-XS.2.jsonshowsspec_decodingin features and listed underopt_in_features.🤖 Generated with Claude Code