Skip to content

Add openai/gpt-oss-20b recipe and Laguna-XS.2 DFlash spec decoding#447

Merged
esmeetu merged 1 commit intomainfrom
add-gpt-oss-20b-and-laguna-spec-decoding
May 8, 2026
Merged

Add openai/gpt-oss-20b recipe and Laguna-XS.2 DFlash spec decoding#447
esmeetu merged 1 commit intomainfrom
add-gpt-oss-20b-and-laguna-spec-decoding

Conversation

@esmeetu
Copy link
Copy Markdown
Member

@esmeetu esmeetu commented May 8, 2026

Summary

  • New recipe openai/gpt-oss-20b — 21B / 3.6B-active MoE with native MXFP4, fits in 16GB VRAM. Single-node TP=1 default, with Hopper / Blackwell / AMD tuning ported from the gpt-oss-120b sibling (shared kernel paths). Removes the now-redundant "20b" variant from gpt-oss-120b.yaml so each HF id maps to one recipe page.
  • poolside/Laguna-XS.2 spec decoding — adds spec_decoding feature wiring --speculative-config to poolside/Laguna-XS.2-speculator.dflash (DFlash method, num_speculative_tokens=7). Guide documents the VLLM_USE_DEEP_GEMM=0 requirement and the vLLM PR #41880 dependency.

Test plan

  • node scripts/build-recipes-api.mjs✓ JSON API: 91 models, 8 strategies (was 90 before).
  • public/openai/gpt-oss-20b.json recommended command matches the YAML on H200 default.
  • public/poolside/Laguna-XS.2.json shows spec_decoding in features and listed under opt_in_features.
  • Reviewer eyeballs the Hopper/Blackwell tuning copied to gpt-oss-20b — confirm it's still correct on the smaller model.

🤖 Generated with Claude Code

- Add standalone openai/gpt-oss-20b recipe (21B/3.6B-A MoE, MXFP4,
  16GB VRAM, single_node_tp with tp=1), with hardware tuning ported
  from the 120b sibling for the shared gpt-oss kernel paths.
- Remove the now-redundant "20b" variant from gpt-oss-120b.yaml so
  the 20b page is the single source of truth.
- Add spec_decoding feature to poolside/Laguna-XS.2 using the
  Laguna-XS.2-speculator.dflash draft model (DFlash, 7 tokens,
  greedy); document the VLLM_USE_DEEP_GEMM=0 requirement and
  PR #41880 dependency in the guide.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: yasong.wang <yasong.wang@inferact.ai>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vllm-recipes Ready Ready Preview, Comment May 8, 2026 11:22am

Request Review

@esmeetu esmeetu merged commit 56922ce into main May 8, 2026
4 checks passed
@esmeetu esmeetu deleted the add-gpt-oss-20b-and-laguna-spec-decoding branch May 8, 2026 11:24
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1bafcf0e1f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +40 to +44
spec_decoding:
description: "DFlash speculative decoding with the Laguna-XS.2 draft model (7 tokens, greedy)"
args:
- "--speculative-config"
- '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include required DeepGEMM env guard for spec decoding

This spec_decoding feature only adds --speculative-config, but the same recipe explicitly documents that DFlash requires VLLM_USE_DEEP_GEMM=0 to work. In this codebase, feature toggles are rendered as CLI args only, so users who enable this toggle via the command builder will get a command that is missing the required environment setting and can fail at runtime unless they manually edit it. Please wire the required env guard into the generated configuration path, not only the guide text.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the OpenAI model configurations by moving the GPT-OSS 20B variant into its own dedicated file and updates the Poolside Laguna-XS.2 configuration to support DFlash speculative decoding. Feedback indicates that the newly added strategies for Laguna-XS.2 are missing their definition files, which will prevent them from appearing in the UI. Additionally, there is an inconsistency in the --max-model-len value provided in the Laguna-XS.2 guide that requires clarification or correction.

Comment on lines +57 to +58
- single_node_tep
- single_node_dep
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The strategies single_node_tep and single_node_dep have been added to compatible_strategies, but the corresponding strategy definition files (e.g., strategies/single_node_tep.yaml) appear to be missing from this pull request. Without these files, these strategies will be filtered out by the CommandBuilder logic and will not be available for selection in the UI.

```bash
VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS.2 \
--trust-remote-code \
--max-model-len 16384 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is an inconsistency in the --max-model-len value between the base launch command (131072 on line 103) and the speculative decoding example (16384 on line 159). If this reduction is a technical requirement for DFlash or due to memory constraints when VLLM_USE_DEEP_GEMM=0 is set, it should be explicitly noted in the guide to avoid confusing users who expect the full 128K context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant