Skip to content

[Spec][Ngram] Add output-as-corpus accept length benchmark for external SAM#22199

Merged
hnyls2002 merged 1 commit intomainfrom
lsyin/ngram-sam-accept-length-benchmark
Apr 7, 2026
Merged

[Spec][Ngram] Add output-as-corpus accept length benchmark for external SAM#22199
hnyls2002 merged 1 commit intomainfrom
lsyin/ngram-sam-accept-length-benchmark

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented Apr 6, 2026

Motivation

Part of Ngram refactoring series #21052
Following #22203

No end-to-end test proving that external SAM actually improves speculative decoding accept length.

Modifications

Add test_output_as_corpus_boosts_accept_length to TestNgramSpeculativeDecodingFlashinfer:

  1. Generate outputs with temperature=0 (baseline, trie only), record avg_spec_accept_length
  2. POST /add_external_corpus with generated outputs as corpus
  3. Regenerate same prompts, assert SAM accept length >= 2x baseline

Single server launch — reuses existing flashinfer test class with --speculative-ngram-external-sam-budget 8. Also removes redundant TestNgramExternalSamSmoke (SAM functionality now covered by this benchmark + unit tests).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the TestNgramExternalSamAcceptLength test class to verify that N-gram speculative decoding with an external corpus significantly increases acceptance length. Feedback includes improving the robustness of server statistics retrieval for multi-GPU environments and a minor refactor of the baseline output generation loop.

Comment on lines +213 to +215
def _get_accept_length(self, base_url):
server_info = requests.get(base_url + "/get_server_info").json()
return server_info["internal_states"][0]["avg_spec_accept_length"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _get_accept_length method assumes that internal_states is non-empty and that the first entry contains the avg_spec_accept_length key. While this is likely true for a single-GPU setup, it could fail with a KeyError or IndexError in multi-GPU (DP/PP) environments where stats might be distributed or not yet reported. Consider making this more robust by checking for the key's existence or averaging across all ranks.

Suggested change
def _get_accept_length(self, base_url):
server_info = requests.get(base_url + "/get_server_info").json()
return server_info["internal_states"][0]["avg_spec_accept_length"]
def _get_accept_length(self, base_url):
server_info = requests.get(base_url + "/get_server_info").json()
states = server_info.get("internal_states", [])
accept_lengths = [s.get("avg_spec_accept_length", 0.0) for s in states if "avg_spec_accept_length" in s]
return sum(accept_lengths) / len(accept_lengths) if accept_lengths else 0.0

Comment on lines +228 to +233
generated_outputs = []
for _ in range(self.num_rounds):
outputs = self._generate_batch(
self.base_url, self.prompts, self.max_new_tokens
)
generated_outputs = outputs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the baseline generation loop, generated_outputs is overwritten in each round. While temperature=0 should produce deterministic results, it is generally safer to either accumulate all outputs or explicitly use the last round's results to ensure the corpus is fully populated as intended.

Suggested change
generated_outputs = []
for _ in range(self.num_rounds):
outputs = self._generate_batch(
self.base_url, self.prompts, self.max_new_tokens
)
generated_outputs = outputs
generated_outputs = []
for _ in range(self.num_rounds):
generated_outputs = self._generate_batch(
self.base_url, self.prompts, self.max_new_tokens
)

@hnyls2002 hnyls2002 force-pushed the lsyin/ngram-sam-accept-length-benchmark branch 2 times, most recently from 6f726fa to fa62815 Compare April 7, 2026 01:53
@hnyls2002 hnyls2002 force-pushed the lsyin/ngram-sam-accept-length-benchmark branch from fa62815 to 921b8c2 Compare April 7, 2026 01:56
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test registered/spec/test_ngram_speculative_decoding.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_ngram_speculative_decoding.py

@hnyls2002 hnyls2002 merged commit be0277f into main Apr 7, 2026
62 of 68 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/ngram-sam-accept-length-benchmark branch April 7, 2026 02:09
carlosfundora pushed a commit to carlosfundora/sglang-1-bit-turbo that referenced this pull request Apr 8, 2026
…h benchmark for external SAM (sgl-project#22199)

Upstream SHA: be0277f
Cherry-picked from sgl-project/sglang

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant