[Spec][Ngram] Support multiple SAMs with dynamic HTTP API by hnyls2002 · Pull Request #22203 · sgl-project/sglang

hnyls2002 · 2026-04-06T19:48:14Z

Motivation

Part of Ngram refactoring series #21052
Following #21425

The single external SAM loaded at startup via --speculative-ngram-external-corpus-path is not flexible enough. Users may want to add/remove corpora at runtime without restarting the server.

Modifications

Multi-SAM storage: Replace single sam_ pointer with map<string, unique_ptr<SAM>> keyed by corpus_id. Total external_sam_budget is split equally across all active SAMs, each builds candidates independently, results merged via combineRootResults_.

HTTP API:

POST /add_external_corpus — accepts {corpus_id?, file_path} or {corpus_id?, documents: [...]}. corpus_id is optional (auto-generated UUID if omitted). Documents exceeding max_tokens are automatically truncated with a note in the response.
POST /remove_external_corpus — accepts {corpus_id}
GET /list_external_corpora — returns active corpus IDs

Non-blocking loading: SAM construction runs in a background thread managed by ExternalCorpusManager. The scheduler event loop continues processing inference requests during corpus loading. Uses the same deferred response pattern as flush_cache.

Tokenization: Happens in TokenizerManager. Token chunks are forwarded to scheduler → NGRAMWorker → C++ via ZMQ (same pattern as flush_cache / update_weights).

Backward compatible: startup --speculative-ngram-external-corpus-path still works, loading as corpus with the file path as ID.

gemini-code-assist

Code Review

This pull request introduces support for multiple external corpora in the ngram speculative decoding system, transitioning from a single Suffix Automaton (SAM) to a map of named SAMs. It adds functionality to dynamically load, remove, and list corpora through new HTTP endpoints and internal API updates. The batchMatch logic has been modified to distribute the draft token budget across all active SAMs. Review feedback highlights a bug in the exception handling that inadvertently clears all corpora, suggests improvements to the budget distribution calculation, recommends restoring documentation for mutex protection, and identifies a performance optimization for string concatenation in the FFI layer.

gemini-code-assist · 2026-04-06T19:52:14Z

python/sglang/jit_kernel/ngram_corpus.py

+                    chunk_count += 1
+                self.finish_external_corpus_load()  # type: ignore
+            except Exception:
+                self.clear_external_corpus()  # type: ignore


Calling self.clear_external_corpus() on an exception will clear all existing corpora, not just the one that failed during loading. This is likely not the intended behavior when multiple corpora are supported.

To clear only the staging area without a dedicated C++ function, you can start a new load with the same ID and immediately finish it. This will effectively discard the partially loaded corpus.

Suggested change

self.clear_external_corpus() # type: ignore

self.start_external_corpus_load_named(corpus_id) # type: ignore

self.finish_external_corpus_load() # type: ignore

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.cpp

gemini-code-assist · 2026-04-06T19:52:15Z

python/sglang/jit_kernel/csrc/ngram_corpus/ngram.h

  mutable std::mutex mutex_;
  mutable std::condition_variable sync_cv_;


The comments explaining what mutex_ protects have been removed. With the addition of new members (sams_, staging_corpus_id_, staging_sam_) that are also protected by this mutex, it would be beneficial for maintainability to restore and update the comment.

Suggested change

mutable std::mutex mutex_;

mutable std::condition_variable sync_cv_;

// NOTE: protects trie_, sams_, staging_corpus_id_, staging_sam_, and pending_count_.

// Ensures batchMatch never reads trie_/sams_ while they are being modified.

// After synchronize(), no pending inserts remain so mutex_ contention on trie_ is effectively zero.

mutable std::mutex mutex_;

mutable std::condition_variable sync_cv_;

gemini-code-assist · 2026-04-06T19:52:15Z

python/sglang/jit_kernel/csrc/ngram_corpus/ngram_corpus_ffi.cpp

+    std::string result;
+    for (size_t i = 0; i < ids.size(); ++i) {
+      if (i > 0) result += ",";
+      result += ids[i];
+    }


Repeatedly concatenating to a std::string in a loop can be inefficient due to potential reallocations. For better performance, especially if the number of corpora could grow, consider pre-calculating the total string size and reserving capacity.

Suggested change

std::string result;

for (size_t i = 0; i < ids.size(); ++i) {

if (i > 0) result += ",";

result += ids[i];

}

if (ids.empty()) {

return "";

}

size_t total_size = ids.size() - 1; // For commas

for (const auto& id : ids) {

total_size += id.length();

}

std::string result;

result.reserve(total_size);

result += ids[0];

for (size_t i = 1; i < ids.size(); ++i) {

result += ',';

result += ids[i];

}

hnyls2002 · 2026-04-06T23:07:01Z

/tag-and-rerun-ci

hnyls2002 · 2026-04-06T23:07:17Z

/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py

github-actions · 2026-04-06T23:07:45Z

✅ ubuntu-latest (2 tests): View workflow run

cd test/ && python3 registered/unit/spec/test_ngram_corpus.py
cd test/ && python3 registered/unit/server_args/test_server_args.py

hnyls2002 · 2026-04-06T23:08:30Z

/rerun-test registered/spec/test_ngram_speculative_decoding.py

github-actions · 2026-04-06T23:08:55Z

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_ngram_speculative_decoding.py

hnyls2002 · 2026-04-06T23:11:47Z

/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py registered/spec/test_ngram_speculative_decoding.py

github-actions · 2026-04-06T23:12:27Z

✅ ubuntu-latest (2 tests): View workflow run

cd test/ && python3 registered/unit/spec/test_ngram_corpus.py
cd test/ && python3 registered/unit/server_args/test_server_args.py

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_ngram_speculative_decoding.py

get_hash_str and hash_str_to_int64 are pure-Python hash functions that don't need CUDA. Moving them to a lightweight hash_utils module breaks the import chain: radix_cache → hicache_storage → memory_pool_host → sgl_kernel → libcuda.so.1. This allows io_struct.py (via schedule_batch → radix_cache) to be imported in CPU-only environments.

- C++: replace single sam_ with map<string, shared_ptr<SAM>> sams_ - Budget splitting: equal division across all active SAMs - FFI: add start_external_corpus_load_named, remove, list methods - HTTP endpoints: POST /add_external_corpus, POST /remove_external_corpus, GET /list_external_corpora - Full request chain: HTTP → TokenizerManager (tokenize) → Scheduler → NGRAMWorker → C++ - Backward compatible: startup --speculative-ngram-external-corpus-path uses "__default__" corpus_id

…nt init

…bled

hnyls2002 · 2026-04-06T23:39:59Z

/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py registered/spec/test_ngram_speculative_decoding.py

github-actions · 2026-04-06T23:40:35Z

✅ ubuntu-latest (2 tests): View workflow run

cd test/ && python3 registered/unit/spec/test_ngram_corpus.py
cd test/ && python3 registered/unit/server_args/test_server_args.py

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_ngram_speculative_decoding.py

hnyls2002 · 2026-04-06T23:49:10Z

/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py registered/spec/test_ngram_speculative_decoding.py

github-actions · 2026-04-06T23:49:49Z

✅ ubuntu-latest (2 tests): View workflow run

cd test/ && python3 registered/unit/spec/test_ngram_corpus.py
cd test/ && python3 registered/unit/server_args/test_server_args.py

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_ngram_speculative_decoding.py

hnyls2002 · 2026-04-07T01:27:07Z

/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py registered/spec/test_ngram_speculative_decoding.py registered/unit/mem_cache/test_radix_cache_unit.py registered/hicache/test_hicache_storage.py registered/radix_cache/test_radix_cache_hit.py

github-actions · 2026-04-07T01:27:50Z

✅ ubuntu-latest (2 tests): View workflow run

cd test/ && python3 registered/unit/spec/test_ngram_corpus.py
cd test/ && python3 registered/unit/server_args/test_server_args.py

✅ 1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_ngram_speculative_decoding.py

✅ 1-gpu-5090 (3 tests): View workflow run

cd test/ && python3 registered/unit/mem_cache/test_radix_cache_unit.py
cd test/ && python3 registered/hicache/test_hicache_storage.py
cd test/ && python3 registered/radix_cache/test_radix_cache_hit.py

hnyls2002 requested review from BBuf, CatherineSue, DarkSharpness, HydraQYH, JustinTong0323, Ying1123, celve, ispobock, merrymercy, slin1237, xiezhq-hermann and yuan-luo as code owners April 6, 2026 19:48

github-actions bot added the jit-kernel label Apr 6, 2026

hnyls2002 mentioned this pull request Apr 6, 2026

[Roadmap] Further Ngram Speculative Decoding Support #21052

Open

13 tasks

gemini-code-assist bot reviewed Apr 6, 2026

View reviewed changes

github-actions bot added the run-ci label Apr 6, 2026

hnyls2002 added 7 commits April 6, 2026 16:30

move hash functions to utils.py instead of separate hash_utils.py

b8b8ee6

auto-generate corpus_id when not provided

a2071a8

add multi-SAM and mock scheduler tests

01f36ef

replace mock scheduler tests with HTTP endpoint tests

9326ddd

use unique_ptr for SAM ownership, restore mutex comments

1efe976

hnyls2002 added 8 commits April 6, 2026 16:39

use newline delimiter for corpus ID list to avoid comma-in-id bug

99846fb

remove no-op staging_sam_.reset() and redundant lock.unlock()

955500a

truncate documents exceeding max_tokens instead of loading unbounded

2061541

add thread.join() for memory visibility guarantee in check_pending_load

83ca818

validate request body in add_external_corpus endpoint, remove redunda…

d3d68c7

…nt init

hoist loop-invariant budget computation out of batchMatch loop

e17cea9

initialize external_corpus_manager to None when spec decoding is disa…

e214be6

…bled

skip TestMultiSamHttpMock on CPU-only runners lacking CUDA libraries

1f412bb

hnyls2002 force-pushed the lsyin/multi-sam-http-api branch from 5eecd74 to 1f412bb Compare April 6, 2026 23:39

hnyls2002 requested review from hanming-lu, hzh0425 and yizhang2077 as code owners April 6, 2026 23:39

hnyls2002 changed the base branch from main to lsyin/move-hash-utils-to-mem-cache April 6, 2026 23:39

kpham-sgl self-assigned this Apr 6, 2026

fix tests calling removed load_external_corpus method

a6900e3

hnyls2002 added 2 commits April 6, 2026 16:52

use file path as corpus_id in file-based tests

b431020

remove __default__ corpus_id convention from tests

2b640e6

Base automatically changed from lsyin/move-hash-utils-to-mem-cache to main April 7, 2026 01:16

Merge branch 'main' into lsyin/multi-sam-http-api

adff50e

hnyls2002 added the high priority label Apr 7, 2026

hnyls2002 merged commit e4b1366 into main Apr 7, 2026
56 of 105 checks passed

hnyls2002 deleted the lsyin/multi-sam-http-api branch April 7, 2026 01:49

hnyls2002 mentioned this pull request Apr 7, 2026

[Spec][Ngram] Add output-as-corpus accept length benchmark for external SAM #22199

Merged

kpham-sgl mentioned this pull request Apr 7, 2026

[Spec][Ngram] Misc enhance support for multiple SAMs #22294

Merged

5 tasks

	self.clear_external_corpus() # type: ignore
	self.start_external_corpus_load_named(corpus_id) # type: ignore
	self.finish_external_corpus_load() # type: ignore

		mutable std::mutex mutex_;
		mutable std::condition_variable sync_cv_;

-    std::string result;
-    for (size_t i = 0; i < ids.size(); ++i) {
-      if (i > 0) result += ",";
-      result += ids[i];
-    }
+    if (ids.empty()) {
+      return "";
+    }
+    size_t total_size = ids.size() - 1; // For commas
+    for (const auto& id : ids) {
+      total_size += id.length();
+    }
+    std::string result;
+    result.reserve(total_size);
+    result += ids[0];
+    for (size_t i = 1; i < ids.size(); ++i) {
+      result += ',';
+      result += ids[i];
+    }

Conversation

hnyls2002 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

hnyls2002 commented Apr 6, 2026

Uh oh!

hnyls2002 commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

hnyls2002 commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

hnyls2002 commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

hnyls2002 commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

hnyls2002 commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

hnyls2002 commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hnyls2002 commented Apr 6, 2026 •

edited

Loading