Skip to content

[Spec][Ngram] Support multiple SAMs with dynamic HTTP API#22203

Merged
hnyls2002 merged 29 commits intomainfrom
lsyin/multi-sam-http-api
Apr 7, 2026
Merged

[Spec][Ngram] Support multiple SAMs with dynamic HTTP API#22203
hnyls2002 merged 29 commits intomainfrom
lsyin/multi-sam-http-api

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented Apr 6, 2026

Motivation

Part of Ngram refactoring series #21052
Following #21425

The single external SAM loaded at startup via --speculative-ngram-external-corpus-path is not flexible enough. Users may want to add/remove corpora at runtime without restarting the server.

Modifications

Multi-SAM storage: Replace single sam_ pointer with map<string, unique_ptr<SAM>> keyed by corpus_id. Total external_sam_budget is split equally across all active SAMs, each builds candidates independently, results merged via combineRootResults_.

HTTP API:

  • POST /add_external_corpus — accepts {corpus_id?, file_path} or {corpus_id?, documents: [...]}. corpus_id is optional (auto-generated UUID if omitted). Documents exceeding max_tokens are automatically truncated with a note in the response.
  • POST /remove_external_corpus — accepts {corpus_id}
  • GET /list_external_corpora — returns active corpus IDs

Non-blocking loading: SAM construction runs in a background thread managed by ExternalCorpusManager. The scheduler event loop continues processing inference requests during corpus loading. Uses the same deferred response pattern as flush_cache.

Tokenization: Happens in TokenizerManager. Token chunks are forwarded to scheduler → NGRAMWorker → C++ via ZMQ (same pattern as flush_cache / update_weights).

Backward compatible: startup --speculative-ngram-external-corpus-path still works, loading as corpus with the file path as ID.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for multiple external corpora in the ngram speculative decoding system, transitioning from a single Suffix Automaton (SAM) to a map of named SAMs. It adds functionality to dynamically load, remove, and list corpora through new HTTP endpoints and internal API updates. The batchMatch logic has been modified to distribute the draft token budget across all active SAMs. Review feedback highlights a bug in the exception handling that inadvertently clears all corpora, suggests improvements to the budget distribution calculation, recommends restoring documentation for mutex protection, and identifies a performance optimization for string concatenation in the FFI layer.

chunk_count += 1
self.finish_external_corpus_load() # type: ignore
except Exception:
self.clear_external_corpus() # type: ignore
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling self.clear_external_corpus() on an exception will clear all existing corpora, not just the one that failed during loading. This is likely not the intended behavior when multiple corpora are supported.

To clear only the staging area without a dedicated C++ function, you can start a new load with the same ID and immediately finish it. This will effectively discard the partially loaded corpus.

Suggested change
self.clear_external_corpus() # type: ignore
self.start_external_corpus_load_named(corpus_id) # type: ignore
self.finish_external_corpus_load() # type: ignore

Comment on lines 27 to 28
mutable std::mutex mutex_;
mutable std::condition_variable sync_cv_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comments explaining what mutex_ protects have been removed. With the addition of new members (sams_, staging_corpus_id_, staging_sam_) that are also protected by this mutex, it would be beneficial for maintainability to restore and update the comment.

Suggested change
mutable std::mutex mutex_;
mutable std::condition_variable sync_cv_;
// NOTE: protects trie_, sams_, staging_corpus_id_, staging_sam_, and pending_count_.
// Ensures batchMatch never reads trie_/sams_ while they are being modified.
// After synchronize(), no pending inserts remain so mutex_ contention on trie_ is effectively zero.
mutable std::mutex mutex_;
mutable std::condition_variable sync_cv_;

Comment on lines +133 to +137
std::string result;
for (size_t i = 0; i < ids.size(); ++i) {
if (i > 0) result += ",";
result += ids[i];
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Repeatedly concatenating to a std::string in a loop can be inefficient due to potential reallocations. For better performance, especially if the number of corpora could grow, consider pre-calculating the total string size and reserving capacity.

Suggested change
std::string result;
for (size_t i = 0; i < ids.size(); ++i) {
if (i > 0) result += ",";
result += ids[i];
}
if (ids.empty()) {
return "";
}
size_t total_size = ids.size() - 1; // For commas
for (const auto& id : ids) {
total_size += id.length();
}
std::string result;
result.reserve(total_size);
result += ids[0];
for (size_t i = 1; i < ids.size(); ++i) {
result += ',';
result += ids[i];
}

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py

@github-actions github-actions bot added the run-ci label Apr 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

ubuntu-latest (2 tests): View workflow run

cd test/ && python3 registered/unit/spec/test_ngram_corpus.py
cd test/ && python3 registered/unit/server_args/test_server_args.py

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test registered/spec/test_ngram_speculative_decoding.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_ngram_speculative_decoding.py

@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py registered/spec/test_ngram_speculative_decoding.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

ubuntu-latest (2 tests): View workflow run

cd test/ && python3 registered/unit/spec/test_ngram_corpus.py
cd test/ && python3 registered/unit/server_args/test_server_args.py

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_ngram_speculative_decoding.py

get_hash_str and hash_str_to_int64 are pure-Python hash functions that
don't need CUDA. Moving them to a lightweight hash_utils module breaks
the import chain: radix_cache → hicache_storage → memory_pool_host →
sgl_kernel → libcuda.so.1. This allows io_struct.py (via schedule_batch
→ radix_cache) to be imported in CPU-only environments.
- C++: replace single sam_ with map<string, shared_ptr<SAM>> sams_
- Budget splitting: equal division across all active SAMs
- FFI: add start_external_corpus_load_named, remove, list methods
- HTTP endpoints: POST /add_external_corpus, POST /remove_external_corpus, GET /list_external_corpora
- Full request chain: HTTP → TokenizerManager (tokenize) → Scheduler → NGRAMWorker → C++
- Backward compatible: startup --speculative-ngram-external-corpus-path uses "__default__" corpus_id
@hnyls2002 hnyls2002 force-pushed the lsyin/multi-sam-http-api branch from 5eecd74 to 1f412bb Compare April 6, 2026 23:39
@hnyls2002 hnyls2002 changed the base branch from main to lsyin/move-hash-utils-to-mem-cache April 6, 2026 23:39
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py registered/spec/test_ngram_speculative_decoding.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

ubuntu-latest (2 tests): View workflow run

cd test/ && python3 registered/unit/spec/test_ngram_corpus.py
cd test/ && python3 registered/unit/server_args/test_server_args.py

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_ngram_speculative_decoding.py

@kpham-sgl kpham-sgl self-assigned this Apr 6, 2026
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py registered/spec/test_ngram_speculative_decoding.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

ubuntu-latest (2 tests): View workflow run

cd test/ && python3 registered/unit/spec/test_ngram_corpus.py
cd test/ && python3 registered/unit/server_args/test_server_args.py

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_ngram_speculative_decoding.py

Base automatically changed from lsyin/move-hash-utils-to-mem-cache to main April 7, 2026 01:16
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py registered/spec/test_ngram_speculative_decoding.py registered/unit/mem_cache/test_radix_cache_unit.py registered/hicache/test_hicache_storage.py registered/radix_cache/test_radix_cache_hit.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

ubuntu-latest (2 tests): View workflow run

cd test/ && python3 registered/unit/spec/test_ngram_corpus.py
cd test/ && python3 registered/unit/server_args/test_server_args.py

1-gpu-h100 (1 test): View workflow run

cd test/ && python3 registered/spec/test_ngram_speculative_decoding.py

1-gpu-5090 (3 tests): View workflow run

cd test/ && python3 registered/unit/mem_cache/test_radix_cache_unit.py
cd test/ && python3 registered/hicache/test_hicache_storage.py
cd test/ && python3 registered/radix_cache/test_radix_cache_hit.py

@hnyls2002 hnyls2002 merged commit e4b1366 into main Apr 7, 2026
56 of 105 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/multi-sam-http-api branch April 7, 2026 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants