[Spec][Ngram] Support multiple SAMs with dynamic HTTP API#22203
[Spec][Ngram] Support multiple SAMs with dynamic HTTP API#22203
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for multiple external corpora in the ngram speculative decoding system, transitioning from a single Suffix Automaton (SAM) to a map of named SAMs. It adds functionality to dynamically load, remove, and list corpora through new HTTP endpoints and internal API updates. The batchMatch logic has been modified to distribute the draft token budget across all active SAMs. Review feedback highlights a bug in the exception handling that inadvertently clears all corpora, suggests improvements to the budget distribution calculation, recommends restoring documentation for mutex protection, and identifies a performance optimization for string concatenation in the FFI layer.
| chunk_count += 1 | ||
| self.finish_external_corpus_load() # type: ignore | ||
| except Exception: | ||
| self.clear_external_corpus() # type: ignore |
There was a problem hiding this comment.
Calling self.clear_external_corpus() on an exception will clear all existing corpora, not just the one that failed during loading. This is likely not the intended behavior when multiple corpora are supported.
To clear only the staging area without a dedicated C++ function, you can start a new load with the same ID and immediately finish it. This will effectively discard the partially loaded corpus.
| self.clear_external_corpus() # type: ignore | |
| self.start_external_corpus_load_named(corpus_id) # type: ignore | |
| self.finish_external_corpus_load() # type: ignore |
| mutable std::mutex mutex_; | ||
| mutable std::condition_variable sync_cv_; |
There was a problem hiding this comment.
The comments explaining what mutex_ protects have been removed. With the addition of new members (sams_, staging_corpus_id_, staging_sam_) that are also protected by this mutex, it would be beneficial for maintainability to restore and update the comment.
| mutable std::mutex mutex_; | |
| mutable std::condition_variable sync_cv_; | |
| // NOTE: protects trie_, sams_, staging_corpus_id_, staging_sam_, and pending_count_. | |
| // Ensures batchMatch never reads trie_/sams_ while they are being modified. | |
| // After synchronize(), no pending inserts remain so mutex_ contention on trie_ is effectively zero. | |
| mutable std::mutex mutex_; | |
| mutable std::condition_variable sync_cv_; |
| std::string result; | ||
| for (size_t i = 0; i < ids.size(); ++i) { | ||
| if (i > 0) result += ","; | ||
| result += ids[i]; | ||
| } |
There was a problem hiding this comment.
Repeatedly concatenating to a std::string in a loop can be inefficient due to potential reallocations. For better performance, especially if the number of corpora could grow, consider pre-calculating the total string size and reserving capacity.
| std::string result; | |
| for (size_t i = 0; i < ids.size(); ++i) { | |
| if (i > 0) result += ","; | |
| result += ids[i]; | |
| } | |
| if (ids.empty()) { | |
| return ""; | |
| } | |
| size_t total_size = ids.size() - 1; // For commas | |
| for (const auto& id : ids) { | |
| total_size += id.length(); | |
| } | |
| std::string result; | |
| result.reserve(total_size); | |
| result += ids[0]; | |
| for (size_t i = 1; i < ids.size(); ++i) { | |
| result += ','; | |
| result += ids[i]; | |
| } |
|
/tag-and-rerun-ci |
|
/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py |
|
✅ |
|
/rerun-test registered/spec/test_ngram_speculative_decoding.py |
|
✅ |
|
/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py registered/spec/test_ngram_speculative_decoding.py |
|
✅ ✅ |
get_hash_str and hash_str_to_int64 are pure-Python hash functions that don't need CUDA. Moving them to a lightweight hash_utils module breaks the import chain: radix_cache → hicache_storage → memory_pool_host → sgl_kernel → libcuda.so.1. This allows io_struct.py (via schedule_batch → radix_cache) to be imported in CPU-only environments.
- C++: replace single sam_ with map<string, shared_ptr<SAM>> sams_ - Budget splitting: equal division across all active SAMs - FFI: add start_external_corpus_load_named, remove, list methods - HTTP endpoints: POST /add_external_corpus, POST /remove_external_corpus, GET /list_external_corpora - Full request chain: HTTP → TokenizerManager (tokenize) → Scheduler → NGRAMWorker → C++ - Backward compatible: startup --speculative-ngram-external-corpus-path uses "__default__" corpus_id
5eecd74 to
1f412bb
Compare
|
/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py registered/spec/test_ngram_speculative_decoding.py |
|
✅ ✅ |
|
/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py registered/spec/test_ngram_speculative_decoding.py |
|
✅ ✅ |
|
/rerun-test registered/unit/spec/test_ngram_corpus.py registered/unit/server_args/test_server_args.py registered/spec/test_ngram_speculative_decoding.py registered/unit/mem_cache/test_radix_cache_unit.py registered/hicache/test_hicache_storage.py registered/radix_cache/test_radix_cache_hit.py |
|
✅ ✅ ✅ |
Motivation
Part of Ngram refactoring series #21052
Following #21425
The single external SAM loaded at startup via
--speculative-ngram-external-corpus-pathis not flexible enough. Users may want to add/remove corpora at runtime without restarting the server.Modifications
Multi-SAM storage: Replace single
sam_pointer withmap<string, unique_ptr<SAM>>keyed bycorpus_id. Totalexternal_sam_budgetis split equally across all active SAMs, each builds candidates independently, results merged viacombineRootResults_.HTTP API:
POST /add_external_corpus— accepts{corpus_id?, file_path}or{corpus_id?, documents: [...]}.corpus_idis optional (auto-generated UUID if omitted). Documents exceedingmax_tokensare automatically truncated with a note in the response.POST /remove_external_corpus— accepts{corpus_id}GET /list_external_corpora— returns active corpus IDsNon-blocking loading: SAM construction runs in a background thread managed by
ExternalCorpusManager. The scheduler event loop continues processing inference requests during corpus loading. Uses the same deferred response pattern asflush_cache.Tokenization: Happens in TokenizerManager. Token chunks are forwarded to scheduler → NGRAMWorker → C++ via ZMQ (same pattern as
flush_cache/update_weights).Backward compatible: startup
--speculative-ngram-external-corpus-pathstill works, loading as corpus with the file path as ID.