[core] Introduce MemoryPoolConfigurator class hierarchy#22389
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
85165a2 to
d235d6d
Compare
|
/rerun-test test_swa_unittest.py test_mimo_models.py test_deepseek_v3_mtp.py test_dsa_models_mtp.py test_qwen3_next_models_mtp.py test_qwen35_models.py test_triton_sliding_window.py test_mamba_unittest.py test_mamba2_mixer.py test_nvidia_nemotron_nano_v2.py test_nvidia_nemotron_3_super_bf16.py test_mla_deepseek_v3.py test_generation_models.py |
|
✅ ✅ ✅ ✅ ✅ ✅ ❌
Please provide the full path, e.g. |
Summary
MemoryPoolConfiguratorbase class with unified coeff+bias interface (calculate_pool_sizes/calculate_pool_sizes_from_max_tokens)DefaultPoolConfiguratorfor MHA/MLA/NSA/FP4 — absorbsget_cell_size_per_token, num_layers deduction, DFLASH scalingHybridSWAPoolConfiguratorfor Gemma2/Command-R/MiMo — absorbsresolve_hybrid_swa_tokenswith full/swa pool splittingcreate_memory_pool_configurator()factory_resolve_memory_pool_configto use configurator flowprofile_max_num_token,_resolve_hybrid_swa_tokensMemoryPoolConfigfrommodel_runner_kv_cache_mixin.pytopool_configurator.py_apply_token_constraintsFollows up on #22384. Mamba configurator is a separate follow-up.
Behavioral changes
_cell_sizeto use ratio-weighted formula (F*nf + r*S*ns), so--max-total-tokenscorrectly constrainsfull_tokensrather than inflating it through a memory-budget round-trip (pre-existing issue in the old code)MemoryPoolConfigdirectly;max_running_requestsdefault changed from requiredinttoOptional[int] = None(filled by consumer after configurator runs)Test plan
/rerun-stage stage-a-test-1/rerun-stage stage-b-test-small-1-gpu/rerun-stage stage-b-test-large-1-gpu