Add vLLM-based runtime statistics for subblock latency measurement#1358
Conversation
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds vLLM-backed subblock runtime benchmarking and wiring into Puzzletron (configs, runtime, and MIP), modernizes subblock helpers' types/docs, adds model export/vLLM runner utilities, includes a GPU integration test, and documents runtime-based latency optimization. ChangesRuntime-Based Latency Optimization for NAS
Sequence Diagram(s)sequenceDiagram
participant CalcRuntime as calc_runtime_for_subblocks
participant Builder as create_benchmark_model
participant Export as save_model_as_anymodel
participant VLLM as run_vllm_latency_benchmark
CalcRuntime->>Builder: build model with repeated subblock config
Builder->>Export: export model + tokenizer to temp dir
Export->>VLLM: invoke `vllm bench latency` subprocess
VLLM-->>CalcRuntime: return avg_latency_ms from JSON
CalcRuntime->>CalcRuntime: normalize vs baseline and return runtimes
Estimated code review effort: Suggested reviewers:
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error)
✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1358 +/- ##
==========================================
- Coverage 77.51% 76.74% -0.77%
==========================================
Files 489 493 +4
Lines 54498 54687 +189
==========================================
- Hits 42242 41971 -271
- Misses 12256 12716 +460
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/puzzletron/main.py (1)
141-141:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
--mip-onlyignoresdist_timeout_minutesoverride.
run_full_puzzletron()honors config-driven timeout, butrun_mip_only()is still hardcoded to 10 minutes. That breaks the documented behavior and can cause distributed init failures in long-startup environments.Suggested fix
+def _resolve_dist_timeout(hydra_config_path: str) -> timedelta: + from omegaconf import OmegaConf + + cfg = OmegaConf.load(str(Path(hydra_config_path).resolve())) + return timedelta(minutes=cfg.dist_timeout_minutes) if hasattr(cfg, "dist_timeout_minutes") else timedelta(minutes=10) + def run_full_puzzletron(hydra_config_path: str): @@ - from omegaconf import OmegaConf - - # Resolve absolute path for Hydra config - hydra_config_path = Path(hydra_config_path).resolve() - hydra_config = OmegaConf.load(str(hydra_config_path)) - - # Default timeout: 10 minutes, or extended to dist_timeout_minutes if set in config - if hasattr(hydra_config, "dist_timeout_minutes"): - timeout_minutes = timedelta(minutes=hydra_config.dist_timeout_minutes) - else: - timeout_minutes = timedelta(minutes=10) + timeout_minutes = _resolve_dist_timeout(hydra_config_path) mtpz.tools.mprint(f"Puzzletron Progress 1/8: Timeout minutes: {timeout_minutes}") dist.setup(timeout=timeout_minutes) @@ def run_mip_only(hydra_config_path: str): @@ - dist.setup(timeout=timedelta(minutes=10)) + dist.setup(timeout=_resolve_dist_timeout(hydra_config_path))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/puzzletron/main.py` at line 141, run_mip_only currently hardcodes the distributed setup timeout to timedelta(minutes=10) while run_full_puzzletron uses the config value dist_timeout_minutes; change run_mip_only to read and use the same config-driven timeout (dist_timeout_minutes) when calling dist.setup() so the --mip-only path honors the override (update references to dist.setup(...) in run_mip_only to construct the timeout from dist_timeout_minutes rather than using a fixed 10-minute timedelta).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/Llama-3_1-8B.yaml`:
- Line 100: Replace the incorrect config key name `nccl_timeout_minutes` with
`dist_timeout_minutes` so the runtime reads the timeout (the pipeline code
checks for `dist_timeout_minutes` and defaults to 10 minutes otherwise); update
the YAML key to `dist_timeout_minutes: ${timedelta_minutes:10}` so the value is
picked up by the logic in main.py that reads `dist_timeout_minutes`.
In
`@examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_solutions_defaults.yaml`:
- Around line 5-10: The YAML keys intended as children of solutions_to_validate
are currently top-level; edit the block so skip_validation, save_models,
bigger_is_better, sort_solutions_by, and calculate_full_score_ablations are
indented under solutions_to_validate (use consistent indentation, e.g., two
spaces) so the config shape is correct and consumers reading
solutions_to_validate.{skip_validation,save_models,bigger_is_better,sort_solutions_by,calculate_full_score_ablations}
get the expected nested values.
In `@examples/puzzletron/README.md`:
- Line 14: The in-page fragment reference '`#attention-pruning-kv-head-reduction`'
in the Note line is broken; either replace that fragment with the actual heading
anchor present in this README (match the exact heading text for the "Attention
Pruning" section) or add a heading whose slug matches
'attention-pruning-kv-head-reduction' so the link resolves; update the text near
the configs reference where the fragment appears and verify the target heading
name (or add a new H2/H3 titled "Attention Pruning — KV-head reduction") so the
anchor and link match exactly.
In `@modelopt/torch/nas/subblock_stats/runtime_vllm.py`:
- Around line 44-50: The subprocess invocation in runtime_vllm.py is fragile:
avoid mutating os.environ and running subprocess.run(cmd) without timeout or
error checking; instead create a local env dict (copy os.environ and set
"VLLM_ENABLE_V1_MULTIPROCESSING" = "0") and pass it to subprocess.run, call
subprocess.run(cmd, env=env, check=True, capture_output=True, text=True,
timeout=some_reasonable_seconds) so failures raise CalledProcessError or
TimeoutExpired instead of hanging silently, and only open output_json_path after
a successful run; catch and handle subprocess.CalledProcessError and
subprocess.TimeoutExpired to log stderr/stdout (from the completed process) and
re-raise or return a clear error so downstream json.load doesn't attempt to
parse missing/partial output.
- Around line 23-35: The benchmark currently hardcodes the CLI flag
"--batch-size" to "1" in runtime_vllm.py which ignores RuntimeConfig.batch_size;
update the argument construction (where the list includes "--batch-size","1") to
use str(runtime_config.batch_size) instead, ensuring you cast the configured
integer to a string and validate it's >0 (or fall back to "1") before inserting;
keep the rest of the command-building logic the same so the runtime actually
measures the configured batch size.
In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py`:
- Around line 130-137: The call to calc_runtime_for_subblocks hardcodes
num_key_value_heads=8 which will misrepresent models with different KV head
counts; change the argument to pass the actual KV-head count from the
model/config (e.g., use the existing variable that represents KV heads such as
n_kv, num_kv_heads, or derive it from the model config) instead of 8, and if
that variable does not exist in this scope add/propagate a parameter (or compute
it from n_head and model-specific kv ratio) so calc_runtime_for_subblocks
receives the correct num_key_value_heads value.
---
Outside diff comments:
In `@examples/puzzletron/main.py`:
- Line 141: run_mip_only currently hardcodes the distributed setup timeout to
timedelta(minutes=10) while run_full_puzzletron uses the config value
dist_timeout_minutes; change run_mip_only to read and use the same config-driven
timeout (dist_timeout_minutes) when calling dist.setup() so the --mip-only path
honors the override (update references to dist.setup(...) in run_mip_only to
construct the timeout from dist_timeout_minutes rather than using a fixed
10-minute timedelta).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9d164286-9a12-4cc6-b029-2f48f9feb22c
📒 Files selected for processing (17)
examples/puzzletron/README.mdexamples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/Llama-3_1-8B.yamlexamples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/llama-3_1-8B_pruneattn_runtime.yamlexamples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/attn_pruning.yamlexamples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/ffn_pruning.yamlexamples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/hidden_dim_pruning.yamlexamples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/pruning_defaults.yamlexamples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_model_defaults.yamlexamples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_solutions_defaults.yamlexamples/puzzletron/main.pymodelopt/torch/nas/subblock_stats/__init__.pymodelopt/torch/nas/subblock_stats/calc_runtime_stats.pymodelopt/torch/nas/subblock_stats/calc_subblock_params_and_memory.pymodelopt/torch/nas/subblock_stats/runtime_utils.pymodelopt/torch/nas/subblock_stats/runtime_vllm.pymodelopt/torch/puzzletron/subblock_stats/__init__.pymodelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py
💤 Files with no reviewable changes (1)
- modelopt/torch/puzzletron/subblock_stats/init.py
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
modelopt/torch/nas/subblock_stats/runtime_vllm.py (2)
24-29:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse configured batch size for benchmark arguments.
Line 24 and Line 29 hardcode
1, so runtime measurements can ignore the requested workload inRuntimeConfig.batch_size.Suggested fix
- args_ns.batch_size = 1 + batch_size = max(1, int(runtime_config.batch_size)) + args_ns.batch_size = batch_size @@ - args_ns.max_num_seqs = 1 + args_ns.max_num_seqs = batch_size🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/nas/subblock_stats/runtime_vllm.py` around lines 24 - 29, The benchmark currently hardcodes args_ns.batch_size = 1 and args_ns.max_num_seqs = 1, ignoring the configured RuntimeConfig.batch_size; update the assignment to use runtime_config.batch_size (and ensure args_ns.max_num_seqs is set appropriately, e.g., to runtime_config.batch_size or computed from it) so that the variables args_ns.batch_size and args_ns.max_num_seqs reflect RuntimeConfig.batch_size when preparing runtime arguments.
39-40:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid leaking process-wide environment changes.
Line 39 mutates
os.environglobally and never restores it. That can affect later benchmark calls in the same process.Suggested fix
- os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" - vllm_latency_main(args_ns) + prev = os.environ.get("VLLM_ENABLE_V1_MULTIPROCESSING") + os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" + try: + vllm_latency_main(args_ns) + finally: + if prev is None: + os.environ.pop("VLLM_ENABLE_V1_MULTIPROCESSING", None) + else: + os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = prev🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/nas/subblock_stats/runtime_vllm.py` around lines 39 - 40, The code currently sets os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" before calling vllm_latency_main(args_ns) and never restores it, leaking a process-wide env change; wrap the mutation in a safe restore pattern (save the previous value or presence, set the env var, call vllm_latency_main, then restore the original value or delete the key) using try/finally (or a small context manager) so that runtime_vllm.py does not leave VLLM_ENABLE_V1_MULTIPROCESSING changed for subsequent benchmarks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@modelopt/torch/nas/subblock_stats/runtime_vllm.py`:
- Around line 24-29: The benchmark currently hardcodes args_ns.batch_size = 1
and args_ns.max_num_seqs = 1, ignoring the configured RuntimeConfig.batch_size;
update the assignment to use runtime_config.batch_size (and ensure
args_ns.max_num_seqs is set appropriately, e.g., to runtime_config.batch_size or
computed from it) so that the variables args_ns.batch_size and
args_ns.max_num_seqs reflect RuntimeConfig.batch_size when preparing runtime
arguments.
- Around line 39-40: The code currently sets
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" before calling
vllm_latency_main(args_ns) and never restores it, leaking a process-wide env
change; wrap the mutation in a safe restore pattern (save the previous value or
presence, set the env var, call vllm_latency_main, then restore the original
value or delete the key) using try/finally (or a small context manager) so that
runtime_vllm.py does not leave VLLM_ENABLE_V1_MULTIPROCESSING changed for
subsequent benchmarks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 93e0ea08-ca66-4466-bc13-5f7509a10f2c
📒 Files selected for processing (1)
modelopt/torch/nas/subblock_stats/runtime_vllm.py
|
@grzegorz-k-karch can you address coderabbit / claude comments and mark resolved. Also vllm import needs to be guarded else CI fails as its not a required dependency |
|
/claude review |
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
d9dff48 to
105c736
Compare
|
/ok to test 105c736 |
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Configuration
Documentation
Tests