Skip to content

Add vLLM-based runtime statistics for subblock latency measurement#1358

Merged
kevalmorabia97 merged 43 commits into
mainfrom
gkarch/runtime_opt
Jun 8, 2026
Merged

Add vLLM-based runtime statistics for subblock latency measurement#1358
kevalmorabia97 merged 43 commits into
mainfrom
gkarch/runtime_opt

Conversation

@grzegorz-k-karch

@grzegorz-k-karch grzegorz-k-karch commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Runtime-based latency optimization: collect vLLM-measured inference latency to constrain optimization.
  • Configuration

    • New runtime config/template for Llama-3.1-8B pruning (runtime stats enabled, NCCL timeout templating, MIP target-latency).
    • Validation sample defaults adjusted (one flow: 128 → 8; runtime flow uses 128).
    • Human constraint key renamed to target_latency_seconds.
  • Documentation

    • README section describing runtime-based latency optimization setup and usage.
  • Tests

    • Added GPU end-to-end test for runtime stats collection.

Review Change Stack

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Apr 28, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds vLLM-backed subblock runtime benchmarking and wiring into Puzzletron (configs, runtime, and MIP), modernizes subblock helpers' types/docs, adds model export/vLLM runner utilities, includes a GPU integration test, and documents runtime-based latency optimization.

Changes

Runtime-Based Latency Optimization for NAS

Layer / File(s) Summary
Documentation and distributed timeout
examples/puzzletron/README.md, examples/puzzletron/main.py, examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/*
Add runtime-latency README content; compute distributed dist.setup timeout from Hydra nccl_timeout_minutes instead of hardcoding 10 minutes; update validation dataset default and eval sample counts.
Puzzletron runtime configs and constraints
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/*
Add runtime-focused Puzzletron configs that enable calc_subblock_stats.runtime_stats, template NCCL timeout, wire scoring/realize/mip stages, set target_latency_seconds, and define FFN candidate list.
Package export switch
modelopt/torch/puzzletron/subblock_stats/__init__.py
Switch package exports to re-export calc_subblock_stats symbols.
calc_subblock_stats refactor & gating
modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py
Replace benchmark_iterations with runtime_stats_enabled flag; conditionally compute runtime via calc_runtime_for_subblocks; thread flag through launch/sweep; restrict runtime collection to BF16; remove int8 runtime-scaling helpers.
Runtime measurement infra
modelopt/torch/puzzletron/subblock_stats/calc_runtime_stats.py
Add cached vLLM-backed benchmarking helpers to build small repeated-block Llama models, run latency benchmarks, and compute per-subblock and no-block runtimes.
Model export & vLLM wrapper
modelopt/torch/puzzletron/subblock_stats/runtime_utils.py, modelopt/torch/puzzletron/subblock_stats/runtime_vllm.py
Add RuntimeConfig, model export/save helpers (HF/AnyModel with tokenizer copy), and run_vllm_latency_benchmark to invoke vLLM CLI and parse JSON latency.
Params/memory helpers: annotations & docs
modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py
Modernize descriptor parameter annotations to type[ModelDescriptor], reorder __all__, and expand docstrings for memory/parameter and MoE helpers.
GPU integration test
tests/gpu/torch/puzzletron/test_calc_runtime_stats.py
Add GPU-gated pytest exercising calc_runtime_for_subblocks with a minimal tokenizer and subblock set, verifying coverage, zero runtime for no-op configs, finite runtimes for others, and positive no-block overhead.
MIP human-constraint key update
modelopt/torch/puzzletron/mip/run_puzzle.py
Replace human constraint key target_latency with target_latency_seconds and read it when converting human constraints to MIP runtime constraints.

Sequence Diagram(s)

sequenceDiagram
  participant CalcRuntime as calc_runtime_for_subblocks
  participant Builder as create_benchmark_model
  participant Export as save_model_as_anymodel
  participant VLLM as run_vllm_latency_benchmark
  CalcRuntime->>Builder: build model with repeated subblock config
  Builder->>Export: export model + tokenizer to temp dir
  Export->>VLLM: invoke `vllm bench latency` subprocess
  VLLM-->>CalcRuntime: return avg_latency_ms from JSON
  CalcRuntime->>CalcRuntime: normalize vs baseline and return runtimes
Loading

Estimated code review effort:
🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers:

  • kevalmorabia97
  • meenchen

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error runtime_vllm.py uses # nosec comments (lines 30, 88) which violates SECURITY.md policy—# nosec is not allowed for bypassing Bandit checks without codeowner review. Remove nosec comments and request @NVIDIA/modelopt-setup-codeowners review with PR justification for subprocess usage.
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding vLLM-based runtime statistics for subblock latency measurement, which is the core objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 86.79% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch gkarch/runtime_opt

Comment @coderabbitai help to get the list of available commands and usage tips.

@grzegorz-k-karch grzegorz-k-karch self-assigned this Apr 28, 2026
@github-actions

github-actions Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-08 19:23 UTC

@codecov

codecov Bot commented Apr 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 29.11392% with 168 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.74%. Comparing base (01415c2) to head (105c736).

Files with missing lines Patch % Lines
modelopt/torch/puzzletron/utils/vllm_adapter.py 10.58% 76 Missing ⚠️
...ch/puzzletron/subblock_stats/calc_runtime_stats.py 29.33% 53 Missing ⚠️
...pt/torch/puzzletron/subblock_stats/runtime_vllm.py 25.92% 20 Missing ⚠️
...t/torch/puzzletron/subblock_stats/runtime_utils.py 62.85% 13 Missing ⚠️
...h/puzzletron/subblock_stats/calc_subblock_stats.py 70.00% 3 Missing ⚠️
.../subblock_stats/calc_subblock_params_and_memory.py 33.33% 2 Missing ⚠️
modelopt/torch/puzzletron/mip/run_puzzle.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1358      +/-   ##
==========================================
- Coverage   77.51%   76.74%   -0.77%     
==========================================
  Files         489      493       +4     
  Lines       54498    54687     +189     
==========================================
- Hits        42242    41971     -271     
- Misses      12256    12716     +460     
Flag Coverage Δ
examples 42.64% <0.42%> (-0.28%) ⬇️
gpu 58.30% <29.11%> (-0.75%) ⬇️
regression 14.83% <0.00%> (+0.02%) ⬆️
unit 53.86% <0.42%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@grzegorz-k-karch grzegorz-k-karch changed the title enabling runtime optimization Enable runtime optimization Apr 28, 2026
grzegorz-k-karch and others added 9 commits April 28, 2026 14:49
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
@grzegorz-k-karch grzegorz-k-karch marked this pull request as ready for review May 18, 2026 10:57
@grzegorz-k-karch grzegorz-k-karch requested review from a team as code owners May 18, 2026 10:57
@grzegorz-k-karch grzegorz-k-karch requested a review from realAsma May 18, 2026 10:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/puzzletron/main.py (1)

141-141: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

--mip-only ignores dist_timeout_minutes override.

run_full_puzzletron() honors config-driven timeout, but run_mip_only() is still hardcoded to 10 minutes. That breaks the documented behavior and can cause distributed init failures in long-startup environments.

Suggested fix
+def _resolve_dist_timeout(hydra_config_path: str) -> timedelta:
+    from omegaconf import OmegaConf
+
+    cfg = OmegaConf.load(str(Path(hydra_config_path).resolve()))
+    return timedelta(minutes=cfg.dist_timeout_minutes) if hasattr(cfg, "dist_timeout_minutes") else timedelta(minutes=10)
+
 def run_full_puzzletron(hydra_config_path: str):
@@
-    from omegaconf import OmegaConf
-
-    # Resolve absolute path for Hydra config
-    hydra_config_path = Path(hydra_config_path).resolve()
-    hydra_config = OmegaConf.load(str(hydra_config_path))
-
-    # Default timeout: 10 minutes, or extended to dist_timeout_minutes if set in config
-    if hasattr(hydra_config, "dist_timeout_minutes"):
-        timeout_minutes = timedelta(minutes=hydra_config.dist_timeout_minutes)
-    else:
-        timeout_minutes = timedelta(minutes=10)
+    timeout_minutes = _resolve_dist_timeout(hydra_config_path)
     mtpz.tools.mprint(f"Puzzletron Progress 1/8: Timeout minutes: {timeout_minutes}")
     dist.setup(timeout=timeout_minutes)
@@
 def run_mip_only(hydra_config_path: str):
@@
-    dist.setup(timeout=timedelta(minutes=10))
+    dist.setup(timeout=_resolve_dist_timeout(hydra_config_path))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/puzzletron/main.py` at line 141, run_mip_only currently hardcodes
the distributed setup timeout to timedelta(minutes=10) while run_full_puzzletron
uses the config value dist_timeout_minutes; change run_mip_only to read and use
the same config-driven timeout (dist_timeout_minutes) when calling dist.setup()
so the --mip-only path honors the override (update references to dist.setup(...)
in run_mip_only to construct the timeout from dist_timeout_minutes rather than
using a fixed 10-minute timedelta).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/Llama-3_1-8B.yaml`:
- Line 100: Replace the incorrect config key name `nccl_timeout_minutes` with
`dist_timeout_minutes` so the runtime reads the timeout (the pipeline code
checks for `dist_timeout_minutes` and defaults to 10 minutes otherwise); update
the YAML key to `dist_timeout_minutes: ${timedelta_minutes:10}` so the value is
picked up by the logic in main.py that reads `dist_timeout_minutes`.

In
`@examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_solutions_defaults.yaml`:
- Around line 5-10: The YAML keys intended as children of solutions_to_validate
are currently top-level; edit the block so skip_validation, save_models,
bigger_is_better, sort_solutions_by, and calculate_full_score_ablations are
indented under solutions_to_validate (use consistent indentation, e.g., two
spaces) so the config shape is correct and consumers reading
solutions_to_validate.{skip_validation,save_models,bigger_is_better,sort_solutions_by,calculate_full_score_ablations}
get the expected nested values.

In `@examples/puzzletron/README.md`:
- Line 14: The in-page fragment reference '`#attention-pruning-kv-head-reduction`'
in the Note line is broken; either replace that fragment with the actual heading
anchor present in this README (match the exact heading text for the "Attention
Pruning" section) or add a heading whose slug matches
'attention-pruning-kv-head-reduction' so the link resolves; update the text near
the configs reference where the fragment appears and verify the target heading
name (or add a new H2/H3 titled "Attention Pruning — KV-head reduction") so the
anchor and link match exactly.

In `@modelopt/torch/nas/subblock_stats/runtime_vllm.py`:
- Around line 44-50: The subprocess invocation in runtime_vllm.py is fragile:
avoid mutating os.environ and running subprocess.run(cmd) without timeout or
error checking; instead create a local env dict (copy os.environ and set
"VLLM_ENABLE_V1_MULTIPROCESSING" = "0") and pass it to subprocess.run, call
subprocess.run(cmd, env=env, check=True, capture_output=True, text=True,
timeout=some_reasonable_seconds) so failures raise CalledProcessError or
TimeoutExpired instead of hanging silently, and only open output_json_path after
a successful run; catch and handle subprocess.CalledProcessError and
subprocess.TimeoutExpired to log stderr/stdout (from the completed process) and
re-raise or return a clear error so downstream json.load doesn't attempt to
parse missing/partial output.
- Around line 23-35: The benchmark currently hardcodes the CLI flag
"--batch-size" to "1" in runtime_vllm.py which ignores RuntimeConfig.batch_size;
update the argument construction (where the list includes "--batch-size","1") to
use str(runtime_config.batch_size) instead, ensuring you cast the configured
integer to a string and validate it's >0 (or fall back to "1") before inserting;
keep the rest of the command-building logic the same so the runtime actually
measures the configured batch size.

In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py`:
- Around line 130-137: The call to calc_runtime_for_subblocks hardcodes
num_key_value_heads=8 which will misrepresent models with different KV head
counts; change the argument to pass the actual KV-head count from the
model/config (e.g., use the existing variable that represents KV heads such as
n_kv, num_kv_heads, or derive it from the model config) instead of 8, and if
that variable does not exist in this scope add/propagate a parameter (or compute
it from n_head and model-specific kv ratio) so calc_runtime_for_subblocks
receives the correct num_key_value_heads value.

---

Outside diff comments:
In `@examples/puzzletron/main.py`:
- Line 141: run_mip_only currently hardcodes the distributed setup timeout to
timedelta(minutes=10) while run_full_puzzletron uses the config value
dist_timeout_minutes; change run_mip_only to read and use the same config-driven
timeout (dist_timeout_minutes) when calling dist.setup() so the --mip-only path
honors the override (update references to dist.setup(...) in run_mip_only to
construct the timeout from dist_timeout_minutes rather than using a fixed
10-minute timedelta).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9d164286-9a12-4cc6-b029-2f48f9feb22c

📥 Commits

Reviewing files that changed from the base of the PR and between 9d2e608 and ab925b9.

📒 Files selected for processing (17)
  • examples/puzzletron/README.md
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/Llama-3_1-8B.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/llama-3_1-8B_pruneattn_runtime.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/attn_pruning.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/ffn_pruning.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/hidden_dim_pruning.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/pruning_defaults.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_model_defaults.yaml
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_solutions_defaults.yaml
  • examples/puzzletron/main.py
  • modelopt/torch/nas/subblock_stats/__init__.py
  • modelopt/torch/nas/subblock_stats/calc_runtime_stats.py
  • modelopt/torch/nas/subblock_stats/calc_subblock_params_and_memory.py
  • modelopt/torch/nas/subblock_stats/runtime_utils.py
  • modelopt/torch/nas/subblock_stats/runtime_vllm.py
  • modelopt/torch/puzzletron/subblock_stats/__init__.py
  • modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py
💤 Files with no reviewable changes (1)
  • modelopt/torch/puzzletron/subblock_stats/init.py

Comment thread examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/Llama-3_1-8B.yaml Outdated
Comment thread examples/puzzletron/README.md Outdated
Comment thread modelopt/torch/nas/subblock_stats/runtime_vllm.py Outdated
Comment thread modelopt/torch/nas/subblock_stats/runtime_vllm.py Outdated
Comment thread modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py Outdated
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
modelopt/torch/nas/subblock_stats/runtime_vllm.py (2)

24-29: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use configured batch size for benchmark arguments.

Line 24 and Line 29 hardcode 1, so runtime measurements can ignore the requested workload in RuntimeConfig.batch_size.

Suggested fix
-    args_ns.batch_size = 1
+    batch_size = max(1, int(runtime_config.batch_size))
+    args_ns.batch_size = batch_size
@@
-    args_ns.max_num_seqs = 1
+    args_ns.max_num_seqs = batch_size
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/nas/subblock_stats/runtime_vllm.py` around lines 24 - 29, The
benchmark currently hardcodes args_ns.batch_size = 1 and args_ns.max_num_seqs =
1, ignoring the configured RuntimeConfig.batch_size; update the assignment to
use runtime_config.batch_size (and ensure args_ns.max_num_seqs is set
appropriately, e.g., to runtime_config.batch_size or computed from it) so that
the variables args_ns.batch_size and args_ns.max_num_seqs reflect
RuntimeConfig.batch_size when preparing runtime arguments.

39-40: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid leaking process-wide environment changes.

Line 39 mutates os.environ globally and never restores it. That can affect later benchmark calls in the same process.

Suggested fix
-    os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
-    vllm_latency_main(args_ns)
+    prev = os.environ.get("VLLM_ENABLE_V1_MULTIPROCESSING")
+    os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
+    try:
+        vllm_latency_main(args_ns)
+    finally:
+        if prev is None:
+            os.environ.pop("VLLM_ENABLE_V1_MULTIPROCESSING", None)
+        else:
+            os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = prev
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/nas/subblock_stats/runtime_vllm.py` around lines 39 - 40, The
code currently sets os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" before
calling vllm_latency_main(args_ns) and never restores it, leaking a process-wide
env change; wrap the mutation in a safe restore pattern (save the previous value
or presence, set the env var, call vllm_latency_main, then restore the original
value or delete the key) using try/finally (or a small context manager) so that
runtime_vllm.py does not leave VLLM_ENABLE_V1_MULTIPROCESSING changed for
subsequent benchmarks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@modelopt/torch/nas/subblock_stats/runtime_vllm.py`:
- Around line 24-29: The benchmark currently hardcodes args_ns.batch_size = 1
and args_ns.max_num_seqs = 1, ignoring the configured RuntimeConfig.batch_size;
update the assignment to use runtime_config.batch_size (and ensure
args_ns.max_num_seqs is set appropriately, e.g., to runtime_config.batch_size or
computed from it) so that the variables args_ns.batch_size and
args_ns.max_num_seqs reflect RuntimeConfig.batch_size when preparing runtime
arguments.
- Around line 39-40: The code currently sets
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" before calling
vllm_latency_main(args_ns) and never restores it, leaking a process-wide env
change; wrap the mutation in a safe restore pattern (save the previous value or
presence, set the env var, call vllm_latency_main, then restore the original
value or delete the key) using try/finally (or a small context manager) so that
runtime_vllm.py does not leave VLLM_ENABLE_V1_MULTIPROCESSING changed for
subsequent benchmarks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 93e0ea08-ca66-4466-bc13-5f7509a10f2c

📥 Commits

Reviewing files that changed from the base of the PR and between ab925b9 and 58f17e4.

📒 Files selected for processing (1)
  • modelopt/torch/nas/subblock_stats/runtime_vllm.py

@kevalmorabia97 kevalmorabia97 requested review from Separius, j-rausch and kevalmorabia97 and removed request for realAsma May 18, 2026 18:54
@kevalmorabia97

kevalmorabia97 commented May 18, 2026

Copy link
Copy Markdown
Collaborator

@grzegorz-k-karch can you address coderabbit / claude comments and mark resolved. Also vllm import needs to be guarded else CI fails as its not a required dependency

@kevalmorabia97

Copy link
Copy Markdown
Collaborator

/claude review

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97

Copy link
Copy Markdown
Collaborator

/ok to test 105c736

@kevalmorabia97 kevalmorabia97 enabled auto-merge (squash) June 8, 2026 18:17
@kevalmorabia97 kevalmorabia97 merged commit b98a595 into main Jun 8, 2026
51 checks passed
@kevalmorabia97 kevalmorabia97 deleted the gkarch/runtime_opt branch June 8, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants