Add vLLM-based runtime statistics for subblock latency measurement by grzegorz-k-karch · Pull Request #1358 · NVIDIA/Model-Optimizer

grzegorz-k-karch · 2026-04-28T07:29:41Z

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Runtime-based latency optimization: collect vLLM-measured inference latency to constrain optimization.
Configuration
- New runtime config/template for Llama-3.1-8B pruning (runtime stats enabled, NCCL timeout templating, MIP target-latency).
- Validation sample defaults adjusted (one flow: 128 → 8; runtime flow uses 128).
- Human constraint key renamed to target_latency_seconds.
Documentation
- README section describing runtime-based latency optimization setup and usage.
Tests
- Added GPU end-to-end test for runtime stats collection.

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

copy-pr-bot · 2026-04-28T07:29:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-28T07:29:49Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds vLLM-backed subblock runtime benchmarking and wiring into Puzzletron (configs, runtime, and MIP), modernizes subblock helpers' types/docs, adds model export/vLLM runner utilities, includes a GPU integration test, and documents runtime-based latency optimization.

Changes

Runtime-Based Latency Optimization for NAS

Layer / File(s)	Summary
Documentation and distributed timeout `examples/puzzletron/README.md`, `examples/puzzletron/main.py`, `examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/*`	Add runtime-latency README content; compute distributed `dist.setup` timeout from Hydra `nccl_timeout_minutes` instead of hardcoding 10 minutes; update validation dataset default and eval sample counts.
Puzzletron runtime configs and constraints `examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/*`	Add runtime-focused Puzzletron configs that enable `calc_subblock_stats.runtime_stats`, template NCCL timeout, wire scoring/realize/mip stages, set `target_latency_seconds`, and define FFN candidate list.
Package export switch `modelopt/torch/puzzletron/subblock_stats/__init__.py`	Switch package exports to re-export `calc_subblock_stats` symbols.
calc_subblock_stats refactor & gating `modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py`	Replace `benchmark_iterations` with `runtime_stats_enabled` flag; conditionally compute runtime via `calc_runtime_for_subblocks`; thread flag through launch/sweep; restrict runtime collection to BF16; remove int8 runtime-scaling helpers.
Runtime measurement infra `modelopt/torch/puzzletron/subblock_stats/calc_runtime_stats.py`	Add cached vLLM-backed benchmarking helpers to build small repeated-block Llama models, run latency benchmarks, and compute per-subblock and no-block runtimes.
Model export & vLLM wrapper `modelopt/torch/puzzletron/subblock_stats/runtime_utils.py`, `modelopt/torch/puzzletron/subblock_stats/runtime_vllm.py`	Add `RuntimeConfig`, model export/save helpers (HF/AnyModel with tokenizer copy), and `run_vllm_latency_benchmark` to invoke vLLM CLI and parse JSON latency.
Params/memory helpers: annotations & docs `modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py`	Modernize `descriptor` parameter annotations to `type[ModelDescriptor]`, reorder `__all__`, and expand docstrings for memory/parameter and MoE helpers.
GPU integration test `tests/gpu/torch/puzzletron/test_calc_runtime_stats.py`	Add GPU-gated pytest exercising `calc_runtime_for_subblocks` with a minimal tokenizer and subblock set, verifying coverage, zero runtime for no-op configs, finite runtimes for others, and positive no-block overhead.
MIP human-constraint key update `modelopt/torch/puzzletron/mip/run_puzzle.py`	Replace human constraint key `target_latency` with `target_latency_seconds` and read it when converting human constraints to MIP runtime constraints.

Sequence Diagram(s)

sequenceDiagram
  participant CalcRuntime as calc_runtime_for_subblocks
  participant Builder as create_benchmark_model
  participant Export as save_model_as_anymodel
  participant VLLM as run_vllm_latency_benchmark
  CalcRuntime->>Builder: build model with repeated subblock config
  Builder->>Export: export model + tokenizer to temp dir
  Export->>VLLM: invoke `vllm bench latency` subprocess
  VLLM-->>CalcRuntime: return avg_latency_ms from JSON
  CalcRuntime->>CalcRuntime: normalize vs baseline and return runtimes

Estimated code review effort:
🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers:

kevalmorabia97
meenchen

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	runtime_vllm.py uses # nosec comments (lines 30, 88) which violates SECURITY.md policy—# nosec is not allowed for bypassing Bandit checks without codeowner review.	Remove nosec comments and request `@NVIDIA/modelopt-setup-codeowners` review with PR justification for subprocess usage.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding vLLM-based runtime statistics for subblock latency measurement, which is the core objective of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 86.79% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch gkarch/runtime_opt

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-28T07:34:00Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-08 19:23 UTC

codecov · 2026-04-28T07:43:14Z

Codecov Report

❌ Patch coverage is 29.11392% with 168 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.74%. Comparing base (01415c2) to head (105c736).

Files with missing lines	Patch %	Lines
modelopt/torch/puzzletron/utils/vllm_adapter.py	10.58%	76 Missing ⚠️
...ch/puzzletron/subblock_stats/calc_runtime_stats.py	29.33%	53 Missing ⚠️
...pt/torch/puzzletron/subblock_stats/runtime_vllm.py	25.92%	20 Missing ⚠️
...t/torch/puzzletron/subblock_stats/runtime_utils.py	62.85%	13 Missing ⚠️
...h/puzzletron/subblock_stats/calc_subblock_stats.py	70.00%	3 Missing ⚠️
.../subblock_stats/calc_subblock_params_and_memory.py	33.33%	2 Missing ⚠️
modelopt/torch/puzzletron/mip/run_puzzle.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1358      +/-   ##
==========================================
- Coverage   77.51%   76.74%   -0.77%     
==========================================
  Files         489      493       +4     
  Lines       54498    54687     +189     
==========================================
- Hits        42242    41971     -271     
- Misses      12256    12716     +460

Flag	Coverage Δ
examples	`42.64% <0.42%> (-0.28%)`	⬇️
gpu	`58.30% <29.11%> (-0.75%)`	⬇️
regression	`14.83% <0.00%> (+0.02%)`	⬆️
unit	`53.86% <0.42%> (-0.20%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/puzzletron/main.py (1)

141-141: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

--mip-only ignores dist_timeout_minutes override.

run_full_puzzletron() honors config-driven timeout, but run_mip_only() is still hardcoded to 10 minutes. That breaks the documented behavior and can cause distributed init failures in long-startup environments.

Suggested fix

+def _resolve_dist_timeout(hydra_config_path: str) -> timedelta:
+    from omegaconf import OmegaConf
+
+    cfg = OmegaConf.load(str(Path(hydra_config_path).resolve()))
+    return timedelta(minutes=cfg.dist_timeout_minutes) if hasattr(cfg, "dist_timeout_minutes") else timedelta(minutes=10)
+
 def run_full_puzzletron(hydra_config_path: str):
@@
-    from omegaconf import OmegaConf
-
-    # Resolve absolute path for Hydra config
-    hydra_config_path = Path(hydra_config_path).resolve()
-    hydra_config = OmegaConf.load(str(hydra_config_path))
-
-    # Default timeout: 10 minutes, or extended to dist_timeout_minutes if set in config
-    if hasattr(hydra_config, "dist_timeout_minutes"):
-        timeout_minutes = timedelta(minutes=hydra_config.dist_timeout_minutes)
-    else:
-        timeout_minutes = timedelta(minutes=10)
+    timeout_minutes = _resolve_dist_timeout(hydra_config_path)
     mtpz.tools.mprint(f"Puzzletron Progress 1/8: Timeout minutes: {timeout_minutes}")
     dist.setup(timeout=timeout_minutes)
@@
 def run_mip_only(hydra_config_path: str):
@@
-    dist.setup(timeout=timedelta(minutes=10))
+    dist.setup(timeout=_resolve_dist_timeout(hydra_config_path))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/puzzletron/main.py` at line 141, run_mip_only currently hardcodes
the distributed setup timeout to timedelta(minutes=10) while run_full_puzzletron
uses the config value dist_timeout_minutes; change run_mip_only to read and use
the same config-driven timeout (dist_timeout_minutes) when calling dist.setup()
so the --mip-only path honors the override (update references to dist.setup(...)
in run_mip_only to construct the timeout from dist_timeout_minutes rather than
using a fixed 10-minute timedelta).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/Llama-3_1-8B.yaml`:
- Line 100: Replace the incorrect config key name `nccl_timeout_minutes` with
`dist_timeout_minutes` so the runtime reads the timeout (the pipeline code
checks for `dist_timeout_minutes` and defaults to 10 minutes otherwise); update
the YAML key to `dist_timeout_minutes: ${timedelta_minutes:10}` so the value is
picked up by the logic in main.py that reads `dist_timeout_minutes`.

In
`@examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_solutions_defaults.yaml`:
- Around line 5-10: The YAML keys intended as children of solutions_to_validate
are currently top-level; edit the block so skip_validation, save_models,
bigger_is_better, sort_solutions_by, and calculate_full_score_ablations are
indented under solutions_to_validate (use consistent indentation, e.g., two
spaces) so the config shape is correct and consumers reading
solutions_to_validate.{skip_validation,save_models,bigger_is_better,sort_solutions_by,calculate_full_score_ablations}
get the expected nested values.

In `@examples/puzzletron/README.md`:
- Line 14: The in-page fragment reference '`#attention-pruning-kv-head-reduction`'
in the Note line is broken; either replace that fragment with the actual heading
anchor present in this README (match the exact heading text for the "Attention
Pruning" section) or add a heading whose slug matches
'attention-pruning-kv-head-reduction' so the link resolves; update the text near
the configs reference where the fragment appears and verify the target heading
name (or add a new H2/H3 titled "Attention Pruning — KV-head reduction") so the
anchor and link match exactly.

In `@modelopt/torch/nas/subblock_stats/runtime_vllm.py`:
- Around line 44-50: The subprocess invocation in runtime_vllm.py is fragile:
avoid mutating os.environ and running subprocess.run(cmd) without timeout or
error checking; instead create a local env dict (copy os.environ and set
"VLLM_ENABLE_V1_MULTIPROCESSING" = "0") and pass it to subprocess.run, call
subprocess.run(cmd, env=env, check=True, capture_output=True, text=True,
timeout=some_reasonable_seconds) so failures raise CalledProcessError or
TimeoutExpired instead of hanging silently, and only open output_json_path after
a successful run; catch and handle subprocess.CalledProcessError and
subprocess.TimeoutExpired to log stderr/stdout (from the completed process) and
re-raise or return a clear error so downstream json.load doesn't attempt to
parse missing/partial output.
- Around line 23-35: The benchmark currently hardcodes the CLI flag
"--batch-size" to "1" in runtime_vllm.py which ignores RuntimeConfig.batch_size;
update the argument construction (where the list includes "--batch-size","1") to
use str(runtime_config.batch_size) instead, ensuring you cast the configured
integer to a string and validate it's >0 (or fall back to "1") before inserting;
keep the rest of the command-building logic the same so the runtime actually
measures the configured batch size.

In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py`:
- Around line 130-137: The call to calc_runtime_for_subblocks hardcodes
num_key_value_heads=8 which will misrepresent models with different KV head
counts; change the argument to pass the actual KV-head count from the
model/config (e.g., use the existing variable that represents KV heads such as
n_kv, num_kv_heads, or derive it from the model config) instead of 8, and if
that variable does not exist in this scope add/propagate a parameter (or compute
it from n_head and model-specific kv ratio) so calc_runtime_for_subblocks
receives the correct num_key_value_heads value.

---

Outside diff comments:
In `@examples/puzzletron/main.py`:
- Line 141: run_mip_only currently hardcodes the distributed setup timeout to
timedelta(minutes=10) while run_full_puzzletron uses the config value
dist_timeout_minutes; change run_mip_only to read and use the same config-driven
timeout (dist_timeout_minutes) when calling dist.setup() so the --mip-only path
honors the override (update references to dist.setup(...) in run_mip_only to
construct the timeout from dist_timeout_minutes rather than using a fixed
10-minute timedelta).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9d164286-9a12-4cc6-b029-2f48f9feb22c

📥 Commits

Reviewing files that changed from the base of the PR and between 9d2e608 and ab925b9.

📒 Files selected for processing (17)

examples/puzzletron/README.md
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/Llama-3_1-8B.yaml
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/llama-3_1-8B_pruneattn_runtime.yaml
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/attn_pruning.yaml
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/ffn_pruning.yaml
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/hidden_dim_pruning.yaml
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/pruning/pruning_defaults.yaml
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_model_defaults.yaml
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/validate_solutions_defaults.yaml
examples/puzzletron/main.py
modelopt/torch/nas/subblock_stats/__init__.py
modelopt/torch/nas/subblock_stats/calc_runtime_stats.py
modelopt/torch/nas/subblock_stats/calc_subblock_params_and_memory.py
modelopt/torch/nas/subblock_stats/runtime_utils.py
modelopt/torch/nas/subblock_stats/runtime_vllm.py
modelopt/torch/puzzletron/subblock_stats/__init__.py
modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py

💤 Files with no reviewable changes (1)

modelopt/torch/puzzletron/subblock_stats/init.py

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

coderabbitai

♻️ Duplicate comments (2)

modelopt/torch/nas/subblock_stats/runtime_vllm.py (2)

24-29: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use configured batch size for benchmark arguments.

Line 24 and Line 29 hardcode 1, so runtime measurements can ignore the requested workload in RuntimeConfig.batch_size.

Suggested fix

-    args_ns.batch_size = 1
+    batch_size = max(1, int(runtime_config.batch_size))
+    args_ns.batch_size = batch_size
@@
-    args_ns.max_num_seqs = 1
+    args_ns.max_num_seqs = batch_size

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/nas/subblock_stats/runtime_vllm.py` around lines 24 - 29, The
benchmark currently hardcodes args_ns.batch_size = 1 and args_ns.max_num_seqs =
1, ignoring the configured RuntimeConfig.batch_size; update the assignment to
use runtime_config.batch_size (and ensure args_ns.max_num_seqs is set
appropriately, e.g., to runtime_config.batch_size or computed from it) so that
the variables args_ns.batch_size and args_ns.max_num_seqs reflect
RuntimeConfig.batch_size when preparing runtime arguments.

39-40: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid leaking process-wide environment changes.

Line 39 mutates os.environ globally and never restores it. That can affect later benchmark calls in the same process.

Suggested fix

-    os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
-    vllm_latency_main(args_ns)
+    prev = os.environ.get("VLLM_ENABLE_V1_MULTIPROCESSING")
+    os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
+    try:
+        vllm_latency_main(args_ns)
+    finally:
+        if prev is None:
+            os.environ.pop("VLLM_ENABLE_V1_MULTIPROCESSING", None)
+        else:
+            os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = prev

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/nas/subblock_stats/runtime_vllm.py` around lines 39 - 40, The
code currently sets os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" before
calling vllm_latency_main(args_ns) and never restores it, leaking a process-wide
env change; wrap the mutation in a safe restore pattern (save the previous value
or presence, set the env var, call vllm_latency_main, then restore the original
value or delete the key) using try/finally (or a small context manager) so that
runtime_vllm.py does not leave VLLM_ENABLE_V1_MULTIPROCESSING changed for
subsequent benchmarks.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@modelopt/torch/nas/subblock_stats/runtime_vllm.py`:
- Around line 24-29: The benchmark currently hardcodes args_ns.batch_size = 1
and args_ns.max_num_seqs = 1, ignoring the configured RuntimeConfig.batch_size;
update the assignment to use runtime_config.batch_size (and ensure
args_ns.max_num_seqs is set appropriately, e.g., to runtime_config.batch_size or
computed from it) so that the variables args_ns.batch_size and
args_ns.max_num_seqs reflect RuntimeConfig.batch_size when preparing runtime
arguments.
- Around line 39-40: The code currently sets
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" before calling
vllm_latency_main(args_ns) and never restores it, leaking a process-wide env
change; wrap the mutation in a safe restore pattern (save the previous value or
presence, set the env var, call vllm_latency_main, then restore the original
value or delete the key) using try/finally (or a small context manager) so that
runtime_vllm.py does not leave VLLM_ENABLE_V1_MULTIPROCESSING changed for
subsequent benchmarks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 93e0ea08-ca66-4466-bc13-5f7509a10f2c

📥 Commits

Reviewing files that changed from the base of the PR and between ab925b9 and 58f17e4.

📒 Files selected for processing (1)

modelopt/torch/nas/subblock_stats/runtime_vllm.py

kevalmorabia97 · 2026-05-18T19:56:22Z

@grzegorz-k-karch can you address coderabbit / claude comments and mark resolved. Also vllm import needs to be guarded else CI fails as its not a required dependency

kevalmorabia97 · 2026-05-18T19:56:28Z

/claude review

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 · 2026-06-08T18:17:43Z

/ok to test 105c736

enabling runtime optimization

816ddfa

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

grzegorz-k-karch self-assigned this Apr 28, 2026

Merge branch 'main' into gkarch/runtime_opt

7aa5fe7

grzegorz-k-karch changed the title ~~enabling runtime optimization~~ Enable runtime optimization Apr 28, 2026

grzegorz-k-karch and others added 9 commits April 28, 2026 14:49

done ruff formatting and docstrings

3041dc2

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

distributed timeout is configurable

a363750

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

Merge branch 'main' into gkarch/runtime_opt

8739fa0

added example config for attn pruning and runtime constraint

53a2caf

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

renamed configs

dfb905c

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

working on readme

e165171

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

working on refactoring

d47b69c

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

working on fix

12ed46b

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

runtime accuracy improved

ab925b9

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

grzegorz-k-karch marked this pull request as ready for review May 18, 2026 10:57

grzegorz-k-karch requested review from a team as code owners May 18, 2026 10:57

grzegorz-k-karch requested a review from realAsma May 18, 2026 10:57

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

using vllm api instead of subprocess

58f17e4

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

kevalmorabia97 requested review from Separius, j-rausch and kevalmorabia97 and removed request for realAsma May 18, 2026 18:54

kevalmorabia97 requested review from a team as code owners June 8, 2026 18:15

kevalmorabia97 requested review from ChenhanYu, h-guo18, jenchen13, realAsma and sugunav14 June 8, 2026 18:15

kevalmorabia97 added 2 commits June 8, 2026 11:16

Fix CI failures

d6e1c6b

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Merge branch 'main' into gkarch/runtime_opt

105c736

kevalmorabia97 force-pushed the gkarch/runtime_opt branch from d9dff48 to 105c736 Compare June 8, 2026 18:16

kevalmorabia97 removed request for a team, ChenhanYu, h-guo18, jenchen13, realAsma and sugunav14 June 8, 2026 18:17

kevalmorabia97 enabled auto-merge (squash) June 8, 2026 18:17

kevalmorabia97 merged commit b98a595 into main Jun 8, 2026
51 checks passed

kevalmorabia97 deleted the gkarch/runtime_opt branch June 8, 2026 19:22

Conversation

grzegorz-k-karch commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Pre-merge checks failed

❌ Failed checks (1 error)

Uh oh!

github-actions Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevalmorabia97 commented May 18, 2026

Uh oh!

kevalmorabia97 commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

grzegorz-k-karch commented Apr 28, 2026 •

edited

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

github-actions Bot commented Apr 28, 2026 •

edited

Loading

codecov Bot commented Apr 28, 2026 •

edited

Loading

kevalmorabia97 commented May 18, 2026 •

edited

Loading