fix: async llm engine didnt have get_metrics() by terrykong · Pull Request #1943 · NVIDIA-NeMo/RL

terrykong · 2026-02-13T21:56:08Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

closes #1934

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Bug Fixes
- Improved vLLM metrics collection compatibility with automatic fallback to alternative metric sources when the standard method is unavailable.
Tests
- Enabled asynchronous engine mode for generation tests to enhance test coverage.

Signed-off-by: Terry Kong <terryk@nvidia.com>

coderabbitai · 2026-02-13T22:00:21Z

📝 Walkthrough

Walkthrough

This PR adds a compatibility layer to the vLLM worker module for retrieving metrics with a fallback mechanism when get_metrics is unavailable, and enables asynchronous engine mode in a GRPO test configuration.

Changes

Cohort / File(s)	Summary
vLLM Metrics Compatibility `nemo_rl/models/generation/vllm/vllm_worker.py`	Introduces conditional metrics retrieval: uses `get_metrics()` if available on the LLM instance, otherwise falls back to `get_metrics_snapshot()` from `vllm.v1.metrics.reader`. Adjusts metric processing pathway while maintaining existing return structure.
Test Configuration Update `tests/functional/grpo_non_colocated.sh`	Adds `policy.generation.vllm_cfg.async_engine=true` flag to GRPO test invocation to enable asynchronous engine mode for vLLM-based generation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

feat: per-worker active/idle timeline + IFB size logging #1534: Also modifies vllm_worker metrics handling—this PR implements metrics logging while the current PR adds a compatibility fallback for metrics retrieval.

Suggested labels

enhancement

Suggested reviewers

chtruong814

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (24 files): ⚔️ `.github/workflows/cicd-main.yml` (content) ⚔️ `.gitmodules` (content) ⚔️ `3rdparty/Megatron-Bridge-workspace/Megatron-Bridge` (content) ⚔️ `3rdparty/Megatron-Bridge-workspace/setup.py` (content) ⚔️ `3rdparty/Megatron-LM-workspace/Megatron-LM` (content) ⚔️ `3rdparty/Megatron-LM-workspace/setup.py` (content) ⚔️ `docs/index.md` (content) ⚔️ `examples/configs/grpo_math_1B_megatron.yaml` (content) ⚔️ `nemo_rl/models/generation/vllm/vllm_worker.py` (content) ⚔️ `nemo_rl/models/megatron/common.py` (content) ⚔️ `nemo_rl/models/megatron/config.py` (content) ⚔️ `nemo_rl/models/megatron/data.py` (content) ⚔️ `nemo_rl/models/megatron/setup.py` (content) ⚔️ `nemo_rl/models/policy/workers/megatron_policy_worker.py` (content) ⚔️ `pyproject.toml` (content) ⚔️ `tests/functional/L1_Functional_Tests_GPU.sh` (content) ⚔️ `tests/functional/grpo_non_colocated.sh` (content) ⚔️ `tests/unit/L0_Unit_Tests_Generation.sh` (content) ⚔️ `tests/unit/L0_Unit_Tests_Other.sh` (content) ⚔️ `tests/unit/L0_Unit_Tests_Policy.sh` (content) ⚔️ `tests/unit/algorithms/test_sequence_packing_gradients.py` (content) ⚔️ `tests/unit/models/megatron/test_megatron_setup.py` (content) ⚔️ `tests/unit/models/policy/test_megatron_worker.py` (content) ⚔️ `uv.lock` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.
Test Results For Major Changes	⚠️ Warning	Critical bug identified in code review (incorrect loop variable breaking metrics collection) combined with PR template requirements for testing confirmation but no test results provided in PR description.	Fix the metrics loop variable bug, run the modified grpo_non_colocated.sh test with async_engine=true, and document test results demonstrating successful metrics collection without regressions.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically identifies the main change: adding a compatibility path for get_metrics() when the async LLM engine lacks this method.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tk/sync-metrics

⚔️ Resolve merge conflicts (beta)

Auto-commit resolved conflicts to branch tk/sync-metrics
Create stacked PR with resolved conflicts
Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_rl/models/generation/vllm/vllm_worker.py (1)
542-554: ⚠️ Potential issue | 🔴 Critical

Bug: iterating over empty metrics dict instead of vllm_prom_metrics.

Line 549 iterates for metric in metrics, but metrics is the empty dict initialized on line 540. This means vllm_prom_metrics is computed but never read, and the method always returns {}. This should iterate over vllm_prom_metrics.
🐛 Proposed fix
             if hasattr(self.llm, "get_metrics"):
                 vllm_prom_metrics = self.llm.get_metrics()
             else:
                 # The AsyncLLM API does not implement get_metrics so we need to call the prometheus API ourselves
                 from vllm.v1.metrics.reader import get_metrics_snapshot
 
                 vllm_prom_metrics = get_metrics_snapshot()
-            for metric in metrics:
+            for metric in vllm_prom_metrics:
                 if hasattr(metric, "values"):
                     metrics[metric.name] = metric.values
                 elif hasattr(metric, "value"):
                     metrics[metric.name] = metric.value

🤖 Fix all issues with AI agents

In `@nemo_rl/models/generation/vllm/vllm_worker.py`:
- Around line 544-548: The loop is iterating the wrong variable: replace the
iteration over the accumulating dict `metrics` with the Prometheus snapshot list
`vllm_prom_metrics` so the metric objects from get_metrics_snapshot() are
processed; find the loop in vllm_worker.py (inside the block that imports
get_metrics_snapshot) and change `for metric in metrics:` to `for metric in
vllm_prom_metrics:` and ensure the loop reads metric.name and metric.value /
metric.values as currently expected to populate the `metrics` dict.

nemo_rl/models/generation/vllm/vllm_worker.py

Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

fix: async llm engine didnt have get_metrics()

9945ffc

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong requested review from a team as code owners February 13, 2026 21:56

terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Feb 13, 2026

terrykong temporarily deployed to nemo-ci February 13, 2026 21:56 — with GitHub Actions Inactive

terrykong requested review from chtruong814 and yuki-97 and removed request for a team February 13, 2026 21:56

terrykong enabled auto-merge (squash) February 13, 2026 21:56

bxyu-nvidia previously approved these changes Feb 13, 2026

View reviewed changes

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

nemo_rl/models/generation/vllm/vllm_worker.py Show resolved Hide resolved

terrykong temporarily deployed to nemo-ci February 13, 2026 22:00 — with GitHub Actions Inactive

super silly mistake

be2df01

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong dismissed bxyu-nvidia’s stale review via be2df01 February 13, 2026 22:19

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 13, 2026

terrykong temporarily deployed to nemo-ci February 13, 2026 22:24 — with GitHub Actions Inactive

chtruong814 approved these changes Feb 13, 2026

View reviewed changes

terrykong temporarily deployed to nemo-ci February 13, 2026 22:29 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci February 14, 2026 01:30 — with GitHub Actions Inactive

terrykong linked an issue Feb 14, 2026 that may be closed by this pull request

[Gym] Single node tutorial from docs crashes on AttributeError #1934

Closed

terrykong merged commit 0e0edcf into main Feb 14, 2026
43 of 44 checks passed

terrykong deleted the tk/sync-metrics branch February 14, 2026 04:23

yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026

fix: async llm engine didnt have get_metrics() (NVIDIA-NeMo#1943)

c784627

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

fix: async llm engine didnt have get_metrics() (#1943)

9ef58f0

Signed-off-by: Terry Kong <terryk@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

fix: async llm engine didnt have get_metrics() (#1943)

48da505

Signed-off-by: Terry Kong <terryk@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 9, 2026

fix: async llm engine didnt have get_metrics() (#1943)

b283694

Signed-off-by: Terry Kong <terryk@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: async llm engine didnt have get_metrics()#1943

fix: async llm engine didnt have get_metrics()#1943
terrykong merged 2 commits intomainfrom
tk/sync-metrics

terrykong commented Feb 13, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

terrykong commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

terrykong commented Feb 13, 2026 •

edited

Loading

coderabbitai bot commented Feb 13, 2026 •

edited

Loading