Skip to content

fix: async llm engine didnt have get_metrics()#1943

Merged
terrykong merged 2 commits intomainfrom
tk/sync-metrics
Feb 14, 2026
Merged

fix: async llm engine didnt have get_metrics()#1943
terrykong merged 2 commits intomainfrom
tk/sync-metrics

Conversation

@terrykong
Copy link
Collaborator

@terrykong terrykong commented Feb 13, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

closes #1934

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Bug Fixes

    • Improved vLLM metrics collection compatibility with automatic fallback to alternative metric sources when the standard method is unavailable.
  • Tests

    • Enabled asynchronous engine mode for generation tests to enhance test coverage.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested review from a team as code owners February 13, 2026 21:56
@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Feb 13, 2026
@terrykong terrykong requested review from chtruong814 and yuki-97 and removed request for a team February 13, 2026 21:56
@terrykong terrykong enabled auto-merge (squash) February 13, 2026 21:56
bxyu-nvidia
bxyu-nvidia previously approved these changes Feb 13, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

This PR adds a compatibility layer to the vLLM worker module for retrieving metrics with a fallback mechanism when get_metrics is unavailable, and enables asynchronous engine mode in a GRPO test configuration.

Changes

Cohort / File(s) Summary
vLLM Metrics Compatibility
nemo_rl/models/generation/vllm/vllm_worker.py
Introduces conditional metrics retrieval: uses get_metrics() if available on the LLM instance, otherwise falls back to get_metrics_snapshot() from vllm.v1.metrics.reader. Adjusts metric processing pathway while maintaining existing return structure.
Test Configuration Update
tests/functional/grpo_non_colocated.sh
Adds policy.generation.vllm_cfg.async_engine=true flag to GRPO test invocation to enable asynchronous engine mode for vLLM-based generation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • chtruong814
🚥 Pre-merge checks | ✅ 3 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (24 files):

⚔️ .github/workflows/cicd-main.yml (content)
⚔️ .gitmodules (content)
⚔️ 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (content)
⚔️ 3rdparty/Megatron-Bridge-workspace/setup.py (content)
⚔️ 3rdparty/Megatron-LM-workspace/Megatron-LM (content)
⚔️ 3rdparty/Megatron-LM-workspace/setup.py (content)
⚔️ docs/index.md (content)
⚔️ examples/configs/grpo_math_1B_megatron.yaml (content)
⚔️ nemo_rl/models/generation/vllm/vllm_worker.py (content)
⚔️ nemo_rl/models/megatron/common.py (content)
⚔️ nemo_rl/models/megatron/config.py (content)
⚔️ nemo_rl/models/megatron/data.py (content)
⚔️ nemo_rl/models/megatron/setup.py (content)
⚔️ nemo_rl/models/policy/workers/megatron_policy_worker.py (content)
⚔️ pyproject.toml (content)
⚔️ tests/functional/L1_Functional_Tests_GPU.sh (content)
⚔️ tests/functional/grpo_non_colocated.sh (content)
⚔️ tests/unit/L0_Unit_Tests_Generation.sh (content)
⚔️ tests/unit/L0_Unit_Tests_Other.sh (content)
⚔️ tests/unit/L0_Unit_Tests_Policy.sh (content)
⚔️ tests/unit/algorithms/test_sequence_packing_gradients.py (content)
⚔️ tests/unit/models/megatron/test_megatron_setup.py (content)
⚔️ tests/unit/models/policy/test_megatron_worker.py (content)
⚔️ uv.lock (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
Test Results For Major Changes ⚠️ Warning Critical bug identified in code review (incorrect loop variable breaking metrics collection) combined with PR template requirements for testing confirmation but no test results provided in PR description. Fix the metrics loop variable bug, run the modified grpo_non_colocated.sh test with async_engine=true, and document test results demonstrating successful metrics collection without regressions.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: adding a compatibility path for get_metrics() when the async LLM engine lacks this method.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tk/sync-metrics
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch tk/sync-metrics
  • Create stacked PR with resolved conflicts
  • Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/models/generation/vllm/vllm_worker.py (1)

542-554: ⚠️ Potential issue | 🔴 Critical

Bug: iterating over empty metrics dict instead of vllm_prom_metrics.

Line 549 iterates for metric in metrics, but metrics is the empty dict initialized on line 540. This means vllm_prom_metrics is computed but never read, and the method always returns {}. This should iterate over vllm_prom_metrics.

🐛 Proposed fix
             if hasattr(self.llm, "get_metrics"):
                 vllm_prom_metrics = self.llm.get_metrics()
             else:
                 # The AsyncLLM API does not implement get_metrics so we need to call the prometheus API ourselves
                 from vllm.v1.metrics.reader import get_metrics_snapshot
 
                 vllm_prom_metrics = get_metrics_snapshot()
-            for metric in metrics:
+            for metric in vllm_prom_metrics:
                 if hasattr(metric, "values"):
                     metrics[metric.name] = metric.values
                 elif hasattr(metric, "value"):
                     metrics[metric.name] = metric.value
🤖 Fix all issues with AI agents
In `@nemo_rl/models/generation/vllm/vllm_worker.py`:
- Around line 544-548: The loop is iterating the wrong variable: replace the
iteration over the accumulating dict `metrics` with the Prometheus snapshot list
`vllm_prom_metrics` so the metric objects from get_metrics_snapshot() are
processed; find the loop in vllm_worker.py (inside the block that imports
get_metrics_snapshot) and change `for metric in metrics:` to `for metric in
vllm_prom_metrics:` and ensure the loop reads metric.name and metric.value /
metric.values as currently expected to populate the `metrics` dict.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 13, 2026
@terrykong terrykong linked an issue Feb 14, 2026 that may be closed by this pull request
@terrykong terrykong merged commit 0e0edcf into main Feb 14, 2026
43 of 44 checks passed
@terrykong terrykong deleted the tk/sync-metrics branch February 14, 2026 04:23
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Terry Kong <terryk@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Terry Kong <terryk@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 9, 2026
Signed-off-by: Terry Kong <terryk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Gym] Single node tutorial from docs crashes on AttributeError

3 participants