fix: fix some trtllm tests #3084

richardhuo-nv · 2025-09-17T07:07:07Z

Overview:

https://gitlab-master.nvidia.com/dl/ai-dynamo/dynamo/-/jobs/210682524
Fix some failing trtllm tests

Details:

TRTLLM is not using ZmqKvEventListenser, so no ZMQ listener logs will be present.
Add some time sleep to wait for the kv event to arrive, since there could be delay
Add the expected response in the logits_processor test

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Tests
- Refined log expectations to remove an unstable batch log pattern, reducing flakiness.
- Made the aggregated chat test deterministic by asserting a specific response when the logits processor is enabled.
- Added a brief delay before validating logs to ensure events are captured reliably, improving CI stability.
Chores
- No changes to public APIs or runtime behavior; end-user functionality remains unaffected.

richardhuo-nv · 2025-09-17T07:08:03Z

/ok to test c35169d

coderabbitai · 2025-09-17T07:11:37Z

Walkthrough

Updates test expectations for aggregated router logs and chat response in TRT-LLM serve tests, and adds a brief delay before validating expected logs in the engine process to allow KV events to appear.

Changes

Cohort / File(s)	Summary
Serve tests: aggregated router expectations `tests/serve/test_trtllm.py`	Adjusted expected logs by removing the ZMQ listener batch pattern; refined assertions to specific worker selection and event processing. Updated a test to pass an explicit expected_response for chat when the test logits processor is enabled.
Test utility: log validation timing `tests/utils/engine_process.py`	Imported time and added a 0.5s sleep before validate_expected_logs in EngineProcess.check_response when payload.expected_log is set to account for delayed KV event logs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as Test
  participant EP as EngineProcess.check_response
  participant L as Logs
  participant V as validate_expected_logs

  T->>EP: check_response(payload with expected_log)
  EP->>EP: If expected_log set
  Note over EP: Delay to allow KV event logs
  EP-->>EP: sleep(0.5)
  EP->>V: validate_expected_logs(expected_log, L)
  V-->>EP: validation result
  EP-->>T: return (assertions proceed)

  rect rgb(230,245,255)
  note right of T: Separate change
  T->>T: chat_payload_default(expected_response=["Hello world!"])
  end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I twitch my whiskers, scan the loggy night,
A half-second pause—now signals are in sight.
Workers chosen, events neatly told,
“Hello world!” replies, predictably bold.
I thump with cheer in QA’s warren bright—
Tests hop smoother, everything just right. 🐇✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "fix: fix some trtllm tests" correctly indicates the PR’s purpose of fixing TRTLLM-related tests and aligns with the changes in tests/serve/test_trtllm.py and tests/utils/engine_process.py, but the wording is slightly redundant and could be more specific about the nature of the fixes (removing ZMQ listener log expectations, adding a short kv-event wait, and asserting an expected logits response).
Description Check	✅ Passed	The PR description follows the repository template and includes a concise Overview and a Details section that map to the file-level changes (removing ZMQ log expectations, adding a sleep for KV events, and adding an expected response), so it is mostly complete; however the "Where should the reviewer start?" section is left empty and the "Related Issues" entry uses a placeholder ("#xxx"), so reviewer guidance and issue linkage are missing.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

tests/utils/engine_process.py (2)

112-118: Replace fixed sleep with bounded polling to de-flake log assertions.

A hardcoded time.sleep(0.5) is brittle (can be too short on loaded CI and unnecessarily slow when logs are already present). Recommend polling the logs until patterns appear or a bounded timeout elapses, with an env‑configurable deadline.

Apply this diff to the call site:

-        if payload.expected_log:
-            time.sleep(
-                0.5
-            )  # The kv event sometimes needs extra time to arrive and be reflected in the log.
-            self.validate_expected_logs(payload.expected_log)
+        if payload.expected_log:
+            self.validate_expected_logs(
+                payload.expected_log,
+                max_wait_secs=float(os.getenv("DYNAMO_LOG_WAIT_SECS", "3.0")),
+                poll_interval=0.1,
+            )

Then update validate_expected_logs to support polling (see snippet below).

def validate_expected_logs(self, patterns: Any, max_wait_secs: float = 0.0, poll_interval: float = 0.1) -> None:
    """Validate that all regex patterns are present in the current logs, optionally waiting up to max_wait_secs."""
    import re
    import time

    deadline = time.time() + max(0.0, max_wait_secs)
    compiled = [re.compile(p) for p in patterns]

    while True:
        content = self.read_logs() or ""
        if content:
            missing = [p for p, rx in zip(patterns, compiled) if not rx.search(content)]
            if not missing:
                logger.info(f"SUCCESS: All expected log patterns: {patterns} found")
                return
        # Exit if no waiting requested or deadline reached
        if max_wait_secs <= 0.0 or time.time() >= deadline:
            sample = content[-1000:] if content and len(content) > 1000 else content
            if not content:
                raise EngineLogError(f"Log file not available or empty at path: {self.log_path}")
            raise EngineLogError(f"Missing expected log patterns: {missing}\n\nLog sample:\n{sample}")
        time.sleep(poll_interval)

7-7: Drop top‑level time import if polling uses local import.

To stay consistent with the existing “local import to keep module load minimal” style, remove the module‑level import time once the polling logic imports it locally.

-import time

tests/serve/test_trtllm.py (1)

111-113: Align expectation with logits processor output; avoid brittle punctuation.

Docstring says “Hello world”, while the assertion requires “Hello world!”. To reduce brittleness, accept both.
-            chat_payload_default(expected_response=["Hello world!"]),
+            chat_payload_default(expected_response=["Hello world", "Hello world!"]),
Also applies to: 125-126

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 08cb08c and c35169d.

📒 Files selected for processing (2)

tests/serve/test_trtllm.py (1 hunks)
tests/utils/engine_process.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/serve/test_trtllm.py (1)

tests/utils/payload_builder.py (1)

chat_payload_default (13-36)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (1)

tests/serve/test_trtllm.py (1)

60-66: LGTM: ZMQ listener pattern removed; remaining patterns match TRT‑LLM path.

The expected log patterns now reflect the router/scheduler flow without assuming ZMQ listener logs.

alec-flowers · 2025-09-17T07:36:01Z

Thanks for fixing this. Was my mistakes when I did the refactor. @nv-tusharma was also fixing these separately in his trtllm container refactor PR. #3009

richardhuo-nv · 2025-09-17T15:36:15Z

Thanks for fixing this. Was my mistakes when I did the refactor. @nv-tusharma was also fixing these separately in his trtllm container refactor PR. #3009

Thanks @alec-flowers @nv-tusharma ! I need to merge this since the CI code merge gate is blocked on this at the moment.

Signed-off-by: Kristen Kelleher <[email protected]>

fix trtllm tests

c35169d

richardhuo-nv requested review from a team as code owners September 17, 2025 07:07

pull-request-size bot added the size/XS label Sep 17, 2025

github-actions bot added the fix label Sep 17, 2025

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

alec-flowers approved these changes Sep 17, 2025

View reviewed changes

richardhuo-nv merged commit 002617a into main Sep 17, 2025
13 of 14 checks passed

richardhuo-nv deleted the rihuo/fix_trtllm_tests branch September 17, 2025 15:35

kmkelle-nv pushed a commit that referenced this pull request Sep 17, 2025

fix: fix some trtllm tests (#3084)

b0a5f6e

Signed-off-by: Kristen Kelleher <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fix some trtllm tests #3084

fix: fix some trtllm tests #3084

Uh oh!

richardhuo-nv commented Sep 17, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

richardhuo-nv commented Sep 17, 2025

Uh oh!

coderabbitai bot commented Sep 17, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

alec-flowers commented Sep 17, 2025

Uh oh!

Uh oh!

richardhuo-nv commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: fix some trtllm tests #3084

fix: fix some trtllm tests #3084

Uh oh!

Conversation

richardhuo-nv commented Sep 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

richardhuo-nv commented Sep 17, 2025

Uh oh!

coderabbitai bot commented Sep 17, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

alec-flowers commented Sep 17, 2025

Uh oh!

Uh oh!

richardhuo-nv commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

richardhuo-nv commented Sep 17, 2025 •

edited by coderabbitai bot

Loading