Skip to content

Conversation

@richardhuo-nv
Copy link
Contributor

@richardhuo-nv richardhuo-nv commented Sep 17, 2025

Overview:

https://gitlab-master.nvidia.com/dl/ai-dynamo/dynamo/-/jobs/210682524
Fix some failing trtllm tests

Details:

  1. TRTLLM is not using ZmqKvEventListenser, so no ZMQ listener logs will be present.
  2. Add some time sleep to wait for the kv event to arrive, since there could be delay
  3. Add the expected response in the logits_processor test

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Tests
    • Refined log expectations to remove an unstable batch log pattern, reducing flakiness.
    • Made the aggregated chat test deterministic by asserting a specific response when the logits processor is enabled.
    • Added a brief delay before validating logs to ensure events are captured reliably, improving CI stability.
  • Chores
    • No changes to public APIs or runtime behavior; end-user functionality remains unaffected.

@richardhuo-nv richardhuo-nv requested review from a team as code owners September 17, 2025 07:07
@github-actions github-actions bot added the fix label Sep 17, 2025
@richardhuo-nv
Copy link
Contributor Author

/ok to test c35169d

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 17, 2025

Walkthrough

Updates test expectations for aggregated router logs and chat response in TRT-LLM serve tests, and adds a brief delay before validating expected logs in the engine process to allow KV events to appear.

Changes

Cohort / File(s) Summary
Serve tests: aggregated router expectations
tests/serve/test_trtllm.py
Adjusted expected logs by removing the ZMQ listener batch pattern; refined assertions to specific worker selection and event processing. Updated a test to pass an explicit expected_response for chat when the test logits processor is enabled.
Test utility: log validation timing
tests/utils/engine_process.py
Imported time and added a 0.5s sleep before validate_expected_logs in EngineProcess.check_response when payload.expected_log is set to account for delayed KV event logs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant T as Test
  participant EP as EngineProcess.check_response
  participant L as Logs
  participant V as validate_expected_logs

  T->>EP: check_response(payload with expected_log)
  EP->>EP: If expected_log set
  Note over EP: Delay to allow KV event logs
  EP-->>EP: sleep(0.5)
  EP->>V: validate_expected_logs(expected_log, L)
  V-->>EP: validation result
  EP-->>T: return (assertions proceed)

  rect rgb(230,245,255)
  note right of T: Separate change
  T->>T: chat_payload_default(expected_response=["Hello world!"])
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I twitch my whiskers, scan the loggy night,
A half-second pause—now signals are in sight.
Workers chosen, events neatly told,
“Hello world!” replies, predictably bold.
I thump with cheer in QA’s warren bright—
Tests hop smoother, everything just right. 🐇✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "fix: fix some trtllm tests" correctly indicates the PR’s purpose of fixing TRTLLM-related tests and aligns with the changes in tests/serve/test_trtllm.py and tests/utils/engine_process.py, but the wording is slightly redundant and could be more specific about the nature of the fixes (removing ZMQ listener log expectations, adding a short kv-event wait, and asserting an expected logits response).
Description Check ✅ Passed The PR description follows the repository template and includes a concise Overview and a Details section that map to the file-level changes (removing ZMQ log expectations, adding a sleep for KV events, and adding an expected response), so it is mostly complete; however the "Where should the reviewer start?" section is left empty and the "Related Issues" entry uses a placeholder ("#xxx"), so reviewer guidance and issue linkage are missing.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tests/utils/engine_process.py (2)

112-118: Replace fixed sleep with bounded polling to de-flake log assertions.

A hardcoded time.sleep(0.5) is brittle (can be too short on loaded CI and unnecessarily slow when logs are already present). Recommend polling the logs until patterns appear or a bounded timeout elapses, with an env‑configurable deadline.

Apply this diff to the call site:

-        if payload.expected_log:
-            time.sleep(
-                0.5
-            )  # The kv event sometimes needs extra time to arrive and be reflected in the log.
-            self.validate_expected_logs(payload.expected_log)
+        if payload.expected_log:
+            self.validate_expected_logs(
+                payload.expected_log,
+                max_wait_secs=float(os.getenv("DYNAMO_LOG_WAIT_SECS", "3.0")),
+                poll_interval=0.1,
+            )

Then update validate_expected_logs to support polling (see snippet below).

def validate_expected_logs(self, patterns: Any, max_wait_secs: float = 0.0, poll_interval: float = 0.1) -> None:
    """Validate that all regex patterns are present in the current logs, optionally waiting up to max_wait_secs."""
    import re
    import time

    deadline = time.time() + max(0.0, max_wait_secs)
    compiled = [re.compile(p) for p in patterns]

    while True:
        content = self.read_logs() or ""
        if content:
            missing = [p for p, rx in zip(patterns, compiled) if not rx.search(content)]
            if not missing:
                logger.info(f"SUCCESS: All expected log patterns: {patterns} found")
                return
        # Exit if no waiting requested or deadline reached
        if max_wait_secs <= 0.0 or time.time() >= deadline:
            sample = content[-1000:] if content and len(content) > 1000 else content
            if not content:
                raise EngineLogError(f"Log file not available or empty at path: {self.log_path}")
            raise EngineLogError(f"Missing expected log patterns: {missing}\n\nLog sample:\n{sample}")
        time.sleep(poll_interval)

7-7: Drop top‑level time import if polling uses local import.

To stay consistent with the existing “local import to keep module load minimal” style, remove the module‑level import time once the polling logic imports it locally.

-import time
tests/serve/test_trtllm.py (1)

111-113: Align expectation with logits processor output; avoid brittle punctuation.

Docstring says “Hello world”, while the assertion requires “Hello world!”. To reduce brittleness, accept both.

-            chat_payload_default(expected_response=["Hello world!"]),
+            chat_payload_default(expected_response=["Hello world", "Hello world!"]),

Also applies to: 125-126

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 08cb08c and c35169d.

📒 Files selected for processing (2)
  • tests/serve/test_trtllm.py (1 hunks)
  • tests/utils/engine_process.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/serve/test_trtllm.py (1)
tests/utils/payload_builder.py (1)
  • chat_payload_default (13-36)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
tests/serve/test_trtllm.py (1)

60-66: LGTM: ZMQ listener pattern removed; remaining patterns match TRT‑LLM path.

The expected log patterns now reflect the router/scheduler flow without assuming ZMQ listener logs.

@alec-flowers
Copy link
Contributor

Thanks for fixing this. Was my mistakes when I did the refactor. @nv-tusharma was also fixing these separately in his trtllm container refactor PR. #3009

@richardhuo-nv richardhuo-nv merged commit 002617a into main Sep 17, 2025
13 of 14 checks passed
@richardhuo-nv richardhuo-nv deleted the rihuo/fix_trtllm_tests branch September 17, 2025 15:35
@richardhuo-nv
Copy link
Contributor Author

Thanks for fixing this. Was my mistakes when I did the refactor. @nv-tusharma was also fixing these separately in his trtllm container refactor PR. #3009

Thanks @alec-flowers @nv-tusharma ! I need to merge this since the CI code merge gate is blocked on this at the moment.

kmkelle-nv pushed a commit that referenced this pull request Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants