Skip to content

Conversation

@zhanghaotong
Copy link
Contributor

@zhanghaotong zhanghaotong commented Jul 10, 2025

[None][feat] Add opentelemetry tracing

Description

This PR implements request-level OpenTelemetry tracing with the following attributes and events:

SpanAttributes:
GEN_AI_USAGE_COMPLETION_TOKENS = "gen_ai.usage.completion_tokens"
GEN_AI_USAGE_PROMPT_TOKENS = "gen_ai.usage.prompt_tokens"
GEN_AI_REQUEST_MAX_TOKENS = "gen_ai.request.max_tokens"
GEN_AI_REQUEST_TOP_P = "gen_ai.request.top_p"
GEN_AI_REQUEST_TOP_K = "gen_ai.request.top_k"
GEN_AI_REQUEST_TEMPERATURE = "gen_ai.request.temperature"
GEN_AI_REQUEST_ID = "gen_ai.request.id"
GEN_AI_REQUEST_N = "gen_ai.request.n"
GEN_AI_LATENCY_TIME_TO_FIRST_TOKEN = "gen_ai.latency.time_to_first_token"
GEN_AI_LATENCY_E2E = "gen_ai.latency.e2e"
GEN_AI_LATENCY_TIME_IN_QUEUE = "gen_ai.latency.time_in_queue"
GEN_AI_LATENCY_KV_CACHE_TRANSFER_TIME = "gen_ai.latency.kv_cache_transfer_time"
GEN_AI_RESPONSE_FINISH_REASONS = "gen_ai.response.finish_reasons"

SpanEvents:
KV_CACHE_TRANSFER_START = "kv_cache_transfer_start"
KV_CACHE_TRANSFER_END = "kv_cache_transfer_end"
CTX_SERVER_SELECTED = "ctx_server.selected"
GEN_SERVER_SELECTED = "gen_server.selected"

These trace attributes integrate with the API server's monitoring system through OpenTelemetry, enabling granular observability of generative AI services at the request level.

Test Coverage

For standalone instance:
image

For P/D separation scenarios, we can automatically chain the invocation process and generate kv_cache-related events.
image
image
image

Summary by CodeRabbit

  • New Features

    • Optional OpenTelemetry tracing: configure an OTLP traces endpoint via CLI/config; trace headers can be supplied to generation APIs and propagated end-to-end.
    • Observability option for disaggregated deployments to enable tracing across services.
  • Improvements

    • Generation results emit spans with latency, token-usage and KV-cache timing attributes (including arrival timestamp) when enabled.
    • Safer disaggregated shutdown with graceful child-process cleanup.
    • One-time diagnostics warnings for tracing.
  • Chores

    • Added OpenTelemetry dependencies to requirements.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Jul 10, 2025
@Shunkangz Shunkangz self-requested a review July 12, 2025 06:16
@zhanghaotong zhanghaotong requested a review from a team as a code owner August 11, 2025 03:05
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 11, 2025

Caution

Review failed

Failed to post review comments.

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 25442aa and ec7b020.

📒 Files selected for processing (2)
  • tensorrt_llm/executor/result.py (7 hunks)
  • tensorrt_llm/llmapi/tracing.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/executor/result.py
  • tensorrt_llm/llmapi/tracing.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/executor/result.py
  • tensorrt_llm/llmapi/tracing.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/executor/result.py
🧬 Code graph analysis (2)
tensorrt_llm/executor/result.py (2)
tensorrt_llm/llmapi/tracing.py (7)
  • global_otlp_tracer (120-122)
  • insufficient_request_metrics_warning (180-182)
  • extract_trace_context (92-98)
  • SpanKind (51-52)
  • SpanAttributes (136-149)
  • add_event (164-170)
  • SpanEvents (152-156)
tensorrt_llm/metrics/enums.py (2)
  • MetricNames (4-9)
  • RequestEventTiming (12-19)
tensorrt_llm/llmapi/tracing.py (1)
tensorrt_llm/logger.py (1)
  • warning (131-132)
📝 Walkthrough

Walkthrough

Adds end-to-end OpenTelemetry tracing and trace-header propagation: new tracing facade and utilities, OTLP deps and CLI flag, tracer init on LLM startup, trace headers carried HTTP → LLM → executor → results with span emission and attributes, KV-cache timing metrics, a run-once utility, and signal-based child cleanup for disaggregated serve mode.

Changes

Cohort / File(s) Summary of changes
Dependencies (OTel)
requirements.txt
Adds OpenTelemetry packages and comment: # observation.tracing, opentelemetry-sdk, opentelemetry-api, opentelemetry-exporter-otlp, opentelemetry-semantic-conventions-ai.
Tracing core & facade
tensorrt_llm/llmapi/tracing.py
New tracing module: OTEL availability detection, exporter selection, tracer init/set, context/header extractors, SpanAttributes/SpanKind, one-time warnings, trace_span decorator and graceful no-OTEL degradation.
CLI / args / startup
tensorrt_llm/commands/serve.py, tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/llmapi/llm.py, tensorrt_llm/llmapi/disagg_utils.py
Adds otlp_traces_endpoint to BaseLlmArgs and serve CLI; threads endpoint into get_llm_args and LLM startup; wires observability_cfg into disagg config and OpenAIDisaggServer; installs SIGINT/SIGTERM child-cleanup handlers in disaggregated leader flow.
Executor / Request / Result / Worker / Metrics
tensorrt_llm/executor/executor.py, tensorrt_llm/executor/request.py, tensorrt_llm/executor/result.py, tensorrt_llm/executor/worker.py, tensorrt_llm/metrics/enums.py
Propagates trace_headers through GenerationRequest → GenerationExecutor → GenerationResult; adds do_tracing to emit server spans with attributes/events and time-correction; merges metrics updates; adds KV cache timing metrics and ARRIVAL_TIMESTAMP.
OpenAI server (non-disagg)
tensorrt_llm/serve/openai_server.py
Extracts trace headers from incoming HTTP requests and forwards trace_headers to LLM.generate_async across chat/completion/streaming flows.
OpenAI disaggregated server
tensorrt_llm/serve/openai_disagg_server.py
OpenAIDisaggServer accepts observability_cfg, initializes tracer when endpoint provided; decorates endpoints with tracing; threads trace_headers through dispatch, generator creation, downstream HTTP calls, and streaming merge flows.
Utilities
tensorrt_llm/_utils.py
Adds typed P = ParamSpec("P") and a run_once decorator for one-time warnings/logs.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant OpenAI_Server as "OpenAI Server"
  participant LLM
  participant Executor
  participant Result
  participant OTLP as "OTLP Exporter"

  Client->>OpenAI_Server: HTTP request (may include trace headers)
  OpenAI_Server->>OpenAI_Server: extract_trace_headers()
  OpenAI_Server->>LLM: generate_async(..., trace_headers)
  LLM->>Executor: generate_async(..., trace_headers)
  Executor->>Result: construct GenerationResult(trace_headers)
  Result->>Result: do_tracing(output, metrics)  %% # subtle: creates server span
  Result->>OTLP: Export span "llm_request" with attributes/events
  Result-->>Client: response
Loading
sequenceDiagram
  participant Leader as "Serve Leader"
  participant Child as "Child Process"
  participant OS as "OS Signals"

  Leader->>Leader: install SIGINT/SIGTERM handlers
  Leader->>Child: launch child process
  OS-->>Leader: SIGINT/SIGTERM received
  Leader->>Child: send terminate (graceful), wait 10s
  alt still alive
    Leader->>Child: kill
  end
  Leader->>Leader: restore handlers and exit (128+sig)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • Superjomn
  • Shunkangz
  • nv-guomingz

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@zhanghaotong zhanghaotong changed the title feat: OTLP Trace Implementation [None][feat] Add OpenTelemetry tracing Aug 11, 2025
@zhanghaotong zhanghaotong marked this pull request as draft August 11, 2025 03:09
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (1)
tensorrt_llm/executor/result.py (1)

153-166: Track whether a trace span was emitted to avoid duplicates

In multi-output scenarios (n > 1 or beam search), do_tracing can be called multiple times. Add a guard to emit one request-level span.

Apply this diff:

         self.metrics_dict = {}
-        self.trace_headers = None
+        self.trace_headers = None
+        self._trace_emitted = False  # Prevent duplicate request-level spans
🧹 Nitpick comments (5)
requirements-dev.txt (1)

36-39: Consider pinning OpenTelemetry packages for reproducibility

All new OpenTelemetry deps are unpinned. Given the wide API surface and rapid releases, unpinned versions can lead to non-reproducible dev environments or breakages.

  • If you have a known-good set (e.g., opentelemetry-api/opentelemetry-sdk/opentelemetry-exporter-otlp at compatible major/minors), please pin them here.
  • Otherwise, at least add upper bounds (e.g., <2) to avoid auto-upgrades to breaking majors.
tensorrt_llm/serve/openai_server.py (2)

14-14: Fix long import line and isort reordering

The import exceeds the line-length and isort flagged import reformatting.

Apply this diff:

-from tensorrt_llm.llmapi.otel_tracing import contains_trace_headers, extract_trace_headers, is_tracing_enabled, log_tracing_disabled_warning
+from tensorrt_llm.llmapi.otel_tracing import (
+    contains_trace_headers,
+    extract_trace_headers,
+    is_tracing_enabled,
+    log_tracing_disabled_warning,
+)

Afterwards, run isort/ruff to normalize the entire file.


525-533: Make helper synchronous or document it; avoid unnecessary await

_get_trace_headers is async but does not await anything. Consider making it sync to reduce overhead, or at least document the behavior.

If you prefer to keep it async for future awaits, add a short docstring:

 async def _get_trace_headers(
     self,
     headers: Headers,
 ) -> Optional[Mapping[str, str]]:
+    """
+    Extract W3C trace headers from incoming request when tracing is enabled.
+    Returns None when tracing is disabled; logs a warning if headers carry trace context.
+    """
tensorrt_llm/llmapi/llm.py (1)

332-345: Document the new trace_headers parameter in generate_async docstring

Public API now accepts trace_headers; include it in the docstring for clarity and Sphinx docs.

Apply this diff snippet inside the Args section:

             scheduling_params (tensorrt_llm.scheduling_params.SchedulingParams, optional): Scheduling parameters. Defaults to None.
+            trace_headers (Optional[Mapping[str, str]], optional): W3C trace context headers (e.g., traceparent, tracestate, baggage) to propagate request-level tracing. Defaults to None.
tensorrt_llm/commands/serve.py (1)

35-43: Avoid logging in signal handlers; use print or defer

Comment claims "Using print for safety" but the code uses logger.info in a signal handler. Logging is generally not async-signal-safe.

Consider using print(...) within the handler, or set a flag and perform logging outside of the signal handler.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2326b0b and eada6ed.

📒 Files selected for processing (12)
  • requirements-dev.txt (1 hunks)
  • requirements.txt (1 hunks)
  • tensorrt_llm/_utils.py (2 hunks)
  • tensorrt_llm/commands/serve.py (5 hunks)
  • tensorrt_llm/executor/executor.py (3 hunks)
  • tensorrt_llm/executor/request.py (3 hunks)
  • tensorrt_llm/executor/result.py (6 hunks)
  • tensorrt_llm/llmapi/llm.py (5 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
  • tensorrt_llm/llmapi/otel_tracing.py (1 hunks)
  • tensorrt_llm/serve/openai_server.py (5 hunks)
  • tensorrt_llm/utils/utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • tensorrt_llm/executor/request.py
  • tensorrt_llm/_utils.py
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/llmapi/llm_args.py
  • requirements.txt
  • tensorrt_llm/utils/utils.py
  • tensorrt_llm/llmapi/otel_tracing.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/serve/openai_server.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/serve/openai_server.py
🧬 Code Graph Analysis (4)
tensorrt_llm/llmapi/llm.py (3)
tensorrt_llm/mapping.py (1)
  • Mapping (20-453)
tensorrt_llm/llmapi/otel_tracing.py (1)
  • init_tracer (48-63)
tensorrt_llm/logger.py (1)
  • error (125-126)
tensorrt_llm/executor/result.py (2)
tensorrt_llm/llmapi/otel_tracing.py (4)
  • global_otlp_tracer (94-96)
  • extract_trace_context (81-86)
  • SpanKind (37-38)
  • SpanAttributes (110-120)
tensorrt_llm/metrics/enums.py (1)
  • RequestEventTiming (11-15)
tensorrt_llm/commands/serve.py (1)
tensorrt_llm/builder.py (1)
  • default (50-58)
tensorrt_llm/serve/openai_server.py (3)
tensorrt_llm/llmapi/otel_tracing.py (4)
  • contains_trace_headers (123-124)
  • extract_trace_headers (89-91)
  • is_tracing_enabled (106-107)
  • log_tracing_disabled_warning (128-129)
tensorrt_llm/executor/executor.py (1)
  • generate_async (113-156)
tensorrt_llm/llmapi/llm.py (1)
  • generate_async (316-450)
🪛 GitHub Actions: Release Checks
tensorrt_llm/llmapi/llm.py

[error] 218-218: codespell: 'endpont' ==> 'endpoint'.


[error] 1-1: isort: imports reformatted in this file.


[error] 218-218: codespell: 'endpont' ==> 'endpoint'.

tensorrt_llm/executor/result.py

[error] 1-1: isort: imports reformatted in this file.

tensorrt_llm/commands/serve.py

[error] 1-1: isort: imports reformatted in this file.

tensorrt_llm/serve/openai_server.py

[error] 1-1: isort: imports reformatted in this file.

🪛 Ruff (0.12.2)
tensorrt_llm/executor/result.py

420-420: Undefined name SupportedMetricNames

(F821)


446-446: Undefined name SupportedMetricNames

(F821)


453-453: Undefined name SupportedMetricNames

(F821)

tensorrt_llm/serve/openai_server.py

14-14: Line too long (140 > 120)

(E501)

🔇 Additional comments (8)
tensorrt_llm/serve/openai_server.py (2)

342-352: LGTM: Trace headers propagated to generation path

Headers are extracted once per request and forwarded via trace_headers, aligning with updated LLM.generate_async.


479-488: LGTM: Trace headers propagation in completions path

Consistent extraction/forwarding for each prompt. Good parity with chat path.

tensorrt_llm/llmapi/llm.py (1)

1-25: Import order fixed in tensorrt_llm/llmapi/llm.py

All import ordering warnings have been resolved by running ruff --select I … --fix (2 issues fixed, 0 remaining).

tensorrt_llm/executor/result.py (2)

410-418: Verify start_time units for OpenTelemetry spans

OpenTelemetry Python expects start_time in time_ns (nanoseconds) in current SDKs. Here we pass an int from RequestEventTiming; ensure it’s in the expected unit.

Please confirm RequestEventTiming.* values are in nanoseconds. If they are in seconds, convert with int(seconds * 1e9).


1-20: Import ordering normalized
The ruff --fix run addressed the import sorting issue; no remaining import-order violations are detected. Great job!

tensorrt_llm/commands/serve.py (3)

290-296: LGTM: Exposes --otlp_traces_endpoint CLI option

The option, help text, and default are reasonable.


307-309: LGTM: Plumbs otlp_traces_endpoint through to get_llm_args

End-to-end wiring from CLI to llm_args is correct.


1-20: Imports have been auto-formatted by Ruff
Ruff fixed the import ordering and grouping issues—no remaining import lint errors.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
tensorrt_llm/executor/result.py (1)

397-461: Emit a single request-level span; add duplicate guard

do_tracing runs per sequence; without a guard, you may emit multiple request-level spans (e.g., best_of > 1 or beams). Add a one-shot flag.

     def do_tracing(
         self,
         output: CompletionOutput,
         req_perf_metrics_dict: Optional[dict[str, float]] = None,
     ):
-        if not global_otlp_tracer() or not req_perf_metrics_dict():
+        # Avoid duplicate request-level spans
+        if getattr(self, "_trace_emitted", False):
+            return
+        if not otel_tracing.global_otlp_tracer() or not req_perf_metrics_dict:
             return
@@
-        with global_otlp_tracer().start_as_current_span(
+        with otel_tracing.global_otlp_tracer().start_as_current_span(
                 "llm_request",
-                kind=SpanKind.SERVER,
+                kind=otel_tracing.SpanKind.SERVER,
                 context=trace_context,
                 start_time=int(
                     req_perf_metrics_dict.get(RequestEventTiming.ARRIVAL_TIME,
                                               0)),
         ) as span:
@@
-            safe_set_attr(span, SpanAttributes.GEN_AI_REQUEST_TEMPERATURE,
+            safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_REQUEST_TEMPERATURE,
                           sampling_params.temperature)
-            safe_set_attr(span, SpanAttributes.GEN_AI_REQUEST_TOP_P,
+            safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_REQUEST_TOP_P,
                           sampling_params.top_p)
-            safe_set_attr(span, SpanAttributes.GEN_AI_REQUEST_MAX_TOKENS,
+            safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_REQUEST_MAX_TOKENS,
                           sampling_params.max_tokens)
-            safe_set_attr(span, SpanAttributes.GEN_AI_REQUEST_N, sampling_params.n)
+            safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_REQUEST_N, sampling_params.n)
             safe_set_attr(
                 span,
-                SpanAttributes.GEN_AI_USAGE_PROMPT_TOKENS,
+                otel_tracing.SpanAttributes.GEN_AI_USAGE_PROMPT_TOKENS,
                 getattr(self.postproc_params.postproc_args, "num_prompt_tokens", None)
                 if self.postproc_params and self.postproc_params.postproc_args
                 else None,
             )
-            safe_set_attr(span, SpanAttributes.GEN_AI_USAGE_COMPLETION_TOKENS, output.length)
+            safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_USAGE_COMPLETION_TOKENS, output.length)
             safe_set_attr(
                 span,
-                SpanAttributes.GEN_AI_LATENCY_TIME_TO_FIRST_TOKEN,
+                otel_tracing.SpanAttributes.GEN_AI_LATENCY_TIME_TO_FIRST_TOKEN,
                 metrics_dict.get(MetricNames.TTFT, -1),
             )
-            safe_set_attr(span, SpanAttributes.GEN_AI_LATENCY_E2E, e2e_time)
-            safe_set_attr(span, SpanAttributes.GEN_AI_REQUEST_ID, self.id)
+            safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_LATENCY_E2E, e2e_time)
+            safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_REQUEST_ID, self.id)
             safe_set_attr(
                 span,
-                SpanAttributes.GEN_AI_LATENCY_TIME_IN_QUEUE,
+                otel_tracing.SpanAttributes.GEN_AI_LATENCY_TIME_IN_QUEUE,
                 metrics_dict.get(MetricNames.REQUEST_QUEUE_TIME, -1),
             )
+
+        self._trace_emitted = True
🧹 Nitpick comments (6)
tensorrt_llm/llmapi/llm.py (3)

19-19: Follow namespace import guideline for OTEL utilities

Per project guidelines, prefer namespaced imports. Import the module and reference its members.

-from tensorrt_llm.llmapi.otel_tracing import init_tracer
+from tensorrt_llm.llmapi import otel_tracing

Also update the call site:

-                init_tracer("trt.llm", self.args.otlp_traces_endpoint)
+                otel_tracing.init_tracer("trt.llm", self.args.otlp_traces_endpoint)

214-222: Tracer initialization nit: include exception traceback for easier diagnosis

Initialization is correctly guarded. Include the traceback in the error log for root-cause analysis.

-        except Exception as e:
-            logger.error(f"Failed to initialize OTLP tracer: {e}")
+        except Exception as e:
+            logger.error(f"Failed to initialize OTLP tracer: {e}", e)
+            # If logger supports keyword, prefer: logger.error("...", exc_info=True)

317-347: Document the new trace_headers parameter in generate_async docstring

Add a brief description for trace_headers to keep the public API docs complete.

Example to add under Args:

  • trace_headers (Mapping[str, str], optional): Trace-context headers (e.g., W3C traceparent/tracestate) to propagate for request-level spans.
tensorrt_llm/executor/result.py (1)

13-15: Adopt namespaced import for OTEL helpers to match project guideline

Import the module once and reference its members for clarity and consistency.

-from tensorrt_llm.llmapi.otel_tracing import (SpanAttributes, SpanKind,
-                                              extract_trace_context,
-                                              global_otlp_tracer)
+from tensorrt_llm.llmapi import otel_tracing

Then update usages accordingly (examples):

-        if not global_otlp_tracer() or not req_perf_metrics_dict():
+        if not otel_tracing.global_otlp_tracer() or not req_perf_metrics_dict:

-        trace_context = extract_trace_context(self.trace_headers)
+        trace_context = otel_tracing.extract_trace_context(self.trace_headers)

-        with global_otlp_tracer().start_as_current_span(
+        with otel_tracing.global_otlp_tracer().start_as_current_span(
             "llm_request",
-            kind=SpanKind.SERVER,
+            kind=otel_tracing.SpanKind.SERVER,
             context=trace_context,
             ...
         ) as span:
-            safe_set_attr(span, SpanAttributes.GEN_AI_REQUEST_TEMPERATURE, ...)
+            safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_REQUEST_TEMPERATURE, ...)
tensorrt_llm/llmapi/otel_tracing.py (2)

96-101: Make set_global_otlp_tracer idempotent or warn on re-init

Hard assert may fire in multi-init scenarios (e.g., tests, multiple LLM instances). Prefer warn-and-return or allow replace.

-def set_global_otlp_tracer(tracer: Tracer):
+def set_global_otlp_tracer(tracer: Tracer):
     """Set the global OTLP Tracer instance in the current process."""
     global _global_tracer_
-    assert _global_tracer_ is None
-    _global_tracer_ = tracer
+    if _global_tracer_ is not None:
+        logger.warning("Global OTLP tracer already set; ignoring duplicate initialization.")
+        return
+    _global_tracer_ = tracer

5-7: Namespace import for run_once decorator

Prefer namespaced import per guidelines.

-from tensorrt_llm._utils import run_once
+from tensorrt_llm import _utils as tllm_utils

And update the decorator:

-@run_once
+@tllm_utils.run_once
 def log_tracing_disabled_warning() -> None:
     logger.warning(
         "Received a request with trace context but tracing is disabled")

Also applies to: 124-127

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eada6ed and b395ee5.

📒 Files selected for processing (12)
  • requirements-dev.txt (1 hunks)
  • requirements.txt (1 hunks)
  • tensorrt_llm/_utils.py (2 hunks)
  • tensorrt_llm/commands/serve.py (5 hunks)
  • tensorrt_llm/executor/executor.py (3 hunks)
  • tensorrt_llm/executor/request.py (3 hunks)
  • tensorrt_llm/executor/result.py (5 hunks)
  • tensorrt_llm/llmapi/llm.py (5 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
  • tensorrt_llm/llmapi/otel_tracing.py (1 hunks)
  • tensorrt_llm/serve/openai_server.py (6 hunks)
  • tensorrt_llm/utils/utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • tensorrt_llm/_utils.py
  • requirements.txt
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/executor/request.py
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/utils/utils.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/llmapi/otel_tracing.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/llmapi/otel_tracing.py
🧬 Code Graph Analysis (4)
tensorrt_llm/llmapi/llm_args.py (1)
tensorrt_llm/builder.py (1)
  • default (50-58)
tensorrt_llm/executor/result.py (2)
tensorrt_llm/llmapi/otel_tracing.py (4)
  • SpanAttributes (107-117)
  • SpanKind (36-37)
  • extract_trace_context (77-83)
  • global_otlp_tracer (91-93)
tensorrt_llm/metrics/enums.py (2)
  • RequestEventTiming (11-15)
  • MetricNames (4-8)
tensorrt_llm/llmapi/llm.py (2)
tensorrt_llm/llmapi/otel_tracing.py (1)
  • init_tracer (47-60)
tensorrt_llm/logger.py (1)
  • error (125-126)
tensorrt_llm/llmapi/otel_tracing.py (2)
tensorrt_llm/_utils.py (1)
  • run_once (1137-1145)
tensorrt_llm/logger.py (1)
  • warning (131-132)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/llmapi/llm_args.py (1)

1314-1318: New tracing endpoint arg LGTM

Optional field with clear description and alias. No issues from a typing or defaulting perspective.

tensorrt_llm/executor/result.py (1)

416-419: Confirm start_time units are compatible with OpenTelemetry

OpenTelemetry Python expects start_time in nanoseconds since Unix epoch. Verify req_perf_metrics_dict timestamps and convert as needed (e.g., seconds->ns or ms->ns) before passing.

If timestamps are ms, convert via int(ms_value * 1e6); if seconds, int(sec_value * 1e9). Please confirm the units of RequestEventTiming in the producing component and adjust here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (2)
tensorrt_llm/commands/serve.py (2)

35-70: Consider signal-safety of logging operations.

While the signal handler implementation is well-structured with proper graceful termination and fallback to kill, using logger.info within a signal handler may not be signal-safe. In Python, most operations including logging are not guaranteed to be async-signal-safe, which could lead to deadlocks or undefined behavior if the signal interrupts certain operations.

Consider using simpler, signal-safe alternatives like os.write to stderr for critical messages, or defer the logging to after the signal handler returns.

Example alternative for critical messages:

import os
import sys

def _signal_handler_cleanup_child(signum, frame):
    """Signal handler to clean up the child process."""
    global _child_p_global
    if _child_p_global and _child_p_global.poll() is None:
        # Use os.write for signal-safe output
        msg = f"Parent process (PID {os.getpid()}) received signal {signum}. Terminating child process (PID {_child_p_global.pid}).\n"
        os.write(sys.stderr.fileno(), msg.encode())
        
        _child_p_global.terminate()
        # ... rest of the handler

606-607: Consider replacing assertions with proper error handling.

Using assert statements for runtime validation can be problematic as they are removed when Python runs with optimization (-O flag). Consider replacing with explicit error handling:

-            assert _child_p_global.poll(
-            ) is not None, f"the subprocess should be terminated"
+            if _child_p_global.poll() is None:
+                logger.error(f"Failed to terminate subprocess despite cleanup attempts")
+                raise RuntimeError(f"The subprocess should be terminated but is still running")

Similarly for line 613-614:

-        assert final_status is not None, \
-            f"The subprocess (PID {_child_p_global.pid}) should be terminated, but its status is {final_status}"
+        if final_status is None:
+            logger.error(f"The subprocess (PID {_child_p_global.pid}) should be terminated, but its status is {final_status}")
+            raise RuntimeError(f"Subprocess termination verification failed")

Also applies to: 613-614

🧹 Nitpick comments (1)
tensorrt_llm/commands/serve.py (1)

3-3: Remove the unnecessary comment.

The inline comment # Added import is not needed and should be removed for cleaner code.

-import signal  # Added import
+import signal
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b395ee5 and e87b643.

📒 Files selected for processing (11)
  • requirements.txt (1 hunks)
  • tensorrt_llm/_utils.py (2 hunks)
  • tensorrt_llm/commands/serve.py (5 hunks)
  • tensorrt_llm/executor/executor.py (3 hunks)
  • tensorrt_llm/executor/request.py (3 hunks)
  • tensorrt_llm/executor/result.py (5 hunks)
  • tensorrt_llm/llmapi/llm.py (5 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
  • tensorrt_llm/llmapi/otel_tracing.py (1 hunks)
  • tensorrt_llm/serve/openai_server.py (6 hunks)
  • tensorrt_llm/utils/utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • tensorrt_llm/llmapi/llm_args.py
  • requirements.txt
  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/_utils.py
  • tensorrt_llm/utils/utils.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/executor/request.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/llmapi/otel_tracing.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/commands/serve.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/executor/request.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/llmapi/otel_tracing.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/commands/serve.py
🧠 Learnings (2)
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Applied to files:

  • tensorrt_llm/llmapi/otel_tracing.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/llmapi/otel_tracing.py
🧬 Code Graph Analysis (4)
tensorrt_llm/executor/request.py (1)
tensorrt_llm/mapping.py (1)
  • Mapping (20-453)
tensorrt_llm/executor/result.py (2)
tensorrt_llm/llmapi/otel_tracing.py (4)
  • SpanAttributes (110-120)
  • SpanKind (38-39)
  • extract_trace_context (79-85)
  • global_otlp_tracer (94-96)
tensorrt_llm/metrics/enums.py (2)
  • RequestEventTiming (11-15)
  • MetricNames (4-8)
tensorrt_llm/llmapi/otel_tracing.py (2)
tensorrt_llm/_utils.py (1)
  • run_once (1137-1145)
tensorrt_llm/logger.py (1)
  • warning (131-132)
tensorrt_llm/llmapi/llm.py (2)
tensorrt_llm/mapping.py (1)
  • Mapping (20-453)
tensorrt_llm/llmapi/otel_tracing.py (1)
  • init_tracer (49-62)
🪛 GitHub Actions: Release Checks
tensorrt_llm/llmapi/otel_tracing.py

[warning] 118-118: Bandit B105: hardcoded_password_string. Possible hardcoded password: 'gen_ai.latency.time_to_first_token'. Location: tensorrt_llm/llmapi/otel_tracing.py:118:41. Failing command: bandit --configfile scripts/bandit.yaml -r tensorrt_llm.

🔇 Additional comments (15)
tensorrt_llm/executor/request.py (1)

2-2: LGTM! Clean implementation of trace headers propagation.

The addition of trace_headers parameter and its storage follows proper patterns:

  • Correct use of Mapping[str, str] type annotation for immutable header collections
  • Backward-compatible placement of the optional parameter
  • Simple pass-through storage without unnecessary validation

Also applies to: 98-98, 126-126

tensorrt_llm/llmapi/llm.py (3)

9-9: LGTM! Necessary imports added correctly.

The imports are properly placed and follow the existing import organization pattern.

Also applies to: 19-19


214-222: LGTM! Robust tracer initialization with proper error handling.

The implementation correctly:

  • Guards initialization with a try/except to prevent startup failures
  • Only initializes when endpoint is configured
  • Provides clear logging for both success and failure cases

327-327: LGTM! Clean trace headers propagation through the generation pipeline.

The trace_headers parameter is properly:

  • Added with correct type annotation Optional[Mapping[str, str]]
  • Positioned to maintain backward compatibility
  • Forwarded correctly to the executor

Also applies to: 446-446

tensorrt_llm/executor/result.py (3)

13-16: LGTM! Correct OpenTelemetry imports.

All necessary tracing components are imported properly.


167-167: LGTM! Proper trace headers initialization.

Trace headers are correctly:

  • Initialized to None in GenerationResultBase
  • Copied from generation_request in GenerationResult

Also applies to: 572-572


296-296: LGTM! Well-implemented tracing with proper guards and error handling.

The implementation correctly:

  • Guards against missing tracer and metrics
  • Safely accesses postproc_params with proper None checks
  • Uses MetricNames enum correctly (not the undefined SupportedMetricNames)
  • Sets span attributes only when values are non-None
  • Is called exactly once when the request completes

Also applies to: 397-461

tensorrt_llm/llmapi/otel_tracing.py (5)

1-2: LGTM! Copyright header present as required.


16-43: LGTM! Excellent fallback handling for optional OpenTelemetry.

The implementation gracefully handles missing OpenTelemetry by:

  • Capturing the import error for informative messages
  • Providing placeholder types to satisfy type checking
  • Allowing the codebase to function without OTEL installed

88-91: LGTM! Case-insensitive header handling implemented correctly.

Both functions now properly handle HTTP headers in a case-insensitive manner:

  • extract_trace_headers: Creates lowercase map and returns normalized keys
  • contains_trace_headers: Compares against lowercase header keys

Also applies to: 123-125


118-118: False positive from Bandit - this is a metric name, not a password.

The string "gen_ai.latency.time_to_first_token" is a standardized OpenTelemetry span attribute name for GenAI metrics, not a password or secret. This is a known false positive pattern with security scanners when they encounter strings containing "token".


45-131: LGTM! Complete and well-structured OpenTelemetry integration.

The module provides a comprehensive tracing integration with:

  • Flexible span exporter configuration (grpc/http)
  • Global tracer management
  • Trace context propagation from HTTP headers
  • GenAI-specific span attributes following OpenTelemetry semantic conventions
  • One-time warning for disabled tracing using the @run_once decorator
tensorrt_llm/commands/serve.py (3)

89-89: LGTM! OpenTelemetry endpoint parameter properly integrated.

The otlp_traces_endpoint parameter is correctly added with appropriate typing and default value, and properly included in the returned llm_args dictionary.

Also applies to: 134-134


272-275: LGTM! CLI option for OpenTelemetry tracing properly wired.

The --otlp_traces_endpoint CLI option is correctly implemented with:

  • Appropriate type, default value, and help text
  • Proper integration into the serve function signature
  • Correct forwarding to get_llm_args

Also applies to: 287-287, 313-314


554-584: Excellent signal handling implementation with proper cleanup.

The signal handling logic is well-designed with:

  • Proper storage and restoration of original handlers
  • Use of finally block to ensure cleanup
  • Creation of new session for child process isolation
  • Comprehensive error handling

This ensures robust process management in the disaggregated mode.

@zhanghaotong zhanghaotong marked this pull request as ready for review August 11, 2025 09:12
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
tensorrt_llm/llmapi/tracing.py (1)

31-58: Consider using a more robust approach for placeholder types.

The placeholder types defined when OpenTelemetry is not available may cause type checking issues. Consider using typing.TYPE_CHECKING to handle imports more cleanly.

 try:
     from opentelemetry.context.context import Context
     from opentelemetry.sdk.environment_variables import \
         OTEL_EXPORTER_OTLP_TRACES_PROTOCOL
     from opentelemetry.sdk.trace import TracerProvider
     from opentelemetry.sdk.trace.export import BatchSpanProcessor
+    from opentelemetry.trace import SpanKind
     from opentelemetry.trace import Tracer, set_tracer_provider
     from opentelemetry.trace.propagation.tracecontext import \
         TraceContextTextMapPropagator
 
     _is_otel_imported = True
 except ImportError:
     import traceback
 
     otel_import_error_traceback = traceback.format_exc()
 
     class Context:  # type: ignore
         pass
 
-    class BaseSpanAttributes:  # type: ignore
-        pass
-
     class SpanKind:  # type: ignore
-        pass
+        SERVER = None
 
     class Tracer:  # type: ignore
         pass
tensorrt_llm/executor/result.py (2)

414-419: Consider using a more accurate arrival time calculation.

The arrival time calculation assumes that the request arrived E2E seconds ago, which might not be accurate if there's processing delay. Consider using the actual arrival time from metrics if available.

-    # TODO: Add request arrival time
-    arrival_time = time.time() - metrics_dict.get(MetricNames.E2E, -1)
+    # Use actual arrival time from metrics if available, otherwise estimate
+    arrival_time_ns = req_perf_metrics_dict.get(RequestEventTiming.ARRIVAL_TIME, 0)
+    if arrival_time_ns > 0:
+        start_time = int(arrival_time_ns)
+    else:
+        # Fallback: estimate arrival time from E2E latency
+        e2e_seconds = metrics_dict.get(MetricNames.E2E, 0)
+        if e2e_seconds > 0:
+            start_time = int((time.time() - e2e_seconds) * 1e9)
+        else:
+            start_time = None
+
     with tracing.global_otlp_tracer().start_as_current_span(
             "llm_request",
             kind=tracing.SpanKind.SERVER,
             context=trace_context,
-            start_time=int(arrival_time * 1e9),
+            start_time=start_time,
     ) as span:

442-448: Add check for empty postproc_params before accessing nested attributes.

When accessing postproc_params.postproc_args.num_prompt_tokens, the nested attribute access could fail if postproc_args is None.

             safe_set_attr(
                 span,
                 tracing.SpanAttributes.GEN_AI_USAGE_PROMPT_TOKENS,
-                getattr(self.postproc_params.postproc_args, "num_prompt_tokens",
-                        None) if self.postproc_params
-                and self.postproc_params.postproc_args else None,
+                getattr(getattr(self.postproc_params, "postproc_args", None),
+                        "num_prompt_tokens", None) if self.postproc_params else None,
             )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3eccbe7 and 9f51b6d.

📒 Files selected for processing (4)
  • tensorrt_llm/executor/result.py (6 hunks)
  • tensorrt_llm/llmapi/llm.py (5 hunks)
  • tensorrt_llm/llmapi/tracing.py (1 hunks)
  • tensorrt_llm/serve/openai_server.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/serve/openai_server.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/llmapi/tracing.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/executor/result.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/llmapi/tracing.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/executor/result.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/executor/result.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tensorrt_llm/llmapi/tracing.py (3)

1-2: LGTM!

Copyright header correctly follows the NVIDIA copyright format for 2025.


125-135: LGTM! Well-structured span attributes.

The SpanAttributes class follows OpenTelemetry semantic conventions for generative AI observability with clear, descriptive attribute names.


80-91: Improve error handling and environment variable access.

The function accesses the environment variable OTEL_EXPORTER_OTLP_TRACES_PROTOCOL without a guard. This could fail when OpenTelemetry is not imported.

Apply this diff to add proper guards:

 def get_span_exporter(endpoint):
+    if not is_otel_available():
+        raise ValueError(
+            "OpenTelemetry is not available. Unable to get span exporter.")
     protocol = os.environ.get(OTEL_EXPORTER_OTLP_TRACES_PROTOCOL, "grpc")
     if protocol == "grpc":
         from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import \
             OTLPSpanExporter
     elif protocol == "http/protobuf":
         from opentelemetry.exporter.otlp.proto.http.trace_exporter import \
             OTLPSpanExporter  # type: ignore
     else:
         raise ValueError(
             f"Unsupported OTLP protocol '{protocol}' is configured")
     return OTLPSpanExporter(endpoint=endpoint)

Likely an incorrect or invalid review comment.

tensorrt_llm/llmapi/llm.py (4)

9-9: LGTM! Proper import of Mapping type.

The import of Mapping from collections.abc is correct and follows best practices for type hints.


214-222: LGTM! Proper error handling for tracer initialization.

The tracer initialization is correctly placed after model building and includes appropriate error handling without breaking the startup flow.


327-327: LGTM! Well-typed trace headers parameter.

The trace_headers parameter is properly typed as Optional[Mapping[str, str]] following Python typing best practices.


446-446: LGTM! Correct propagation of trace headers.

The trace headers are properly forwarded to the executor's generate_async method.

tensorrt_llm/executor/result.py (3)

3-3: LGTM! Proper imports for tracing functionality.

The imports for time and tracing are correctly added to support the new tracing capabilities.

Also applies to: 14-14


166-166: LGTM! Proper initialization of trace headers.

The trace_headers attribute is correctly initialized in both base classes to support trace context propagation.

Also applies to: 577-577


295-295: LGTM! Correct placement of tracing call.

The do_tracing method is properly called after record_stats to ensure metrics are available for the trace.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tensorrt_llm/executor/result.py (2)

417-417: Confirm SpanKind is exported by tracing module.

The code uses tracing.SpanKind.SERVER. Ensure SpanKind is re-exported from tensorrt_llm.llmapi.tracing; otherwise this will fail at runtime.

Run this to verify exports:

#!/bin/bash
# Verify SpanKind, SpanAttributes, helpers, and global_otlp_tracer exist
rg -n -C2 -P 'class\s+SpanAttributes\b|def\s+global_otlp_tracer\b|def\s+extract_trace_context\b|def\s+insufficient_request_metrics_warning\b' tensorrt_llm/llmapi
rg -n -C2 -P '\bSpanKind\b' tensorrt_llm/llmapi

404-409: Emit a single request-level span (dedupe across streaming/beams).

do_tracing() is called per-sequence and possibly multiple times during streaming. Without a guard, you’ll emit duplicate request spans. Add a one-shot flag.

Apply this diff:

         metrics_dict = self.metrics_dict
         if not metrics_dict or not req_perf_metrics_dict:
             # Insufficient request metrics available; trace generation aborted.
             tracing.insufficient_request_metrics_warning()
             return

+        # Avoid emitting duplicate request-level spans
+        if getattr(self, "_trace_emitted", False):
+            return

And set the flag after span emission:

@@
                 json.dumps([output.finish_reason])
                 if output.finish_reason else None)
 
+        self._trace_emitted = True
🧹 Nitpick comments (6)
tensorrt_llm/executor/result.py (6)

432-434: Guard optional GEN_AI_REQUEST_TOP_K attribute (may not exist).

SpanAttributes may not define GEN_AI_REQUEST_TOP_K (it’s not listed in the PR objectives). Unconditional access risks AttributeError at runtime.

Apply this diff to set it only if available:

-            safe_set_attr(span, tracing.SpanAttributes.GEN_AI_REQUEST_TOP_K,
-                          sampling_params.top_k)
+            if hasattr(tracing.SpanAttributes, "GEN_AI_REQUEST_TOP_K"):
+                safe_set_attr(
+                    span,
+                    tracing.SpanAttributes.GEN_AI_REQUEST_TOP_K,
+                    getattr(sampling_params, "top_k", None),
+                )

396-401: Add a concise docstring to do_tracing for maintainability.

A short docstring clarifies preconditions (tracer present, metrics required) and what’s emitted.

Apply this diff:

 def do_tracing(
     self,
     output: CompletionOutput,
     req_perf_metrics_dict: Optional[dict[str, float]] = None,
 ):
+    """
+    Emit a single request-level OpenTelemetry span for this generation request.
+    Preconditions:
+      - tracing.global_otlp_tracer() is configured,
+      - req_perf_metrics_dict and self.metrics_dict contain timing metrics.
+    The span is linked to upstream context via self.trace_headers and is guarded
+    to avoid duplicate emissions across streaming/beam updates.
+    """

401-403: Minor: combine early-return condition to reduce overhead and log spam.

You can short-circuit both the tracer check and missing metrics in one condition.

Apply this diff:

-        if not tracing.global_otlp_tracer():
-            return
+        if not tracing.global_otlp_tracer():
+            return

And rely on the subsequent guard to warn once per call. Optionally, consider rate-limiting insufficient_request_metrics_warning() to avoid noisy logs on streaming.


295-295: Call site is fine; rely on one-shot guard inside do_tracing.

Given the dedupe guard in do_tracing(), keeping this unconditional call here is acceptable. Optionally, you could gate on self._done to skip earlier attempts during streaming.

Example:

-            self.do_tracing(output, req_perf_metrics_dict)
+            if self._done:
+                self.do_tracing(output, req_perf_metrics_dict)

417-419: Consider SpanKind.INTERNAL for in-process request-level span.

If an upstream HTTP layer already emits a SERVER span, this nested span is better modeled as INTERNAL to avoid misclassification.

Apply this diff:

-                kind=tracing.SpanKind.SERVER,
+                kind=tracing.SpanKind.INTERNAL,

3-3: Remove unused time import after start_time fix.

Once start_time uses ARRIVAL_TIME (epoch ns), time is no longer used.

Apply this diff:

-import time
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9f51b6d and f185383.

📒 Files selected for processing (2)
  • tensorrt_llm/executor/result.py (6 hunks)
  • tensorrt_llm/llmapi/tracing.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/llmapi/tracing.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/executor/result.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/executor/result.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/executor/result.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/executor/result.py (6)

441-447: LGTM: setting request id and prompt/usage tokens.

Using self.id and prompt_token_ids length for span attributes is consistent and low-risk.


456-458: LGTM: queue time attribute wired from processed metrics.

This aligns with RequestEventTiming → MetricNames mapping.


459-461: LGTM: finish reasons captured as a JSON array.

Compact and interoperable with downstream consumers.


166-167: LGTM: initialize trace_headers on base class.

Prevents attribute errors when tracing context is absent.


573-574: LGTM: propagate trace headers from request to result.

Ensures context extraction works for downstream spans.


14-15: All required tracing exports verified
The following symbols are already listed in __all__ in tensorrt_llm/llmapi/tracing.py:

  • SpanAttributes
  • SpanKind
  • extract_trace_context
  • global_otlp_tracer
  • insufficient_request_metrics_warning

No changes needed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tensorrt_llm/llmapi/disagg_utils.py (1)

1-1: Add NVIDIA copyright header.

Per the coding guidelines, prepend the NVIDIA copyright header to all source files.

Apply this patch at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+
 import logging
tensorrt_llm/commands/serve.py (1)

1-1: Add NVIDIA copyright header.

Please prepend the standard NVIDIA header.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+
 import asyncio
tensorrt_llm/serve/openai_disagg_server.py (1)

1-1: Insert NVIDIA copyright header (keep shebang on line 1).

Add the header immediately after the shebang to comply with guidelines.

 #!/usr/bin/env python
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
♻️ Duplicate comments (1)
tensorrt_llm/llmapi/tracing.py (1)

118-123: Use explicit runtime check instead of assert in set_global_otlp_tracer.

Asserts can be disabled with -O; keep this safety check active in production.

 def set_global_otlp_tracer(tracer: Tracer):
     """Set the global OTLP Tracer instance in the current process."""
     global _global_tracer_
-    assert _global_tracer_ is None
+    if _global_tracer_ is not None:
+        raise RuntimeError("Global OTLP tracer has already been set")
     _global_tracer_ = tracer
🧹 Nitpick comments (7)
tensorrt_llm/llmapi/disagg_utils.py (2)

45-49: Move field docstring to a proper class docstring (current string literal won’t be recognized).

The string literal after the field assignment is not a valid class docstring (must be the first statement). Move it to the class docstring so tooling can pick it up.

-@dataclass
-class ObservabilityConfig():
-    otlp_traces_endpoint: Optional[str] = None
-    """Target URL to which OpenTelemetry traces will be sent."""
+@dataclass
+class ObservabilityConfig:
+    """Observability-related options.
+
+    Attributes:
+        otlp_traces_endpoint: Target URL to which OpenTelemetry traces will be sent.
+    """
+    otlp_traces_endpoint: Optional[str] = None

33-38: Type annotation for RouterConfig.server_role should be Optional.

Default is None but annotation is ServerRole. Make it Optional[ServerRole] to avoid type checker warnings.

 @dataclass
 class RouterConfig():
     type: str = "round_robin"
     args: dict = field(default_factory=dict)
-    server_role: ServerRole = None
+    server_role: Optional[ServerRole] = None
tensorrt_llm/commands/serve.py (1)

35-70: Minimize work inside signal handlers; avoid logger usage there.

Signal handlers should do the least possible work; logger may not be async-signal-safe. Consider setting a flag or writing directly to stderr with os.write, then performing cleanup in normal control flow.

Example minimal adjustment:

 def _signal_handler_cleanup_child(signum, frame):
     """Signal handler to clean up the child process."""
     global _child_p_global
-    if _child_p_global and _child_p_global.poll() is None:
-        # Using print for safety in signal handlers
-        logger.info(
-            f"Parent process (PID {os.getpid()}) received signal {signal.Signals(signum).name}. Terminating child process (PID {_child_p_global.pid})."
-        )
+    if _child_p_global and _child_p_global.poll() is None:
+        try:
+            os.write(2, f"Signal {signal.Signals(signum).name}: terminating child PID {_child_p_global.pid}\n".encode())
+        except Exception:
+            pass
         _child_p_global.terminate()
         try:
             _child_p_global.wait(
                 timeout=10)  # Allow 10 seconds for graceful termination
         except subprocess.TimeoutExpired:
-            logger.info(
-                f"Child process (PID {_child_p_global.pid}) did not terminate gracefully after signal. Killing."
-            )
+            try:
+                os.write(2, f"Child {_child_p_global.pid} not terminated; killing.\n".encode())
+            except Exception:
+                pass
             _child_p_global.kill()
             try:
                 _child_p_global.wait(timeout=10)  # Allow 10 seconds for kill
             except subprocess.TimeoutExpired:
-                logger.info(
-                    f"Child process (PID {_child_p_global.pid}) failed to die even after kill command from signal handler."
-                )
+                try:
+                    os.write(2, f"Child {_child_p_global.pid} failed to die after kill.\n".encode())
+                except Exception:
+                    pass
tensorrt_llm/serve/openai_disagg_server.py (4)

348-356: Align return type hints with possible StreamingResponse; wrap overly long lines.

  • send_chat_request can return a StreamingResponse when request.stream is True; adjust annotation accordingly.
  • A few lines exceed 120 chars; wrap to satisfy Ruff E501.
-async def create_chat_generator(self, url: str, request: ChatCompletionRequest,
-                                trace_headers: Optional[Mapping[str, str]] = None):
-    async for chunk in self.create_generator(url, request, "/v1/chat/completions", trace_headers):
+async def create_chat_generator(
+    self,
+    url: str,
+    request: ChatCompletionRequest,
+    trace_headers: Optional[Mapping[str, str]] = None,
+):
+    async for chunk in self.create_generator(
+        url, request, "/v1/chat/completions", trace_headers
+    ):
         yield chunk
-async def send_chat_request(self, url: str, request: ChatCompletionRequest,
-                            trace_headers: Optional[Mapping[str, str]] = None) -> ChatCompletionResponse:
-    return await self.send_request(url, request, "/v1/chat/completions", ChatCompletionResponse, self.create_chat_generator, trace_headers)
+async def send_chat_request(
+    self,
+    url: str,
+    request: ChatCompletionRequest,
+    trace_headers: Optional[Mapping[str, str]] = None,
+) -> Union[ChatCompletionResponse, StreamingResponse]:
+    return await self.send_request(
+        url,
+        request,
+        "/v1/chat/completions",
+        ChatCompletionResponse,
+        self.create_chat_generator,
+        trace_headers,
+    )

Also applies to: 394-401


36-36: Typo in yapf directive.

Comment says “yapf: enale”. Should be “enable” or remove the directive.

-# yapf: enale
+# yapf: enable

7-7: Avoid confusion with project’s Mapping class by aliasing collections.abc.Mapping.

The repo defines tensorrt_llm.mapping.Mapping; aliasing here reduces ambiguity when reading code.

-from collections.abc import Mapping
+from collections.abc import Mapping as HeadersMapping
...
-                                        trace_headers: Optional[Mapping[str, str]] = None):
+                                        trace_headers: Optional[HeadersMapping[str, str]] = None):

280-282: Optional: enrich event attributes for better observability.

Attach key attributes (e.g., server chosen, stream flag, router) to the event for richer traces.

-            tracing.add_event('picking generation server')
+            tracing.add_event(
+                "disagg.pick_gen_server",
+                {
+                    "request.stream": bool(req.stream),
+                    "ctx_present": bool(ctx_response is not None),
+                },
+            )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f185383 and 6e9fe8b.

📒 Files selected for processing (4)
  • tensorrt_llm/commands/serve.py (6 hunks)
  • tensorrt_llm/llmapi/disagg_utils.py (4 hunks)
  • tensorrt_llm/llmapi/tracing.py (1 hunks)
  • tensorrt_llm/serve/openai_disagg_server.py (16 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/llmapi/disagg_utils.py
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/llmapi/tracing.py
  • tensorrt_llm/serve/openai_disagg_server.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/llmapi/disagg_utils.py
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/llmapi/tracing.py
  • tensorrt_llm/serve/openai_disagg_server.py
🧬 Code graph analysis (2)
tensorrt_llm/llmapi/tracing.py (1)
tensorrt_llm/logger.py (1)
  • warning (131-132)
tensorrt_llm/serve/openai_disagg_server.py (3)
tensorrt_llm/mapping.py (1)
  • Mapping (20-453)
tensorrt_llm/llmapi/disagg_utils.py (4)
  • ConditionalDisaggConfig (41-42)
  • MetadataServerConfig (64-69)
  • ObservabilityConfig (46-48)
  • RouterConfig (34-37)
tensorrt_llm/llmapi/tracing.py (4)
  • init_tracer (62-75)
  • trace_span (167-198)
  • inject_trace_headers (107-110)
  • add_event (149-152)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/openai_disagg_server.py

363-363: Line too long (150 > 120)

(E501)


395-395: Line too long (137 > 120)

(E501)


396-396: Line too long (140 > 120)

(E501)


400-400: Line too long (143 > 120)

(E501)

🔇 Additional comments (3)
tensorrt_llm/llmapi/disagg_utils.py (1)

59-60: Observability config plumbed correctly.

  • New Optional[ObservabilityConfig] field on DisaggServerConfig looks good.
  • extract_disagg_cfg accepts a dict and instantiates ObservabilityConfig safely.
  • Properly passed into DisaggServerConfig constructor.

No functional concerns here.

Also applies to: 89-90, 123-125, 126-130

tensorrt_llm/llmapi/tracing.py (1)

3-9: Export surface looks complete (including set_global_otlp_tracer).

The all list now includes set_global_otlp_tracer as expected. Good catch and fix.

tensorrt_llm/commands/serve.py (1)

89-90: OTLP endpoint wiring into CLI and LLM args looks correct.

The CLI option, serve() signature, get_llm_args() parameter, and propagation into llm_args are consistent. No issues spotted.

Also applies to: 113-135, 272-276, 286-288, 312-315

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
tensorrt_llm/llmapi/tracing.py (3)

125-130: Consider using a more robust check instead of assert.

Using assert for runtime state validation can be disabled with Python's -O flag. Consider using an explicit check with an exception for production safety.

 def set_global_otlp_tracer(tracer: Tracer):
     """Set the global OTLP Tracer instance in the current process."""
     global _global_tracer_
-    assert _global_tracer_ is None
+    if _global_tracer_ is not None:
+        raise RuntimeError("Global OTLP tracer has already been set")
     _global_tracer_ = tracer

164-168: Guard against missing OpenTelemetry and fix type annotation.

The function will fail when OpenTelemetry is not installed because get_current_span is undefined. Additionally, the types.Attributes type annotation will cause import errors.

 def add_event(name: str,
-              attributes: types.Attributes = None,
+              attributes: typing.Optional[dict] = None,
               timestamp: typing.Optional[int] = None) -> None:
+    """Add an event to the current span if tracing is available."""
+    if not is_otel_available():
+        return
     get_current_span().add_event(name, attributes, timestamp)

112-118: Fix inverted condition in inject_trace_headers.

The ternary condition is inverted - it creates an empty dict when headers are provided and tries to extract from headers when they're None. Additionally, the function should guard against missing OpenTelemetry.

 def inject_trace_headers(headers: Mapping[str, str]) -> Mapping[str, str]:
+    """Return a dict containing normalized trace headers with current context injected.
+    Returns empty dict if OpenTelemetry is unavailable or tracing is disabled.
+    """
+    if not is_otel_available() or not is_tracing_enabled():
+        return {}
-    if is_tracing_enabled():
-        trace_headers = extract_trace_headers(headers) if not headers else {}
-        TraceContextTextMapPropagator().inject(trace_headers)
-        return trace_headers
-    return None
+    trace_headers = extract_trace_headers(headers) if headers else {}
+    TraceContextTextMapPropagator().inject(trace_headers)
+    return trace_headers
tensorrt_llm/serve/openai_disagg_server.py (1)

363-364: Fix streaming path parameter mismatch.

The streaming branch incorrectly passes headers as the third argument to create_generator, but the method signature expects end_point as the third parameter and trace_headers as the fourth.

🧹 Nitpick comments (2)
tensorrt_llm/metrics/enums.py (1)

9-9: Consider using consistent quoting style for enum values.

The new ARRIVAL_TIMESTAMP value uses single quotes while other values in the enum use double quotes. Consider maintaining consistency.

-    ARRIVAL_TIMESTAMP = 'arrival_timestamp'
+    ARRIVAL_TIMESTAMP = "arrival_timestamp"
tensorrt_llm/serve/openai_disagg_server.py (1)

364-364: Fix line length violations.

Several lines exceed the 120-character limit as flagged by the linter.

Apply these formatting fixes:

-                           trace_headers: Optional[Mapping[str, str]] = None) -> Union[CompletionResponse, ChatCompletionResponse, StreamingResponse]:
+                           trace_headers: Optional[Mapping[str, str]] = None) -> Union[
+                               CompletionResponse, ChatCompletionResponse, StreamingResponse]:

-    async def send_completion_request(self, url: str, request: CompletionRequest,
-                                      trace_headers: Optional[Mapping[str, str]] = None) -> Union[CompletionResponse, StreamingResponse]:
-        return await self.send_request(url, request, "/v1/completions", CompletionResponse, self.create_completion_generator, trace_headers)
+    async def send_completion_request(self, url: str, request: CompletionRequest,
+                                      trace_headers: Optional[Mapping[str, str]] = None) -> Union[
+                                          CompletionResponse, StreamingResponse]:
+        return await self.send_request(url, request, "/v1/completions", CompletionResponse,
+                                       self.create_completion_generator, trace_headers)

-    async def send_chat_request(self, url: str, request: ChatCompletionRequest,
-                                trace_headers: Optional[Mapping[str, str]] = None) -> ChatCompletionResponse:
-        return await self.send_request(url, request, "/v1/chat/completions", ChatCompletionResponse, self.create_chat_generator, trace_headers)
+    async def send_chat_request(self, url: str, request: ChatCompletionRequest,
+                                trace_headers: Optional[Mapping[str, str]] = None) -> ChatCompletionResponse:
+        return await self.send_request(url, request, "/v1/chat/completions", ChatCompletionResponse,
+                                       self.create_chat_generator, trace_headers)

Also applies to: 396-396, 397-397, 401-401

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6e9fe8b and b459a6e.

📒 Files selected for processing (7)
  • tensorrt_llm/executor/result.py (7 hunks)
  • tensorrt_llm/executor/worker.py (1 hunks)
  • tensorrt_llm/llmapi/llm.py (5 hunks)
  • tensorrt_llm/llmapi/tracing.py (1 hunks)
  • tensorrt_llm/metrics/enums.py (1 hunks)
  • tensorrt_llm/serve/openai_disagg_server.py (16 hunks)
  • tensorrt_llm/serve/openai_server.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/serve/openai_server.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/metrics/enums.py
  • tensorrt_llm/serve/openai_disagg_server.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/llmapi/tracing.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/metrics/enums.py
  • tensorrt_llm/serve/openai_disagg_server.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/llmapi/tracing.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/executor/result.py
🧬 Code graph analysis (5)
tensorrt_llm/executor/worker.py (1)
tensorrt_llm/metrics/enums.py (1)
  • RequestEventTiming (12-19)
tensorrt_llm/serve/openai_disagg_server.py (4)
tensorrt_llm/llmapi/disagg_utils.py (4)
  • ConditionalDisaggConfig (41-42)
  • MetadataServerConfig (64-69)
  • ObservabilityConfig (46-48)
  • RouterConfig (34-37)
tensorrt_llm/serve/router.py (1)
  • create_router (615-653)
tensorrt_llm/llmapi/tracing.py (5)
  • init_tracer (62-75)
  • trace_span (182-213)
  • inject_trace_headers (112-117)
  • add_event (164-167)
  • SpanEvents (152-156)
tensorrt_llm/serve/openai_server.py (1)
  • openai_completion (371-511)
tensorrt_llm/llmapi/llm.py (3)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
  • tensorrt_llm (37-265)
tensorrt_llm/metrics/enums.py (1)
  • MetricNames (4-9)
tensorrt_llm/llmapi/tracing.py (1)
  • init_tracer (62-75)
tensorrt_llm/executor/result.py (2)
tensorrt_llm/llmapi/tracing.py (7)
  • global_otlp_tracer (120-122)
  • insufficient_request_metrics_warning (177-179)
  • extract_trace_context (92-98)
  • SpanKind (51-52)
  • SpanAttributes (136-149)
  • add_event (164-167)
  • SpanEvents (152-156)
tensorrt_llm/metrics/enums.py (2)
  • MetricNames (4-9)
  • RequestEventTiming (12-19)
tensorrt_llm/llmapi/tracing.py (1)
tensorrt_llm/logger.py (1)
  • warning (131-132)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/openai_disagg_server.py

364-364: Line too long (150 > 120)

(E501)


396-396: Line too long (137 > 120)

(E501)


397-397: Line too long (140 > 120)

(E501)


401-401: Line too long (143 > 120)

(E501)

🔇 Additional comments (31)
tensorrt_llm/llmapi/llm.py (3)

215-223: LGTM! Clean OTLP tracer initialization with proper error handling.

The implementation correctly initializes the OTLP tracer when configured and gracefully handles initialization failures without disrupting the LLM startup.


328-328: Good addition for trace context propagation.

The trace_headers parameter enables proper distributed tracing by allowing trace context to be propagated through the request pipeline.


453-456: LGTM! Arrival timestamp tracking for accurate latency measurements.

The code correctly captures the arrival timestamp when performance metrics are requested, which is essential for calculating accurate end-to-end latencies in the tracing implementation.

tensorrt_llm/executor/result.py (4)

396-409: Early return when metrics are insufficient.

The implementation correctly guards against missing metrics before attempting to generate traces. The early returns prevent errors when either the global tracer or required metrics are unavailable.


432-478: Well-structured span attributes with proper null safety.

The safe_set_attr helper function elegantly handles optional values, ensuring that only non-null attributes are added to the span. The comprehensive set of attributes aligns well with the PR objectives for OpenTelemetry tracing.


479-493: KV cache transfer events properly recorded with timestamps.

The implementation correctly adds KV cache transfer events only when both start and end timestamps are available, preventing partial or invalid event data.


413-424: Verify E2E metric units and default in do_tracing time correction

Before merging, please confirm the following in tensorrt_llm/executor/result.py (around the do_tracing method):

  • Ensure that both req_perf_metrics_dict[RequestEventTiming.LAST_TOKEN_TIME] and req_perf_metrics_dict[RequestEventTiming.ARRIVAL_TIME] originate from the same clock source in the C++ runtime and are reported in seconds.
  • Check whether the default -1 for metrics_dict.get(MetricNames.E2E, -1) could skew the fallback branch (time.time() - e2e - arrival_time), and consider using 0 or guarding negative values if the metric is missing.
  • Validate that mixing Python’s time.time() (system epoch) with the C++ timestamps does not introduce a constant offset; if the C++ runtime uses a steady clock, you may need to convert or switch to time.monotonic() to align origins.
tensorrt_llm/llmapi/tracing.py (2)

1-2: LGTM! Proper copyright header.


58-76: Well-designed tracer initialization with proper error handling.

The init_tracer function properly checks for OpenTelemetry availability and provides clear error messages with the original traceback when dependencies are missing. The implementation follows best practices for setting up the TracerProvider with a BatchSpanProcessor.

tensorrt_llm/serve/openai_disagg_server.py (22)

7-7: Import looks good.

The Mapping import from collections.abc is appropriately added to support the new trace headers functionality.


14-14: Essential imports added for observability.

The imports for Request, tracing module, and ObservabilityConfig are correctly added to support the new OpenTelemetry tracing functionality.

Also applies to: 21-21, 24-24


50-51: Constructor signature updated correctly.

The addition of observability_cfg parameter with proper Optional typing aligns with the observability integration requirements.


59-68: OTLP tracer initialization with proper error handling.

The tracer initialization logic correctly checks for the OTLP endpoint configuration and handles exceptions gracefully. The logging provides good visibility into initialization success/failure.


144-145: Function signature updated for trace propagation.

The merge_streaming_responses method correctly adds the trace_headers parameter to support trace context propagation in streaming scenarios.


157-157: Trace headers properly propagated to downstream calls.

The trace headers are correctly passed to both completion and chat request methods in the streaming path.

Also applies to: 159-159


169-170: Proper tracing decorator and signature update.

The @tracing.trace_span("llm_request") decorator correctly instruments the completion endpoint, and the addition of raw_request: Request parameter enables trace header extraction.


179-179: Consistent parameter passing for disaggregated requests.

The raw_request parameter is properly passed to _send_disagg_request to maintain trace context.


184-185: Chat completion endpoint properly instrumented.

The chat completion endpoint follows the same pattern as the completion endpoint with proper tracing decoration and parameter updates.

Also applies to: 188-188


202-203: Context request method updated for trace propagation.

The _send_context_request method signature correctly includes the trace_headers parameter for downstream trace propagation.


212-212: Trace headers passed to context server calls.

The trace headers are properly propagated to both chat and completion requests sent to context servers.

Also applies to: 215-215


228-228: Disaggregated request method enhanced for tracing.

The method signature correctly adds raw_request parameter, and trace header injection is properly implemented using the tracing utility.

Also applies to: 231-231


260-262: Server selection events properly instrumented.

The context server selection event is correctly added with appropriate attributes for observability.


264-264: Context request includes trace headers.

The trace headers are properly passed to the context request to maintain trace continuity.


287-287: Generation server selection instrumented.

The generation server selection event is correctly added with server information for observability.


297-297: Trace headers propagated to generation requests.

The trace headers are correctly passed to both completion and chat generation requests.

Also applies to: 300-300


309-309: Streaming response includes trace context.

The trace headers are properly passed to the streaming response merger to maintain trace context throughout the streaming flow.


326-332: Generator method enhanced with trace header support.

The create_generator method correctly adds trace header parameter and merges them with the Content-Type header for downstream requests.


333-334: HTTP request includes trace headers.

The trace headers are properly included in the HTTP POST request to ensure trace propagation to downstream services.


349-351: Completion and chat generators updated consistently.

Both generator methods are consistently updated to accept and pass trace headers to the base generator method.

Also applies to: 354-356


367-375: Non-streaming path correctly handles trace headers.

The trace headers are properly merged with Content-Type headers and included in the HTTP POST request for non-streaming scenarios.


395-397: Request methods consistently updated with trace headers.

Both send_completion_request and send_chat_request methods are consistently updated to accept and pass trace headers to the base send_request method.

Also applies to: 399-401

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (3)
tensorrt_llm/executor/result.py (3)

3-3: Remove unused time import after fixing start_time computation.

Once do_tracing uses absolute timestamps directly (see below), time becomes unnecessary.

-import time

401-409: Emit the request-level span once and only when inputs are sufficient.

Add a duplicate-emission guard and keep the existing sufficiency checks. Suggested diff:

-        if not tracing.global_otlp_tracer():
+        if not tracing.global_otlp_tracer():
             return

-        metrics_dict = self.metrics_dict
+        # Avoid emitting duplicate request-level spans (e.g., beam/n>1)
+        if getattr(self, "_trace_emitted", False):
+            return
+
+        metrics_dict = self.metrics_dict
         if not metrics_dict or not req_perf_metrics_dict:
             # Insufficient request metrics available; trace generation aborted.
             tracing.insufficient_request_metrics_warning()
             return

413-431: Fix span start_time computation; avoid mixing epochs and durations.

Current logic derives a "time_correction" using time.time() and E2E which can produce invalid start times (misordered spans, drops). Use absolute arrival timestamp when available, otherwise ARRIVAL_TIME, both in epoch seconds, and convert to ns.

Apply this diff:

-        # Since arrival_time and other timing metrics are based on different time origins,
-        # we need to apply corrections to align them with absolute timestamps
-        time_correction = 0
-        arrival_timestamp = metrics_dict.get(MetricNames.ARRIVAL_TIMESTAMP, 0)
-        arrival_time = req_perf_metrics_dict.get(
-            RequestEventTiming.ARRIVAL_TIME, 0)
-        if arrival_timestamp > 0:
-            time_correction = arrival_timestamp - arrival_time
-        else:
-            time_correction = time.time() - metrics_dict.get(
-                MetricNames.E2E, -1) - arrival_time
-
-        with tracing.global_otlp_tracer().start_as_current_span(
+        # Compute absolute span start time (epoch-ns). Prefer ARRIVAL_TIMESTAMP, else ARRIVAL_TIME.
+        arrival_time_s = float(req_perf_metrics_dict.get(RequestEventTiming.ARRIVAL_TIME, 0.0))
+        arrival_timestamp_s = float(metrics_dict.get(MetricNames.ARRIVAL_TIMESTAMP, 0.0))
+        start_time_ns = (
+            int(arrival_timestamp_s * 1e9) if arrival_timestamp_s > 0.0
+            else int(arrival_time_s * 1e9)
+        )
+
+        with tracing.global_otlp_tracer().start_as_current_span(
                 "llm_request",
                 kind=tracing.SpanKind.SERVER,
                 context=trace_context,
-                start_time=int((arrival_time + time_correction) * 1e9),
+                start_time=start_time_ns,
         ) as span:
🧹 Nitpick comments (3)
tensorrt_llm/executor/result.py (3)

468-470: Set finish reasons as a list attribute, not a JSON string.

OTel attributes support sequences; avoid json.dumps to keep type fidelity.

-            safe_set_attr(
-                span, tracing.SpanAttributes.GEN_AI_RESPONSE_FINISH_REASONS,
-                json.dumps([output.finish_reason])
-                if output.finish_reason else None)
+            safe_set_attr(
+                span,
+                tracing.SpanAttributes.GEN_AI_RESPONSE_FINISH_REASONS,
+                [output.finish_reason] if output.finish_reason else None,
+            )

472-477: Guard KV cache transfer time computation.

If either start or end is missing, the subtraction yields misleading negatives or meaningless values. Compute only when both are present.

-            safe_set_attr(
-                span,
-                tracing.SpanAttributes.GEN_AI_LATENCY_KV_CACHE_TRANSFER_TIME,
-                req_perf_metrics_dict.get(
-                    RequestEventTiming.KV_CACHE_TRANSFER_END, 0.0) -
-                req_perf_metrics_dict.get(
-                    RequestEventTiming.KV_CACHE_TRANSFER_START, 0.0))
+            kv_start_s = float(req_perf_metrics_dict.get(RequestEventTiming.KV_CACHE_TRANSFER_START, 0.0))
+            kv_end_s = float(req_perf_metrics_dict.get(RequestEventTiming.KV_CACHE_TRANSFER_END, 0.0))
+            kv_duration_s = kv_end_s - kv_start_s if (kv_start_s and kv_end_s) else None
+            safe_set_attr(
+                span,
+                tracing.SpanAttributes.GEN_AI_LATENCY_KV_CACHE_TRANSFER_TIME,
+                kv_duration_s,
+            )

396-400: Optional: Factor out a tiny helper to set attributes safely.

safe_set_attr appears inline; consider a small local lambda or reuse a shared helper from tracing if available to reduce repetition across the codebase. Not blocking.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b459a6e and 25442aa.

📒 Files selected for processing (1)
  • tensorrt_llm/executor/result.py (7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/executor/result.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/executor/result.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/executor/result.py
🧬 Code graph analysis (1)
tensorrt_llm/executor/result.py (2)
tensorrt_llm/llmapi/tracing.py (7)
  • global_otlp_tracer (120-122)
  • insufficient_request_metrics_warning (177-179)
  • extract_trace_context (92-98)
  • SpanKind (51-52)
  • SpanAttributes (136-149)
  • add_event (164-167)
  • SpanEvents (152-156)
tensorrt_llm/metrics/enums.py (2)
  • MetricNames (4-9)
  • RequestEventTiming (12-19)
🔇 Additional comments (5)
tensorrt_llm/executor/result.py (5)

14-15: Tracing module import is appropriate and consistent with module-namespace guideline.

Using from tensorrt_llm.llmapi import tracing keeps the module namespace and matches our guidelines. No action needed.


319-319: Metrics merge semantics LGTM.

Using self.metrics_dict.update(response.metrics) prevents clobbering previously recorded stats and is the right call.


394-395: Metrics accumulation LGTM.

Switching to update preserves earlier labels (e.g., finish reason) while adding runtime-derived metrics.


609-609: Trace-context propagation LGTM.

Assigning self.trace_headers = generation_request.trace_headers ensures downstream spans join the inbound trace context.


848-873: Units sanity confirmed: all latency metrics are handled in seconds

  • The _process_req_perf_metrics function returns TTFT, E2E, REQUEST_QUEUE_TIME, and TPOT as floating-point seconds, derived from time.time() differences.
  • Those values are directly set on OpenTelemetry spans in tensorrt_llm/executor/result.py (via safe_set_attr(...)) and logged as histograms in tensorrt_llm/metrics/collector.py, both expecting seconds .
  • The benchmark script (tensorrt_llm/serve/scripts/benchmark_serving.py) deliberately converts seconds to milliseconds when printing _ms-suffixed metrics for human-readable reports—this is separate and does not affect the OTel data .

No code changes required; dashboards and alerts should assume these OTel spans are in seconds.

@zhanghaotong zhanghaotong force-pushed the otlp-trace branch 2 times, most recently from 67dc753 to 1b42813 Compare August 25, 2025 08:16
@zhanghaotong zhanghaotong requested a review from a team as a code owner August 25, 2025 13:35
@zhanghaotong zhanghaotong force-pushed the otlp-trace branch 3 times, most recently from 63afd96 to 30b1340 Compare August 26, 2025 02:51
@Shunkangz
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22041 [ run ] triggered by Bot. Commit: e570f0b

@Shunkangz
Copy link
Collaborator

/bot run --disable-fail-fast

@Shunkangz
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22118 [ run ] triggered by Bot. Commit: 0859050

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22041 [ run ] completed with state ABORTED. Commit: e570f0b
LLM/main/L0_MergeRequest_PR #16619 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22118 [ run ] completed with state SUCCESS. Commit: 0859050
/LLM/main/L0_MergeRequest_PR pipeline #16679 completed with status: 'FAILURE'

@Shunkangz
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22258 [ run ] triggered by Bot. Commit: cfa9293

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22258 [ run ] completed with state SUCCESS. Commit: cfa9293
/LLM/main/L0_MergeRequest_PR pipeline #16781 completed with status: 'FAILURE'

@Shunkangz
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22364 [ run ] triggered by Bot. Commit: cfa9293

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22364 [ run ] completed with state SUCCESS. Commit: cfa9293
/LLM/main/L0_MergeRequest_PR pipeline #16859 completed with status: 'FAILURE'

@Shunkangz
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22566 [ run ] triggered by Bot. Commit: 7bacc81

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22566 [ run ] completed with state SUCCESS. Commit: 7bacc81
/LLM/main/L0_MergeRequest_PR pipeline #17011 completed with status: 'FAILURE'

@Shixiaowei02
Copy link
Collaborator

/bot skip --comment "Pipelines 17011 and 16859 have passed respectively"

Copy link
Collaborator

@Shixiaowei02 Shixiaowei02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve the codes related with dist-serving.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22617 [ skip ] triggered by Bot. Commit: 7bacc81

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22617 [ skip ] completed with state SUCCESS. Commit: 7bacc81
Skipping testing for commit 7bacc81

@Shunkangz Shunkangz merged commit 1026069 into NVIDIA:main Oct 27, 2025
5 checks passed
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 1, 2025
Signed-off-by: Zhang Haotong <[email protected]>
Signed-off-by: zhanghaotong <[email protected]>
Signed-off-by: Shunkang <[email protected]>
Co-authored-by: Zhang Haotong <[email protected]>
Co-authored-by: Shunkang <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
Signed-off-by: Zhang Haotong <[email protected]>
Signed-off-by: zhanghaotong <[email protected]>
Signed-off-by: Shunkang <[email protected]>
Co-authored-by: Zhang Haotong <[email protected]>
Co-authored-by: Shunkang <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
Signed-off-by: Zhang Haotong <[email protected]>
Signed-off-by: zhanghaotong <[email protected]>
Signed-off-by: Shunkang <[email protected]>
Co-authored-by: Zhang Haotong <[email protected]>
Co-authored-by: Shunkang <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Nov 3, 2025
Signed-off-by: Zhang Haotong <[email protected]>
Signed-off-by: zhanghaotong <[email protected]>
Signed-off-by: Shunkang <[email protected]>
Co-authored-by: Zhang Haotong <[email protected]>
Co-authored-by: Shunkang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants