-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][feat] Add opentelemetry tracing #5897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
21205b8 to
2326b0b
Compare
|
Caution Review failedFailed to post review comments. Configuration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (2)
🧰 Additional context used📓 Path-based instructions (2)**/*.py📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
🧠 Learnings (1)📚 Learning: 2025-07-28T17:06:08.621ZApplied to files:
🧬 Code graph analysis (2)tensorrt_llm/executor/result.py (2)
tensorrt_llm/llmapi/tracing.py (1)
📝 WalkthroughWalkthroughAdds end-to-end OpenTelemetry tracing and trace-header propagation: new tracing facade and utilities, OTLP deps and CLI flag, tracer init on LLM startup, trace headers carried HTTP → LLM → executor → results with span emission and attributes, KV-cache timing metrics, a run-once utility, and signal-based child cleanup for disaggregated serve mode. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant OpenAI_Server as "OpenAI Server"
participant LLM
participant Executor
participant Result
participant OTLP as "OTLP Exporter"
Client->>OpenAI_Server: HTTP request (may include trace headers)
OpenAI_Server->>OpenAI_Server: extract_trace_headers()
OpenAI_Server->>LLM: generate_async(..., trace_headers)
LLM->>Executor: generate_async(..., trace_headers)
Executor->>Result: construct GenerationResult(trace_headers)
Result->>Result: do_tracing(output, metrics) %% # subtle: creates server span
Result->>OTLP: Export span "llm_request" with attributes/events
Result-->>Client: response
sequenceDiagram
participant Leader as "Serve Leader"
participant Child as "Child Process"
participant OS as "OS Signals"
Leader->>Leader: install SIGINT/SIGTERM handlers
Leader->>Child: launch child process
OS-->>Leader: SIGINT/SIGTERM received
Leader->>Child: send terminate (graceful), wait 10s
alt still alive
Leader->>Child: kill
end
Leader->>Leader: restore handlers and exit (128+sig)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
2326b0b to
eada6ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (1)
tensorrt_llm/executor/result.py (1)
153-166: Track whether a trace span was emitted to avoid duplicatesIn multi-output scenarios (n > 1 or beam search), do_tracing can be called multiple times. Add a guard to emit one request-level span.
Apply this diff:
self.metrics_dict = {} - self.trace_headers = None + self.trace_headers = None + self._trace_emitted = False # Prevent duplicate request-level spans
🧹 Nitpick comments (5)
requirements-dev.txt (1)
36-39: Consider pinning OpenTelemetry packages for reproducibilityAll new OpenTelemetry deps are unpinned. Given the wide API surface and rapid releases, unpinned versions can lead to non-reproducible dev environments or breakages.
- If you have a known-good set (e.g., opentelemetry-api/opentelemetry-sdk/opentelemetry-exporter-otlp at compatible major/minors), please pin them here.
- Otherwise, at least add upper bounds (e.g., <2) to avoid auto-upgrades to breaking majors.
tensorrt_llm/serve/openai_server.py (2)
14-14: Fix long import line and isort reorderingThe import exceeds the line-length and isort flagged import reformatting.
Apply this diff:
-from tensorrt_llm.llmapi.otel_tracing import contains_trace_headers, extract_trace_headers, is_tracing_enabled, log_tracing_disabled_warning +from tensorrt_llm.llmapi.otel_tracing import ( + contains_trace_headers, + extract_trace_headers, + is_tracing_enabled, + log_tracing_disabled_warning, +)Afterwards, run isort/ruff to normalize the entire file.
525-533: Make helper synchronous or document it; avoid unnecessary await_get_trace_headers is async but does not await anything. Consider making it sync to reduce overhead, or at least document the behavior.
If you prefer to keep it async for future awaits, add a short docstring:
async def _get_trace_headers( self, headers: Headers, ) -> Optional[Mapping[str, str]]: + """ + Extract W3C trace headers from incoming request when tracing is enabled. + Returns None when tracing is disabled; logs a warning if headers carry trace context. + """tensorrt_llm/llmapi/llm.py (1)
332-345: Document the new trace_headers parameter in generate_async docstringPublic API now accepts trace_headers; include it in the docstring for clarity and Sphinx docs.
Apply this diff snippet inside the Args section:
scheduling_params (tensorrt_llm.scheduling_params.SchedulingParams, optional): Scheduling parameters. Defaults to None. + trace_headers (Optional[Mapping[str, str]], optional): W3C trace context headers (e.g., traceparent, tracestate, baggage) to propagate request-level tracing. Defaults to None.tensorrt_llm/commands/serve.py (1)
35-43: Avoid logging in signal handlers; use print or deferComment claims "Using print for safety" but the code uses logger.info in a signal handler. Logging is generally not async-signal-safe.
Consider using print(...) within the handler, or set a flag and perform logging outside of the signal handler.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
requirements-dev.txt(1 hunks)requirements.txt(1 hunks)tensorrt_llm/_utils.py(2 hunks)tensorrt_llm/commands/serve.py(5 hunks)tensorrt_llm/executor/executor.py(3 hunks)tensorrt_llm/executor/request.py(3 hunks)tensorrt_llm/executor/result.py(6 hunks)tensorrt_llm/llmapi/llm.py(5 hunks)tensorrt_llm/llmapi/llm_args.py(1 hunks)tensorrt_llm/llmapi/otel_tracing.py(1 hunks)tensorrt_llm/serve/openai_server.py(5 hunks)tensorrt_llm/utils/utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- tensorrt_llm/executor/request.py
- tensorrt_llm/_utils.py
- tensorrt_llm/executor/executor.py
- tensorrt_llm/llmapi/llm_args.py
- requirements.txt
- tensorrt_llm/utils/utils.py
- tensorrt_llm/llmapi/otel_tracing.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/llmapi/llm.pytensorrt_llm/executor/result.pytensorrt_llm/commands/serve.pytensorrt_llm/serve/openai_server.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/llmapi/llm.pytensorrt_llm/executor/result.pytensorrt_llm/commands/serve.pytensorrt_llm/serve/openai_server.py
🧬 Code Graph Analysis (4)
tensorrt_llm/llmapi/llm.py (3)
tensorrt_llm/mapping.py (1)
Mapping(20-453)tensorrt_llm/llmapi/otel_tracing.py (1)
init_tracer(48-63)tensorrt_llm/logger.py (1)
error(125-126)
tensorrt_llm/executor/result.py (2)
tensorrt_llm/llmapi/otel_tracing.py (4)
global_otlp_tracer(94-96)extract_trace_context(81-86)SpanKind(37-38)SpanAttributes(110-120)tensorrt_llm/metrics/enums.py (1)
RequestEventTiming(11-15)
tensorrt_llm/commands/serve.py (1)
tensorrt_llm/builder.py (1)
default(50-58)
tensorrt_llm/serve/openai_server.py (3)
tensorrt_llm/llmapi/otel_tracing.py (4)
contains_trace_headers(123-124)extract_trace_headers(89-91)is_tracing_enabled(106-107)log_tracing_disabled_warning(128-129)tensorrt_llm/executor/executor.py (1)
generate_async(113-156)tensorrt_llm/llmapi/llm.py (1)
generate_async(316-450)
🪛 GitHub Actions: Release Checks
tensorrt_llm/llmapi/llm.py
[error] 218-218: codespell: 'endpont' ==> 'endpoint'.
[error] 1-1: isort: imports reformatted in this file.
[error] 218-218: codespell: 'endpont' ==> 'endpoint'.
tensorrt_llm/executor/result.py
[error] 1-1: isort: imports reformatted in this file.
tensorrt_llm/commands/serve.py
[error] 1-1: isort: imports reformatted in this file.
tensorrt_llm/serve/openai_server.py
[error] 1-1: isort: imports reformatted in this file.
🪛 Ruff (0.12.2)
tensorrt_llm/executor/result.py
420-420: Undefined name SupportedMetricNames
(F821)
446-446: Undefined name SupportedMetricNames
(F821)
453-453: Undefined name SupportedMetricNames
(F821)
tensorrt_llm/serve/openai_server.py
14-14: Line too long (140 > 120)
(E501)
🔇 Additional comments (8)
tensorrt_llm/serve/openai_server.py (2)
342-352: LGTM: Trace headers propagated to generation pathHeaders are extracted once per request and forwarded via trace_headers, aligning with updated LLM.generate_async.
479-488: LGTM: Trace headers propagation in completions pathConsistent extraction/forwarding for each prompt. Good parity with chat path.
tensorrt_llm/llmapi/llm.py (1)
1-25: Import order fixed in tensorrt_llm/llmapi/llm.pyAll import ordering warnings have been resolved by running
ruff --select I … --fix(2 issues fixed, 0 remaining).tensorrt_llm/executor/result.py (2)
410-418: Verify start_time units for OpenTelemetry spansOpenTelemetry Python expects start_time in time_ns (nanoseconds) in current SDKs. Here we pass an int from RequestEventTiming; ensure it’s in the expected unit.
Please confirm RequestEventTiming.* values are in nanoseconds. If they are in seconds, convert with int(seconds * 1e9).
1-20: Import ordering normalized
Theruff --fixrun addressed the import sorting issue; no remaining import-order violations are detected. Great job!tensorrt_llm/commands/serve.py (3)
290-296: LGTM: Exposes --otlp_traces_endpoint CLI optionThe option, help text, and default are reasonable.
307-309: LGTM: Plumbs otlp_traces_endpoint through to get_llm_argsEnd-to-end wiring from CLI to llm_args is correct.
1-20: Imports have been auto-formatted by Ruff
Ruff fixed the import ordering and grouping issues—no remaining import lint errors.
eada6ed to
b395ee5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (1)
tensorrt_llm/executor/result.py (1)
397-461: Emit a single request-level span; add duplicate guarddo_tracing runs per sequence; without a guard, you may emit multiple request-level spans (e.g., best_of > 1 or beams). Add a one-shot flag.
def do_tracing( self, output: CompletionOutput, req_perf_metrics_dict: Optional[dict[str, float]] = None, ): - if not global_otlp_tracer() or not req_perf_metrics_dict(): + # Avoid duplicate request-level spans + if getattr(self, "_trace_emitted", False): + return + if not otel_tracing.global_otlp_tracer() or not req_perf_metrics_dict: return @@ - with global_otlp_tracer().start_as_current_span( + with otel_tracing.global_otlp_tracer().start_as_current_span( "llm_request", - kind=SpanKind.SERVER, + kind=otel_tracing.SpanKind.SERVER, context=trace_context, start_time=int( req_perf_metrics_dict.get(RequestEventTiming.ARRIVAL_TIME, 0)), ) as span: @@ - safe_set_attr(span, SpanAttributes.GEN_AI_REQUEST_TEMPERATURE, + safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_REQUEST_TEMPERATURE, sampling_params.temperature) - safe_set_attr(span, SpanAttributes.GEN_AI_REQUEST_TOP_P, + safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_REQUEST_TOP_P, sampling_params.top_p) - safe_set_attr(span, SpanAttributes.GEN_AI_REQUEST_MAX_TOKENS, + safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_REQUEST_MAX_TOKENS, sampling_params.max_tokens) - safe_set_attr(span, SpanAttributes.GEN_AI_REQUEST_N, sampling_params.n) + safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_REQUEST_N, sampling_params.n) safe_set_attr( span, - SpanAttributes.GEN_AI_USAGE_PROMPT_TOKENS, + otel_tracing.SpanAttributes.GEN_AI_USAGE_PROMPT_TOKENS, getattr(self.postproc_params.postproc_args, "num_prompt_tokens", None) if self.postproc_params and self.postproc_params.postproc_args else None, ) - safe_set_attr(span, SpanAttributes.GEN_AI_USAGE_COMPLETION_TOKENS, output.length) + safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_USAGE_COMPLETION_TOKENS, output.length) safe_set_attr( span, - SpanAttributes.GEN_AI_LATENCY_TIME_TO_FIRST_TOKEN, + otel_tracing.SpanAttributes.GEN_AI_LATENCY_TIME_TO_FIRST_TOKEN, metrics_dict.get(MetricNames.TTFT, -1), ) - safe_set_attr(span, SpanAttributes.GEN_AI_LATENCY_E2E, e2e_time) - safe_set_attr(span, SpanAttributes.GEN_AI_REQUEST_ID, self.id) + safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_LATENCY_E2E, e2e_time) + safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_REQUEST_ID, self.id) safe_set_attr( span, - SpanAttributes.GEN_AI_LATENCY_TIME_IN_QUEUE, + otel_tracing.SpanAttributes.GEN_AI_LATENCY_TIME_IN_QUEUE, metrics_dict.get(MetricNames.REQUEST_QUEUE_TIME, -1), ) + + self._trace_emitted = True
🧹 Nitpick comments (6)
tensorrt_llm/llmapi/llm.py (3)
19-19: Follow namespace import guideline for OTEL utilitiesPer project guidelines, prefer namespaced imports. Import the module and reference its members.
-from tensorrt_llm.llmapi.otel_tracing import init_tracer +from tensorrt_llm.llmapi import otel_tracingAlso update the call site:
- init_tracer("trt.llm", self.args.otlp_traces_endpoint) + otel_tracing.init_tracer("trt.llm", self.args.otlp_traces_endpoint)
214-222: Tracer initialization nit: include exception traceback for easier diagnosisInitialization is correctly guarded. Include the traceback in the error log for root-cause analysis.
- except Exception as e: - logger.error(f"Failed to initialize OTLP tracer: {e}") + except Exception as e: + logger.error(f"Failed to initialize OTLP tracer: {e}", e) + # If logger supports keyword, prefer: logger.error("...", exc_info=True)
317-347: Document the new trace_headers parameter in generate_async docstringAdd a brief description for trace_headers to keep the public API docs complete.
Example to add under Args:
- trace_headers (Mapping[str, str], optional): Trace-context headers (e.g., W3C traceparent/tracestate) to propagate for request-level spans.
tensorrt_llm/executor/result.py (1)
13-15: Adopt namespaced import for OTEL helpers to match project guidelineImport the module once and reference its members for clarity and consistency.
-from tensorrt_llm.llmapi.otel_tracing import (SpanAttributes, SpanKind, - extract_trace_context, - global_otlp_tracer) +from tensorrt_llm.llmapi import otel_tracingThen update usages accordingly (examples):
- if not global_otlp_tracer() or not req_perf_metrics_dict(): + if not otel_tracing.global_otlp_tracer() or not req_perf_metrics_dict: - trace_context = extract_trace_context(self.trace_headers) + trace_context = otel_tracing.extract_trace_context(self.trace_headers) - with global_otlp_tracer().start_as_current_span( + with otel_tracing.global_otlp_tracer().start_as_current_span( "llm_request", - kind=SpanKind.SERVER, + kind=otel_tracing.SpanKind.SERVER, context=trace_context, ... ) as span: - safe_set_attr(span, SpanAttributes.GEN_AI_REQUEST_TEMPERATURE, ...) + safe_set_attr(span, otel_tracing.SpanAttributes.GEN_AI_REQUEST_TEMPERATURE, ...)tensorrt_llm/llmapi/otel_tracing.py (2)
96-101: Make set_global_otlp_tracer idempotent or warn on re-initHard assert may fire in multi-init scenarios (e.g., tests, multiple LLM instances). Prefer warn-and-return or allow replace.
-def set_global_otlp_tracer(tracer: Tracer): +def set_global_otlp_tracer(tracer: Tracer): """Set the global OTLP Tracer instance in the current process.""" global _global_tracer_ - assert _global_tracer_ is None - _global_tracer_ = tracer + if _global_tracer_ is not None: + logger.warning("Global OTLP tracer already set; ignoring duplicate initialization.") + return + _global_tracer_ = tracer
5-7: Namespace import for run_once decoratorPrefer namespaced import per guidelines.
-from tensorrt_llm._utils import run_once +from tensorrt_llm import _utils as tllm_utilsAnd update the decorator:
-@run_once +@tllm_utils.run_once def log_tracing_disabled_warning() -> None: logger.warning( "Received a request with trace context but tracing is disabled")Also applies to: 124-127
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
requirements-dev.txt(1 hunks)requirements.txt(1 hunks)tensorrt_llm/_utils.py(2 hunks)tensorrt_llm/commands/serve.py(5 hunks)tensorrt_llm/executor/executor.py(3 hunks)tensorrt_llm/executor/request.py(3 hunks)tensorrt_llm/executor/result.py(5 hunks)tensorrt_llm/llmapi/llm.py(5 hunks)tensorrt_llm/llmapi/llm_args.py(1 hunks)tensorrt_llm/llmapi/otel_tracing.py(1 hunks)tensorrt_llm/serve/openai_server.py(6 hunks)tensorrt_llm/utils/utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- tensorrt_llm/_utils.py
- requirements.txt
- tensorrt_llm/executor/executor.py
- tensorrt_llm/executor/request.py
- tensorrt_llm/commands/serve.py
- tensorrt_llm/serve/openai_server.py
- tensorrt_llm/utils/utils.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/llmapi/llm_args.pytensorrt_llm/executor/result.pytensorrt_llm/llmapi/llm.pytensorrt_llm/llmapi/otel_tracing.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/llmapi/llm_args.pytensorrt_llm/executor/result.pytensorrt_llm/llmapi/llm.pytensorrt_llm/llmapi/otel_tracing.py
🧬 Code Graph Analysis (4)
tensorrt_llm/llmapi/llm_args.py (1)
tensorrt_llm/builder.py (1)
default(50-58)
tensorrt_llm/executor/result.py (2)
tensorrt_llm/llmapi/otel_tracing.py (4)
SpanAttributes(107-117)SpanKind(36-37)extract_trace_context(77-83)global_otlp_tracer(91-93)tensorrt_llm/metrics/enums.py (2)
RequestEventTiming(11-15)MetricNames(4-8)
tensorrt_llm/llmapi/llm.py (2)
tensorrt_llm/llmapi/otel_tracing.py (1)
init_tracer(47-60)tensorrt_llm/logger.py (1)
error(125-126)
tensorrt_llm/llmapi/otel_tracing.py (2)
tensorrt_llm/_utils.py (1)
run_once(1137-1145)tensorrt_llm/logger.py (1)
warning(131-132)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/llmapi/llm_args.py (1)
1314-1318: New tracing endpoint arg LGTMOptional field with clear description and alias. No issues from a typing or defaulting perspective.
tensorrt_llm/executor/result.py (1)
416-419: Confirm start_time units are compatible with OpenTelemetryOpenTelemetry Python expects start_time in nanoseconds since Unix epoch. Verify req_perf_metrics_dict timestamps and convert as needed (e.g., seconds->ns or ms->ns) before passing.
If timestamps are ms, convert via int(ms_value * 1e6); if seconds, int(sec_value * 1e9). Please confirm the units of RequestEventTiming in the producing component and adjust here.
290cd96 to
e87b643
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (2)
tensorrt_llm/commands/serve.py (2)
35-70: Consider signal-safety of logging operations.While the signal handler implementation is well-structured with proper graceful termination and fallback to kill, using
logger.infowithin a signal handler may not be signal-safe. In Python, most operations including logging are not guaranteed to be async-signal-safe, which could lead to deadlocks or undefined behavior if the signal interrupts certain operations.Consider using simpler, signal-safe alternatives like
os.writeto stderr for critical messages, or defer the logging to after the signal handler returns.Example alternative for critical messages:
import os import sys def _signal_handler_cleanup_child(signum, frame): """Signal handler to clean up the child process.""" global _child_p_global if _child_p_global and _child_p_global.poll() is None: # Use os.write for signal-safe output msg = f"Parent process (PID {os.getpid()}) received signal {signum}. Terminating child process (PID {_child_p_global.pid}).\n" os.write(sys.stderr.fileno(), msg.encode()) _child_p_global.terminate() # ... rest of the handler
606-607: Consider replacing assertions with proper error handling.Using
assertstatements for runtime validation can be problematic as they are removed when Python runs with optimization (-Oflag). Consider replacing with explicit error handling:- assert _child_p_global.poll( - ) is not None, f"the subprocess should be terminated" + if _child_p_global.poll() is None: + logger.error(f"Failed to terminate subprocess despite cleanup attempts") + raise RuntimeError(f"The subprocess should be terminated but is still running")Similarly for line 613-614:
- assert final_status is not None, \ - f"The subprocess (PID {_child_p_global.pid}) should be terminated, but its status is {final_status}" + if final_status is None: + logger.error(f"The subprocess (PID {_child_p_global.pid}) should be terminated, but its status is {final_status}") + raise RuntimeError(f"Subprocess termination verification failed")Also applies to: 613-614
🧹 Nitpick comments (1)
tensorrt_llm/commands/serve.py (1)
3-3: Remove the unnecessary comment.The inline comment
# Added importis not needed and should be removed for cleaner code.-import signal # Added import +import signal
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
requirements.txt(1 hunks)tensorrt_llm/_utils.py(2 hunks)tensorrt_llm/commands/serve.py(5 hunks)tensorrt_llm/executor/executor.py(3 hunks)tensorrt_llm/executor/request.py(3 hunks)tensorrt_llm/executor/result.py(5 hunks)tensorrt_llm/llmapi/llm.py(5 hunks)tensorrt_llm/llmapi/llm_args.py(1 hunks)tensorrt_llm/llmapi/otel_tracing.py(1 hunks)tensorrt_llm/serve/openai_server.py(6 hunks)tensorrt_llm/utils/utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- tensorrt_llm/llmapi/llm_args.py
- requirements.txt
- tensorrt_llm/serve/openai_server.py
- tensorrt_llm/executor/executor.py
- tensorrt_llm/_utils.py
- tensorrt_llm/utils/utils.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/executor/request.pytensorrt_llm/executor/result.pytensorrt_llm/llmapi/otel_tracing.pytensorrt_llm/llmapi/llm.pytensorrt_llm/commands/serve.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/executor/request.pytensorrt_llm/executor/result.pytensorrt_llm/llmapi/otel_tracing.pytensorrt_llm/llmapi/llm.pytensorrt_llm/commands/serve.py
🧠 Learnings (2)
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Applied to files:
tensorrt_llm/llmapi/otel_tracing.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/llmapi/otel_tracing.py
🧬 Code Graph Analysis (4)
tensorrt_llm/executor/request.py (1)
tensorrt_llm/mapping.py (1)
Mapping(20-453)
tensorrt_llm/executor/result.py (2)
tensorrt_llm/llmapi/otel_tracing.py (4)
SpanAttributes(110-120)SpanKind(38-39)extract_trace_context(79-85)global_otlp_tracer(94-96)tensorrt_llm/metrics/enums.py (2)
RequestEventTiming(11-15)MetricNames(4-8)
tensorrt_llm/llmapi/otel_tracing.py (2)
tensorrt_llm/_utils.py (1)
run_once(1137-1145)tensorrt_llm/logger.py (1)
warning(131-132)
tensorrt_llm/llmapi/llm.py (2)
tensorrt_llm/mapping.py (1)
Mapping(20-453)tensorrt_llm/llmapi/otel_tracing.py (1)
init_tracer(49-62)
🪛 GitHub Actions: Release Checks
tensorrt_llm/llmapi/otel_tracing.py
[warning] 118-118: Bandit B105: hardcoded_password_string. Possible hardcoded password: 'gen_ai.latency.time_to_first_token'. Location: tensorrt_llm/llmapi/otel_tracing.py:118:41. Failing command: bandit --configfile scripts/bandit.yaml -r tensorrt_llm.
🔇 Additional comments (15)
tensorrt_llm/executor/request.py (1)
2-2: LGTM! Clean implementation of trace headers propagation.The addition of
trace_headersparameter and its storage follows proper patterns:
- Correct use of
Mapping[str, str]type annotation for immutable header collections- Backward-compatible placement of the optional parameter
- Simple pass-through storage without unnecessary validation
Also applies to: 98-98, 126-126
tensorrt_llm/llmapi/llm.py (3)
9-9: LGTM! Necessary imports added correctly.The imports are properly placed and follow the existing import organization pattern.
Also applies to: 19-19
214-222: LGTM! Robust tracer initialization with proper error handling.The implementation correctly:
- Guards initialization with a try/except to prevent startup failures
- Only initializes when endpoint is configured
- Provides clear logging for both success and failure cases
327-327: LGTM! Clean trace headers propagation through the generation pipeline.The trace_headers parameter is properly:
- Added with correct type annotation
Optional[Mapping[str, str]]- Positioned to maintain backward compatibility
- Forwarded correctly to the executor
Also applies to: 446-446
tensorrt_llm/executor/result.py (3)
13-16: LGTM! Correct OpenTelemetry imports.All necessary tracing components are imported properly.
167-167: LGTM! Proper trace headers initialization.Trace headers are correctly:
- Initialized to None in GenerationResultBase
- Copied from generation_request in GenerationResult
Also applies to: 572-572
296-296: LGTM! Well-implemented tracing with proper guards and error handling.The implementation correctly:
- Guards against missing tracer and metrics
- Safely accesses postproc_params with proper None checks
- Uses MetricNames enum correctly (not the undefined SupportedMetricNames)
- Sets span attributes only when values are non-None
- Is called exactly once when the request completes
Also applies to: 397-461
tensorrt_llm/llmapi/otel_tracing.py (5)
1-2: LGTM! Copyright header present as required.
16-43: LGTM! Excellent fallback handling for optional OpenTelemetry.The implementation gracefully handles missing OpenTelemetry by:
- Capturing the import error for informative messages
- Providing placeholder types to satisfy type checking
- Allowing the codebase to function without OTEL installed
88-91: LGTM! Case-insensitive header handling implemented correctly.Both functions now properly handle HTTP headers in a case-insensitive manner:
extract_trace_headers: Creates lowercase map and returns normalized keyscontains_trace_headers: Compares against lowercase header keysAlso applies to: 123-125
118-118: False positive from Bandit - this is a metric name, not a password.The string "gen_ai.latency.time_to_first_token" is a standardized OpenTelemetry span attribute name for GenAI metrics, not a password or secret. This is a known false positive pattern with security scanners when they encounter strings containing "token".
45-131: LGTM! Complete and well-structured OpenTelemetry integration.The module provides a comprehensive tracing integration with:
- Flexible span exporter configuration (grpc/http)
- Global tracer management
- Trace context propagation from HTTP headers
- GenAI-specific span attributes following OpenTelemetry semantic conventions
- One-time warning for disabled tracing using the @run_once decorator
tensorrt_llm/commands/serve.py (3)
89-89: LGTM! OpenTelemetry endpoint parameter properly integrated.The
otlp_traces_endpointparameter is correctly added with appropriate typing and default value, and properly included in the returnedllm_argsdictionary.Also applies to: 134-134
272-275: LGTM! CLI option for OpenTelemetry tracing properly wired.The
--otlp_traces_endpointCLI option is correctly implemented with:
- Appropriate type, default value, and help text
- Proper integration into the serve function signature
- Correct forwarding to
get_llm_argsAlso applies to: 287-287, 313-314
554-584: Excellent signal handling implementation with proper cleanup.The signal handling logic is well-designed with:
- Proper storage and restoration of original handlers
- Use of finally block to ensure cleanup
- Creation of new session for child process isolation
- Comprehensive error handling
This ensures robust process management in the disaggregated mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (3)
tensorrt_llm/llmapi/tracing.py (1)
31-58: Consider using a more robust approach for placeholder types.The placeholder types defined when OpenTelemetry is not available may cause type checking issues. Consider using
typing.TYPE_CHECKINGto handle imports more cleanly.try: from opentelemetry.context.context import Context from opentelemetry.sdk.environment_variables import \ OTEL_EXPORTER_OTLP_TRACES_PROTOCOL from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor + from opentelemetry.trace import SpanKind from opentelemetry.trace import Tracer, set_tracer_provider from opentelemetry.trace.propagation.tracecontext import \ TraceContextTextMapPropagator _is_otel_imported = True except ImportError: import traceback otel_import_error_traceback = traceback.format_exc() class Context: # type: ignore pass - class BaseSpanAttributes: # type: ignore - pass - class SpanKind: # type: ignore - pass + SERVER = None class Tracer: # type: ignore passtensorrt_llm/executor/result.py (2)
414-419: Consider using a more accurate arrival time calculation.The arrival time calculation assumes that the request arrived
E2Eseconds ago, which might not be accurate if there's processing delay. Consider using the actual arrival time from metrics if available.- # TODO: Add request arrival time - arrival_time = time.time() - metrics_dict.get(MetricNames.E2E, -1) + # Use actual arrival time from metrics if available, otherwise estimate + arrival_time_ns = req_perf_metrics_dict.get(RequestEventTiming.ARRIVAL_TIME, 0) + if arrival_time_ns > 0: + start_time = int(arrival_time_ns) + else: + # Fallback: estimate arrival time from E2E latency + e2e_seconds = metrics_dict.get(MetricNames.E2E, 0) + if e2e_seconds > 0: + start_time = int((time.time() - e2e_seconds) * 1e9) + else: + start_time = None + with tracing.global_otlp_tracer().start_as_current_span( "llm_request", kind=tracing.SpanKind.SERVER, context=trace_context, - start_time=int(arrival_time * 1e9), + start_time=start_time, ) as span:
442-448: Add check for empty postproc_params before accessing nested attributes.When accessing
postproc_params.postproc_args.num_prompt_tokens, the nested attribute access could fail ifpostproc_argsis None.safe_set_attr( span, tracing.SpanAttributes.GEN_AI_USAGE_PROMPT_TOKENS, - getattr(self.postproc_params.postproc_args, "num_prompt_tokens", - None) if self.postproc_params - and self.postproc_params.postproc_args else None, + getattr(getattr(self.postproc_params, "postproc_args", None), + "num_prompt_tokens", None) if self.postproc_params else None, )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/executor/result.py(6 hunks)tensorrt_llm/llmapi/llm.py(5 hunks)tensorrt_llm/llmapi/tracing.py(1 hunks)tensorrt_llm/serve/openai_server.py(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/serve/openai_server.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/llmapi/tracing.pytensorrt_llm/llmapi/llm.pytensorrt_llm/executor/result.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/llmapi/tracing.pytensorrt_llm/llmapi/llm.pytensorrt_llm/executor/result.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/executor/result.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tensorrt_llm/llmapi/tracing.py (3)
1-2: LGTM!Copyright header correctly follows the NVIDIA copyright format for 2025.
125-135: LGTM! Well-structured span attributes.The
SpanAttributesclass follows OpenTelemetry semantic conventions for generative AI observability with clear, descriptive attribute names.
80-91: Improve error handling and environment variable access.The function accesses the environment variable
OTEL_EXPORTER_OTLP_TRACES_PROTOCOLwithout a guard. This could fail when OpenTelemetry is not imported.Apply this diff to add proper guards:
def get_span_exporter(endpoint): + if not is_otel_available(): + raise ValueError( + "OpenTelemetry is not available. Unable to get span exporter.") protocol = os.environ.get(OTEL_EXPORTER_OTLP_TRACES_PROTOCOL, "grpc") if protocol == "grpc": from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import \ OTLPSpanExporter elif protocol == "http/protobuf": from opentelemetry.exporter.otlp.proto.http.trace_exporter import \ OTLPSpanExporter # type: ignore else: raise ValueError( f"Unsupported OTLP protocol '{protocol}' is configured") return OTLPSpanExporter(endpoint=endpoint)Likely an incorrect or invalid review comment.
tensorrt_llm/llmapi/llm.py (4)
9-9: LGTM! Proper import ofMappingtype.The import of
Mappingfromcollections.abcis correct and follows best practices for type hints.
214-222: LGTM! Proper error handling for tracer initialization.The tracer initialization is correctly placed after model building and includes appropriate error handling without breaking the startup flow.
327-327: LGTM! Well-typed trace headers parameter.The
trace_headersparameter is properly typed asOptional[Mapping[str, str]]following Python typing best practices.
446-446: LGTM! Correct propagation of trace headers.The trace headers are properly forwarded to the executor's
generate_asyncmethod.tensorrt_llm/executor/result.py (3)
3-3: LGTM! Proper imports for tracing functionality.The imports for
timeandtracingare correctly added to support the new tracing capabilities.Also applies to: 14-14
166-166: LGTM! Proper initialization of trace headers.The
trace_headersattribute is correctly initialized in both base classes to support trace context propagation.Also applies to: 577-577
295-295: LGTM! Correct placement of tracing call.The
do_tracingmethod is properly called afterrecord_statsto ensure metrics are available for the trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tensorrt_llm/executor/result.py (2)
417-417: Confirm SpanKind is exported by tracing module.The code uses tracing.SpanKind.SERVER. Ensure SpanKind is re-exported from tensorrt_llm.llmapi.tracing; otherwise this will fail at runtime.
Run this to verify exports:
#!/bin/bash # Verify SpanKind, SpanAttributes, helpers, and global_otlp_tracer exist rg -n -C2 -P 'class\s+SpanAttributes\b|def\s+global_otlp_tracer\b|def\s+extract_trace_context\b|def\s+insufficient_request_metrics_warning\b' tensorrt_llm/llmapi rg -n -C2 -P '\bSpanKind\b' tensorrt_llm/llmapi
404-409: Emit a single request-level span (dedupe across streaming/beams).do_tracing() is called per-sequence and possibly multiple times during streaming. Without a guard, you’ll emit duplicate request spans. Add a one-shot flag.
Apply this diff:
metrics_dict = self.metrics_dict if not metrics_dict or not req_perf_metrics_dict: # Insufficient request metrics available; trace generation aborted. tracing.insufficient_request_metrics_warning() return + # Avoid emitting duplicate request-level spans + if getattr(self, "_trace_emitted", False): + returnAnd set the flag after span emission:
@@ json.dumps([output.finish_reason]) if output.finish_reason else None) + self._trace_emitted = True
🧹 Nitpick comments (6)
tensorrt_llm/executor/result.py (6)
432-434: Guard optional GEN_AI_REQUEST_TOP_K attribute (may not exist).SpanAttributes may not define GEN_AI_REQUEST_TOP_K (it’s not listed in the PR objectives). Unconditional access risks AttributeError at runtime.
Apply this diff to set it only if available:
- safe_set_attr(span, tracing.SpanAttributes.GEN_AI_REQUEST_TOP_K, - sampling_params.top_k) + if hasattr(tracing.SpanAttributes, "GEN_AI_REQUEST_TOP_K"): + safe_set_attr( + span, + tracing.SpanAttributes.GEN_AI_REQUEST_TOP_K, + getattr(sampling_params, "top_k", None), + )
396-401: Add a concise docstring to do_tracing for maintainability.A short docstring clarifies preconditions (tracer present, metrics required) and what’s emitted.
Apply this diff:
def do_tracing( self, output: CompletionOutput, req_perf_metrics_dict: Optional[dict[str, float]] = None, ): + """ + Emit a single request-level OpenTelemetry span for this generation request. + Preconditions: + - tracing.global_otlp_tracer() is configured, + - req_perf_metrics_dict and self.metrics_dict contain timing metrics. + The span is linked to upstream context via self.trace_headers and is guarded + to avoid duplicate emissions across streaming/beam updates. + """
401-403: Minor: combine early-return condition to reduce overhead and log spam.You can short-circuit both the tracer check and missing metrics in one condition.
Apply this diff:
- if not tracing.global_otlp_tracer(): - return + if not tracing.global_otlp_tracer(): + returnAnd rely on the subsequent guard to warn once per call. Optionally, consider rate-limiting insufficient_request_metrics_warning() to avoid noisy logs on streaming.
295-295: Call site is fine; rely on one-shot guard inside do_tracing.Given the dedupe guard in do_tracing(), keeping this unconditional call here is acceptable. Optionally, you could gate on self._done to skip earlier attempts during streaming.
Example:
- self.do_tracing(output, req_perf_metrics_dict) + if self._done: + self.do_tracing(output, req_perf_metrics_dict)
417-419: Consider SpanKind.INTERNAL for in-process request-level span.If an upstream HTTP layer already emits a SERVER span, this nested span is better modeled as INTERNAL to avoid misclassification.
Apply this diff:
- kind=tracing.SpanKind.SERVER, + kind=tracing.SpanKind.INTERNAL,
3-3: Remove unused time import after start_time fix.Once start_time uses ARRIVAL_TIME (epoch ns), time is no longer used.
Apply this diff:
-import time
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/executor/result.py(6 hunks)tensorrt_llm/llmapi/tracing.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/llmapi/tracing.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/executor/result.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/executor/result.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/executor/result.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/executor/result.py (6)
441-447: LGTM: setting request id and prompt/usage tokens.Using self.id and prompt_token_ids length for span attributes is consistent and low-risk.
456-458: LGTM: queue time attribute wired from processed metrics.This aligns with RequestEventTiming → MetricNames mapping.
459-461: LGTM: finish reasons captured as a JSON array.Compact and interoperable with downstream consumers.
166-167: LGTM: initialize trace_headers on base class.Prevents attribute errors when tracing context is absent.
573-574: LGTM: propagate trace headers from request to result.Ensures context extraction works for downstream spans.
14-15: All required tracing exports verified
The following symbols are already listed in__all__intensorrt_llm/llmapi/tracing.py:
- SpanAttributes
- SpanKind
- extract_trace_context
- global_otlp_tracer
- insufficient_request_metrics_warning
No changes needed.
3971a3c to
6e9fe8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/llmapi/disagg_utils.py (1)
1-1: Add NVIDIA copyright header.Per the coding guidelines, prepend the NVIDIA copyright header to all source files.
Apply this patch at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + import loggingtensorrt_llm/commands/serve.py (1)
1-1: Add NVIDIA copyright header.Please prepend the standard NVIDIA header.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + import asynciotensorrt_llm/serve/openai_disagg_server.py (1)
1-1: Insert NVIDIA copyright header (keep shebang on line 1).Add the header immediately after the shebang to comply with guidelines.
#!/usr/bin/env python +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
♻️ Duplicate comments (1)
tensorrt_llm/llmapi/tracing.py (1)
118-123: Use explicit runtime check instead of assert in set_global_otlp_tracer.Asserts can be disabled with -O; keep this safety check active in production.
def set_global_otlp_tracer(tracer: Tracer): """Set the global OTLP Tracer instance in the current process.""" global _global_tracer_ - assert _global_tracer_ is None + if _global_tracer_ is not None: + raise RuntimeError("Global OTLP tracer has already been set") _global_tracer_ = tracer
🧹 Nitpick comments (7)
tensorrt_llm/llmapi/disagg_utils.py (2)
45-49: Move field docstring to a proper class docstring (current string literal won’t be recognized).The string literal after the field assignment is not a valid class docstring (must be the first statement). Move it to the class docstring so tooling can pick it up.
-@dataclass -class ObservabilityConfig(): - otlp_traces_endpoint: Optional[str] = None - """Target URL to which OpenTelemetry traces will be sent.""" +@dataclass +class ObservabilityConfig: + """Observability-related options. + + Attributes: + otlp_traces_endpoint: Target URL to which OpenTelemetry traces will be sent. + """ + otlp_traces_endpoint: Optional[str] = None
33-38: Type annotation for RouterConfig.server_role should be Optional.Default is None but annotation is ServerRole. Make it Optional[ServerRole] to avoid type checker warnings.
@dataclass class RouterConfig(): type: str = "round_robin" args: dict = field(default_factory=dict) - server_role: ServerRole = None + server_role: Optional[ServerRole] = Nonetensorrt_llm/commands/serve.py (1)
35-70: Minimize work inside signal handlers; avoid logger usage there.Signal handlers should do the least possible work; logger may not be async-signal-safe. Consider setting a flag or writing directly to stderr with os.write, then performing cleanup in normal control flow.
Example minimal adjustment:
def _signal_handler_cleanup_child(signum, frame): """Signal handler to clean up the child process.""" global _child_p_global - if _child_p_global and _child_p_global.poll() is None: - # Using print for safety in signal handlers - logger.info( - f"Parent process (PID {os.getpid()}) received signal {signal.Signals(signum).name}. Terminating child process (PID {_child_p_global.pid})." - ) + if _child_p_global and _child_p_global.poll() is None: + try: + os.write(2, f"Signal {signal.Signals(signum).name}: terminating child PID {_child_p_global.pid}\n".encode()) + except Exception: + pass _child_p_global.terminate() try: _child_p_global.wait( timeout=10) # Allow 10 seconds for graceful termination except subprocess.TimeoutExpired: - logger.info( - f"Child process (PID {_child_p_global.pid}) did not terminate gracefully after signal. Killing." - ) + try: + os.write(2, f"Child {_child_p_global.pid} not terminated; killing.\n".encode()) + except Exception: + pass _child_p_global.kill() try: _child_p_global.wait(timeout=10) # Allow 10 seconds for kill except subprocess.TimeoutExpired: - logger.info( - f"Child process (PID {_child_p_global.pid}) failed to die even after kill command from signal handler." - ) + try: + os.write(2, f"Child {_child_p_global.pid} failed to die after kill.\n".encode()) + except Exception: + passtensorrt_llm/serve/openai_disagg_server.py (4)
348-356: Align return type hints with possible StreamingResponse; wrap overly long lines.
- send_chat_request can return a StreamingResponse when request.stream is True; adjust annotation accordingly.
- A few lines exceed 120 chars; wrap to satisfy Ruff E501.
-async def create_chat_generator(self, url: str, request: ChatCompletionRequest, - trace_headers: Optional[Mapping[str, str]] = None): - async for chunk in self.create_generator(url, request, "/v1/chat/completions", trace_headers): +async def create_chat_generator( + self, + url: str, + request: ChatCompletionRequest, + trace_headers: Optional[Mapping[str, str]] = None, +): + async for chunk in self.create_generator( + url, request, "/v1/chat/completions", trace_headers + ): yield chunk-async def send_chat_request(self, url: str, request: ChatCompletionRequest, - trace_headers: Optional[Mapping[str, str]] = None) -> ChatCompletionResponse: - return await self.send_request(url, request, "/v1/chat/completions", ChatCompletionResponse, self.create_chat_generator, trace_headers) +async def send_chat_request( + self, + url: str, + request: ChatCompletionRequest, + trace_headers: Optional[Mapping[str, str]] = None, +) -> Union[ChatCompletionResponse, StreamingResponse]: + return await self.send_request( + url, + request, + "/v1/chat/completions", + ChatCompletionResponse, + self.create_chat_generator, + trace_headers, + )Also applies to: 394-401
36-36: Typo in yapf directive.Comment says “yapf: enale”. Should be “enable” or remove the directive.
-# yapf: enale +# yapf: enable
7-7: Avoid confusion with project’s Mapping class by aliasing collections.abc.Mapping.The repo defines tensorrt_llm.mapping.Mapping; aliasing here reduces ambiguity when reading code.
-from collections.abc import Mapping +from collections.abc import Mapping as HeadersMapping ... - trace_headers: Optional[Mapping[str, str]] = None): + trace_headers: Optional[HeadersMapping[str, str]] = None):
280-282: Optional: enrich event attributes for better observability.Attach key attributes (e.g., server chosen, stream flag, router) to the event for richer traces.
- tracing.add_event('picking generation server') + tracing.add_event( + "disagg.pick_gen_server", + { + "request.stream": bool(req.stream), + "ctx_present": bool(ctx_response is not None), + }, + )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/commands/serve.py(6 hunks)tensorrt_llm/llmapi/disagg_utils.py(4 hunks)tensorrt_llm/llmapi/tracing.py(1 hunks)tensorrt_llm/serve/openai_disagg_server.py(16 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/llmapi/disagg_utils.pytensorrt_llm/commands/serve.pytensorrt_llm/llmapi/tracing.pytensorrt_llm/serve/openai_disagg_server.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/llmapi/disagg_utils.pytensorrt_llm/commands/serve.pytensorrt_llm/llmapi/tracing.pytensorrt_llm/serve/openai_disagg_server.py
🧬 Code graph analysis (2)
tensorrt_llm/llmapi/tracing.py (1)
tensorrt_llm/logger.py (1)
warning(131-132)
tensorrt_llm/serve/openai_disagg_server.py (3)
tensorrt_llm/mapping.py (1)
Mapping(20-453)tensorrt_llm/llmapi/disagg_utils.py (4)
ConditionalDisaggConfig(41-42)MetadataServerConfig(64-69)ObservabilityConfig(46-48)RouterConfig(34-37)tensorrt_llm/llmapi/tracing.py (4)
init_tracer(62-75)trace_span(167-198)inject_trace_headers(107-110)add_event(149-152)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/openai_disagg_server.py
363-363: Line too long (150 > 120)
(E501)
395-395: Line too long (137 > 120)
(E501)
396-396: Line too long (140 > 120)
(E501)
400-400: Line too long (143 > 120)
(E501)
🔇 Additional comments (3)
tensorrt_llm/llmapi/disagg_utils.py (1)
59-60: Observability config plumbed correctly.
- New Optional[ObservabilityConfig] field on DisaggServerConfig looks good.
- extract_disagg_cfg accepts a dict and instantiates ObservabilityConfig safely.
- Properly passed into DisaggServerConfig constructor.
No functional concerns here.
Also applies to: 89-90, 123-125, 126-130
tensorrt_llm/llmapi/tracing.py (1)
3-9: Export surface looks complete (including set_global_otlp_tracer).The all list now includes set_global_otlp_tracer as expected. Good catch and fix.
tensorrt_llm/commands/serve.py (1)
89-90: OTLP endpoint wiring into CLI and LLM args looks correct.The CLI option, serve() signature, get_llm_args() parameter, and propagation into llm_args are consistent. No issues spotted.
Also applies to: 113-135, 272-276, 286-288, 312-315
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
tensorrt_llm/llmapi/tracing.py (3)
125-130: Consider using a more robust check instead of assert.Using
assertfor runtime state validation can be disabled with Python's-Oflag. Consider using an explicit check with an exception for production safety.def set_global_otlp_tracer(tracer: Tracer): """Set the global OTLP Tracer instance in the current process.""" global _global_tracer_ - assert _global_tracer_ is None + if _global_tracer_ is not None: + raise RuntimeError("Global OTLP tracer has already been set") _global_tracer_ = tracer
164-168: Guard against missing OpenTelemetry and fix type annotation.The function will fail when OpenTelemetry is not installed because
get_current_spanis undefined. Additionally, thetypes.Attributestype annotation will cause import errors.def add_event(name: str, - attributes: types.Attributes = None, + attributes: typing.Optional[dict] = None, timestamp: typing.Optional[int] = None) -> None: + """Add an event to the current span if tracing is available.""" + if not is_otel_available(): + return get_current_span().add_event(name, attributes, timestamp)
112-118: Fix inverted condition in inject_trace_headers.The ternary condition is inverted - it creates an empty dict when headers are provided and tries to extract from headers when they're None. Additionally, the function should guard against missing OpenTelemetry.
def inject_trace_headers(headers: Mapping[str, str]) -> Mapping[str, str]: + """Return a dict containing normalized trace headers with current context injected. + Returns empty dict if OpenTelemetry is unavailable or tracing is disabled. + """ + if not is_otel_available() or not is_tracing_enabled(): + return {} - if is_tracing_enabled(): - trace_headers = extract_trace_headers(headers) if not headers else {} - TraceContextTextMapPropagator().inject(trace_headers) - return trace_headers - return None + trace_headers = extract_trace_headers(headers) if headers else {} + TraceContextTextMapPropagator().inject(trace_headers) + return trace_headerstensorrt_llm/serve/openai_disagg_server.py (1)
363-364: Fix streaming path parameter mismatch.The streaming branch incorrectly passes
headersas the third argument tocreate_generator, but the method signature expectsend_pointas the third parameter andtrace_headersas the fourth.
🧹 Nitpick comments (2)
tensorrt_llm/metrics/enums.py (1)
9-9: Consider using consistent quoting style for enum values.The new
ARRIVAL_TIMESTAMPvalue uses single quotes while other values in the enum use double quotes. Consider maintaining consistency.- ARRIVAL_TIMESTAMP = 'arrival_timestamp' + ARRIVAL_TIMESTAMP = "arrival_timestamp"tensorrt_llm/serve/openai_disagg_server.py (1)
364-364: Fix line length violations.Several lines exceed the 120-character limit as flagged by the linter.
Apply these formatting fixes:
- trace_headers: Optional[Mapping[str, str]] = None) -> Union[CompletionResponse, ChatCompletionResponse, StreamingResponse]: + trace_headers: Optional[Mapping[str, str]] = None) -> Union[ + CompletionResponse, ChatCompletionResponse, StreamingResponse]: - async def send_completion_request(self, url: str, request: CompletionRequest, - trace_headers: Optional[Mapping[str, str]] = None) -> Union[CompletionResponse, StreamingResponse]: - return await self.send_request(url, request, "/v1/completions", CompletionResponse, self.create_completion_generator, trace_headers) + async def send_completion_request(self, url: str, request: CompletionRequest, + trace_headers: Optional[Mapping[str, str]] = None) -> Union[ + CompletionResponse, StreamingResponse]: + return await self.send_request(url, request, "/v1/completions", CompletionResponse, + self.create_completion_generator, trace_headers) - async def send_chat_request(self, url: str, request: ChatCompletionRequest, - trace_headers: Optional[Mapping[str, str]] = None) -> ChatCompletionResponse: - return await self.send_request(url, request, "/v1/chat/completions", ChatCompletionResponse, self.create_chat_generator, trace_headers) + async def send_chat_request(self, url: str, request: ChatCompletionRequest, + trace_headers: Optional[Mapping[str, str]] = None) -> ChatCompletionResponse: + return await self.send_request(url, request, "/v1/chat/completions", ChatCompletionResponse, + self.create_chat_generator, trace_headers)Also applies to: 396-396, 397-397, 401-401
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
tensorrt_llm/executor/result.py(7 hunks)tensorrt_llm/executor/worker.py(1 hunks)tensorrt_llm/llmapi/llm.py(5 hunks)tensorrt_llm/llmapi/tracing.py(1 hunks)tensorrt_llm/metrics/enums.py(1 hunks)tensorrt_llm/serve/openai_disagg_server.py(16 hunks)tensorrt_llm/serve/openai_server.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/serve/openai_server.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/executor/worker.pytensorrt_llm/metrics/enums.pytensorrt_llm/serve/openai_disagg_server.pytensorrt_llm/llmapi/llm.pytensorrt_llm/executor/result.pytensorrt_llm/llmapi/tracing.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/executor/worker.pytensorrt_llm/metrics/enums.pytensorrt_llm/serve/openai_disagg_server.pytensorrt_llm/llmapi/llm.pytensorrt_llm/executor/result.pytensorrt_llm/llmapi/tracing.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/executor/result.py
🧬 Code graph analysis (5)
tensorrt_llm/executor/worker.py (1)
tensorrt_llm/metrics/enums.py (1)
RequestEventTiming(12-19)
tensorrt_llm/serve/openai_disagg_server.py (4)
tensorrt_llm/llmapi/disagg_utils.py (4)
ConditionalDisaggConfig(41-42)MetadataServerConfig(64-69)ObservabilityConfig(46-48)RouterConfig(34-37)tensorrt_llm/serve/router.py (1)
create_router(615-653)tensorrt_llm/llmapi/tracing.py (5)
init_tracer(62-75)trace_span(182-213)inject_trace_headers(112-117)add_event(164-167)SpanEvents(152-156)tensorrt_llm/serve/openai_server.py (1)
openai_completion(371-511)
tensorrt_llm/llmapi/llm.py (3)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
tensorrt_llm(37-265)tensorrt_llm/metrics/enums.py (1)
MetricNames(4-9)tensorrt_llm/llmapi/tracing.py (1)
init_tracer(62-75)
tensorrt_llm/executor/result.py (2)
tensorrt_llm/llmapi/tracing.py (7)
global_otlp_tracer(120-122)insufficient_request_metrics_warning(177-179)extract_trace_context(92-98)SpanKind(51-52)SpanAttributes(136-149)add_event(164-167)SpanEvents(152-156)tensorrt_llm/metrics/enums.py (2)
MetricNames(4-9)RequestEventTiming(12-19)
tensorrt_llm/llmapi/tracing.py (1)
tensorrt_llm/logger.py (1)
warning(131-132)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/openai_disagg_server.py
364-364: Line too long (150 > 120)
(E501)
396-396: Line too long (137 > 120)
(E501)
397-397: Line too long (140 > 120)
(E501)
401-401: Line too long (143 > 120)
(E501)
🔇 Additional comments (31)
tensorrt_llm/llmapi/llm.py (3)
215-223: LGTM! Clean OTLP tracer initialization with proper error handling.The implementation correctly initializes the OTLP tracer when configured and gracefully handles initialization failures without disrupting the LLM startup.
328-328: Good addition for trace context propagation.The
trace_headersparameter enables proper distributed tracing by allowing trace context to be propagated through the request pipeline.
453-456: LGTM! Arrival timestamp tracking for accurate latency measurements.The code correctly captures the arrival timestamp when performance metrics are requested, which is essential for calculating accurate end-to-end latencies in the tracing implementation.
tensorrt_llm/executor/result.py (4)
396-409: Early return when metrics are insufficient.The implementation correctly guards against missing metrics before attempting to generate traces. The early returns prevent errors when either the global tracer or required metrics are unavailable.
432-478: Well-structured span attributes with proper null safety.The
safe_set_attrhelper function elegantly handles optional values, ensuring that only non-null attributes are added to the span. The comprehensive set of attributes aligns well with the PR objectives for OpenTelemetry tracing.
479-493: KV cache transfer events properly recorded with timestamps.The implementation correctly adds KV cache transfer events only when both start and end timestamps are available, preventing partial or invalid event data.
413-424: Verify E2E metric units and default indo_tracingtime correctionBefore merging, please confirm the following in
tensorrt_llm/executor/result.py(around thedo_tracingmethod):
- Ensure that both
req_perf_metrics_dict[RequestEventTiming.LAST_TOKEN_TIME]andreq_perf_metrics_dict[RequestEventTiming.ARRIVAL_TIME]originate from the same clock source in the C++ runtime and are reported in seconds.- Check whether the default
-1formetrics_dict.get(MetricNames.E2E, -1)could skew the fallback branch (time.time() - e2e - arrival_time), and consider using0or guarding negative values if the metric is missing.- Validate that mixing Python’s
time.time()(system epoch) with the C++ timestamps does not introduce a constant offset; if the C++ runtime uses a steady clock, you may need to convert or switch totime.monotonic()to align origins.tensorrt_llm/llmapi/tracing.py (2)
1-2: LGTM! Proper copyright header.
58-76: Well-designed tracer initialization with proper error handling.The
init_tracerfunction properly checks for OpenTelemetry availability and provides clear error messages with the original traceback when dependencies are missing. The implementation follows best practices for setting up the TracerProvider with a BatchSpanProcessor.tensorrt_llm/serve/openai_disagg_server.py (22)
7-7: Import looks good.The
Mappingimport fromcollections.abcis appropriately added to support the new trace headers functionality.
14-14: Essential imports added for observability.The imports for
Request, tracing module, andObservabilityConfigare correctly added to support the new OpenTelemetry tracing functionality.Also applies to: 21-21, 24-24
50-51: Constructor signature updated correctly.The addition of
observability_cfgparameter with proper Optional typing aligns with the observability integration requirements.
59-68: OTLP tracer initialization with proper error handling.The tracer initialization logic correctly checks for the OTLP endpoint configuration and handles exceptions gracefully. The logging provides good visibility into initialization success/failure.
144-145: Function signature updated for trace propagation.The
merge_streaming_responsesmethod correctly adds thetrace_headersparameter to support trace context propagation in streaming scenarios.
157-157: Trace headers properly propagated to downstream calls.The trace headers are correctly passed to both completion and chat request methods in the streaming path.
Also applies to: 159-159
169-170: Proper tracing decorator and signature update.The
@tracing.trace_span("llm_request")decorator correctly instruments the completion endpoint, and the addition ofraw_request: Requestparameter enables trace header extraction.
179-179: Consistent parameter passing for disaggregated requests.The
raw_requestparameter is properly passed to_send_disagg_requestto maintain trace context.
184-185: Chat completion endpoint properly instrumented.The chat completion endpoint follows the same pattern as the completion endpoint with proper tracing decoration and parameter updates.
Also applies to: 188-188
202-203: Context request method updated for trace propagation.The
_send_context_requestmethod signature correctly includes thetrace_headersparameter for downstream trace propagation.
212-212: Trace headers passed to context server calls.The trace headers are properly propagated to both chat and completion requests sent to context servers.
Also applies to: 215-215
228-228: Disaggregated request method enhanced for tracing.The method signature correctly adds
raw_requestparameter, and trace header injection is properly implemented using the tracing utility.Also applies to: 231-231
260-262: Server selection events properly instrumented.The context server selection event is correctly added with appropriate attributes for observability.
264-264: Context request includes trace headers.The trace headers are properly passed to the context request to maintain trace continuity.
287-287: Generation server selection instrumented.The generation server selection event is correctly added with server information for observability.
297-297: Trace headers propagated to generation requests.The trace headers are correctly passed to both completion and chat generation requests.
Also applies to: 300-300
309-309: Streaming response includes trace context.The trace headers are properly passed to the streaming response merger to maintain trace context throughout the streaming flow.
326-332: Generator method enhanced with trace header support.The
create_generatormethod correctly adds trace header parameter and merges them with the Content-Type header for downstream requests.
333-334: HTTP request includes trace headers.The trace headers are properly included in the HTTP POST request to ensure trace propagation to downstream services.
349-351: Completion and chat generators updated consistently.Both generator methods are consistently updated to accept and pass trace headers to the base generator method.
Also applies to: 354-356
367-375: Non-streaming path correctly handles trace headers.The trace headers are properly merged with Content-Type headers and included in the HTTP POST request for non-streaming scenarios.
395-397: Request methods consistently updated with trace headers.Both
send_completion_requestandsend_chat_requestmethods are consistently updated to accept and pass trace headers to the basesend_requestmethod.Also applies to: 399-401
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (3)
tensorrt_llm/executor/result.py (3)
3-3: Remove unused time import after fixing start_time computation.Once
do_tracinguses absolute timestamps directly (see below),timebecomes unnecessary.-import time
401-409: Emit the request-level span once and only when inputs are sufficient.Add a duplicate-emission guard and keep the existing sufficiency checks. Suggested diff:
- if not tracing.global_otlp_tracer(): + if not tracing.global_otlp_tracer(): return - metrics_dict = self.metrics_dict + # Avoid emitting duplicate request-level spans (e.g., beam/n>1) + if getattr(self, "_trace_emitted", False): + return + + metrics_dict = self.metrics_dict if not metrics_dict or not req_perf_metrics_dict: # Insufficient request metrics available; trace generation aborted. tracing.insufficient_request_metrics_warning() return
413-431: Fix span start_time computation; avoid mixing epochs and durations.Current logic derives a "time_correction" using
time.time()andE2Ewhich can produce invalid start times (misordered spans, drops). Use absolute arrival timestamp when available, otherwise ARRIVAL_TIME, both in epoch seconds, and convert to ns.Apply this diff:
- # Since arrival_time and other timing metrics are based on different time origins, - # we need to apply corrections to align them with absolute timestamps - time_correction = 0 - arrival_timestamp = metrics_dict.get(MetricNames.ARRIVAL_TIMESTAMP, 0) - arrival_time = req_perf_metrics_dict.get( - RequestEventTiming.ARRIVAL_TIME, 0) - if arrival_timestamp > 0: - time_correction = arrival_timestamp - arrival_time - else: - time_correction = time.time() - metrics_dict.get( - MetricNames.E2E, -1) - arrival_time - - with tracing.global_otlp_tracer().start_as_current_span( + # Compute absolute span start time (epoch-ns). Prefer ARRIVAL_TIMESTAMP, else ARRIVAL_TIME. + arrival_time_s = float(req_perf_metrics_dict.get(RequestEventTiming.ARRIVAL_TIME, 0.0)) + arrival_timestamp_s = float(metrics_dict.get(MetricNames.ARRIVAL_TIMESTAMP, 0.0)) + start_time_ns = ( + int(arrival_timestamp_s * 1e9) if arrival_timestamp_s > 0.0 + else int(arrival_time_s * 1e9) + ) + + with tracing.global_otlp_tracer().start_as_current_span( "llm_request", kind=tracing.SpanKind.SERVER, context=trace_context, - start_time=int((arrival_time + time_correction) * 1e9), + start_time=start_time_ns, ) as span:
🧹 Nitpick comments (3)
tensorrt_llm/executor/result.py (3)
468-470: Set finish reasons as a list attribute, not a JSON string.OTel attributes support sequences; avoid
json.dumpsto keep type fidelity.- safe_set_attr( - span, tracing.SpanAttributes.GEN_AI_RESPONSE_FINISH_REASONS, - json.dumps([output.finish_reason]) - if output.finish_reason else None) + safe_set_attr( + span, + tracing.SpanAttributes.GEN_AI_RESPONSE_FINISH_REASONS, + [output.finish_reason] if output.finish_reason else None, + )
472-477: Guard KV cache transfer time computation.If either start or end is missing, the subtraction yields misleading negatives or meaningless values. Compute only when both are present.
- safe_set_attr( - span, - tracing.SpanAttributes.GEN_AI_LATENCY_KV_CACHE_TRANSFER_TIME, - req_perf_metrics_dict.get( - RequestEventTiming.KV_CACHE_TRANSFER_END, 0.0) - - req_perf_metrics_dict.get( - RequestEventTiming.KV_CACHE_TRANSFER_START, 0.0)) + kv_start_s = float(req_perf_metrics_dict.get(RequestEventTiming.KV_CACHE_TRANSFER_START, 0.0)) + kv_end_s = float(req_perf_metrics_dict.get(RequestEventTiming.KV_CACHE_TRANSFER_END, 0.0)) + kv_duration_s = kv_end_s - kv_start_s if (kv_start_s and kv_end_s) else None + safe_set_attr( + span, + tracing.SpanAttributes.GEN_AI_LATENCY_KV_CACHE_TRANSFER_TIME, + kv_duration_s, + )
396-400: Optional: Factor out a tiny helper to set attributes safely.
safe_set_attrappears inline; consider a small local lambda or reuse a shared helper fromtracingif available to reduce repetition across the codebase. Not blocking.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/executor/result.py(7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/executor/result.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/executor/result.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/executor/result.py
🧬 Code graph analysis (1)
tensorrt_llm/executor/result.py (2)
tensorrt_llm/llmapi/tracing.py (7)
global_otlp_tracer(120-122)insufficient_request_metrics_warning(177-179)extract_trace_context(92-98)SpanKind(51-52)SpanAttributes(136-149)add_event(164-167)SpanEvents(152-156)tensorrt_llm/metrics/enums.py (2)
MetricNames(4-9)RequestEventTiming(12-19)
🔇 Additional comments (5)
tensorrt_llm/executor/result.py (5)
14-15: Tracing module import is appropriate and consistent with module-namespace guideline.Using
from tensorrt_llm.llmapi import tracingkeeps the module namespace and matches our guidelines. No action needed.
319-319: Metrics merge semantics LGTM.Using
self.metrics_dict.update(response.metrics)prevents clobbering previously recorded stats and is the right call.
394-395: Metrics accumulation LGTM.Switching to
updatepreserves earlier labels (e.g., finish reason) while adding runtime-derived metrics.
609-609: Trace-context propagation LGTM.Assigning
self.trace_headers = generation_request.trace_headersensures downstream spans join the inbound trace context.
848-873: Units sanity confirmed: all latency metrics are handled in seconds
- The
_process_req_perf_metricsfunction returnsTTFT,E2E,REQUEST_QUEUE_TIME, andTPOTas floating-point seconds, derived fromtime.time()differences.- Those values are directly set on OpenTelemetry spans in
tensorrt_llm/executor/result.py(viasafe_set_attr(...)) and logged as histograms intensorrt_llm/metrics/collector.py, both expecting seconds .- The benchmark script (
tensorrt_llm/serve/scripts/benchmark_serving.py) deliberately converts seconds to milliseconds when printing_ms-suffixed metrics for human-readable reports—this is separate and does not affect the OTel data .No code changes required; dashboards and alerts should assume these OTel spans are in seconds.
67dc753 to
1b42813
Compare
63afd96 to
30b1340
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #22041 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
Signed-off-by: zhanghaotong <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #22118 [ run ] triggered by Bot. Commit: |
|
PR_Github #22041 [ run ] completed with state |
|
PR_Github #22118 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #22258 [ run ] triggered by Bot. Commit: |
|
PR_Github #22258 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #22364 [ run ] triggered by Bot. Commit: |
|
PR_Github #22364 [ run ] completed with state |
Signed-off-by: Zhang Haotong <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #22566 [ run ] triggered by Bot. Commit: |
|
PR_Github #22566 [ run ] completed with state |
|
/bot skip --comment "Pipelines 17011 and 16859 have passed respectively" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve the codes related with dist-serving.
|
PR_Github #22617 [ skip ] triggered by Bot. Commit: |
|
PR_Github #22617 [ skip ] completed with state |
Signed-off-by: Zhang Haotong <[email protected]> Signed-off-by: zhanghaotong <[email protected]> Signed-off-by: Shunkang <[email protected]> Co-authored-by: Zhang Haotong <[email protected]> Co-authored-by: Shunkang <[email protected]>
Signed-off-by: Zhang Haotong <[email protected]> Signed-off-by: zhanghaotong <[email protected]> Signed-off-by: Shunkang <[email protected]> Co-authored-by: Zhang Haotong <[email protected]> Co-authored-by: Shunkang <[email protected]>
Signed-off-by: Zhang Haotong <[email protected]> Signed-off-by: zhanghaotong <[email protected]> Signed-off-by: Shunkang <[email protected]> Co-authored-by: Zhang Haotong <[email protected]> Co-authored-by: Shunkang <[email protected]>
Signed-off-by: Zhang Haotong <[email protected]> Signed-off-by: zhanghaotong <[email protected]> Signed-off-by: Shunkang <[email protected]> Co-authored-by: Zhang Haotong <[email protected]> Co-authored-by: Shunkang <[email protected]>
[None][feat] Add opentelemetry tracing
Description
This PR implements request-level OpenTelemetry tracing with the following attributes and events:
These trace attributes integrate with the API server's monitoring system through OpenTelemetry, enabling granular observability of generative AI services at the request level.
Test Coverage
For standalone instance:

For P/D separation scenarios, we can automatically chain the invocation process and generate kv_cache-related events.



Summary by CodeRabbit
New Features
Improvements
Chores