Skip to content

Conversation

@zhongxuanwang-nv
Copy link
Contributor

@zhongxuanwang-nv zhongxuanwang-nv commented Sep 28, 2025

Description

Closes 1897

Added:

  • ContextState fields: workflow_run_id, workflow_trace_id (src/nat/builder/context.py)
  • Non-zero ID generators & validators for OpenTelemetry-sized IDs (src/nat/data_models/span.py)
  • Workflow start/end event emission with metadata: workflow_run_id, workflow_trace_id, conversation_id (src/nat/runtime/runner.py)
  • Span attributes for correlation: .workflow.run_id, .workflow.trace_id, .conversation.id (src/nat/observability/exporter/span_exporter.py)
  • Tests: tests/nat/opentelemetry/test_otel_span_ids.py, tests/nat/runtime/test_runner_trace_ids.py, tests/nat/runtime/test_session_traceparent.py, updates to tests/nat/runtime/test_runner.py

Changed:

  • Ensure non-zero 128-bit trace IDs and 64-bit span IDs in OTEL plugin (packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py)
  • Runner establishes workflow-scoped run/trace IDs and resets them on exit (src/nat/runtime/runner.py)
  • Exporter prefers parent/context trace ID; falls back to workflow trace ID and sets correlated attributes (src/nat/observability/exporter/span_exporter.py)
  • SpanContext defaults & field validators enforce size/range and non-zero invariants (src/nat/data_models/span.py)
  • Parse/propagate incoming W3C traceparent and workflow headers to session/runner (src/nat/runtime/session.py, tests/nat/runtime/test_session_traceparent.py)

Removed:

  • None

Why it matters:

  • Consistent cross-span correlation and easier log/trace joins across agents and sub-spans
  • Fewer broken traces: eliminates zero/invalid IDs, aligns with OpenTelemetry sizing
  • Improves debugging and fleet-wide reporting via explicit .workflow.* and .conversation.id attributes
  • Safer header handling for external callers; preserves upstream traces

Breaking changes / Migrations:

  • None

Docs:

  • None

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Workflow-level run and trace IDs are created, stored in context, propagated, and emitted with WORKFLOW_START/WORKFLOW_END events.
    • Incoming HTTP tracing headers (traceparent, workflow-trace-id, workflow-run-id) are parsed into context.
  • Improvements

    • Span metadata now includes workflow IDs and richer attributes (event type, function info, timestamps, conversation/run IDs).
    • ID generation and validation tightened to ensure non-zero, correctly sized trace and span IDs; span context ID generation approach updated.
  • Tests

    • Added tests validating span ID formats, runner trace/run ID behavior, and header parsing.

Signed-off-by: Daniel Wang <[email protected]>
@zhongxuanwang-nv zhongxuanwang-nv self-assigned this Sep 28, 2025
@zhongxuanwang-nv zhongxuanwang-nv added improvement Improvement to existing functionality non-breaking Non-breaking change labels Sep 28, 2025
@coderabbitai
Copy link

coderabbitai bot commented Sep 28, 2025

Walkthrough

Generate non-zero OTEL-style 128-bit trace IDs and 64-bit span IDs with stricter validation and parsing; add Context workflow_run_id and workflow_trace_id; parse trace headers into context; propagate workflow IDs through Runner (emit WORKFLOW_START/END) and Exporter (trace fallback and attributes); add tests for IDs and header parsing.

Changes

Cohort / File(s) Summary of changes
OTEL span construction & ID validation
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py, src/nat/data_models/span.py, tests/nat/opentelemetry/test_otel_span_ids.py
OtelSpan now constructs SpanContext using updated UUID-based ID generation; SpanContext gets non-zero ID generators and validators for trace_id (128-bit) and span_id (64-bit) accepting int/str/None and enforcing bit-width/non-zero constraints; tests assert non-zero and expected hex lengths.
Context state: workflow IDs
src/nat/builder/context.py
Added ContextState ContextVars `workflow_run_id: ContextVar[str
Runner: workflow ID scoping & events
src/nat/runtime/runner.py, tests/nat/runtime/test_runner_trace_ids.py
Runner derives/creates or preserves workflow_run_id and workflow_trace_id (UUIDs when absent), stores via tokens, emits WORKFLOW_START and WORKFLOW_END (TraceMetadata with workflow IDs) for result and streaming flows, and restores prior context; tests verify propagation and restoration.
Session: header parsing → workflow IDs
src/nat/runtime/session.py, tests/nat/runtime/test_session_traceparent.py
Added handling of W3C traceparent header (preferred) and fallback workflow-trace-id/workflow-run-id headers; parses and sets workflow IDs into ContextState with tolerant error handling; tests cover valid, invalid, and edge cases.
Exporter: span attribute population & trace fallback
src/nat/observability/exporter/span_exporter.py
Exporter reads workflow IDs from context and, when creating sub-spans, selects trace_id from span context → parent → workflow_trace_id fallback; constructs consolidated attributes including workflow.run_id and workflow.trace_id (formatted as 032x) attached to spans.
Tests: integration and unit coverage
tests/nat/opentelemetry/test_otel_span_ids.py, tests/nat/runtime/test_runner_trace_ids.py, tests/nat/runtime/test_session_traceparent.py
New tests validate OtelSpan ID properties, Runner workflow ID preservation/propagation, and Session header parsing behavior across edge cases.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant SessionManager
  participant ContextState

  Client->>SessionManager: HTTP request (headers)
  alt traceparent present
    note right of SessionManager #DDEBF7: extract 32-hex trace_id → int
    SessionManager->>ContextState: set workflow_trace_id
  else workflow-trace-id present
    note right of SessionManager #F3F2DD: parse UUID/hex → int
    SessionManager->>ContextState: set workflow_trace_id
  end
  alt workflow-run-id present
    SessionManager->>ContextState: set workflow_run_id
  end
Loading
sequenceDiagram
  autonumber
  participant Runner
  participant ContextState
  participant ExporterManager
  participant Function

  Runner->>ContextState: derive/set workflow_run_id & workflow_trace_id (tokens)
  Runner->>ExporterManager: start()
  note right of Runner #E8F8E8: emit WORKFLOW_START (TraceMetadata)
  Runner->>Function: execute (result or stream)
  Function-->>Runner: output(s)
  note right of Runner #FFEFE0: emit WORKFLOW_END (TraceMetadata + payload)
  Runner->>ContextState: restore previous values (finally)
Loading
sequenceDiagram
  autonumber
  participant Exporter
  participant ContextState
  participant SpanSystem

  Exporter->>ContextState: read workflow_trace_id, workflow_run_id
  alt parent span exists
    Exporter->>SpanSystem: create sub-span (use parent.trace_id)
  else workflow_trace_id available
    Exporter->>SpanSystem: create sub-span (trace_id = workflow_trace_id)
  else
    Exporter->>SpanSystem: create sub-span (auto-generated non-zero IDs)
  end
  note over SpanSystem #F5F5FF: span attributes include workflow.run_id and workflow.trace_id (formatted 032x)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

! - Release

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit's high-level summary is enabled.
Title Check ✅ Passed The title uses an imperative verb, clearly describes the main change, remains under the 72-character limit, and concisely conveys the addition of hierarchical IDs for telemetry and reporting.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e43759 and 58ae9cc.

📒 Files selected for processing (1)
  • src/nat/data_models/span.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

  • src/nat/data_models/span.py
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

  • src/nat/data_models/span.py
src/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All importable Python code must live under src/ (or packages//src/)

Files:

  • src/nat/data_models/span.py
src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Changes in src/nat should prioritize backward compatibility

Files:

  • src/nat/data_models/span.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/data_models/span.py
{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful

Files:

  • src/nat/data_models/span.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • src/nat/data_models/span.py
🧬 Code graph analysis (1)
src/nat/data_models/span.py (1)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (2)
  • trace_id (274-280)
  • span_id (265-271)
🪛 Ruff (0.13.2)
src/nat/data_models/span.py

156-156: Avoid specifying long messages outside the exception class

(TRY003)


167-167: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


167-167: Avoid specifying long messages outside the exception class

(TRY003)


171-171: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (3)
src/nat/data_models/span.py (3)

131-139: LGTM!

The generator functions correctly produce non-zero IDs. uuid.uuid4() guarantees non-zero values per RFC 4122, and the 64-bit shift for span_id appropriately reduces the UUID to the required size.


147-157: LGTM!

The trace_id validator correctly handles all input types:

  • Converts valid UUID strings to 128-bit integers
  • Regenerates when None
  • Validates non-zero and bit-width constraints

The static analysis hint about TRY003 (inline exception messages) is pedantic for this context—the error messages are clear and concise.


168-172: LGTM!

The validation logic correctly regenerates on None and enforces non-zero 64-bit constraints. The implementation is consistent with the trace_id validator approach.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@zhongxuanwang-nv
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Sep 28, 2025

@zhongxuanwang-nv I'll review the changes in this PR that adds hierarchical IDs to NAT for consistent telemetry and reporting.


🧠 Learnings used
Learnt from: willkill07
PR: NVIDIA/NeMo-Agent-Toolkit#710
File: .coderabbit.yaml:33-33
Timestamp: 2025-08-25T15:24:13.143Z
Learning: User feedback indicates that CodeRabbit's chat commands, particularly `coderabbitai review` for manual incremental reviews, need better documentation and discoverability within repositories, especially when auto_incremental_review is disabled.
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@zhongxuanwang-nv zhongxuanwang-nv changed the title Add hierarchical IDs to NAT for consistent telemetry and reporting Add hierarchical IDs for consistent telemetry and reporting Sep 28, 2025
@zhongxuanwang-nv zhongxuanwang-nv marked this pull request as ready for review September 29, 2025 03:59
@zhongxuanwang-nv zhongxuanwang-nv requested a review from a team as a code owner September 29, 2025 03:59
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/nat/observability/exporter/span_exporter.py (1)

156-160: Span kind always falls back to UNKNOWN due to Enum passed to string-keyed map; subspan name also shows Enum repr.

event_type_to_span_kind expects a string key (e.g., "WORKFLOW_START"), but event.event_type is an Enum; similarly sub_span_name should use the string value. This degrades telemetry classification.

-        if event.payload.name:
-            sub_span_name = f"{event.payload.name}"
-        else:
-            sub_span_name = f"{event.payload.event_type}"
+        if event.payload.name:
+            sub_span_name = f"{event.payload.name}"
+        else:
+            sub_span_name = f"{getattr(event.payload.event_type, 'value', str(event.payload.event_type))}"

-        span_kind = event_type_to_span_kind(event.event_type)
+        span_kind = event_type_to_span_kind(getattr(event.event_type, 'value', str(event.event_type)))

Also applies to: 196-198

🧹 Nitpick comments (13)
tests/nat/opentelemetry/test_otel_span_ids.py (1)

19-25: Strengthen assertions to cover accessors and parent propagation.

Add quick checks that the OtelSpan accessors reflect the same IDs as the context, and that a child span inherits the trace_id from its parent.

Apply:

 def test_otel_span_ids_are_non_zero():
     s = OtelSpan(name="test", context=None, parent=None, attributes={})
     ctx = s.get_span_context()
     assert ctx.trace_id != 0
     assert ctx.span_id != 0
     assert len(f"{ctx.trace_id:032x}") == 32
     assert len(f"{ctx.span_id:016x}") == 16
+    # Accessors mirror context
+    assert s.trace_id == ctx.trace_id
+    assert s.span_id == ctx.span_id
+    # Child inherits parent's trace_id
+    child = OtelSpan(name="child", context=None, parent=s, attributes={})
+    assert child.trace_id == s.trace_id
src/nat/runtime/session.py (3)

16-23: Add module logger for parse diagnostics.

 import asyncio
 import contextvars
 import typing
 from collections.abc import Awaitable
 from collections.abc import Callable
 from contextlib import asynccontextmanager
 from contextlib import nullcontext
+import logging
+import uuid

As per coding guidelines.


164-187: Avoid bare except, validate 32‑nybble non‑zero IDs, and log parse failures.

Current try/except: pass drops signal and accepts malformed inputs. Narrow exceptions, enforce spec length/non‑zero, and log once.

-        # W3C Trace Context header: traceparent: 00-<trace-id>-<span-id>-<flags>
+        # W3C Trace Context header: traceparent: 00-<trace-id>-<span-id>-<flags>
         traceparent = request.headers.get("traceparent")
         if traceparent:
-            try:
-                parts = traceparent.split("-")
-                if len(parts) >= 4:
-                    trace_id_hex = parts[1]
-                    if len(trace_id_hex) == 32:
-                        trace_id_int = int(trace_id_hex, 16)
-                        self._context_state.workflow_trace_id.set(trace_id_int)
-            except Exception:
-                pass
+            try:
+                parts = traceparent.split("-")
+                if len(parts) >= 4:
+                    trace_id_hex = parts[1]
+                    # Enforce 128-bit hex and non-zero
+                    if len(trace_id_hex) == 32:
+                        trace_id_int = int(trace_id_hex, 16)
+                        if trace_id_int != 0:
+                            self._context_state.workflow_trace_id.set(trace_id_int)
+            except (ValueError, TypeError) as e:
+                logging.getLogger(__name__).debug("Ignoring invalid traceparent header '%s': %s", traceparent, e)
 
-        if not self._context_state.workflow_trace_id.get():
-            workflow_trace_id = request.headers.get("workflow-trace-id")
-            if workflow_trace_id:
-                try:
-                    self._context_state.workflow_trace_id.set(int(workflow_trace_id, 16))
-                except Exception:
-                    pass
+        if not self._context_state.workflow_trace_id.get():
+            workflow_trace_id = request.headers.get("workflow-trace-id")
+            if workflow_trace_id:
+                try:
+                    norm = workflow_trace_id.strip().lower().removeprefix("0x")
+                    if len(norm) == 32:
+                        tid = int(norm, 16)
+                        if tid != 0:
+                            self._context_state.workflow_trace_id.set(tid)
+                except (ValueError, TypeError) as e:
+                    logging.getLogger(__name__).debug("Ignoring invalid workflow-trace-id '%s': %s",
+                                                      workflow_trace_id, e)

Based on coding guidelines.


185-188: Validate workflow-run-id as UUID before accepting.

Prevents arbitrary strings; keeps telemetry clean.

-        workflow_run_id = request.headers.get("workflow-run-id")
-        if workflow_run_id:
-            self._context_state.workflow_run_id.set(workflow_run_id)
+        workflow_run_id = request.headers.get("workflow-run-id")
+        if workflow_run_id:
+            try:
+                uuid.UUID(workflow_run_id)
+            except (ValueError, AttributeError):
+                logging.getLogger(__name__).debug("Ignoring invalid workflow-run-id '%s'", workflow_run_id)
+            else:
+                self._context_state.workflow_run_id.set(workflow_run_id)

As per coding guidelines.

packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (2)

88-97: Prefer secrets.randbits for full-entropy IDs and simpler loops.

uuid4() provides ~122 bits entropy and adds structure; secrets.randbits (or os.urandom) gives full-width random bits and simpler code.

-        if context is None or isinstance(context, Context):
-            # Generate non-zero IDs per OTel spec
-            while True:
-                trace_id = uuid.uuid4().int & ((1 << 128) - 1)
-                if trace_id != 0:
-                    break
-            while True:
-                span_id = uuid.uuid4().int & ((1 << 64) - 1)
-                if span_id != 0:
-                    break
+        if context is None or isinstance(context, Context):
+            # Generate non-zero IDs per OTel spec
+            import secrets
+            trace_id = 0
+            while trace_id == 0:
+                trace_id = secrets.randbits(128)
+            span_id = 0
+            while span_id == 0:
+                span_id = secrets.randbits(64)

115-117: Avoid hard-coding instrumentation scope version.

Wire this to the package version (e.g., nvidia_nat_opentelemetry.__version__) or accept via constructor to keep telemetry accurate across releases.

-        self._instrumentation_scope = instrumentation_scope or InstrumentationScope("nat", "1.0.0")
+        from importlib.metadata import version, PackageNotFoundError
+        if instrumentation_scope is None:
+            try:
+                lib_version = version("nvidia-nat-opentelemetry")
+            except PackageNotFoundError:
+                lib_version = "0"
+            self._instrumentation_scope = InstrumentationScope("nat", lib_version)
+        else:
+            self._instrumentation_scope = instrumentation_scope
src/nat/builder/context.py (2)

122-134: Fix return-type docstring for input_message.

Docstring claims str but the property returns typing.Any. Align for correctness and typing clarity.

     def input_message(self):
         """
-        Retrieves the input message from the context state.
+        Retrieves the input message from the context state.
 
-        The input_message property is used to access the message stored in the
-        context state. This property returns the message as it is currently
-        maintained in the context.
+        The `input_message` property exposes the message stored in the context state.
 
         Returns:
-            str: The input message retrieved from the context state.
+            typing.Any: The input message retrieved from the context state.
         """
         return self._context_state.input_message.get()

Optionally add an explicit return annotation: def input_message(self) -> typing.Any: .... As per coding guidelines.


201-214: New workflow ID properties look good; consider docstrings with first-line summary style.

Minor nit: ensure the first line ends with a period and wraps code entities in backticks.

tests/nat/runtime/test_session_traceparent.py (1)

27-64: Add precedence and negative cases; remove unnecessary asyncio.

  • When both headers are present, assert traceparent wins.
  • Add malformed header cases (wrong length, zero trace_id) to ensure nothing is set.
  • Add a test for workflow-run-id UUID validation if you adopt the validation change.
  • Test is sync; you can drop @pytest.mark.asyncio and async def.

Example additions:

 @pytest.mark.parametrize("header_case", [
     "traceparent",
     "workflow-trace-id",
 ])
-@pytest.mark.asyncio
-async def test_session_sets_trace_id_from_headers(header_case: str):
+def test_session_sets_trace_id_from_headers(header_case: str):
     ...
 
+def test_traceparent_precedence_over_workflow_trace_id():
+    trace_id_hex = "c" * 32
+    other_hex = "d" * 32
+    headers = [
+        (b"traceparent", f"00-{trace_id_hex}-{'e'*16}-01".encode()),
+        (b"workflow-trace-id", other_hex.encode()),
+    ]
+    request = Request({"type": "http", "method": "GET", "path": "/", "headers": headers, "client": ("127.0.0.1", 0)})
+    ctx_state = ContextState.get()
+    prev = ctx_state.workflow_trace_id.set(None)
+    try:
+        SessionManager(workflow=_DummyWorkflow(), max_concurrency=0).set_metadata_from_http_request(request)
+        assert ctx_state.workflow_trace_id.get() == int(trace_id_hex, 16)
+    finally:
+        ctx_state.workflow_trace_id.reset(prev)
+
+@pytest.mark.parametrize("bad_header", ["", "abc", "0"*32, "g"*32])
+def test_ignores_malformed_traceparent(bad_header: str):
+    headers = [(b"traceparent", f"00-{bad_header}-{'f'*16}-01".encode())]
+    request = Request({"type": "http", "method": "GET", "path": "/", "headers": headers, "client": ("127.0.0.1", 0)})
+    ctx_state = ContextState.get()
+    prev = ctx_state.workflow_trace_id.set(None)
+    try:
+        SessionManager(workflow=_DummyWorkflow(), max_concurrency=0).set_metadata_from_http_request(request)
+        assert ctx_state.workflow_trace_id.get() is None
+    finally:
+        ctx_state.workflow_trace_id.reset(prev)

As per coding guidelines.

tests/nat/runtime/test_runner_trace_ids.py (1)

35-39: Strengthen assertions: also validate workflow_run_id is set during execution.

This ensures both IDs are present, not just the trace.

     async def ainvoke(self, _message, _to_type=None):
         ctx = Context.get()
-        assert isinstance(ctx.workflow_trace_id, int) and ctx.workflow_trace_id != 0
+        assert isinstance(ctx.workflow_trace_id, int) and ctx.workflow_trace_id != 0
+        assert isinstance(ctx.workflow_run_id, str) and ctx.workflow_run_id
         return {"ok": True}

     async def astream(self, _message, _to_type=None):
         ctx = Context.get()
-        assert isinstance(ctx.workflow_trace_id, int) and ctx.workflow_trace_id != 0
+        assert isinstance(ctx.workflow_trace_id, int) and ctx.workflow_trace_id != 0
+        assert isinstance(ctx.workflow_run_id, str) and ctx.workflow_run_id
         yield "chunk-1"

Also applies to: 40-44

src/nat/runtime/runner.py (2)

146-162: Duplicate run/trace-ID establishment in result and result_stream.

Extract into a small helper to reduce drift and ease future changes.

As per coding guidelines.

 class Runner:
+    def _establish_workflow_ids(self) -> tuple[str, int, typing.ContextManager[None], typing.ContextManager[None]]:
+        existing_run_id = self._context_state.workflow_run_id.get()
+        existing_trace_id = self._context_state.workflow_trace_id.get()
+        workflow_run_id = existing_run_id if existing_run_id else str(uuid.uuid4())
+        if existing_trace_id:
+            workflow_trace_id = existing_trace_id
+        else:
+            while True:
+                workflow_trace_id = uuid.uuid4().int
+                if workflow_trace_id != 0:
+                    break
+        tkn_run = self._context_state.workflow_run_id.set(workflow_run_id)
+        tkn_trace = self._context_state.workflow_trace_id.set(workflow_trace_id)
+        return workflow_run_id, workflow_trace_id, tkn_run, tkn_trace

Then replace the duplicated blocks with:

-            # Establish workflow run and trace identifiers
-            existing_run_id = ...
-            ...
-            token_run_id = self._context_state.workflow_run_id.set(workflow_run_id)
-            token_trace_id = self._context_state.workflow_trace_id.set(workflow_trace_id)
+            (workflow_run_id,
+             workflow_trace_id,
+             token_run_id,
+             token_trace_id) = self._establish_workflow_ids()

Also applies to: 231-247


205-207: Simplify logging to avoid message building.

Minor style; keep logger.error per guidelines when re-raising.

As per coding guidelines.

-            err_msg = f": {e}" if str(e).strip() else "."
-            logger.error("Error running workflow%s", err_msg)
+            logger.error("Error running workflow: %s", e)
src/nat/data_models/span.py (1)

153-166: Narrow exception handling in validators.

Catching Exception is unnecessary here; limit to ValueError | TypeError for int() coercion.

As per coding guidelines.

-        try:
-            value = int(v)
-        except Exception:
+        try:
+            value = int(v)
+        except (ValueError, TypeError):
             return _generate_nonzero_trace_id()
...
-        try:
-            value = int(v)
-        except Exception:
+        try:
+            value = int(v)
+        except (ValueError, TypeError):
             return _generate_nonzero_span_id()

Also applies to: 168-181

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 189a048 and 7e2e07e.

📒 Files selected for processing (9)
  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (1 hunks)
  • src/nat/builder/context.py (3 hunks)
  • src/nat/data_models/span.py (1 hunks)
  • src/nat/observability/exporter/span_exporter.py (3 hunks)
  • src/nat/runtime/runner.py (4 hunks)
  • src/nat/runtime/session.py (1 hunks)
  • tests/nat/opentelemetry/test_otel_span_ids.py (1 hunks)
  • tests/nat/runtime/test_runner_trace_ids.py (1 hunks)
  • tests/nat/runtime/test_session_traceparent.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (10)
**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

  • tests/nat/runtime/test_session_traceparent.py
  • src/nat/builder/context.py
  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
  • src/nat/runtime/session.py
  • tests/nat/opentelemetry/test_otel_span_ids.py
  • src/nat/runtime/runner.py
  • tests/nat/runtime/test_runner_trace_ids.py
  • src/nat/observability/exporter/span_exporter.py
  • src/nat/data_models/span.py
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

  • tests/nat/runtime/test_session_traceparent.py
  • src/nat/builder/context.py
  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
  • src/nat/runtime/session.py
  • tests/nat/opentelemetry/test_otel_span_ids.py
  • src/nat/runtime/runner.py
  • tests/nat/runtime/test_runner_trace_ids.py
  • src/nat/observability/exporter/span_exporter.py
  • src/nat/data_models/span.py
tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Unit tests reside under tests/ and should use markers defined in pyproject.toml (e.g., integration)

Files:

  • tests/nat/runtime/test_session_traceparent.py
  • tests/nat/opentelemetry/test_otel_span_ids.py
  • tests/nat/runtime/test_runner_trace_ids.py

⚙️ CodeRabbit configuration file

tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using the test_ prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using the fixture_ prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass

Files:

  • tests/nat/runtime/test_session_traceparent.py
  • tests/nat/opentelemetry/test_otel_span_ids.py
  • tests/nat/runtime/test_runner_trace_ids.py
{tests/**/*.py,examples/*/tests/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{tests/**/*.py,examples/*/tests/**/*.py}: Use pytest (with pytest-asyncio for async); name test files test_*.py; test functions start with test_; extract repeated code into fixtures; fixtures must set name in decorator and be named with fixture_ prefix
Mock external services with pytest_httpserver or unittest.mock; do not hit live endpoints
Mark expensive tests with @pytest.mark.slow or @pytest.mark.integration

Files:

  • tests/nat/runtime/test_session_traceparent.py
  • tests/nat/opentelemetry/test_otel_span_ids.py
  • tests/nat/runtime/test_runner_trace_ids.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • tests/nat/runtime/test_session_traceparent.py
  • src/nat/builder/context.py
  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
  • src/nat/runtime/session.py
  • tests/nat/opentelemetry/test_otel_span_ids.py
  • src/nat/runtime/runner.py
  • tests/nat/runtime/test_runner_trace_ids.py
  • src/nat/observability/exporter/span_exporter.py
  • src/nat/data_models/span.py
src/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All importable Python code must live under src/ (or packages//src/)

Files:

  • src/nat/builder/context.py
  • src/nat/runtime/session.py
  • src/nat/runtime/runner.py
  • src/nat/observability/exporter/span_exporter.py
  • src/nat/data_models/span.py
src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Changes in src/nat should prioritize backward compatibility

Files:

  • src/nat/builder/context.py
  • src/nat/runtime/session.py
  • src/nat/runtime/runner.py
  • src/nat/observability/exporter/span_exporter.py
  • src/nat/data_models/span.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/builder/context.py
  • src/nat/runtime/session.py
  • src/nat/runtime/runner.py
  • src/nat/observability/exporter/span_exporter.py
  • src/nat/data_models/span.py
{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful

Files:

  • src/nat/builder/context.py
  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
  • src/nat/runtime/session.py
  • src/nat/runtime/runner.py
  • src/nat/observability/exporter/span_exporter.py
  • src/nat/data_models/span.py
packages/*/src/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Importable Python code inside packages must live under packages//src/

Files:

  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
packages/**/*

⚙️ CodeRabbit configuration file

packages/**/*: - This directory contains optional plugin packages for the toolkit, each should contain a pyproject.toml file. - The pyproject.toml file should declare a dependency on nvidia-nat or another package with a name starting
with nvidia-nat-. This dependency should be declared using ~=<version>, and the version should be a two
digit version (ex: ~=1.0).

  • Not all packages contain Python code, if they do they should also contain their own set of tests, in a
    tests/ directory at the same level as the pyproject.toml file.

Files:

  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
🧬 Code graph analysis (7)
tests/nat/runtime/test_session_traceparent.py (2)
src/nat/builder/context.py (2)
  • ContextState (65-114)
  • workflow_trace_id (209-213)
src/nat/runtime/session.py (2)
  • workflow (84-85)
  • set_metadata_from_http_request (142-187)
src/nat/runtime/session.py (1)
src/nat/builder/context.py (4)
  • get (113-114)
  • get (305-315)
  • workflow_trace_id (209-213)
  • workflow_run_id (202-206)
tests/nat/opentelemetry/test_otel_span_ids.py (1)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (3)
  • get_span_context (315-321)
  • trace_id (280-286)
  • span_id (271-277)
src/nat/runtime/runner.py (3)
src/nat/builder/context.py (11)
  • Context (117-315)
  • ContextState (65-114)
  • workflow_run_id (202-206)
  • get (113-114)
  • get (305-315)
  • workflow_trace_id (209-213)
  • conversation_id (185-192)
  • intermediate_step_manager (171-182)
  • metadata (89-92)
  • metadata (151-160)
  • output (58-59)
src/nat/observability/exporter_manager.py (2)
  • get (326-335)
  • start (201-257)
tests/nat/runtime/test_runner_trace_ids.py (3)
  • start (48-58)
  • ainvoke (35-38)
  • astream (40-43)
tests/nat/runtime/test_runner_trace_ids.py (3)
src/nat/runtime/runner.py (7)
  • context (90-91)
  • Runner (50-300)
  • convert (93-94)
  • result (126-127)
  • result (130-131)
  • result (133-216)
  • result_stream (218-300)
src/nat/builder/context.py (4)
  • Context (117-315)
  • ContextState (65-114)
  • workflow_trace_id (209-213)
  • workflow_run_id (202-206)
src/nat/observability/exporter_manager.py (1)
  • ExporterManager (26-335)
src/nat/observability/exporter/span_exporter.py (2)
src/nat/builder/context.py (5)
  • workflow_trace_id (209-213)
  • get (113-114)
  • get (305-315)
  • conversation_id (185-192)
  • workflow_run_id (202-206)
src/nat/data_models/span.py (2)
  • SpanContext (147-181)
  • Span (184-237)
src/nat/data_models/span.py (1)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (2)
  • trace_id (280-286)
  • span_id (271-277)
🪛 Ruff (0.13.1)
src/nat/runtime/session.py

174-175: try-except-pass detected, consider logging the exception

(S110)


174-174: Do not catch blind exception: Exception

(BLE001)


182-183: try-except-pass detected, consider logging the exception

(S110)


182-182: Do not catch blind exception: Exception

(BLE001)

src/nat/runtime/runner.py

144-144: Abstract raise to an inner function

(TRY301)


144-144: Avoid specifying long messages outside the exception class

(TRY003)


221-221: Avoid specifying long messages outside the exception class

(TRY003)


229-229: Abstract raise to an inner function

(TRY301)


229-229: Avoid specifying long messages outside the exception class

(TRY003)

tests/nat/runtime/test_runner_trace_ids.py

32-32: Unused method argument: to_type

(ARG002)


35-35: Unused method argument: to_type

(ARG002)


40-40: Unused method argument: to_type

(ARG002)


48-48: Unused method argument: context_state

(ARG002)

src/nat/data_models/span.py

161-161: Do not catch blind exception: Exception

(BLE001)


176-176: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (1)
src/nat/observability/exporter/span_exporter.py (1)

170-188: Nit: attribute formatting is solid; consider consistently hex-formatting only when non-None.

Current expression handles it; no action required.

Please run one span-exporting flow and confirm nat.span.kind is not UNKNOWN for WORKFLOW_* and FUNCTION_* events after the above fix.

Copy link
Member

@willkill07 willkill07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know too much about UUIDs :)
This is nice, but we can clean it up a lot!

Signed-off-by: Daniel Wang <[email protected]>
@zhongxuanwang-nv zhongxuanwang-nv changed the base branch from develop to release/1.3 October 1, 2025 00:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/nat/data_models/span.py (1)

142-145: Fix typo in field descriptions.

The field descriptions contain a typo: "OTel-syle" should be "OpenTelemetry-style" on both lines 143 and 145.

Apply this diff:

-    trace_id: int = Field(default_factory=_generate_nonzero_trace_id,
-                          description="The OTel-syle 128-bit trace ID of the span.")
-    span_id: int = Field(default_factory=_generate_nonzero_span_id,
-                         description="The OTel-syle 64-bit span ID of the span.")
+    trace_id: int = Field(default_factory=_generate_nonzero_trace_id,
+                          description="The OpenTelemetry-style 128-bit trace ID of the span.")
+    span_id: int = Field(default_factory=_generate_nonzero_span_id,
+                         description="The OpenTelemetry-style 64-bit span ID of the span.")
🧹 Nitpick comments (3)
src/nat/runtime/runner.py (1)

225-234: Consider extracting workflow ID establishment logic.

The workflow ID establishment logic is identical in both result() and result_stream() methods (lines 146-155 and 225-234). Consider extracting this into a helper method to follow DRY principles.

Example refactor:

def _establish_workflow_ids(self) -> tuple[str, int, Any, Any]:
    """Establish workflow run and trace identifiers.
    
    Returns:
        Tuple of (workflow_run_id, workflow_trace_id, token_run_id, token_trace_id)
    """
    existing_run_id = self._context_state.workflow_run_id.get()
    existing_trace_id = self._context_state.workflow_trace_id.get()
    
    workflow_run_id = existing_run_id or str(uuid.uuid4())
    workflow_trace_id = existing_trace_id or uuid.uuid4().int
    
    token_run_id = self._context_state.workflow_run_id.set(workflow_run_id)
    token_trace_id = self._context_state.workflow_trace_id.set(workflow_trace_id)
    
    return workflow_run_id, workflow_trace_id, token_run_id, token_trace_id

Then use it in both methods:

workflow_run_id, workflow_trace_id, token_run_id, token_trace_id = self._establish_workflow_ids()
src/nat/data_models/span.py (2)

147-157: Validator logic is correct; consider custom exception type.

The validation logic correctly handles UUID string conversion, None regeneration, and zero rejection. The implementation follows suggestions from previous reviews.

Static analysis suggests using a custom exception type instead of generic Exception. While the current implementation works, consider defining a custom exception for better error handling:

class InvalidSpanIdError(ValueError):
    """Raised when a span ID is invalid."""
    pass

Then update line 156:

-        raise Exception("A null UUID is invalid")
+        raise InvalidSpanIdError("A null UUID is invalid")

159-169: Validator logic is correct; note duplication.

The span_id validator correctly mirrors the trace_id validator logic. Similar to the trace_id validator, consider using a custom exception type (see previous comment).

The validation logic is duplicated between _validate_trace_id and _validate_span_id. Consider extracting common logic if this pattern is used elsewhere:

def _validate_id(v: int | str | None, generator_fn: callable, id_type: str) -> int:
    """Common validation logic for trace_id and span_id."""
    if isinstance(v, str):
        v = uuid.UUID(v).int
    if isinstance(v, type(None)):
        v = generator_fn()
    if v == 0:
        raise InvalidSpanIdError(f"A null {id_type} is invalid")
    return v

However, given the validators are already concise, this refactor is optional.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e2e07e and 5b87eb8.

📒 Files selected for processing (3)
  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (1 hunks)
  • src/nat/data_models/span.py (1 hunks)
  • src/nat/runtime/runner.py (4 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

  • src/nat/data_models/span.py
  • src/nat/runtime/runner.py
  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

  • src/nat/data_models/span.py
  • src/nat/runtime/runner.py
  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
src/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All importable Python code must live under src/ (or packages//src/)

Files:

  • src/nat/data_models/span.py
  • src/nat/runtime/runner.py
src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Changes in src/nat should prioritize backward compatibility

Files:

  • src/nat/data_models/span.py
  • src/nat/runtime/runner.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/data_models/span.py
  • src/nat/runtime/runner.py
{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful

Files:

  • src/nat/data_models/span.py
  • src/nat/runtime/runner.py
  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • src/nat/data_models/span.py
  • src/nat/runtime/runner.py
  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
packages/*/src/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Importable Python code inside packages must live under packages//src/

Files:

  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
packages/**/*

⚙️ CodeRabbit configuration file

packages/**/*: - This directory contains optional plugin packages for the toolkit, each should contain a pyproject.toml file. - The pyproject.toml file should declare a dependency on nvidia-nat or another package with a name starting
with nvidia-nat-. This dependency should be declared using ~=<version>, and the version should be a two
digit version (ex: ~=1.0).

  • Not all packages contain Python code, if they do they should also contain their own set of tests, in a
    tests/ directory at the same level as the pyproject.toml file.

Files:

  • packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
🧬 Code graph analysis (2)
src/nat/data_models/span.py (1)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (2)
  • trace_id (274-280)
  • span_id (265-271)
src/nat/runtime/runner.py (3)
src/nat/builder/context.py (11)
  • Context (117-315)
  • ContextState (65-114)
  • workflow_run_id (202-206)
  • get (113-114)
  • get (305-315)
  • workflow_trace_id (209-213)
  • conversation_id (185-192)
  • intermediate_step_manager (171-182)
  • metadata (89-92)
  • metadata (151-160)
  • output (58-59)
src/nat/observability/exporter_manager.py (2)
  • get (326-335)
  • start (201-257)
tests/nat/runtime/test_runner_trace_ids.py (3)
  • start (48-58)
  • ainvoke (35-38)
  • astream (40-43)
🪛 Ruff (0.13.1)
src/nat/data_models/span.py

156-156: Create your own exception

(TRY002)


156-156: Avoid specifying long messages outside the exception class

(TRY003)


168-168: Create your own exception

(TRY002)


168-168: Avoid specifying long messages outside the exception class

(TRY003)

src/nat/runtime/runner.py

144-144: Abstract raise to an inner function

(TRY301)


144-144: Avoid specifying long messages outside the exception class

(TRY003)


215-215: Avoid specifying long messages outside the exception class

(TRY003)


223-223: Abstract raise to an inner function

(TRY301)


223-223: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (7)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (1)

89-91: LGTM: UUID v4 guarantees non-zero IDs.

The implementation correctly uses uuid.uuid4() to generate non-zero IDs as suggested in previous reviews. UUID v4 properties ensure the trace_id (128-bit) and span_id (64-bit) are always non-zero due to version and variant bits.

src/nat/runtime/runner.py (5)

146-155: LGTM: Workflow ID establishment follows suggested pattern.

The implementation correctly:

  • Preserves existing workflow_run_id and workflow_trace_id when present
  • Falls back to UUID generation when absent
  • Uses token-based context management for proper cleanup

157-189: LGTM: Proper workflow event emission with consistent metadata.

The implementation correctly:

  • Emits WORKFLOW_START and WORKFLOW_END events at appropriate points
  • Includes consistent metadata (workflow_run_id, workflow_trace_id as hex, conversation_id)
  • Attaches output data to WORKFLOW_END event
  • Integrates properly with the intermediate step manager

206-210: LGTM: Proper token cleanup in finally block.

The finally block correctly resets workflow_run_id and workflow_trace_id tokens, ensuring context isolation is maintained even when exceptions occur.


236-269: LGTM: Workflow event emission for streaming path.

The implementation correctly emits WORKFLOW_START and WORKFLOW_END events for the streaming path with consistent metadata. Note that similar duplication exists here as in the ID establishment logic (see previous comment about extracting helper methods).


284-288: LGTM: Proper token cleanup in finally block.

The finally block correctly resets workflow_run_id and workflow_trace_id tokens for the streaming path, maintaining consistency with the result() method.

src/nat/data_models/span.py (1)

131-138: LGTM: Non-zero ID generators align with UUID v4 guarantees.

The implementation correctly generates non-zero IDs using UUID v4 properties as suggested in previous reviews. The trace_id uses the full 128-bit UUID, while span_id uses the lower 64 bits via right shift.

@coderabbitai coderabbitai bot added the external This issue was filed by someone outside of the NeMo Agent toolkit team label Oct 1, 2025
@zhongxuanwang-nv
Copy link
Contributor Author

@willkill07 Thank you so much Will for all of the comments for PR-863!! I was scared when I saw there were 9 comments 😂 but I'm relieved to see most of them are about one issue (I should've dived deeper to realize that uuid.uuid4 already guarantees non-zero-ness!)

Signed-off-by: Zhongxuan (Daniel) Wang <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 1, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@zhongxuanwang-nv
Copy link
Contributor Author

/ok to test 3a2186b

@willkill07 willkill07 removed the external This issue was filed by someone outside of the NeMo Agent toolkit team label Oct 1, 2025
@dagardner-nv dagardner-nv deleted the branch NVIDIA:release/1.3 October 1, 2025 20:06
@willkill07 willkill07 reopened this Oct 1, 2025
Signed-off-by: Daniel Wang <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/nat/data_models/span.py (1)

142-145: Typo in field descriptions: "OTel-syle" should be "OpenTelemetry-style".

Both trace_id and span_id field descriptions contain the typo "OTel-syle".

Apply this diff:

-    trace_id: int = Field(default_factory=_generate_nonzero_trace_id,
-                          description="The OTel-syle 128-bit trace ID of the span.")
-    span_id: int = Field(default_factory=_generate_nonzero_span_id,
-                         description="The OTel-syle 64-bit span ID of the span.")
+    trace_id: int = Field(default_factory=_generate_nonzero_trace_id,
+                          description="The OpenTelemetry-style 128-bit trace ID of the span.")
+    span_id: int = Field(default_factory=_generate_nonzero_span_id,
+                         description="The OpenTelemetry-style 64-bit span ID of the span.")
🧹 Nitpick comments (1)
src/nat/runtime/session.py (1)

175-176: Consider logging header parse failures for observability.

The bare except Exception: pass blocks silently swallow all parsing errors, making it difficult to diagnose misconfigurations or malformed headers from upstream callers.

Add debug-level logging to improve observability while maintaining resilience:

         except Exception:
+            logger.debug("Failed to parse traceparent header: %s", traceparent, exc_info=True)
             pass
         except Exception:
+            logger.debug("Failed to parse workflow-trace-id header: %s", workflow_trace_id, exc_info=True)
             pass

Based on coding guidelines (exception handling should use logger.error() when re-raising or logger.exception() when catching).

Also applies to: 183-184

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a2186b and 471cda2.

📒 Files selected for processing (3)
  • src/nat/data_models/span.py (1 hunks)
  • src/nat/runtime/session.py (2 hunks)
  • tests/nat/runtime/test_session_traceparent.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

  • tests/nat/runtime/test_session_traceparent.py
  • src/nat/runtime/session.py
  • src/nat/data_models/span.py
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

  • tests/nat/runtime/test_session_traceparent.py
  • src/nat/runtime/session.py
  • src/nat/data_models/span.py
tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Unit tests reside under tests/ and should use markers defined in pyproject.toml (e.g., integration)

Files:

  • tests/nat/runtime/test_session_traceparent.py

⚙️ CodeRabbit configuration file

tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using the test_ prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using the fixture_ prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass

Files:

  • tests/nat/runtime/test_session_traceparent.py
{tests/**/*.py,examples/*/tests/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{tests/**/*.py,examples/*/tests/**/*.py}: Use pytest (with pytest-asyncio for async); name test files test_*.py; test functions start with test_; extract repeated code into fixtures; fixtures must set name in decorator and be named with fixture_ prefix
Mock external services with pytest_httpserver or unittest.mock; do not hit live endpoints
Mark expensive tests with @pytest.mark.slow or @pytest.mark.integration

Files:

  • tests/nat/runtime/test_session_traceparent.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • tests/nat/runtime/test_session_traceparent.py
  • src/nat/runtime/session.py
  • src/nat/data_models/span.py
src/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All importable Python code must live under src/ (or packages//src/)

Files:

  • src/nat/runtime/session.py
  • src/nat/data_models/span.py
src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Changes in src/nat should prioritize backward compatibility

Files:

  • src/nat/runtime/session.py
  • src/nat/data_models/span.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/runtime/session.py
  • src/nat/data_models/span.py
{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful

Files:

  • src/nat/runtime/session.py
  • src/nat/data_models/span.py
🧬 Code graph analysis (3)
tests/nat/runtime/test_session_traceparent.py (2)
src/nat/runtime/session.py (4)
  • context (89-90)
  • workflow (85-86)
  • session (93-128)
  • set_metadata_from_http_request (143-188)
src/nat/builder/context.py (5)
  • ContextState (65-114)
  • workflow_trace_id (209-213)
  • conversation_id (185-192)
  • user_message_id (195-199)
  • workflow_run_id (202-206)
src/nat/runtime/session.py (1)
src/nat/builder/context.py (4)
  • get (113-114)
  • get (305-315)
  • workflow_trace_id (209-213)
  • workflow_run_id (202-206)
src/nat/data_models/span.py (1)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (2)
  • trace_id (274-280)
  • span_id (265-271)
🪛 Ruff (0.13.2)
src/nat/runtime/session.py

175-176: try-except-pass detected, consider logging the exception

(S110)


175-175: Do not catch blind exception: Exception

(BLE001)


183-184: try-except-pass detected, consider logging the exception

(S110)


183-183: Do not catch blind exception: Exception

(BLE001)

src/nat/data_models/span.py

156-156: Create your own exception

(TRY002)


156-156: Avoid specifying long messages outside the exception class

(TRY003)


167-167: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


167-167: Create your own exception

(TRY002)


167-167: Avoid specifying long messages outside the exception class

(TRY003)


171-171: Create your own exception

(TRY002)


171-171: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (7)
src/nat/runtime/session.py (2)

186-189: LGTM: workflow-run-id handling is straightforward.

The workflow run ID is treated as an opaque string identifier, which is appropriate for correlation purposes.


165-177: Critical: uuid.UUID() cannot parse unhyphenated 32-character hex strings.

Line 173 calls uuid.UUID(trace_id_hex).int where trace_id_hex is extracted from the W3C traceparent header (format: 00-{32-hex}-{16-hex}-{flags}). The trace ID portion is 32 contiguous hexadecimal characters without hyphens, but uuid.UUID() expects either a hyphenated UUID string or an explicit hex= parameter.

This will raise a ValueError on every valid traceparent header and fall through to the silent exception handler, effectively breaking W3C trace context propagation.

Apply this diff to fix the parsing:

             parts = traceparent.split("-")
             if len(parts) >= 4:
                 trace_id_hex = parts[1]
                 if len(trace_id_hex) == 32:
-                    trace_id_int = uuid.UUID(trace_id_hex).int
+                    trace_id_int = int(trace_id_hex, 16)
                     self._context_state.workflow_trace_id.set(trace_id_int)

Alternatively, if you prefer to use uuid.UUID() for validation:

-                    trace_id_int = uuid.UUID(trace_id_hex).int
+                    trace_id_int = uuid.UUID(hex=trace_id_hex).int

Likely an incorrect or invalid review comment.

tests/nat/runtime/test_session_traceparent.py (4)

76-103: Well-structured test with proper cleanup.

The test correctly:

  • Constructs a Starlette Request with parametrized headers
  • Obtains the singleton ContextState and sets/resets tokens in a try/finally block
  • Verifies that set_metadata_from_http_request populates workflow_trace_id correctly

Token cleanup ensures test isolation.


123-159: Thorough metadata validation test.

The test validates all metadata headers (conversation-id, user-message-id, workflow-run-id) both individually and in combination, with proper context variable isolation using token reset in finally blocks.


51-54: Error cases are well covered.

The test suite includes error handling scenarios:

  • Both headers invalid (lines 51-54)
  • Negative and overflow values for workflow-trace-id (lines 71-72)
  • Malformed traceparent with fallback (lines 45-49)

This addresses the past review comment requesting error case coverage.

Based on learnings.

Also applies to: 71-72


36-73: Parametrized tests are comprehensive and session.py’s UUID parsing is correct
The use of uuid.UUID(trace_id_hex) and uuid.UUID(workflow_trace_id) already accepts both 32-character hex strings (case-insensitive) and standard hyphenated UUID formats (and rejects invalid lengths), so all TRACE_ID_CASES will pass without changes to session.py.

src/nat/data_models/span.py (1)

131-139: LGTM: ID generators correctly produce non-zero OpenTelemetry-sized IDs.

  • _generate_nonzero_trace_id() generates a 128-bit random integer using uuid.uuid4().int
  • _generate_nonzero_span_id() derives a 64-bit span ID by shifting the UUID right by 64 bits

Both functions are guaranteed to produce non-zero values due to UUID version and variant bits.

Based on learnings (past review comments from willkill07).

Copy link
Member

@willkill07 willkill07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last round of feedback. Otherwise LGTM

@zhongxuanwang-nv
Copy link
Contributor Author

/ok to test 9e43759

@coderabbitai coderabbitai bot added the ! - Release Related to a Major Release label Oct 2, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/nat/data_models/span.py (1)

143-145: Fix typo in field descriptions.

The descriptions contain "OTel-syle" which should be "OpenTelemetry-style" on both lines 143 and 145.

Apply this diff:

-                          description="The OTel-syle 128-bit trace ID of the span.")
+                          description="The OpenTelemetry-style 128-bit trace ID of the span.")
     span_id: int = Field(default_factory=_generate_nonzero_span_id,
-                         description="The OTel-syle 64-bit span ID of the span.")
+                         description="The OpenTelemetry-style 64-bit span ID of the span.")
🧹 Nitpick comments (1)
src/nat/data_models/span.py (1)

151-152: UUID parsing may fail for unhyphenated hex strings.

The validator uses uuid.UUID(v) which requires hyphens or braces. If trace_id strings arrive as raw 32-character hex (e.g., from external systems or headers), parsing will fail.

Apply this diff to handle both formats:

         if isinstance(v, str):
-            v = uuid.UUID(v).int
+            try:
+                # Support both hyphenated UUID and 32-char hex
+                if len(v) == 32 and '-' not in v:
+                    v = uuid.UUID(hex=v).int
+                else:
+                    v = uuid.UUID(v).int
+            except (ValueError, AttributeError) as err:
+                raise ValueError(f"Invalid trace_id format: {v}") from err
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 471cda2 and 9e43759.

📒 Files selected for processing (2)
  • src/nat/data_models/span.py (1 hunks)
  • tests/nat/runtime/test_session_traceparent.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/nat/runtime/test_session_traceparent.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

  • src/nat/data_models/span.py
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

  • src/nat/data_models/span.py
src/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All importable Python code must live under src/ (or packages//src/)

Files:

  • src/nat/data_models/span.py
src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Changes in src/nat should prioritize backward compatibility

Files:

  • src/nat/data_models/span.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/data_models/span.py
{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful

Files:

  • src/nat/data_models/span.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • src/nat/data_models/span.py
🧬 Code graph analysis (1)
src/nat/data_models/span.py (1)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (2)
  • trace_id (274-280)
  • span_id (265-271)
🪛 Ruff (0.13.2)
src/nat/data_models/span.py

156-156: Avoid specifying long messages outside the exception class

(TRY003)


167-167: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


167-167: Avoid specifying long messages outside the exception class

(TRY003)


171-171: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: CI Pipeline / Check
🔇 Additional comments (2)
src/nat/data_models/span.py (2)

131-133: LGTM!

The trace ID generator correctly uses uuid.uuid4().int to produce a 128-bit non-zero value, aligning with OpenTelemetry standards.


136-138: LGTM!

The span ID generator correctly extracts the lower 64 bits from a UUID, producing a non-zero 64-bit value per OpenTelemetry conventions.

Signed-off-by: Daniel Wang <[email protected]>
@zhongxuanwang-nv
Copy link
Contributor Author

/ok to test 58ae9cc

@zhongxuanwang-nv zhongxuanwang-nv merged commit cb31e79 into NVIDIA:release/1.3 Oct 2, 2025
17 checks passed
yczhang-nv pushed a commit to yczhang-nv/NeMo-Agent-Toolkit that referenced this pull request Oct 2, 2025
## Description
<!-- Note: The pull request title will be included in the CHANGELOG. -->
<!-- Provide a standalone description of changes in this PR. -->
<!-- Reference any issues closed by this PR with "closes #1234". All PRs
should have an issue they close-->
Closes 1897

**Added**:
- `ContextState` fields: `workflow_run_id`, `workflow_trace_id`
(`src/nat/builder/context.py`)
- Non-zero ID generators & validators for OpenTelemetry-sized IDs
(`src/nat/data_models/span.py`)
- Workflow start/end event emission with metadata: `workflow_run_id`,
`workflow_trace_id`, `conversation_id` (`src/nat/runtime/runner.py`)
- Span attributes for correlation: `.workflow.run_id`,
`.workflow.trace_id`, `.conversation.id`
(`src/nat/observability/exporter/span_exporter.py`)
- Tests: `tests/nat/opentelemetry/test_otel_span_ids.py`,
`tests/nat/runtime/test_runner_trace_ids.py`,
`tests/nat/runtime/test_session_traceparent.py`, updates to
`tests/nat/runtime/test_runner.py`

**Changed**:
- Ensure non-zero 128-bit trace IDs and 64-bit span IDs in OTEL plugin
(`packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py`)
- Runner establishes workflow-scoped run/trace IDs and resets them on
exit (`src/nat/runtime/runner.py`)
- Exporter prefers parent/context trace ID; falls back to workflow trace
ID and sets correlated attributes
(`src/nat/observability/exporter/span_exporter.py`)
- `SpanContext` defaults & field validators enforce size/range and
non-zero invariants (`src/nat/data_models/span.py`)
- Parse/propagate incoming W3C `traceparent` and workflow headers to
session/runner (`src/nat/runtime/session.py`,
`tests/nat/runtime/test_session_traceparent.py`)

**Removed**:
- None

**Why it matters**:
- Consistent cross-span correlation and easier log/trace joins across
agents and sub-spans
- Fewer broken traces: eliminates zero/invalid IDs, aligns with
OpenTelemetry sizing
- Improves debugging and fleet-wide reporting via explicit `.workflow.*`
and `.conversation.id` attributes
- Safer header handling for external callers; preserves upstream traces

**Breaking changes / Migrations**:
- None

**Docs**:
- None

## By Submitting this PR I confirm:
- I am familiar with the [Contributing
Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md).
- We require that all contributors "sign-off" on their commits. This
certifies that the contribution is your original work, or you have
rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will
not be accepted.
- When the PR is ready for review, new or existing tests cover these
changes.
- When the PR is ready for review, the documentation is up to date with
these changes.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Workflow-level run and trace IDs are created, stored in context,
propagated, and emitted with WORKFLOW_START/WORKFLOW_END events.
* Incoming HTTP tracing headers (traceparent, workflow-trace-id,
workflow-run-id) are parsed into context.

* **Improvements**
* Span metadata now includes workflow IDs and richer attributes (event
type, function info, timestamps, conversation/run IDs).
* ID generation and validation tightened to ensure non-zero, correctly
sized trace and span IDs; span context ID generation approach updated.

* **Tests**
* Added tests validating span ID formats, runner trace/run ID behavior,
and header parsing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Daniel Wang <[email protected]>
Signed-off-by: Zhongxuan (Daniel) Wang <[email protected]>
Signed-off-by: Yuchen Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

! - Release Related to a Major Release improvement Improvement to existing functionality non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants