-
Notifications
You must be signed in to change notification settings - Fork 416
Add hierarchical IDs for consistent telemetry and reporting #863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hierarchical IDs for consistent telemetry and reporting #863
Conversation
Signed-off-by: Daniel Wang <[email protected]>
WalkthroughGenerate non-zero OTEL-style 128-bit trace IDs and 64-bit span IDs with stricter validation and parsing; add Context workflow_run_id and workflow_trace_id; parse trace headers into context; propagate workflow IDs through Runner (emit WORKFLOW_START/END) and Exporter (trace fallback and attributes); add tests for IDs and header parsing. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant SessionManager
participant ContextState
Client->>SessionManager: HTTP request (headers)
alt traceparent present
note right of SessionManager #DDEBF7: extract 32-hex trace_id → int
SessionManager->>ContextState: set workflow_trace_id
else workflow-trace-id present
note right of SessionManager #F3F2DD: parse UUID/hex → int
SessionManager->>ContextState: set workflow_trace_id
end
alt workflow-run-id present
SessionManager->>ContextState: set workflow_run_id
end
sequenceDiagram
autonumber
participant Runner
participant ContextState
participant ExporterManager
participant Function
Runner->>ContextState: derive/set workflow_run_id & workflow_trace_id (tokens)
Runner->>ExporterManager: start()
note right of Runner #E8F8E8: emit WORKFLOW_START (TraceMetadata)
Runner->>Function: execute (result or stream)
Function-->>Runner: output(s)
note right of Runner #FFEFE0: emit WORKFLOW_END (TraceMetadata + payload)
Runner->>ContextState: restore previous values (finally)
sequenceDiagram
autonumber
participant Exporter
participant ContextState
participant SpanSystem
Exporter->>ContextState: read workflow_trace_id, workflow_run_id
alt parent span exists
Exporter->>SpanSystem: create sub-span (use parent.trace_id)
else workflow_trace_id available
Exporter->>SpanSystem: create sub-span (trace_id = workflow_trace_id)
else
Exporter->>SpanSystem: create sub-span (auto-generated non-zero IDs)
end
note over SpanSystem #F5F5FF: span attributes include workflow.run_id and workflow.trace_id (formatted 032x)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (6)**/*.{py,yaml,yml}📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
Files:
**/*.py📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
Files:
src/**/*.py📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Files:
src/nat/**/*📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Files:
⚙️ CodeRabbit configuration file
Files:
{src/**/*.py,packages/*/src/**/*.py}📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Files:
**/*⚙️ CodeRabbit configuration file
Files:
🧬 Code graph analysis (1)src/nat/data_models/span.py (1)
🪛 Ruff (0.13.2)src/nat/data_models/span.py156-156: Avoid specifying long messages outside the exception class (TRY003) 167-167: Within an (B904) 167-167: Avoid specifying long messages outside the exception class (TRY003) 171-171: Avoid specifying long messages outside the exception class (TRY003) 🔇 Additional comments (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
|
@zhongxuanwang-nv I'll review the changes in this PR that adds hierarchical IDs to NAT for consistent telemetry and reporting. 🧠 Learnings used✅ Actions performedReview triggered.
|
Signed-off-by: Zhongxuan (Daniel) Wang <[email protected]>
Signed-off-by: Daniel Wang <[email protected]>
Signed-off-by: Daniel Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/nat/observability/exporter/span_exporter.py (1)
156-160: Span kind always falls back to UNKNOWN due to Enum passed to string-keyed map; subspan name also shows Enum repr.
event_type_to_span_kindexpects a string key (e.g.,"WORKFLOW_START"), butevent.event_typeis an Enum; similarlysub_span_nameshould use the string value. This degrades telemetry classification.- if event.payload.name: - sub_span_name = f"{event.payload.name}" - else: - sub_span_name = f"{event.payload.event_type}" + if event.payload.name: + sub_span_name = f"{event.payload.name}" + else: + sub_span_name = f"{getattr(event.payload.event_type, 'value', str(event.payload.event_type))}" - span_kind = event_type_to_span_kind(event.event_type) + span_kind = event_type_to_span_kind(getattr(event.event_type, 'value', str(event.event_type)))Also applies to: 196-198
🧹 Nitpick comments (13)
tests/nat/opentelemetry/test_otel_span_ids.py (1)
19-25: Strengthen assertions to cover accessors and parent propagation.Add quick checks that the
OtelSpanaccessors reflect the same IDs as the context, and that a child span inherits the trace_id from its parent.Apply:
def test_otel_span_ids_are_non_zero(): s = OtelSpan(name="test", context=None, parent=None, attributes={}) ctx = s.get_span_context() assert ctx.trace_id != 0 assert ctx.span_id != 0 assert len(f"{ctx.trace_id:032x}") == 32 assert len(f"{ctx.span_id:016x}") == 16 + # Accessors mirror context + assert s.trace_id == ctx.trace_id + assert s.span_id == ctx.span_id + # Child inherits parent's trace_id + child = OtelSpan(name="child", context=None, parent=s, attributes={}) + assert child.trace_id == s.trace_idsrc/nat/runtime/session.py (3)
16-23: Add module logger for parse diagnostics.import asyncio import contextvars import typing from collections.abc import Awaitable from collections.abc import Callable from contextlib import asynccontextmanager from contextlib import nullcontext +import logging +import uuidAs per coding guidelines.
164-187: Avoid bareexcept, validate 32‑nybble non‑zero IDs, and log parse failures.Current
try/except: passdrops signal and accepts malformed inputs. Narrow exceptions, enforce spec length/non‑zero, and log once.- # W3C Trace Context header: traceparent: 00-<trace-id>-<span-id>-<flags> + # W3C Trace Context header: traceparent: 00-<trace-id>-<span-id>-<flags> traceparent = request.headers.get("traceparent") if traceparent: - try: - parts = traceparent.split("-") - if len(parts) >= 4: - trace_id_hex = parts[1] - if len(trace_id_hex) == 32: - trace_id_int = int(trace_id_hex, 16) - self._context_state.workflow_trace_id.set(trace_id_int) - except Exception: - pass + try: + parts = traceparent.split("-") + if len(parts) >= 4: + trace_id_hex = parts[1] + # Enforce 128-bit hex and non-zero + if len(trace_id_hex) == 32: + trace_id_int = int(trace_id_hex, 16) + if trace_id_int != 0: + self._context_state.workflow_trace_id.set(trace_id_int) + except (ValueError, TypeError) as e: + logging.getLogger(__name__).debug("Ignoring invalid traceparent header '%s': %s", traceparent, e) - if not self._context_state.workflow_trace_id.get(): - workflow_trace_id = request.headers.get("workflow-trace-id") - if workflow_trace_id: - try: - self._context_state.workflow_trace_id.set(int(workflow_trace_id, 16)) - except Exception: - pass + if not self._context_state.workflow_trace_id.get(): + workflow_trace_id = request.headers.get("workflow-trace-id") + if workflow_trace_id: + try: + norm = workflow_trace_id.strip().lower().removeprefix("0x") + if len(norm) == 32: + tid = int(norm, 16) + if tid != 0: + self._context_state.workflow_trace_id.set(tid) + except (ValueError, TypeError) as e: + logging.getLogger(__name__).debug("Ignoring invalid workflow-trace-id '%s': %s", + workflow_trace_id, e)Based on coding guidelines.
185-188: Validateworkflow-run-idas UUID before accepting.Prevents arbitrary strings; keeps telemetry clean.
- workflow_run_id = request.headers.get("workflow-run-id") - if workflow_run_id: - self._context_state.workflow_run_id.set(workflow_run_id) + workflow_run_id = request.headers.get("workflow-run-id") + if workflow_run_id: + try: + uuid.UUID(workflow_run_id) + except (ValueError, AttributeError): + logging.getLogger(__name__).debug("Ignoring invalid workflow-run-id '%s'", workflow_run_id) + else: + self._context_state.workflow_run_id.set(workflow_run_id)As per coding guidelines.
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (2)
88-97: Prefersecrets.randbitsfor full-entropy IDs and simpler loops.
uuid4()provides ~122 bits entropy and adds structure;secrets.randbits(oros.urandom) gives full-width random bits and simpler code.- if context is None or isinstance(context, Context): - # Generate non-zero IDs per OTel spec - while True: - trace_id = uuid.uuid4().int & ((1 << 128) - 1) - if trace_id != 0: - break - while True: - span_id = uuid.uuid4().int & ((1 << 64) - 1) - if span_id != 0: - break + if context is None or isinstance(context, Context): + # Generate non-zero IDs per OTel spec + import secrets + trace_id = 0 + while trace_id == 0: + trace_id = secrets.randbits(128) + span_id = 0 + while span_id == 0: + span_id = secrets.randbits(64)
115-117: Avoid hard-coding instrumentation scope version.Wire this to the package version (e.g.,
nvidia_nat_opentelemetry.__version__) or accept via constructor to keep telemetry accurate across releases.- self._instrumentation_scope = instrumentation_scope or InstrumentationScope("nat", "1.0.0") + from importlib.metadata import version, PackageNotFoundError + if instrumentation_scope is None: + try: + lib_version = version("nvidia-nat-opentelemetry") + except PackageNotFoundError: + lib_version = "0" + self._instrumentation_scope = InstrumentationScope("nat", lib_version) + else: + self._instrumentation_scope = instrumentation_scopesrc/nat/builder/context.py (2)
122-134: Fix return-type docstring forinput_message.Docstring claims
strbut the property returnstyping.Any. Align for correctness and typing clarity.def input_message(self): """ - Retrieves the input message from the context state. + Retrieves the input message from the context state. - The input_message property is used to access the message stored in the - context state. This property returns the message as it is currently - maintained in the context. + The `input_message` property exposes the message stored in the context state. Returns: - str: The input message retrieved from the context state. + typing.Any: The input message retrieved from the context state. """ return self._context_state.input_message.get()Optionally add an explicit return annotation:
def input_message(self) -> typing.Any: .... As per coding guidelines.
201-214: New workflow ID properties look good; consider docstrings with first-line summary style.Minor nit: ensure the first line ends with a period and wraps code entities in backticks.
tests/nat/runtime/test_session_traceparent.py (1)
27-64: Add precedence and negative cases; remove unnecessary asyncio.
- When both headers are present, assert
traceparentwins.- Add malformed header cases (wrong length, zero trace_id) to ensure nothing is set.
- Add a test for
workflow-run-idUUID validation if you adopt the validation change.- Test is sync; you can drop
@pytest.mark.asyncioandasync def.Example additions:
@pytest.mark.parametrize("header_case", [ "traceparent", "workflow-trace-id", ]) -@pytest.mark.asyncio -async def test_session_sets_trace_id_from_headers(header_case: str): +def test_session_sets_trace_id_from_headers(header_case: str): ... +def test_traceparent_precedence_over_workflow_trace_id(): + trace_id_hex = "c" * 32 + other_hex = "d" * 32 + headers = [ + (b"traceparent", f"00-{trace_id_hex}-{'e'*16}-01".encode()), + (b"workflow-trace-id", other_hex.encode()), + ] + request = Request({"type": "http", "method": "GET", "path": "/", "headers": headers, "client": ("127.0.0.1", 0)}) + ctx_state = ContextState.get() + prev = ctx_state.workflow_trace_id.set(None) + try: + SessionManager(workflow=_DummyWorkflow(), max_concurrency=0).set_metadata_from_http_request(request) + assert ctx_state.workflow_trace_id.get() == int(trace_id_hex, 16) + finally: + ctx_state.workflow_trace_id.reset(prev) + +@pytest.mark.parametrize("bad_header", ["", "abc", "0"*32, "g"*32]) +def test_ignores_malformed_traceparent(bad_header: str): + headers = [(b"traceparent", f"00-{bad_header}-{'f'*16}-01".encode())] + request = Request({"type": "http", "method": "GET", "path": "/", "headers": headers, "client": ("127.0.0.1", 0)}) + ctx_state = ContextState.get() + prev = ctx_state.workflow_trace_id.set(None) + try: + SessionManager(workflow=_DummyWorkflow(), max_concurrency=0).set_metadata_from_http_request(request) + assert ctx_state.workflow_trace_id.get() is None + finally: + ctx_state.workflow_trace_id.reset(prev)As per coding guidelines.
tests/nat/runtime/test_runner_trace_ids.py (1)
35-39: Strengthen assertions: also validateworkflow_run_idis set during execution.This ensures both IDs are present, not just the trace.
async def ainvoke(self, _message, _to_type=None): ctx = Context.get() - assert isinstance(ctx.workflow_trace_id, int) and ctx.workflow_trace_id != 0 + assert isinstance(ctx.workflow_trace_id, int) and ctx.workflow_trace_id != 0 + assert isinstance(ctx.workflow_run_id, str) and ctx.workflow_run_id return {"ok": True} async def astream(self, _message, _to_type=None): ctx = Context.get() - assert isinstance(ctx.workflow_trace_id, int) and ctx.workflow_trace_id != 0 + assert isinstance(ctx.workflow_trace_id, int) and ctx.workflow_trace_id != 0 + assert isinstance(ctx.workflow_run_id, str) and ctx.workflow_run_id yield "chunk-1"Also applies to: 40-44
src/nat/runtime/runner.py (2)
146-162: Duplicate run/trace-ID establishment inresultandresult_stream.Extract into a small helper to reduce drift and ease future changes.
As per coding guidelines.
class Runner: + def _establish_workflow_ids(self) -> tuple[str, int, typing.ContextManager[None], typing.ContextManager[None]]: + existing_run_id = self._context_state.workflow_run_id.get() + existing_trace_id = self._context_state.workflow_trace_id.get() + workflow_run_id = existing_run_id if existing_run_id else str(uuid.uuid4()) + if existing_trace_id: + workflow_trace_id = existing_trace_id + else: + while True: + workflow_trace_id = uuid.uuid4().int + if workflow_trace_id != 0: + break + tkn_run = self._context_state.workflow_run_id.set(workflow_run_id) + tkn_trace = self._context_state.workflow_trace_id.set(workflow_trace_id) + return workflow_run_id, workflow_trace_id, tkn_run, tkn_traceThen replace the duplicated blocks with:
- # Establish workflow run and trace identifiers - existing_run_id = ... - ... - token_run_id = self._context_state.workflow_run_id.set(workflow_run_id) - token_trace_id = self._context_state.workflow_trace_id.set(workflow_trace_id) + (workflow_run_id, + workflow_trace_id, + token_run_id, + token_trace_id) = self._establish_workflow_ids()Also applies to: 231-247
205-207: Simplify logging to avoid message building.Minor style; keep
logger.errorper guidelines when re-raising.As per coding guidelines.
- err_msg = f": {e}" if str(e).strip() else "." - logger.error("Error running workflow%s", err_msg) + logger.error("Error running workflow: %s", e)src/nat/data_models/span.py (1)
153-166: Narrow exception handling in validators.Catching
Exceptionis unnecessary here; limit toValueError | TypeErrorforint()coercion.As per coding guidelines.
- try: - value = int(v) - except Exception: + try: + value = int(v) + except (ValueError, TypeError): return _generate_nonzero_trace_id() ... - try: - value = int(v) - except Exception: + try: + value = int(v) + except (ValueError, TypeError): return _generate_nonzero_span_id()Also applies to: 168-181
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py(1 hunks)src/nat/builder/context.py(3 hunks)src/nat/data_models/span.py(1 hunks)src/nat/observability/exporter/span_exporter.py(3 hunks)src/nat/runtime/runner.py(4 hunks)src/nat/runtime/session.py(1 hunks)tests/nat/opentelemetry/test_otel_span_ids.py(1 hunks)tests/nat/runtime/test_runner_trace_ids.py(1 hunks)tests/nat/runtime/test_session_traceparent.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (10)
**/*.{py,yaml,yml}
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.
Files:
tests/nat/runtime/test_session_traceparent.pysrc/nat/builder/context.pypackages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.pysrc/nat/runtime/session.pytests/nat/opentelemetry/test_otel_span_ids.pysrc/nat/runtime/runner.pytests/nat/runtime/test_runner_trace_ids.pysrc/nat/observability/exporter/span_exporter.pysrc/nat/data_models/span.py
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).
**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial
Files:
tests/nat/runtime/test_session_traceparent.pysrc/nat/builder/context.pypackages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.pysrc/nat/runtime/session.pytests/nat/opentelemetry/test_otel_span_ids.pysrc/nat/runtime/runner.pytests/nat/runtime/test_runner_trace_ids.pysrc/nat/observability/exporter/span_exporter.pysrc/nat/data_models/span.py
tests/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Unit tests reside under tests/ and should use markers defined in pyproject.toml (e.g., integration)
Files:
tests/nat/runtime/test_session_traceparent.pytests/nat/opentelemetry/test_otel_span_ids.pytests/nat/runtime/test_runner_trace_ids.py
⚙️ CodeRabbit configuration file
tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using thetest_prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using thefixture_prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass
Files:
tests/nat/runtime/test_session_traceparent.pytests/nat/opentelemetry/test_otel_span_ids.pytests/nat/runtime/test_runner_trace_ids.py
{tests/**/*.py,examples/*/tests/**/*.py}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
{tests/**/*.py,examples/*/tests/**/*.py}: Use pytest (with pytest-asyncio for async); name test files test_*.py; test functions start with test_; extract repeated code into fixtures; fixtures must set name in decorator and be named with fixture_ prefix
Mock external services with pytest_httpserver or unittest.mock; do not hit live endpoints
Mark expensive tests with @pytest.mark.slow or @pytest.mark.integration
Files:
tests/nat/runtime/test_session_traceparent.pytests/nat/opentelemetry/test_otel_span_ids.pytests/nat/runtime/test_runner_trace_ids.py
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,
and should contain an Apache License 2.0 header comment at the top of each file.
- Confirm that copyright years are up-to date whenever a file is changed.
Files:
tests/nat/runtime/test_session_traceparent.pysrc/nat/builder/context.pypackages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.pysrc/nat/runtime/session.pytests/nat/opentelemetry/test_otel_span_ids.pysrc/nat/runtime/runner.pytests/nat/runtime/test_runner_trace_ids.pysrc/nat/observability/exporter/span_exporter.pysrc/nat/data_models/span.py
src/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
All importable Python code must live under src/ (or packages//src/)
Files:
src/nat/builder/context.pysrc/nat/runtime/session.pysrc/nat/runtime/runner.pysrc/nat/observability/exporter/span_exporter.pysrc/nat/data_models/span.py
src/nat/**/*
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Changes in src/nat should prioritize backward compatibility
Files:
src/nat/builder/context.pysrc/nat/runtime/session.pysrc/nat/runtime/runner.pysrc/nat/observability/exporter/span_exporter.pysrc/nat/data_models/span.py
⚙️ CodeRabbit configuration file
This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.
Files:
src/nat/builder/context.pysrc/nat/runtime/session.pysrc/nat/runtime/runner.pysrc/nat/observability/exporter/span_exporter.pysrc/nat/data_models/span.py
{src/**/*.py,packages/*/src/**/*.py}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful
Files:
src/nat/builder/context.pypackages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.pysrc/nat/runtime/session.pysrc/nat/runtime/runner.pysrc/nat/observability/exporter/span_exporter.pysrc/nat/data_models/span.py
packages/*/src/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Importable Python code inside packages must live under packages//src/
Files:
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
packages/**/*
⚙️ CodeRabbit configuration file
packages/**/*: - This directory contains optional plugin packages for the toolkit, each should contain apyproject.tomlfile. - Thepyproject.tomlfile should declare a dependency onnvidia-nator another package with a name starting
withnvidia-nat-. This dependency should be declared using~=<version>, and the version should be a two
digit version (ex:~=1.0).
- Not all packages contain Python code, if they do they should also contain their own set of tests, in a
tests/directory at the same level as thepyproject.tomlfile.
Files:
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
🧬 Code graph analysis (7)
tests/nat/runtime/test_session_traceparent.py (2)
src/nat/builder/context.py (2)
ContextState(65-114)workflow_trace_id(209-213)src/nat/runtime/session.py (2)
workflow(84-85)set_metadata_from_http_request(142-187)
src/nat/runtime/session.py (1)
src/nat/builder/context.py (4)
get(113-114)get(305-315)workflow_trace_id(209-213)workflow_run_id(202-206)
tests/nat/opentelemetry/test_otel_span_ids.py (1)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (3)
get_span_context(315-321)trace_id(280-286)span_id(271-277)
src/nat/runtime/runner.py (3)
src/nat/builder/context.py (11)
Context(117-315)ContextState(65-114)workflow_run_id(202-206)get(113-114)get(305-315)workflow_trace_id(209-213)conversation_id(185-192)intermediate_step_manager(171-182)metadata(89-92)metadata(151-160)output(58-59)src/nat/observability/exporter_manager.py (2)
get(326-335)start(201-257)tests/nat/runtime/test_runner_trace_ids.py (3)
start(48-58)ainvoke(35-38)astream(40-43)
tests/nat/runtime/test_runner_trace_ids.py (3)
src/nat/runtime/runner.py (7)
context(90-91)Runner(50-300)convert(93-94)result(126-127)result(130-131)result(133-216)result_stream(218-300)src/nat/builder/context.py (4)
Context(117-315)ContextState(65-114)workflow_trace_id(209-213)workflow_run_id(202-206)src/nat/observability/exporter_manager.py (1)
ExporterManager(26-335)
src/nat/observability/exporter/span_exporter.py (2)
src/nat/builder/context.py (5)
workflow_trace_id(209-213)get(113-114)get(305-315)conversation_id(185-192)workflow_run_id(202-206)src/nat/data_models/span.py (2)
SpanContext(147-181)Span(184-237)
src/nat/data_models/span.py (1)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (2)
trace_id(280-286)span_id(271-277)
🪛 Ruff (0.13.1)
src/nat/runtime/session.py
174-175: try-except-pass detected, consider logging the exception
(S110)
174-174: Do not catch blind exception: Exception
(BLE001)
182-183: try-except-pass detected, consider logging the exception
(S110)
182-182: Do not catch blind exception: Exception
(BLE001)
src/nat/runtime/runner.py
144-144: Abstract raise to an inner function
(TRY301)
144-144: Avoid specifying long messages outside the exception class
(TRY003)
221-221: Avoid specifying long messages outside the exception class
(TRY003)
229-229: Abstract raise to an inner function
(TRY301)
229-229: Avoid specifying long messages outside the exception class
(TRY003)
tests/nat/runtime/test_runner_trace_ids.py
32-32: Unused method argument: to_type
(ARG002)
35-35: Unused method argument: to_type
(ARG002)
40-40: Unused method argument: to_type
(ARG002)
48-48: Unused method argument: context_state
(ARG002)
src/nat/data_models/span.py
161-161: Do not catch blind exception: Exception
(BLE001)
176-176: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (1)
src/nat/observability/exporter/span_exporter.py (1)
170-188: Nit: attribute formatting is solid; consider consistently hex-formatting only when non-None.Current expression handles it; no action required.
Please run one span-exporting flow and confirm
nat.span.kindis notUNKNOWNfor WORKFLOW_* and FUNCTION_* events after the above fix.
willkill07
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know too much about UUIDs :)
This is nice, but we can clean it up a lot!
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
Outdated
Show resolved
Hide resolved
f87d320 to
7e2e07e
Compare
Signed-off-by: Daniel Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/nat/data_models/span.py (1)
142-145: Fix typo in field descriptions.The field descriptions contain a typo: "OTel-syle" should be "OpenTelemetry-style" on both lines 143 and 145.
Apply this diff:
- trace_id: int = Field(default_factory=_generate_nonzero_trace_id, - description="The OTel-syle 128-bit trace ID of the span.") - span_id: int = Field(default_factory=_generate_nonzero_span_id, - description="The OTel-syle 64-bit span ID of the span.") + trace_id: int = Field(default_factory=_generate_nonzero_trace_id, + description="The OpenTelemetry-style 128-bit trace ID of the span.") + span_id: int = Field(default_factory=_generate_nonzero_span_id, + description="The OpenTelemetry-style 64-bit span ID of the span.")
🧹 Nitpick comments (3)
src/nat/runtime/runner.py (1)
225-234: Consider extracting workflow ID establishment logic.The workflow ID establishment logic is identical in both
result()andresult_stream()methods (lines 146-155 and 225-234). Consider extracting this into a helper method to follow DRY principles.Example refactor:
def _establish_workflow_ids(self) -> tuple[str, int, Any, Any]: """Establish workflow run and trace identifiers. Returns: Tuple of (workflow_run_id, workflow_trace_id, token_run_id, token_trace_id) """ existing_run_id = self._context_state.workflow_run_id.get() existing_trace_id = self._context_state.workflow_trace_id.get() workflow_run_id = existing_run_id or str(uuid.uuid4()) workflow_trace_id = existing_trace_id or uuid.uuid4().int token_run_id = self._context_state.workflow_run_id.set(workflow_run_id) token_trace_id = self._context_state.workflow_trace_id.set(workflow_trace_id) return workflow_run_id, workflow_trace_id, token_run_id, token_trace_idThen use it in both methods:
workflow_run_id, workflow_trace_id, token_run_id, token_trace_id = self._establish_workflow_ids()src/nat/data_models/span.py (2)
147-157: Validator logic is correct; consider custom exception type.The validation logic correctly handles UUID string conversion, None regeneration, and zero rejection. The implementation follows suggestions from previous reviews.
Static analysis suggests using a custom exception type instead of generic
Exception. While the current implementation works, consider defining a custom exception for better error handling:class InvalidSpanIdError(ValueError): """Raised when a span ID is invalid.""" passThen update line 156:
- raise Exception("A null UUID is invalid") + raise InvalidSpanIdError("A null UUID is invalid")
159-169: Validator logic is correct; note duplication.The span_id validator correctly mirrors the trace_id validator logic. Similar to the trace_id validator, consider using a custom exception type (see previous comment).
The validation logic is duplicated between
_validate_trace_idand_validate_span_id. Consider extracting common logic if this pattern is used elsewhere:def _validate_id(v: int | str | None, generator_fn: callable, id_type: str) -> int: """Common validation logic for trace_id and span_id.""" if isinstance(v, str): v = uuid.UUID(v).int if isinstance(v, type(None)): v = generator_fn() if v == 0: raise InvalidSpanIdError(f"A null {id_type} is invalid") return vHowever, given the validators are already concise, this refactor is optional.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py(1 hunks)src/nat/data_models/span.py(1 hunks)src/nat/runtime/runner.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{py,yaml,yml}
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.
Files:
src/nat/data_models/span.pysrc/nat/runtime/runner.pypackages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).
**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial
Files:
src/nat/data_models/span.pysrc/nat/runtime/runner.pypackages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
src/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
All importable Python code must live under src/ (or packages//src/)
Files:
src/nat/data_models/span.pysrc/nat/runtime/runner.py
src/nat/**/*
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Changes in src/nat should prioritize backward compatibility
Files:
src/nat/data_models/span.pysrc/nat/runtime/runner.py
⚙️ CodeRabbit configuration file
This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.
Files:
src/nat/data_models/span.pysrc/nat/runtime/runner.py
{src/**/*.py,packages/*/src/**/*.py}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful
Files:
src/nat/data_models/span.pysrc/nat/runtime/runner.pypackages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,
and should contain an Apache License 2.0 header comment at the top of each file.
- Confirm that copyright years are up-to date whenever a file is changed.
Files:
src/nat/data_models/span.pysrc/nat/runtime/runner.pypackages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
packages/*/src/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Importable Python code inside packages must live under packages//src/
Files:
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
packages/**/*
⚙️ CodeRabbit configuration file
packages/**/*: - This directory contains optional plugin packages for the toolkit, each should contain apyproject.tomlfile. - Thepyproject.tomlfile should declare a dependency onnvidia-nator another package with a name starting
withnvidia-nat-. This dependency should be declared using~=<version>, and the version should be a two
digit version (ex:~=1.0).
- Not all packages contain Python code, if they do they should also contain their own set of tests, in a
tests/directory at the same level as thepyproject.tomlfile.
Files:
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py
🧬 Code graph analysis (2)
src/nat/data_models/span.py (1)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (2)
trace_id(274-280)span_id(265-271)
src/nat/runtime/runner.py (3)
src/nat/builder/context.py (11)
Context(117-315)ContextState(65-114)workflow_run_id(202-206)get(113-114)get(305-315)workflow_trace_id(209-213)conversation_id(185-192)intermediate_step_manager(171-182)metadata(89-92)metadata(151-160)output(58-59)src/nat/observability/exporter_manager.py (2)
get(326-335)start(201-257)tests/nat/runtime/test_runner_trace_ids.py (3)
start(48-58)ainvoke(35-38)astream(40-43)
🪛 Ruff (0.13.1)
src/nat/data_models/span.py
156-156: Create your own exception
(TRY002)
156-156: Avoid specifying long messages outside the exception class
(TRY003)
168-168: Create your own exception
(TRY002)
168-168: Avoid specifying long messages outside the exception class
(TRY003)
src/nat/runtime/runner.py
144-144: Abstract raise to an inner function
(TRY301)
144-144: Avoid specifying long messages outside the exception class
(TRY003)
215-215: Avoid specifying long messages outside the exception class
(TRY003)
223-223: Abstract raise to an inner function
(TRY301)
223-223: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (7)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (1)
89-91: LGTM: UUID v4 guarantees non-zero IDs.The implementation correctly uses
uuid.uuid4()to generate non-zero IDs as suggested in previous reviews. UUID v4 properties ensure the trace_id (128-bit) and span_id (64-bit) are always non-zero due to version and variant bits.src/nat/runtime/runner.py (5)
146-155: LGTM: Workflow ID establishment follows suggested pattern.The implementation correctly:
- Preserves existing workflow_run_id and workflow_trace_id when present
- Falls back to UUID generation when absent
- Uses token-based context management for proper cleanup
157-189: LGTM: Proper workflow event emission with consistent metadata.The implementation correctly:
- Emits WORKFLOW_START and WORKFLOW_END events at appropriate points
- Includes consistent metadata (workflow_run_id, workflow_trace_id as hex, conversation_id)
- Attaches output data to WORKFLOW_END event
- Integrates properly with the intermediate step manager
206-210: LGTM: Proper token cleanup in finally block.The finally block correctly resets workflow_run_id and workflow_trace_id tokens, ensuring context isolation is maintained even when exceptions occur.
236-269: LGTM: Workflow event emission for streaming path.The implementation correctly emits WORKFLOW_START and WORKFLOW_END events for the streaming path with consistent metadata. Note that similar duplication exists here as in the ID establishment logic (see previous comment about extracting helper methods).
284-288: LGTM: Proper token cleanup in finally block.The finally block correctly resets workflow_run_id and workflow_trace_id tokens for the streaming path, maintaining consistency with the
result()method.src/nat/data_models/span.py (1)
131-138: LGTM: Non-zero ID generators align with UUID v4 guarantees.The implementation correctly generates non-zero IDs using UUID v4 properties as suggested in previous reviews. The trace_id uses the full 128-bit UUID, while span_id uses the lower 64 bits via right shift.
|
@willkill07 Thank you so much Will for all of the comments for PR-863!! I was scared when I saw there were 9 comments 😂 but I'm relieved to see most of them are about one issue (I should've dived deeper to realize that uuid.uuid4 already guarantees non-zero-ness!) |
Signed-off-by: Zhongxuan (Daniel) Wang <[email protected]>
|
/ok to test 3a2186b |
Signed-off-by: Daniel Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/nat/data_models/span.py (1)
142-145: Typo in field descriptions: "OTel-syle" should be "OpenTelemetry-style".Both
trace_idandspan_idfield descriptions contain the typo "OTel-syle".Apply this diff:
- trace_id: int = Field(default_factory=_generate_nonzero_trace_id, - description="The OTel-syle 128-bit trace ID of the span.") - span_id: int = Field(default_factory=_generate_nonzero_span_id, - description="The OTel-syle 64-bit span ID of the span.") + trace_id: int = Field(default_factory=_generate_nonzero_trace_id, + description="The OpenTelemetry-style 128-bit trace ID of the span.") + span_id: int = Field(default_factory=_generate_nonzero_span_id, + description="The OpenTelemetry-style 64-bit span ID of the span.")
🧹 Nitpick comments (1)
src/nat/runtime/session.py (1)
175-176: Consider logging header parse failures for observability.The bare
except Exception: passblocks silently swallow all parsing errors, making it difficult to diagnose misconfigurations or malformed headers from upstream callers.Add debug-level logging to improve observability while maintaining resilience:
except Exception: + logger.debug("Failed to parse traceparent header: %s", traceparent, exc_info=True) passexcept Exception: + logger.debug("Failed to parse workflow-trace-id header: %s", workflow_trace_id, exc_info=True) passBased on coding guidelines (exception handling should use logger.error() when re-raising or logger.exception() when catching).
Also applies to: 183-184
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/nat/data_models/span.py(1 hunks)src/nat/runtime/session.py(2 hunks)tests/nat/runtime/test_session_traceparent.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{py,yaml,yml}
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.
Files:
tests/nat/runtime/test_session_traceparent.pysrc/nat/runtime/session.pysrc/nat/data_models/span.py
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).
**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial
Files:
tests/nat/runtime/test_session_traceparent.pysrc/nat/runtime/session.pysrc/nat/data_models/span.py
tests/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Unit tests reside under tests/ and should use markers defined in pyproject.toml (e.g., integration)
Files:
tests/nat/runtime/test_session_traceparent.py
⚙️ CodeRabbit configuration file
tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using thetest_prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using thefixture_prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass
Files:
tests/nat/runtime/test_session_traceparent.py
{tests/**/*.py,examples/*/tests/**/*.py}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
{tests/**/*.py,examples/*/tests/**/*.py}: Use pytest (with pytest-asyncio for async); name test files test_*.py; test functions start with test_; extract repeated code into fixtures; fixtures must set name in decorator and be named with fixture_ prefix
Mock external services with pytest_httpserver or unittest.mock; do not hit live endpoints
Mark expensive tests with @pytest.mark.slow or @pytest.mark.integration
Files:
tests/nat/runtime/test_session_traceparent.py
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,
and should contain an Apache License 2.0 header comment at the top of each file.
- Confirm that copyright years are up-to date whenever a file is changed.
Files:
tests/nat/runtime/test_session_traceparent.pysrc/nat/runtime/session.pysrc/nat/data_models/span.py
src/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
All importable Python code must live under src/ (or packages//src/)
Files:
src/nat/runtime/session.pysrc/nat/data_models/span.py
src/nat/**/*
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Changes in src/nat should prioritize backward compatibility
Files:
src/nat/runtime/session.pysrc/nat/data_models/span.py
⚙️ CodeRabbit configuration file
This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.
Files:
src/nat/runtime/session.pysrc/nat/data_models/span.py
{src/**/*.py,packages/*/src/**/*.py}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful
Files:
src/nat/runtime/session.pysrc/nat/data_models/span.py
🧬 Code graph analysis (3)
tests/nat/runtime/test_session_traceparent.py (2)
src/nat/runtime/session.py (4)
context(89-90)workflow(85-86)session(93-128)set_metadata_from_http_request(143-188)src/nat/builder/context.py (5)
ContextState(65-114)workflow_trace_id(209-213)conversation_id(185-192)user_message_id(195-199)workflow_run_id(202-206)
src/nat/runtime/session.py (1)
src/nat/builder/context.py (4)
get(113-114)get(305-315)workflow_trace_id(209-213)workflow_run_id(202-206)
src/nat/data_models/span.py (1)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (2)
trace_id(274-280)span_id(265-271)
🪛 Ruff (0.13.2)
src/nat/runtime/session.py
175-176: try-except-pass detected, consider logging the exception
(S110)
175-175: Do not catch blind exception: Exception
(BLE001)
183-184: try-except-pass detected, consider logging the exception
(S110)
183-183: Do not catch blind exception: Exception
(BLE001)
src/nat/data_models/span.py
156-156: Create your own exception
(TRY002)
156-156: Avoid specifying long messages outside the exception class
(TRY003)
167-167: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
167-167: Create your own exception
(TRY002)
167-167: Avoid specifying long messages outside the exception class
(TRY003)
171-171: Create your own exception
(TRY002)
171-171: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (7)
src/nat/runtime/session.py (2)
186-189: LGTM: workflow-run-id handling is straightforward.The workflow run ID is treated as an opaque string identifier, which is appropriate for correlation purposes.
165-177: Critical:uuid.UUID()cannot parse unhyphenated 32-character hex strings.Line 173 calls
uuid.UUID(trace_id_hex).intwheretrace_id_hexis extracted from the W3Ctraceparentheader (format:00-{32-hex}-{16-hex}-{flags}). The trace ID portion is 32 contiguous hexadecimal characters without hyphens, butuuid.UUID()expects either a hyphenated UUID string or an explicithex=parameter.This will raise a
ValueErroron every validtraceparentheader and fall through to the silent exception handler, effectively breaking W3C trace context propagation.Apply this diff to fix the parsing:
parts = traceparent.split("-") if len(parts) >= 4: trace_id_hex = parts[1] if len(trace_id_hex) == 32: - trace_id_int = uuid.UUID(trace_id_hex).int + trace_id_int = int(trace_id_hex, 16) self._context_state.workflow_trace_id.set(trace_id_int)Alternatively, if you prefer to use
uuid.UUID()for validation:- trace_id_int = uuid.UUID(trace_id_hex).int + trace_id_int = uuid.UUID(hex=trace_id_hex).intLikely an incorrect or invalid review comment.
tests/nat/runtime/test_session_traceparent.py (4)
76-103: Well-structured test with proper cleanup.The test correctly:
- Constructs a Starlette
Requestwith parametrized headers- Obtains the singleton
ContextStateand sets/resets tokens in a try/finally block- Verifies that
set_metadata_from_http_requestpopulatesworkflow_trace_idcorrectlyToken cleanup ensures test isolation.
123-159: Thorough metadata validation test.The test validates all metadata headers (
conversation-id,user-message-id,workflow-run-id) both individually and in combination, with proper context variable isolation using token reset in finally blocks.
51-54: Error cases are well covered.The test suite includes error handling scenarios:
- Both headers invalid (lines 51-54)
- Negative and overflow values for
workflow-trace-id(lines 71-72)- Malformed
traceparentwith fallback (lines 45-49)This addresses the past review comment requesting error case coverage.
Based on learnings.
Also applies to: 71-72
36-73: Parametrized tests are comprehensive and session.py’s UUID parsing is correct
The use of uuid.UUID(trace_id_hex) and uuid.UUID(workflow_trace_id) already accepts both 32-character hex strings (case-insensitive) and standard hyphenated UUID formats (and rejects invalid lengths), so all TRACE_ID_CASES will pass without changes to session.py.src/nat/data_models/span.py (1)
131-139: LGTM: ID generators correctly produce non-zero OpenTelemetry-sized IDs.
_generate_nonzero_trace_id()generates a 128-bit random integer usinguuid.uuid4().int_generate_nonzero_span_id()derives a 64-bit span ID by shifting the UUID right by 64 bitsBoth functions are guaranteed to produce non-zero values due to UUID version and variant bits.
Based on learnings (past review comments from willkill07).
willkill07
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last round of feedback. Otherwise LGTM
Signed-off-by: Daniel Wang <[email protected]>
Signed-off-by: Daniel Wang <[email protected]>
|
/ok to test 9e43759 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/nat/data_models/span.py (1)
143-145: Fix typo in field descriptions.The descriptions contain "OTel-syle" which should be "OpenTelemetry-style" on both lines 143 and 145.
Apply this diff:
- description="The OTel-syle 128-bit trace ID of the span.") + description="The OpenTelemetry-style 128-bit trace ID of the span.") span_id: int = Field(default_factory=_generate_nonzero_span_id, - description="The OTel-syle 64-bit span ID of the span.") + description="The OpenTelemetry-style 64-bit span ID of the span.")
🧹 Nitpick comments (1)
src/nat/data_models/span.py (1)
151-152: UUID parsing may fail for unhyphenated hex strings.The validator uses
uuid.UUID(v)which requires hyphens or braces. If trace_id strings arrive as raw 32-character hex (e.g., from external systems or headers), parsing will fail.Apply this diff to handle both formats:
if isinstance(v, str): - v = uuid.UUID(v).int + try: + # Support both hyphenated UUID and 32-char hex + if len(v) == 32 and '-' not in v: + v = uuid.UUID(hex=v).int + else: + v = uuid.UUID(v).int + except (ValueError, AttributeError) as err: + raise ValueError(f"Invalid trace_id format: {v}") from err
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/nat/data_models/span.py(1 hunks)tests/nat/runtime/test_session_traceparent.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/nat/runtime/test_session_traceparent.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{py,yaml,yml}
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.
Files:
src/nat/data_models/span.py
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).
**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial
Files:
src/nat/data_models/span.py
src/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
All importable Python code must live under src/ (or packages//src/)
Files:
src/nat/data_models/span.py
src/nat/**/*
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Changes in src/nat should prioritize backward compatibility
Files:
src/nat/data_models/span.py
⚙️ CodeRabbit configuration file
This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.
Files:
src/nat/data_models/span.py
{src/**/*.py,packages/*/src/**/*.py}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful
Files:
src/nat/data_models/span.py
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,
and should contain an Apache License 2.0 header comment at the top of each file.
- Confirm that copyright years are up-to date whenever a file is changed.
Files:
src/nat/data_models/span.py
🧬 Code graph analysis (1)
src/nat/data_models/span.py (1)
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py (2)
trace_id(274-280)span_id(265-271)
🪛 Ruff (0.13.2)
src/nat/data_models/span.py
156-156: Avoid specifying long messages outside the exception class
(TRY003)
167-167: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
167-167: Avoid specifying long messages outside the exception class
(TRY003)
171-171: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: CI Pipeline / Check
🔇 Additional comments (2)
src/nat/data_models/span.py (2)
131-133: LGTM!The trace ID generator correctly uses
uuid.uuid4().intto produce a 128-bit non-zero value, aligning with OpenTelemetry standards.
136-138: LGTM!The span ID generator correctly extracts the lower 64 bits from a UUID, producing a non-zero 64-bit value per OpenTelemetry conventions.
Signed-off-by: Daniel Wang <[email protected]>
|
/ok to test 58ae9cc |
## Description <!-- Note: The pull request title will be included in the CHANGELOG. --> <!-- Provide a standalone description of changes in this PR. --> <!-- Reference any issues closed by this PR with "closes #1234". All PRs should have an issue they close--> Closes 1897 **Added**: - `ContextState` fields: `workflow_run_id`, `workflow_trace_id` (`src/nat/builder/context.py`) - Non-zero ID generators & validators for OpenTelemetry-sized IDs (`src/nat/data_models/span.py`) - Workflow start/end event emission with metadata: `workflow_run_id`, `workflow_trace_id`, `conversation_id` (`src/nat/runtime/runner.py`) - Span attributes for correlation: `.workflow.run_id`, `.workflow.trace_id`, `.conversation.id` (`src/nat/observability/exporter/span_exporter.py`) - Tests: `tests/nat/opentelemetry/test_otel_span_ids.py`, `tests/nat/runtime/test_runner_trace_ids.py`, `tests/nat/runtime/test_session_traceparent.py`, updates to `tests/nat/runtime/test_runner.py` **Changed**: - Ensure non-zero 128-bit trace IDs and 64-bit span IDs in OTEL plugin (`packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py`) - Runner establishes workflow-scoped run/trace IDs and resets them on exit (`src/nat/runtime/runner.py`) - Exporter prefers parent/context trace ID; falls back to workflow trace ID and sets correlated attributes (`src/nat/observability/exporter/span_exporter.py`) - `SpanContext` defaults & field validators enforce size/range and non-zero invariants (`src/nat/data_models/span.py`) - Parse/propagate incoming W3C `traceparent` and workflow headers to session/runner (`src/nat/runtime/session.py`, `tests/nat/runtime/test_session_traceparent.py`) **Removed**: - None **Why it matters**: - Consistent cross-span correlation and easier log/trace joins across agents and sub-spans - Fewer broken traces: eliminates zero/invalid IDs, aligns with OpenTelemetry sizing - Improves debugging and fleet-wide reporting via explicit `.workflow.*` and `.conversation.id` attributes - Safer header handling for external callers; preserves upstream traces **Breaking changes / Migrations**: - None **Docs**: - None ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Workflow-level run and trace IDs are created, stored in context, propagated, and emitted with WORKFLOW_START/WORKFLOW_END events. * Incoming HTTP tracing headers (traceparent, workflow-trace-id, workflow-run-id) are parsed into context. * **Improvements** * Span metadata now includes workflow IDs and richer attributes (event type, function info, timestamps, conversation/run IDs). * ID generation and validation tightened to ensure non-zero, correctly sized trace and span IDs; span context ID generation approach updated. * **Tests** * Added tests validating span ID formats, runner trace/run ID behavior, and header parsing. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Daniel Wang <[email protected]> Signed-off-by: Zhongxuan (Daniel) Wang <[email protected]> Signed-off-by: Yuchen Zhang <[email protected]>
Description
Closes 1897
Added:
ContextStatefields:workflow_run_id,workflow_trace_id(src/nat/builder/context.py)src/nat/data_models/span.py)workflow_run_id,workflow_trace_id,conversation_id(src/nat/runtime/runner.py).workflow.run_id,.workflow.trace_id,.conversation.id(src/nat/observability/exporter/span_exporter.py)tests/nat/opentelemetry/test_otel_span_ids.py,tests/nat/runtime/test_runner_trace_ids.py,tests/nat/runtime/test_session_traceparent.py, updates totests/nat/runtime/test_runner.pyChanged:
packages/nvidia_nat_opentelemetry/src/nat/plugins/opentelemetry/otel_span.py)src/nat/runtime/runner.py)src/nat/observability/exporter/span_exporter.py)SpanContextdefaults & field validators enforce size/range and non-zero invariants (src/nat/data_models/span.py)traceparentand workflow headers to session/runner (src/nat/runtime/session.py,tests/nat/runtime/test_session_traceparent.py)Removed:
Why it matters:
.workflow.*and.conversation.idattributesBreaking changes / Migrations:
Docs:
By Submitting this PR I confirm:
Summary by CodeRabbit
New Features
Improvements
Tests