Skip to content

Distributed tracing implementation docs and guides#745

Merged
Gregory-Pereira merged 4 commits into
llm-d:mainfrom
sallyom:dist-tracing-implement
Mar 4, 2026
Merged

Distributed tracing implementation docs and guides#745
Gregory-Pereira merged 4 commits into
llm-d:mainfrom
sallyom:dist-tracing-implement

Conversation

@sallyom
Copy link
Copy Markdown
Collaborator

@sallyom sallyom commented Feb 12, 2026

Everything required to enable tracing is here!!
Also adding an install-otel-jaeger script, meant to install in the same ns where you are running your llm-d components.
Also adding cleanup to the generate-traffic scripts, adding one for pd specifically, and also dashboard for pd

Must accompany the following for complete implementation:

@sallyom
Copy link
Copy Markdown
Collaborator Author

sallyom commented Feb 12, 2026

pd-disaggregation traces in Jaeger:
Screenshot 2026-02-12 at 3 09 35 PM
Screenshot 2026-02-12 at 3 11 38 PM

Comment thread docs/monitoring/tracing/README.md Outdated
Comment thread docs/monitoring/tracing/README.md Outdated
Comment thread docs/monitoring/scripts/install-otel-collector-jaeger.sh Outdated
@sallyom sallyom force-pushed the dist-tracing-implement branch 3 times, most recently from 6e3eaef to 21fa3ff Compare February 26, 2026 17:11
@sallyom sallyom force-pushed the dist-tracing-implement branch 3 times, most recently from b9b951d to 2cd9877 Compare February 26, 2026 18:12
sallyom and others added 4 commits March 3, 2026 14:31
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: sallyom <somalley@redhat.com>
Route all tracing through an OTel Collector instead of exporting directly
to Jaeger. The new install script deploys both the collector and Jaeger
into the user-specified namespace (-n required). It auto-detects the
OpenTelemetry Operator CRD and uses an OpenTelemetryCollector CR when
available, falling back to a standalone Deployment otherwise. Both paths
produce the same service name (otel-collector) so all chart defaults use
the short name http://otel-collector:4317, resolving within the workload
namespace with no cross-namespace references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: sallyom <somalley@redhat.com>
Signed-off-by: sallyom <somalley@redhat.com>
Signed-off-by: sallyom <somalley@redhat.com>
@Gregory-Pereira Gregory-Pereira force-pushed the dist-tracing-implement branch from 2cd9877 to 2084417 Compare March 3, 2026 22:36
Copy link
Copy Markdown
Member

@Gregory-Pereira Gregory-Pereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Copy Markdown
Collaborator

@diegocastanibm diegocastanibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@Gregory-Pereira Gregory-Pereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@Gregory-Pereira Gregory-Pereira merged commit 8be4e5f into llm-d:main Mar 4, 2026
38 of 39 checks passed
diegocastanibm pushed a commit to diegocastanibm/llm-d that referenced this pull request Mar 18, 2026
* Add tracing config to all GAIE guide values

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: sallyom <somalley@redhat.com>

* docs: replace install-jaeger.sh with install-otel-collector-jaeger.sh

Route all tracing through an OTel Collector instead of exporting directly
to Jaeger. The new install script deploys both the collector and Jaeger
into the user-specified namespace (-n required). It auto-detects the
OpenTelemetry Operator CRD and uses an OpenTelemetryCollector CR when
available, falling back to a standalone Deployment otherwise. Both paths
produce the same service name (otel-collector) so all chart defaults use
the short name http://otel-collector:4317, resolving within the workload
namespace with no cross-namespace references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: sallyom <somalley@redhat.com>

* add tracing to new guides and clean up/update tracing README

Signed-off-by: sallyom <somalley@redhat.com>

* update promql queries

Signed-off-by: sallyom <somalley@redhat.com>

---------

Signed-off-by: sallyom <somalley@redhat.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants