[AMD] Fix CI RuntimeError: opentelemetry package is not installed by yichiche · Pull Request #23940 · sgl-project/sglang

yichiche · 2026-04-28T13:39:07Z

Motivation

The AMD ROCm CI is failing on all tracing-related tests across multiple jobs:

multimodal-gen-test-1-gpu-amd: test_spans_exported, test_spans_without_traceparent, test_batch_requests in test_tracing.py
multimodal-gen-test-2-gpu-amd: TestDisaggZImageTracing::test_disagg_spans_share_trace_id in test_disagg_server.py

All fail with the same root cause:

RuntimeError: opentelemetry package is not installed!!!
Please not enable tracing or install opentelemetry

This was introduced when OpenTelemetry tracing was added to the diffusion pipeline (#21254) and tracing CI tests were added (#21740), but the tracing optional dependency group was never wired into the ROCm install path.

The CUDA pyproject.toml correctly includes sglang[tracing] in its all extra (line 169), but the ROCm-specific pyproject_other.toml does not include it in all_hip.

Modifications

Single-line change in python/pyproject_other.toml:

Add "sglang[tracing]" to the all_hip optional dependency group:

-all_hip = ["sglang[srt_hip]", "sglang[diffusion_hip]"]
+all_hip = ["sglang[srt_hip]", "sglang[diffusion_hip]", "sglang[tracing]"]

The tracing extra was already defined in the same file (line 84-89) with the correct packages:

opentelemetry-sdk
opentelemetry-api
opentelemetry-exporter-otlp
opentelemetry-exporter-otlp-proto-grpc

The ROCm Dockerfile (docker/rocm.Dockerfile, line 256) installs python[all_hip], so this change automatically pulls in the tracing packages during image build without any Dockerfile modifications.

Accuracy Tests

N/A - dependency-only change, no model or kernel code affected.

Speed Tests and Profiling

N/A - no runtime code changes.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request updates python/pyproject_other.toml to include sglang[tracing] in the all_hip dependency group. The reviewer suggests extending this change to other platform-specific groups (all_hpu, all_musa, all_mps) and the test extra to ensure consistency and prevent CI failures across different environments.

gemini-code-assist · 2026-04-28T13:48:21Z

 ]

-all_hip = ["sglang[srt_hip]", "sglang[diffusion_hip]"]
+all_hip = ["sglang[srt_hip]", "sglang[diffusion_hip]", "sglang[tracing]"]


While adding sglang[tracing] to all_hip correctly addresses the issue for AMD ROCm CI, the same inconsistency exists for other platform-specific 'all' extras in this file (all_hpu, all_musa, all_mps). To ensure feature parity and prevent similar CI failures on these platforms, consider adding sglang[tracing] to them as well.

Additionally, since the failure occurred during tracing-related tests, it might be beneficial to include sglang[tracing] in the test extra (line 157) to ensure these dependencies are always available for testing environments, regardless of the platform-specific 'all' extra used.

yichiche · 2026-04-29T01:20:55Z

@amd-bot ci-status

amd-bot · 2026-04-29T01:25:26Z

@yichiche

CI Status for PR #23940

PR: [AMD] Fix CI RuntimeError: opentelemetry package is not installed
Changed files: python/pyproject_other.toml (+1/-1), scripts/ci/amd/amd_ci_install_dependency.sh (+2/-2)

AMD: 8 failures (0 likely related) | Others: 0 failures

The PR change is verified working — installer now logs Installing python extras: [dev_hip,tracing,diffusion] and opentelemetry-sdk-1.41.1 (and the rest) gets installed. The original RuntimeError: opentelemetry package is not installed is gone from every shard with a usable log. None of the remaining failures involve the tracing import path.

multimodal-gen-test-1-gpu-amd (mi300, 0) is solved by [CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions #24002
multimodal-gen-test-2-gpu-amd (mi300, 2) is solved by [AMD] Fix CI error: Increase disagg test timeout from 120s to 300s #24015

AMD CI Failures

Job	Test File	Test Function	Error	Related?	Explanation	Log
multimodal-gen-test-1-gpu-amd (mi300, 0)	N/A	N/A	Step "Run diffusion server tests (1-GPU)" cancelled after ~1h40m; Azure blob log lost (`BlobNotFound`)	🟢 Unlikely	Runner timeout / cancellation; install step succeeded. PR only changes pip extras and cannot trigger a runtime cancellation.	Log
multimodal-gen-test-1-gpu-amd (mi300, 2)	N/A	N/A	`manifest for 10.245.143.50:5000/rocm/sgl-dev:v0.5.10.post1-rocm700-mi30x-20260428 not found` then `docker pull rate limit` ×6 → `failed after 6 attempts`	🟢 Unlikely	Pure infrastructure (private registry image not yet pushed + Docker Hub anonymous rate limit). Unrelated to this PR.	Log
multimodal-gen-test-2-gpu-amd (mi300, 0)	`sglang/multimodal_gen/test/server/test_server_2_gpu.py`	`TestDiffusionServerTwoGpu::test_diffusion_generation` (all 11 parametrizations)	`RuntimeError: Server exited early (code 1)` → `torch.distributed.DistBackendError: NCCL ncclUnhandledCudaError` during `init_process_group`	🟢 Unlikely	NCCL/CUDA dist init failure on the runner (every parametrization same root cause). opentelemetry installed fine; failure is in `parallel_state.py` device init, not in tracing or in any code touched by the PR. Likely runner/driver issue.	Log
multimodal-gen-test-2-gpu-amd (mi300, 2)	`sglang/multimodal_gen/test/server/test_disagg_server.py`	`TestDisaggZImage1Rank::test_generates_image`, `TestDisaggZImage2RankDenoiser::test_generates_image_with_sp2_denoiser`, `TestDisaggZImageTracing::test_disagg_spans_share_trace_id`	Warmup `POST /v1/images/generations` → `500 Internal Server Error`; server log: `DiffusionServer timeout: request ... not completed within 120.0s`	🟡 Possibly	The tracing test (`test_disagg_spans_share_trace_id`) is now past the opentelemetry import (which the PR fixed) and reaches warmup — so the PR did its job for that test. The other 2 tests in the shard are non-tracing (Z-Image disagg) and passed in baseline run 25053383235, so something is making the disagg cluster slower in the PR run. Possibly a flake (model warmup hitting 120s ceiling on shared 2-GPU runner), but worth a rerun before merge.	Log
stage-b-test-1-gpu-small-amd-mi35x (mi35x-1)	`test/registered/core/test_gpt_oss_1gpu.py`	`TestGptOss1Gpu.test_mxfp4_20b`	`AssertionError: False is not true` after `retry() exceed maximum number of retries.`	🟢 Unlikely	Known never-passed on mi35x — in-flight fix #23829 sets `SGLANG_USE_AITER=1`. Failure pre-dates this PR and is unrelated (no tracing or pyproject involvement).	Log
stage-b-test-2-gpu-large-amd (mi300, 1)	`test/registered/hicache/test_hicache_storage_file_backend.py`	`TestHiCacheStorageAccuracy.test_eval_accuracy`	`AssertionError: 0.04500000000000004 not less than 0.03 : Accuracy should be consistent between cache states` (gsm8k accuracy delta 0.045 vs threshold 0.03)	🟢 Unlikely	Same shard PASSED on baseline scheduled main run job 73386927933. PR only adds a pip extra — it cannot affect hicache eval accuracy. Looks like an accuracy flake (just barely over threshold).	Log
wait-for-stage-b-amd	N/A	N/A	Gating job timed out / failed because stage-b dependency jobs failed	🟢 Unlikely	Cascading from the stage-b failures above.	Log
pr-test-amd-finish	N/A	N/A	`multimodal-gen-test-1-gpu-amd: failure` → exit 1	🟢 Unlikely	Aggregator — fails because of upstream failures.	Log

Details

The PR's intended fix is verified working. In baseline scheduled run 25053383235, the multimodal-gen-test-1-gpu-amd (mi325, 3) shard ran test_tracing.py and emitted ERROR test_spans_exported, ERROR test_spans_without_traceparent, ERROR test_batch_requests — all with RuntimeError: opentelemetry package is not installed!!! Please not enable tracing or install opentelemetry. The same shard in this PR run installed opentelemetry-sdk-1.41.1, opentelemetry-api-1.41.1, opentelemetry-exporter-otlp-1.41.1, opentelemetry-exporter-otlp-proto-grpc-1.41.1 (visible in logs of shards we have, e.g. job 73408828905 line 1535) and that import error is gone everywhere.

The single 🟡 row is multimodal-gen-test-2-gpu-amd (mi300, 2). The tracing test there now gets past the import and fails at the warmup HTTP call, alongside two non-tracing Z-Image disagg tests in the same shard. The two non-tracing tests passed in the baseline run, so the new failure is not just opentelemetry-related — but the PR doesn't touch any disagg/server runtime code. Most plausible read is a 2-GPU disagg-cluster warmup flake; a rerun should clear it. If reruns also fail this exact set, the underlying disagg-cluster startup needs separate investigation, independent of this PR.

Bottom line: the PR fix is correct and minimally scoped. None of the 8 red jobs implicates the PR's diff. Recommend rerunning the AMD jobs (especially `multimodal-gen-test-2-gpu-amd (mi300, 0)` and `(mi300, 2)`, plus the docker-pull-rate-limited shard 2) to confirm.

Generated by amd-bot using Claude Code CLI

bingxche · 2026-04-29T10:01:21Z

Tests passed. Can be merged.

https://github.com/sgl-project/sglang/actions/runs/25098445552/job/73541121555

…l-project#23940) Co-authored-by: Bingxu Chen <bingxche@amd.com>

[AMD] Fix CI RuntimeError: opentelemetry package is not installed

e476551

github-actions Bot added the dependencies Pull requests that update a dependency file label Apr 28, 2026

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

add tracing in install dependency

b1bc152

github-actions Bot added the amd label Apr 28, 2026

bingxche added the run-ci label Apr 28, 2026

Merge branch 'main' into yichiche/fix-ci-opentelemetry

0ad3c97

This was referenced Apr 29, 2026

[AMD] Support fp8 MLA for diffusion model #20319

Merged

[AMD] Fix CI error: Increase disagg test timeout from 120s to 300s #24015

Closed

bingxche approved these changes Apr 29, 2026

View reviewed changes

bingxche merged commit 180bb26 into sgl-project:main Apr 29, 2026
112 of 127 checks passed

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

[AMD] Fix CI RuntimeError: opentelemetry package is not installed (sg…

fb882cf

…l-project#23940) Co-authored-by: Bingxu Chen <bingxche@amd.com>

LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026

[AMD] Fix CI RuntimeError: opentelemetry package is not installed (sg…

fd87496

…l-project#23940) Co-authored-by: Bingxu Chen <bingxche@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Fix CI RuntimeError: opentelemetry package is not installed#23940

[AMD] Fix CI RuntimeError: opentelemetry package is not installed#23940
bingxche merged 3 commits into
sgl-project:mainfrom
yichiche:yichiche/fix-ci-opentelemetry

yichiche commented Apr 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 28, 2026

Uh oh!

yichiche commented Apr 29, 2026

Uh oh!

amd-bot commented Apr 29, 2026 •

edited by yichiche

Loading

Uh oh!

bingxche commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yichiche commented Apr 28, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

yichiche commented Apr 29, 2026

Uh oh!

amd-bot commented Apr 29, 2026 • edited by yichiche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Status for PR #23940

AMD CI Failures

Details

Bottom line: the PR fix is correct and minimally scoped. None of the 8 red jobs implicates the PR's diff. Recommend rerunning the AMD jobs (especially multimodal-gen-test-2-gpu-amd (mi300, 0) and (mi300, 2), plus the docker-pull-rate-limited shard 2) to confirm.

Uh oh!

bingxche commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amd-bot commented Apr 29, 2026 •

edited by yichiche

Loading

Bottom line: the PR fix is correct and minimally scoped. None of the 8 red jobs implicates the PR's diff. Recommend rerunning the AMD jobs (especially `multimodal-gen-test-2-gpu-amd (mi300, 0)` and `(mi300, 2)`, plus the docker-pull-rate-limited shard 2) to confirm.