Skip to content

[AMD] Fix CI RuntimeError: opentelemetry package is not installed#23940

Merged
bingxche merged 3 commits into
sgl-project:mainfrom
yichiche:yichiche/fix-ci-opentelemetry
Apr 29, 2026
Merged

[AMD] Fix CI RuntimeError: opentelemetry package is not installed#23940
bingxche merged 3 commits into
sgl-project:mainfrom
yichiche:yichiche/fix-ci-opentelemetry

Conversation

@yichiche
Copy link
Copy Markdown
Collaborator

Motivation

The AMD ROCm CI is failing on all tracing-related tests across multiple jobs:

  • multimodal-gen-test-1-gpu-amd: test_spans_exported, test_spans_without_traceparent, test_batch_requests in test_tracing.py
  • multimodal-gen-test-2-gpu-amd: TestDisaggZImageTracing::test_disagg_spans_share_trace_id in test_disagg_server.py

All fail with the same root cause:

RuntimeError: opentelemetry package is not installed!!!
Please not enable tracing or install opentelemetry

This was introduced when OpenTelemetry tracing was added to the diffusion pipeline (#21254) and tracing CI tests were added (#21740), but the tracing optional dependency group was never wired into the ROCm install path.

The CUDA pyproject.toml correctly includes sglang[tracing] in its all extra (line 169), but the ROCm-specific pyproject_other.toml does not include it in all_hip.

Modifications

Single-line change in python/pyproject_other.toml:

Add "sglang[tracing]" to the all_hip optional dependency group:

-all_hip = ["sglang[srt_hip]", "sglang[diffusion_hip]"]
+all_hip = ["sglang[srt_hip]", "sglang[diffusion_hip]", "sglang[tracing]"]

The tracing extra was already defined in the same file (line 84-89) with the correct packages:

  • opentelemetry-sdk
  • opentelemetry-api
  • opentelemetry-exporter-otlp
  • opentelemetry-exporter-otlp-proto-grpc

The ROCm Dockerfile (docker/rocm.Dockerfile, line 256) installs python[all_hip], so this change automatically pulls in the tracing packages during image build without any Dockerfile modifications.

Accuracy Tests

N/A - dependency-only change, no model or kernel code affected.

Speed Tests and Profiling

N/A - no runtime code changes.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label Apr 28, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates python/pyproject_other.toml to include sglang[tracing] in the all_hip dependency group. The reviewer suggests extending this change to other platform-specific groups (all_hpu, all_musa, all_mps) and the test extra to ensure consistency and prevent CI failures across different environments.

]

all_hip = ["sglang[srt_hip]", "sglang[diffusion_hip]"]
all_hip = ["sglang[srt_hip]", "sglang[diffusion_hip]", "sglang[tracing]"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While adding sglang[tracing] to all_hip correctly addresses the issue for AMD ROCm CI, the same inconsistency exists for other platform-specific 'all' extras in this file (all_hpu, all_musa, all_mps). To ensure feature parity and prevent similar CI failures on these platforms, consider adding sglang[tracing] to them as well.

Additionally, since the failure occurred during tracing-related tests, it might be beneficial to include sglang[tracing] in the test extra (line 157) to ensure these dependencies are always available for testing environments, regardless of the platform-specific 'all' extra used.

@yichiche
Copy link
Copy Markdown
Collaborator Author

@amd-bot ci-status

@amd-bot
Copy link
Copy Markdown

amd-bot commented Apr 29, 2026

@yichiche

CI Status for PR #23940

PR: [AMD] Fix CI RuntimeError: opentelemetry package is not installed
Changed files: python/pyproject_other.toml (+1/-1), scripts/ci/amd/amd_ci_install_dependency.sh (+2/-2)

AMD: 8 failures (0 likely related) | Others: 0 failures

The PR change is verified working — installer now logs Installing python extras: [dev_hip,tracing,diffusion] and opentelemetry-sdk-1.41.1 (and the rest) gets installed. The original RuntimeError: opentelemetry package is not installed is gone from every shard with a usable log. None of the remaining failures involve the tracing import path.

AMD CI Failures

Job Test File Test Function Error Related? Explanation Log
multimodal-gen-test-1-gpu-amd (mi300, 0) N/A N/A Step "Run diffusion server tests (1-GPU)" cancelled after ~1h40m; Azure blob log lost (BlobNotFound) 🟢 Unlikely Runner timeout / cancellation; install step succeeded. PR only changes pip extras and cannot trigger a runtime cancellation. Log
multimodal-gen-test-1-gpu-amd (mi300, 2) N/A N/A manifest for 10.245.143.50:5000/rocm/sgl-dev:v0.5.10.post1-rocm700-mi30x-20260428 not found then docker pull rate limit ×6 → failed after 6 attempts 🟢 Unlikely Pure infrastructure (private registry image not yet pushed + Docker Hub anonymous rate limit). Unrelated to this PR. Log
multimodal-gen-test-2-gpu-amd (mi300, 0) sglang/multimodal_gen/test/server/test_server_2_gpu.py TestDiffusionServerTwoGpu::test_diffusion_generation (all 11 parametrizations) RuntimeError: Server exited early (code 1)torch.distributed.DistBackendError: NCCL ncclUnhandledCudaError during init_process_group 🟢 Unlikely NCCL/CUDA dist init failure on the runner (every parametrization same root cause). opentelemetry installed fine; failure is in parallel_state.py device init, not in tracing or in any code touched by the PR. Likely runner/driver issue. Log
multimodal-gen-test-2-gpu-amd (mi300, 2) sglang/multimodal_gen/test/server/test_disagg_server.py TestDisaggZImage1Rank::test_generates_image, TestDisaggZImage2RankDenoiser::test_generates_image_with_sp2_denoiser, TestDisaggZImageTracing::test_disagg_spans_share_trace_id Warmup POST /v1/images/generations500 Internal Server Error; server log: DiffusionServer timeout: request ... not completed within 120.0s 🟡 Possibly The tracing test (test_disagg_spans_share_trace_id) is now past the opentelemetry import (which the PR fixed) and reaches warmup — so the PR did its job for that test. The other 2 tests in the shard are non-tracing (Z-Image disagg) and passed in baseline run 25053383235, so something is making the disagg cluster slower in the PR run. Possibly a flake (model warmup hitting 120s ceiling on shared 2-GPU runner), but worth a rerun before merge. Log
stage-b-test-1-gpu-small-amd-mi35x (mi35x-1) test/registered/core/test_gpt_oss_1gpu.py TestGptOss1Gpu.test_mxfp4_20b AssertionError: False is not true after retry() exceed maximum number of retries. 🟢 Unlikely Known never-passed on mi35x — in-flight fix #23829 sets SGLANG_USE_AITER=1. Failure pre-dates this PR and is unrelated (no tracing or pyproject involvement). Log
stage-b-test-2-gpu-large-amd (mi300, 1) test/registered/hicache/test_hicache_storage_file_backend.py TestHiCacheStorageAccuracy.test_eval_accuracy AssertionError: 0.04500000000000004 not less than 0.03 : Accuracy should be consistent between cache states (gsm8k accuracy delta 0.045 vs threshold 0.03) 🟢 Unlikely Same shard PASSED on baseline scheduled main run job 73386927933. PR only adds a pip extra — it cannot affect hicache eval accuracy. Looks like an accuracy flake (just barely over threshold). Log
wait-for-stage-b-amd N/A N/A Gating job timed out / failed because stage-b dependency jobs failed 🟢 Unlikely Cascading from the stage-b failures above. Log
pr-test-amd-finish N/A N/A multimodal-gen-test-1-gpu-amd: failure → exit 1 🟢 Unlikely Aggregator — fails because of upstream failures. Log

Details

The PR's intended fix is verified working. In baseline scheduled run 25053383235, the multimodal-gen-test-1-gpu-amd (mi325, 3) shard ran test_tracing.py and emitted ERROR test_spans_exported, ERROR test_spans_without_traceparent, ERROR test_batch_requests — all with RuntimeError: opentelemetry package is not installed!!! Please not enable tracing or install opentelemetry. The same shard in this PR run installed opentelemetry-sdk-1.41.1, opentelemetry-api-1.41.1, opentelemetry-exporter-otlp-1.41.1, opentelemetry-exporter-otlp-proto-grpc-1.41.1 (visible in logs of shards we have, e.g. job 73408828905 line 1535) and that import error is gone everywhere.

The single 🟡 row is multimodal-gen-test-2-gpu-amd (mi300, 2). The tracing test there now gets past the import and fails at the warmup HTTP call, alongside two non-tracing Z-Image disagg tests in the same shard. The two non-tracing tests passed in the baseline run, so the new failure is not just opentelemetry-related — but the PR doesn't touch any disagg/server runtime code. Most plausible read is a 2-GPU disagg-cluster warmup flake; a rerun should clear it. If reruns also fail this exact set, the underlying disagg-cluster startup needs separate investigation, independent of this PR.

Bottom line: the PR fix is correct and minimally scoped. None of the 8 red jobs implicates the PR's diff. Recommend rerunning the AMD jobs (especially multimodal-gen-test-2-gpu-amd (mi300, 0) and (mi300, 2), plus the docker-pull-rate-limited shard 2) to confirm.

Generated by amd-bot using Claude Code CLI

@bingxche
Copy link
Copy Markdown
Collaborator

@bingxche bingxche merged commit 180bb26 into sgl-project:main Apr 29, 2026
112 of 127 checks passed
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd dependencies Pull requests that update a dependency file run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants