Skip to content

[ROCm][CI] Fix AMD pipeline generator command wrapping, gating, and git diff behavior#347

Open
AndreasKaratzas wants to merge 6 commits into
mainfrom
akaratza_amdci_markers_fix
Open

[ROCm][CI] Fix AMD pipeline generator command wrapping, gating, and git diff behavior#347
AndreasKaratzas wants to merge 6 commits into
mainfrom
akaratza_amdci_markers_fix

Conversation

@AndreasKaratzas
Copy link
Copy Markdown
Member

@AndreasKaratzas AndreasKaratzas commented Apr 23, 2026

This PR hardens AMD support in the Python pipeline generator and addresses a few related generator issues.

buildkite/bootstrap-amd.sh

  • Set a default for AMD_USE_PIPELINE_GENERATOR=0.
  • Removed the old attempted handoff into the Python generator.
  • Left a clear message that bootstrap-amd.sh still uses the dedicated Jinja path.

buildkite/pipeline_generator/buildkite_step.py

  • Fixed source file dependency matching to use real exact/prefix path matching instead of substring matching.
  • Added shared blocking logic so AMD mirror steps follow the same auto-run vs block rules as their parent steps.
  • Updated AMD mirror steps to pass commands through:
    • bash .buildkite/scripts/hardware_ci/run-amd-test.sh
    • VLLM_TEST_COMMANDS
  • Updated AMD custom command overrides to go through the same command preparation path.
  • Replaced the old boolean-style GPU setup toggle with:
    • setup_profile="nvidia" | "amd" | "none"
  • Added _get_setup_commands(...) so setup logic lives in one place.
  • Kept NVIDIA behavior unchanged:
    • nvidia-smi
    • CUDA coredump exports
  • Updated AMD mirror steps to use:
    • setup_profile="amd"
  • AMD mirror commands now include:
    • amd-smi
  • AMD mirror commands still exclude:
    • nvidia-smi
    • CUDA coredump exports

buildkite/pipeline_generator/utils_lib/git_utils.py

  • Removed git add . from the diff helper.
  • Made git diff ranges explicit:
    • HEAD~1..HEAD on main
    • merge-base..HEAD on other branches
  • Fail clearly if merge base cannot be determined.

buildkite/pipeline_generator/pyproject.toml

  • Removed the stale utils module entry from py-modules.

buildkite/tests/pipeline_generator/test_step.py

  • Updated the test to match the current step.py API.
  • Added a small local sys.path setup.
  • Stubbed global config in-test so step loading stays deterministic.
  • Switched coverage to the real step-loading and grouping flow.

buildkite/tests/pipeline_generator/test_amd_mirror.py

  • Added regression coverage for AMD mirror command construction and gating.
  • Added coverage for AMD custom command overrides.
  • Added a regression test for all three setup profiles:
    • nvidia
    • amd
    • none
  • Strengthened AMD mirror tests to verify:
    • amd-smi is present
    • nvidia-smi is absent
    • CUDA coredump env is absent

cc @kenroche

…it diff behavior

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas requested a review from khluu April 23, 2026 04:58
Copy link
Copy Markdown
Member Author

@AndreasKaratzas AndreasKaratzas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khluu These changes affect the upstream CI logic too, so check them out if you like them or if I should remove them.

merge_base = merge_base_commit
if not merge_base:
pass
raise RuntimeError("Failed to determine merge base commit for git diff.")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khluu Lmk if I should revert this part. I added it cause I thought that we should not silently pass right?

Comment on lines +150 to +168
def _get_setup_commands(step: Step, setup_profile: SetupProfile) -> List[str]:
if step.label.startswith(":docker:") or step.no_plugin or setup_profile == "none":
return []

if setup_profile == "nvidia":
return [
"echo '--- :nvidia: GPU Info'",
"(command nvidia-smi || true)",
"echo '--- :gear: CUDA Coredump Setup'",
"export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 && export CUDA_COREDUMP_SHOW_PROGRESS=1 && export CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory'",
]

if setup_profile == "amd":
return [
"echo '--- :amd: GPU Info'",
"(command amd-smi || true)",
]

raise ValueError(f"Unsupported setup profile: {setup_profile}")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added different profile for AMD and that also gives flexibility for anyone else.

Comment on lines +230 to +243
def _matches_source_dependency(source_file: str, diff_file: str) -> bool:
normalized = source_file.rstrip("/")
if not normalized:
return False
return diff_file == normalized or diff_file.startswith(f"{normalized}/")


def _step_is_blocked(step: Step, list_file_diff: List[str]) -> bool:
global_config = get_global_config()
return (not _step_should_run(step, list_file_diff) or (
step.optional and global_config["nightly"] != "1"
))


Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generator used raw substring matching:

if source_file in diff_file
That is too loose. Examples:

vllm/ would match notvllm/foo.py
src/file would match src/file_extra.py
The helper changes that to proper exact-or-prefix path matching:

exact file match: diff_file == normalized
directory prefix match: diff_file.startswith(f"{normalized}/")
So this one exists to make gating correct.

_step_is_blocked() is mostly a cleanup/helper.

I added it because the “should this step be behind a manual block?” logic was being repeated.

…s_fix

# Conflicts:
#	buildkite/pipeline_generator/buildkite_step.py
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…tream vllm

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Copy link
Copy Markdown
Member

@khluu khluu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also break this down to smaller PRs?

if branch == "main":
output = subprocess.check_output(
["git", "diff", "--name-only", "--diff-filter=ACMDR", "HEAD~1"],
["git", "diff", "--name-only", "--diff-filter=ACMDR", "HEAD~1", "HEAD"],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we adding HEAD at the end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants