Skip to content

[CK] Add render group to AITER and FA dockers#6563

Merged
DDEle merged 3 commits into
developfrom
yiding/fix-ck-aiter-fa-docker-render-group
Apr 21, 2026
Merged

[CK] Add render group to AITER and FA dockers#6563
DDEle merged 3 commits into
developfrom
yiding/fix-ck-aiter-fa-docker-render-group

Conversation

@DDEle
Copy link
Copy Markdown
Contributor

@DDEle DDEle commented Apr 20, 2026

Motivation

The AITER and FA test dockers (Dockerfile.aiter, Dockerfile.fa) inherit from the rocm/pytorch base image. Recent updates to that base image dropped the render group from /etc/group, so every parallel test stage now fails on the test agents with:

docker: Error response from daemon: Unable to find group render:
no matching entries in group file.

Jenkins resolves --group-add render against the container's /etc/group, not the host's, so even though the test agents have render in their /etc/group (GID 109), the container lookup fails.

This pattern affects every recent develop build (#673, #674, #686, #688, #699, #708 — 6 days in a row), where AITER tests fail in seconds and the cascading failure aborts all downstream Build/FMHA/TILE_ENGINE stages.

Technical Details

Add groupadd -f render to both Dockerfile.aiter and Dockerfile.fa, mirroring what the main Dockerfile already does (Dockerfile:96) and what Dockerfile.pytorch does (Dockerfile.pytorch:4). The -f flag makes it idempotent — silently succeeds if the group already exists.

This guarantees the render group is always present in the container, regardless of whether the base image happens to ship it.

Test Plan

Triggering AITER CI job:

Test Result

Submission Checklist

The AITER and FA test dockers (Dockerfile.aiter, Dockerfile.fa) inherit
from rocm/pytorch base image. Recent updates to the base image dropped
the render group from /etc/group, which makes docker run fail on test
agents with:

  docker: Error response from daemon: Unable to find group render:
  no matching entries in group file.

This breaks all parallel stages because Jenkins resolves --group-add
render against the container's /etc/group, not the host's.

Add `groupadd -f render` to both dockers (matching Dockerfile.pytorch
and the main Dockerfile) so the group always exists in the container,
regardless of base image drift.
@DDEle DDEle requested a review from a team as a code owner April 20, 2026 08:25
@DDEle DDEle force-pushed the yiding/fix-ck-aiter-fa-docker-render-group branch from 80842f5 to 39f4755 Compare April 20, 2026 08:52
groupadd -f render without -g auto-assigns the next free GID (1001 on
the new rocm/pytorch base where only ubuntu=1000 exists), which then
collides with the explicit `groupadd -g 1001 jenkins` on the next line:

  groupadd: GID '1001' already exists

Move the render groupadd to after the jenkins user/group is created so
render gets 1002 instead. The actual GID render lands on doesn't matter
for GPU access because the docker run command also passes
`--group-add=109` (the host's render GID) directly.
Belt-and-suspenders: current rocm/pytorch base still ships video group
(GID 44), but add `groupadd -f video` next to the render one so a
future base drop wouldn't break docker run --group-add video either.

`-f` is a no-op when the group already exists.
@DDEle DDEle merged commit 6559ac9 into develop Apr 21, 2026
52 of 54 checks passed
@DDEle DDEle deleted the yiding/fix-ck-aiter-fa-docker-render-group branch April 21, 2026 05:35
assistant-librarian Bot pushed a commit to ROCm/composable_kernel that referenced this pull request Apr 21, 2026
[CK] Add render group to AITER and FA dockers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The AITER and FA test dockers (`Dockerfile.aiter`, `Dockerfile.fa`)
inherit from the `rocm/pytorch` base image. Recent updates to that base
image dropped the `render` group from `/etc/group`, so every parallel
test stage now fails on the test agents with:

```
docker: Error response from daemon: Unable to find group render:
no matching entries in group file.
```

Jenkins resolves `--group-add render` against the **container's**
`/etc/group`, not the host's, so even though the test agents have render
in their `/etc/group` (GID 109), the container lookup fails.

This pattern affects every recent develop build
([#673](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/673),
[#674](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/674),
[#686](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/686),
[#688](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/688),
[#699](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/699),
[#708](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/708)
— 6 days in a row), where AITER tests fail in seconds and the cascading
failure aborts all downstream Build/FMHA/TILE_ENGINE stages.

## Technical Details

Add `groupadd -f render` to both `Dockerfile.aiter` and `Dockerfile.fa`,
mirroring what the main `Dockerfile` already does (`Dockerfile:96`) and
what `Dockerfile.pytorch` does (`Dockerfile.pytorch:4`). The `-f` flag
makes it idempotent — silently succeeds if the group already exists.

This guarantees the `render` group is always present in the container,
regardless of whether the base image happens to ship it.

## Test Plan
Triggering AITER CI job:

## Test Result

## Submission Checklist

- [x] Look over the contributing guidelines at

https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
aledudek pushed a commit that referenced this pull request May 20, 2026
## Motivation

The AITER and FA test dockers (`Dockerfile.aiter`, `Dockerfile.fa`)
inherit from the `rocm/pytorch` base image. Recent updates to that base
image dropped the `render` group from `/etc/group`, so every parallel
test stage now fails on the test agents with:

```
docker: Error response from daemon: Unable to find group render:
no matching entries in group file.
```

Jenkins resolves `--group-add render` against the **container's**
`/etc/group`, not the host's, so even though the test agents have render
in their `/etc/group` (GID 109), the container lookup fails.

This pattern affects every recent develop build
([#673](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/673),
[#674](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/674),
[#686](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/686),
[#688](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/688),
[#699](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/699),
[#708](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/708)
— 6 days in a row), where AITER tests fail in seconds and the cascading
failure aborts all downstream Build/FMHA/TILE_ENGINE stages.

## Technical Details

Add `groupadd -f render` to both `Dockerfile.aiter` and `Dockerfile.fa`,
mirroring what the main `Dockerfile` already does (`Dockerfile:96`) and
what `Dockerfile.pytorch` does (`Dockerfile.pytorch:4`). The `-f` flag
makes it idempotent — silently succeeds if the group already exists.

This guarantees the `render` group is always present in the container,
regardless of whether the base image happens to ship it.

## Test Plan
Triggering AITER CI job: 

## Test Result

## Submission Checklist

- [x] Look over the contributing guidelines at

https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants