Skip to content

fix(ml): fix rocm ci#25541

Closed
mertalev wants to merge 3 commits intomainfrom
fix/ml-rocm-build
Closed

fix(ml): fix rocm ci#25541
mertalev wants to merge 3 commits intomainfrom
fix/ml-rocm-build

Conversation

@mertalev
Copy link
Member

Description

The targets were expanded in #23458 and ccache was added, but it causes the build to take over six hours. This PR removes the added target and ups parallelism temporarily to get a working image, which should hopefully improve future build time with ccache.

@bo0tzz
Copy link
Member

bo0tzz commented Jan 26, 2026

This is just raw C(++?) compilation and not anything we could speed up with gpu hwaccel, right?

@mertalev
Copy link
Member Author

This is just raw C(++?) compilation and not anything we could speed up with gpu hwaccel, right?

Yeah, there's no way to use a GPU for this. Fortunately caching should be very effective if it can build once and start using it - though I'm not 100% on whether it will.

@mertalev
Copy link
Member Author

mertalev commented Jan 27, 2026

Oof, it's still failing. Maybe the version bump added more things to compile? I suppose I can try reverting the ORT upgrade for ROCm. It's still the same version... why is it slower?

@savely-krasovsky
Copy link
Contributor

I would make the cache behind the flag since it's genuinely useful during local development. Having an error after three hours of build, only to fix it within a few seconds and start all over again, is a real pain.

@mertalev
Copy link
Member Author

mertalev commented Feb 6, 2026

I would make the cache behind the flag since it's genuinely useful during local development. Having an error after three hours of build, only to fix it within a few seconds and start all over again, is a real pain.

It seems to time out when I just include the cache mount and ccache env, even if I don't have --use_cache set. But it builds when those are removed. Very odd... I guess bringing the image back to a working state is the priority before worrying about anything else.

@kprinssu
Copy link
Contributor

kprinssu commented Feb 10, 2026

@mertalev Is removing gfx906 support intentional? I have various Vega VII/MI50 cards and ORT works well on them.

To also reduce build times, I suggest you also remove the extended list of GPU architectures. I primarily target gfx906 and gfx1200. My build times are roughly 3-5 mins on my 5950X.

@savely-krasovsky
Copy link
Contributor

@kprinssu yes, but I guess we need an image that will work across all architectures. Two old arches make no sense.

@savely-krasovsky
Copy link
Contributor

I believe we should split AMD into two backends: the old ROCm-based backend for older arches and a new MIGraphX backend for newer arches.

@kprinssu
Copy link
Contributor

@savely-krasovsky We can support older arches with newer ROCm support. However, we will need to build ROCBlas or use a janky solution.

I am personally using the official Arch Linux team's ROCBlas binaries here, https://archlinux.org/packages/extra/x86_64/rocblas/

@mertalev mertalev closed this Feb 10, 2026
@mertalev
Copy link
Member Author

No longer needed since we upgraded the build server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants