-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add workgroup size attribute to AMDGPU functions in codegen #4342
Conversation
5602ae5
to
9ccc89d
Compare
does this solve the issue of INVALID_ISA errors discussed in the forum when running certain models? For example, can you run resnet 18 now? |
@masahi Yes, indeed, it fixes this, in fact I'm doing this with @mvermeulen 's tests in mind. @petrex So this patch works independently from yours but will profit from yours. The query API for |
Sure, Let's not use hardcoded value. btw you are testing on gfx900 right (or another arch)? |
Great! Thanks. |
src/codegen/llvm/codegen_amdgpu.cc
Outdated
} | ||
} | ||
LOG(WARNING) << "Cannot get maximum number of threads for AMD codegen"; | ||
return 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we set this value to 256?
In general 1024 is fine as long as the kernel did not use a lot of sgpr/vgpr.
however, we also see kernels with high vgpr usage (> 128), generate incorrect results with workgroup size == 1024.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is the max workgroup size at compile time, so the kernels will use fewer VGPRs when set larger.
Personally, I'd have half a mind to make it a LOG(FATAL) but I adapt to what the other code uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also seen a weird issue when using 1024 threads per block. Making it 256 fixed for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so I always had the trouble only when I didn't set workgroup size on compilation (which is what this patch fixes by setting the max workgroup size) and then used many threads on launch, but I've changed the default to 256 based on your expertise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since @t-vi pinged me: this is not entirely correct. Let me explain. Our LC backend assumes, in the absence of explicit annotation, the max workgroup size to be 256 and generates code for that. This impacts us differently than CUDA since we finalize to ISA during compile time, as opposed to some IR that gets finalized at runtime. So indeed, if a kernel is dispatched with more than 256 it may fail in interesting ways at runtime. There is internal discussion going on to finally mitigate this behavior on the FE level. However, it is, as @t-vi correctly asserted, easy to fix: explicit annotation with __launch_bounds__()
and the max workgroup size will fix this. Hence, just dropping back to 256 is not the optimal solution, it is a workaround. The optimal solution is to figure out best workgroup size for a given kernel and annotate explicitly. I would hence recommend @t-vi to use the threads per block he finds performance optimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is just the default, so it should not matter...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimal value depends on your VGPR/SGPR consumption, and that would vary among kernels/arch. 256 is a safe bet across multiple workloads/architecture.
Also, TVM AMD backend does not use HIP for kernel construction, LLVM backend generate the ISA. Are you referring to the Nvidia path when you say annotate the kernel with __launch_bounds__()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. As said, NV doesn't have this issue since they compile to PTX IR and any launch_bounds annotation is simply a performance optimization. This is independent of HIP - if you want to use a work group size >256, you must tell LC about it. launch_bounds is the way to do it for HIP source kernels, there are obviously equivalent processes along to stack to get said information to LC.
There is nothing inherently unstable with our HW with work group sizes >256 - you simply must use it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the default number of threads per block to 256. This is also consistent with what topi uses.
When we did not set the workgroup size, LLVM will use too many registers for kernel launches with many threads. This resulted in "invalid ISA" errors. Here we set the maximum workgroup size to the maximum threads per block from the device API. Of course, one might look into allowing configurations with fewer threads at runtime to use more registers.
9ccc89d
to
ad59e64
Compare
I changed the default to 256. This patch is independent of #4305 in terms of merge order (the decoupling of the device API from codegen at work 🙂). |
Thanks, @t-vi Can you sort out the CI issues? I saw tests fail, jut not sure if it is valid. |
So for me it says "All checks have passed 1 successful check". |
The flaky test is a known issue #4284 . Finding a way to set the optimal work group size would be an interesting future work. |
) When we did not set the workgroup size, LLVM will use too many registers for kernel launches with many threads. This resulted in "invalid ISA" errors. Here we set the maximum workgroup size to the maximum threads per block from the device API. Of course, one might look into allowing configurations with fewer threads at runtime to use more registers.
) When we did not set the workgroup size, LLVM will use too many registers for kernel launches with many threads. This resulted in "invalid ISA" errors. Here we set the maximum workgroup size to the maximum threads per block from the device API. Of course, one might look into allowing configurations with fewer threads at runtime to use more registers.
When we did not set the workgroup size, LLVM will use too many registers for kernel launches with many threads. This resulted in "invalid ISA" errors. Here we set the maximum workgroup size to the maximum threads per block from the device API.
One might later look into allowing configurations with fewer threads at runtime to use more registers.