Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workgroup size attribute to AMDGPU functions in codegen #4342

Merged
merged 1 commit into from
Nov 15, 2019

Conversation

t-vi
Copy link
Contributor

@t-vi t-vi commented Nov 14, 2019

When we did not set the workgroup size, LLVM will use too many registers for kernel launches with many threads. This resulted in "invalid ISA" errors. Here we set the maximum workgroup size to the maximum threads per block from the device API.

One might later look into allowing configurations with fewer threads at runtime to use more registers.

@t-vi
Copy link
Contributor Author

t-vi commented Nov 14, 2019

@masahi as potential reviewer
@petrex fyi

@t-vi t-vi force-pushed the rocm_workgroup_size branch 3 times, most recently from 5602ae5 to 9ccc89d Compare November 14, 2019 21:21
@masahi masahi self-assigned this Nov 14, 2019
@petrex
Copy link
Contributor

petrex commented Nov 14, 2019

Thanks @t-vi .

@masahi This PR utilizes device query for kMaxThreadsPerBlock, in that case we might need to land #4305 first. thanks

Background : detail about ROCm reg/thread usage control. This PR enables the similar for AMD LLVM backend

@masahi
Copy link
Member

masahi commented Nov 14, 2019

does this solve the issue of INVALID_ISA errors discussed in the forum when running certain models? For example, can you run resnet 18 now?

@t-vi
Copy link
Contributor Author

t-vi commented Nov 14, 2019

@masahi Yes, indeed, it fixes this, in fact I'm doing this with @mvermeulen 's tests in mind.

@petrex So this patch works independently from yours but will profit from yours. The query API for kMaxThreadsPerBlock doesn't change, with your PR we just get better (i.e. non-hardcoded) defaults.

@petrex
Copy link
Contributor

petrex commented Nov 14, 2019

Sure, Let's not use hardcoded value. btw you are testing on gfx900 right (or another arch)?

@masahi
Copy link
Member

masahi commented Nov 14, 2019

Great! Thanks.

}
}
LOG(WARNING) << "Cannot get maximum number of threads for AMD codegen";
return 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set this value to 256?

In general 1024 is fine as long as the kernel did not use a lot of sgpr/vgpr.
however, we also see kernels with high vgpr usage (> 128), generate incorrect results with workgroup size == 1024.

Copy link
Contributor Author

@t-vi t-vi Nov 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is the max workgroup size at compile time, so the kernels will use fewer VGPRs when set larger.
Personally, I'd have half a mind to make it a LOG(FATAL) but I adapt to what the other code uses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also seen a weird issue when using 1024 threads per block. Making it 256 fixed for me.

Copy link
Contributor Author

@t-vi t-vi Nov 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I always had the trouble only when I didn't set workgroup size on compilation (which is what this patch fixes by setting the max workgroup size) and then used many threads on launch, but I've changed the default to 256 based on your expertise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since @t-vi pinged me: this is not entirely correct. Let me explain. Our LC backend assumes, in the absence of explicit annotation, the max workgroup size to be 256 and generates code for that. This impacts us differently than CUDA since we finalize to ISA during compile time, as opposed to some IR that gets finalized at runtime. So indeed, if a kernel is dispatched with more than 256 it may fail in interesting ways at runtime. There is internal discussion going on to finally mitigate this behavior on the FE level. However, it is, as @t-vi correctly asserted, easy to fix: explicit annotation with __launch_bounds__() and the max workgroup size will fix this. Hence, just dropping back to 256 is not the optimal solution, it is a workaround. The optimal solution is to figure out best workgroup size for a given kernel and annotate explicitly. I would hence recommend @t-vi to use the threads per block he finds performance optimal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is just the default, so it should not matter...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimal value depends on your VGPR/SGPR consumption, and that would vary among kernels/arch. 256 is a safe bet across multiple workloads/architecture.

Also, TVM AMD backend does not use HIP for kernel construction, LLVM backend generate the ISA. Are you referring to the Nvidia path when you say annotate the kernel with __launch_bounds__()?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. As said, NV doesn't have this issue since they compile to PTX IR and any launch_bounds annotation is simply a performance optimization. This is independent of HIP - if you want to use a work group size >256, you must tell LC about it. launch_bounds is the way to do it for HIP source kernels, there are obviously equivalent processes along to stack to get said information to LC.

There is nothing inherently unstable with our HW with work group sizes >256 - you simply must use it correctly.

Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the default number of threads per block to 256. This is also consistent with what topi uses.

When we did not set the workgroup size, LLVM will use too many registers
for kernel launches with many threads. This resulted in "invalid ISA"
errors. Here we set the maximum workgroup size to the maximum threads
per block from the device API.

Of course, one might look into allowing configurations with fewer
threads at runtime to use more registers.
@t-vi
Copy link
Contributor Author

t-vi commented Nov 15, 2019

I changed the default to 256. This patch is independent of #4305 in terms of merge order (the decoupling of the device API from codegen at work 🙂).

@petrex
Copy link
Contributor

petrex commented Nov 15, 2019

Thanks, @t-vi Can you sort out the CI issues? I saw tests fail, jut not sure if it is valid.

@t-vi
Copy link
Contributor Author

t-vi commented Nov 15, 2019

So for me it says "All checks have passed 1 successful check".

@masahi
Copy link
Member

masahi commented Nov 15, 2019

The flaky test is a known issue #4284 . Finding a way to set the optimal work group size would be an interesting future work.

@masahi masahi merged commit 0a9f7e9 into apache:master Nov 15, 2019
zxy844288792 pushed a commit to zxy844288792/tvm that referenced this pull request Nov 26, 2019
)

When we did not set the workgroup size, LLVM will use too many registers
for kernel launches with many threads. This resulted in "invalid ISA"
errors. Here we set the maximum workgroup size to the maximum threads
per block from the device API.

Of course, one might look into allowing configurations with fewer
threads at runtime to use more registers.
yongwww pushed a commit to neo-ai/tvm that referenced this pull request Nov 26, 2019
)

When we did not set the workgroup size, LLVM will use too many registers
for kernel launches with many threads. This resulted in "invalid ISA"
errors. Here we set the maximum workgroup size to the maximum threads
per block from the device API.

Of course, one might look into allowing configurations with fewer
threads at runtime to use more registers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants