Add workgroup size attribute to AMDGPU functions in codegen #4342

t-vi · 2019-11-14T21:06:13Z

When we did not set the workgroup size, LLVM will use too many registers for kernel launches with many threads. This resulted in "invalid ISA" errors. Here we set the maximum workgroup size to the maximum threads per block from the device API.

One might later look into allowing configurations with fewer threads at runtime to use more registers.

t-vi · 2019-11-14T21:07:36Z

@masahi as potential reviewer
@petrex fyi

petrex · 2019-11-14T22:17:40Z

Thanks @t-vi .

@masahi This PR utilizes device query for kMaxThreadsPerBlock, in that case we might need to land #4305 first. thanks

Background : detail about ROCm reg/thread usage control. This PR enables the similar for AMD LLVM backend

masahi · 2019-11-14T22:21:44Z

does this solve the issue of INVALID_ISA errors discussed in the forum when running certain models? For example, can you run resnet 18 now?

t-vi · 2019-11-14T22:31:55Z

@masahi Yes, indeed, it fixes this, in fact I'm doing this with @mvermeulen 's tests in mind.

@petrex So this patch works independently from yours but will profit from yours. The query API for kMaxThreadsPerBlock doesn't change, with your PR we just get better (i.e. non-hardcoded) defaults.

petrex · 2019-11-14T22:38:40Z

Sure, Let's not use hardcoded value. btw you are testing on gfx900 right (or another arch)?

masahi · 2019-11-14T22:38:58Z

Great! Thanks.

petrex · 2019-11-14T23:07:01Z

src/codegen/llvm/codegen_amdgpu.cc

+    }
+  }
+  LOG(WARNING) << "Cannot get maximum number of threads for AMD codegen";
+  return 1024;


Can we set this value to 256?

In general 1024 is fine as long as the kernel did not use a lot of sgpr/vgpr.
however, we also see kernels with high vgpr usage (> 128), generate incorrect results with workgroup size == 1024.

But this is the max workgroup size at compile time, so the kernels will use fewer VGPRs when set larger.
Personally, I'd have half a mind to make it a LOG(FATAL) but I adapt to what the other code uses.

I've also seen a weird issue when using 1024 threads per block. Making it 256 fixed for me.

OK, so I always had the trouble only when I didn't set workgroup size on compilation (which is what this patch fixes by setting the max workgroup size) and then used many threads on launch, but I've changed the default to 256 based on your expertise.

Since @t-vi pinged me: this is not entirely correct. Let me explain. Our LC backend assumes, in the absence of explicit annotation, the max workgroup size to be 256 and generates code for that. This impacts us differently than CUDA since we finalize to ISA during compile time, as opposed to some IR that gets finalized at runtime. So indeed, if a kernel is dispatched with more than 256 it may fail in interesting ways at runtime. There is internal discussion going on to finally mitigate this behavior on the FE level. However, it is, as @t-vi correctly asserted, easy to fix: explicit annotation with __launch_bounds__() and the max workgroup size will fix this. Hence, just dropping back to 256 is not the optimal solution, it is a workaround. The optimal solution is to figure out best workgroup size for a given kernel and annotate explicitly. I would hence recommend @t-vi to use the threads per block he finds performance optimal.

Well, this is just the default, so it should not matter...

The optimal value depends on your VGPR/SGPR consumption, and that would vary among kernels/arch. 256 is a safe bet across multiple workloads/architecture.

Also, TVM AMD backend does not use HIP for kernel construction, LLVM backend generate the ISA. Are you referring to the Nvidia path when you say annotate the kernel with __launch_bounds__()?

No. As said, NV doesn't have this issue since they compile to PTX IR and any launch_bounds annotation is simply a performance optimization. This is independent of HIP - if you want to use a work group size >256, you must tell LC about it. launch_bounds is the way to do it for HIP source kernels, there are obviously equivalent processes along to stack to get said information to LC.

There is nothing inherently unstable with our HW with work group sizes >256 - you simply must use it correctly.

masahi

Let's keep the default number of threads per block to 256. This is also consistent with what topi uses.

When we did not set the workgroup size, LLVM will use too many registers for kernel launches with many threads. This resulted in "invalid ISA" errors. Here we set the maximum workgroup size to the maximum threads per block from the device API. Of course, one might look into allowing configurations with fewer threads at runtime to use more registers.

t-vi · 2019-11-15T07:35:02Z

I changed the default to 256. This patch is independent of #4305 in terms of merge order (the decoupling of the device API from codegen at work 🙂).

petrex · 2019-11-15T16:37:26Z

Thanks, @t-vi Can you sort out the CI issues? I saw tests fail, jut not sure if it is valid.

t-vi · 2019-11-15T21:21:11Z

So for me it says "All checks have passed 1 successful check".

masahi · 2019-11-15T23:14:49Z

The flaky test is a known issue #4284 . Finding a way to set the optimal work group size would be an interesting future work.

) When we did not set the workgroup size, LLVM will use too many registers for kernel launches with many threads. This resulted in "invalid ISA" errors. Here we set the maximum workgroup size to the maximum threads per block from the device API. Of course, one might look into allowing configurations with fewer threads at runtime to use more registers.

t-vi force-pushed the rocm_workgroup_size branch 3 times, most recently from 5602ae5 to 9ccc89d Compare November 14, 2019 21:21

masahi self-assigned this Nov 14, 2019

petrex suggested changes Nov 14, 2019

View reviewed changes

petrex mentioned this pull request Nov 14, 2019

[RELEASE][DRAFT] TVM v0.6 Release candidate #4259

Closed

masahi requested changes Nov 15, 2019

View reviewed changes

t-vi force-pushed the rocm_workgroup_size branch from 9ccc89d to ad59e64 Compare November 15, 2019 07:26

masahi approved these changes Nov 15, 2019

View reviewed changes

masahi merged commit 0a9f7e9 into apache:master Nov 15, 2019

t-vi mentioned this pull request Nov 25, 2019

add rocm codegen unittest for cross thread reduction #4423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workgroup size attribute to AMDGPU functions in codegen #4342

Add workgroup size attribute to AMDGPU functions in codegen #4342

t-vi commented Nov 14, 2019

t-vi commented Nov 14, 2019

petrex commented Nov 14, 2019

masahi commented Nov 14, 2019 •

edited

Loading

t-vi commented Nov 14, 2019

petrex commented Nov 14, 2019 •

edited

Loading

masahi commented Nov 14, 2019

petrex Nov 14, 2019

t-vi Nov 14, 2019 •

edited

Loading

masahi Nov 15, 2019

t-vi Nov 15, 2019 •

edited

Loading

iotamudelta Nov 15, 2019

t-vi Nov 15, 2019

petrex Nov 15, 2019

iotamudelta Nov 15, 2019

masahi left a comment

t-vi commented Nov 15, 2019

petrex commented Nov 15, 2019

t-vi commented Nov 15, 2019

masahi commented Nov 15, 2019

Add workgroup size attribute to AMDGPU functions in codegen #4342

Add workgroup size attribute to AMDGPU functions in codegen #4342

Conversation

t-vi commented Nov 14, 2019

t-vi commented Nov 14, 2019

petrex commented Nov 14, 2019

masahi commented Nov 14, 2019 • edited Loading

t-vi commented Nov 14, 2019

petrex commented Nov 14, 2019 • edited Loading

masahi commented Nov 14, 2019

petrex Nov 14, 2019

Choose a reason for hiding this comment

t-vi Nov 14, 2019 • edited Loading

Choose a reason for hiding this comment

masahi Nov 15, 2019

Choose a reason for hiding this comment

t-vi Nov 15, 2019 • edited Loading

Choose a reason for hiding this comment

iotamudelta Nov 15, 2019

Choose a reason for hiding this comment

t-vi Nov 15, 2019

Choose a reason for hiding this comment

petrex Nov 15, 2019

Choose a reason for hiding this comment

iotamudelta Nov 15, 2019

Choose a reason for hiding this comment

masahi left a comment

Choose a reason for hiding this comment

t-vi commented Nov 15, 2019

petrex commented Nov 15, 2019

t-vi commented Nov 15, 2019

masahi commented Nov 15, 2019

masahi commented Nov 14, 2019 •

edited

Loading

petrex commented Nov 14, 2019 •

edited

Loading

t-vi Nov 14, 2019 •

edited

Loading

t-vi Nov 15, 2019 •

edited

Loading