[FRONTEND] Use PyTorch device_interface in do_bench #4470

abulavin · 2024-08-05T13:41:11Z

Removed hard-coded references to "cuda" in the do_bench function of testing.py in favour of using PyTorch's DeviceInterface object

The existing do_bench hard codes references to a cuda device type, making the implementation not flexible to other device types in torch. PyTorch has a DeviceInterface class that is a generic interface to different device types. Using the device interface we can continue using the cuda device by default in do_bench, but the implementation is now flexible to be used with other device types e.g. out-of-tree backends.

The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because existing tests in triton exist that exercise the implementation of do_bench.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

agnostic

Jokeren · 2024-08-05T14:41:01Z

@int3 I think it might be good to block these PRs before the new benchmark interface. Could you please coordinate the effort and provide us a timeline when you think the new benchmark interface will be ready?

int3 · 2024-08-05T16:21:47Z

Thanks for the heads up @Jokeren! Yes, we should definitely coordinate this.

Could I get input on whether backwards compatibility is something worth accounting for, per #4417 (comment)? That's mostly what I've been waiting for.

I should be able to put up a PR for the new benchmark interface within a week.

int3 · 2024-08-05T16:26:28Z

@abulavin I actually was thinking about your approach initially, but I opted to instead make the autotuner generic and keep do_bench platform-specific because things like do_bench_cudagraph don't fit cleanly into the do_bench code structure. I think having multiple do_bench_{platform} implementations isn't the worst thing, there'll be some code duplication, but that's better than having to shoehorn multiple implementations into the same template. What do you think?

AnthonyBarbier · 2024-08-06T15:50:13Z

I think having multiple do_bench_{platform} implementations isn't the worst thing

I don't think that's a good idea: do_bench is called directly by pytorch in the Triton inductor backend and that part of the code is device agnostic. (The changes in this PR + updating the callsites in PyTorch to forward the device type is all we need to make it work with our out-of-tree + non-cuda device type torch backend).

If you start having do_bench_{platform} functions then pytorch would have to hardcode some kind of switch / case dispatch with a set list of supported devices.

I think what you described in the other thread would be more sensible: having a default (device agnostic) version which works for all backends with the option for the backends to provide their own do_bench implementation if they want to do things differently. (But at least the default would be that it works).

nmacchioni · 2024-08-06T19:04:11Z

I don't think that's a good idea: do_bench is called directly by pytorch in the Triton inductor backend and that part of the code is device agnostic. (The changes in this PR + updating the callsites in PyTorch to forward the device type is all we need to make it work with our out-of-tree + non-cuda device type torch backend).

If you start having do_bench_{platform} functions then pytorch would have to hardcode some kind of switch / case dispatch with a set list of supported devices.

PyTorch (Inductor specifically) is moving towards using its own implementation of do_bench, so this should not be an issue on that front: pytorch/pytorch#130926

int3 · 2024-08-06T21:08:53Z

The changes in this PR + updating the callsites in PyTorch to forward the device type is all we need to make it work with our out-of-tree + non-cuda device type torch backend

It would generally work, but making do_bench properly device-agnostic will require a bit more -- the cache size should be parameterizable. We could handle that with a extra field on DeviceInterface, I suppose, or a flush_cache() method. But the fast_flush option is also device-specific (see #840), and we don't have a way of abstracting that out with DeviceInterface since the interface is stateless. Now fast_flush could possibly be removed as a parameter, it seems like it was there for migration purposes, but I could imagine that we would want to pass in other device-specific parameters in the future.

I think what you described in the other thread would be more sensible

I have been describing the same thing in both threads, actually -- each backend would have its own default do_bench implementation, and the autotuner would pick the appropriate one.

nmacchioni · 2024-08-06T21:37:23Z

the cache size should be parameterizable

Device-aware cache size for L2 flushing is a good (but dangerous) idea; we are adding L2 cache size as an attribute in torch.cuda.get_device_properties (the name is confusing, but this is not Nvidia specific). I say this is dangerous because, depending on your benchmarker, changing the cache flush size can and probably will dramatically affect the benchmarking results (i.e. you can become enter CPU-GPU lock step, and now your benchmarking results are completely false).

Now fast_flush could possibly be removed as a parameter, it seems like it was there for migration purposes,

I agree with this, I can't see any situation in which fast_flush=False is better.

but I could imagine that we would want to pass in other device-specific parameters in the future.

Shouldn't we figure this out first? In my opinion we should first have concrete evidence that, for example, do_bench is slow, inaccurate, or unusable on AMD devices before we go through the effort (and added complexity) of having backends select (or define) their benchmarker. And, if this concrete evidence does exist, we should first try to use that new knowledge to improve the generic do_bench before going down the per-device-benchmarker route.

int3 · 2024-08-06T22:54:59Z

Shouldn't we figure this out first?

I mean there's value in making things easily extensible, even if you're not using it at the moment. I don't think there's much added complexity here, it just makes the code a little bit less DRY which IMO is not a bad thing in a quickly evolving codebase.

But if we can make the existing do_bench properly generic by removing the fast_flush parameter & making the cache parameterizable, I think having it as the default implementation is clean enough. We should still make do_bench itself a parameter to the autotuner, so we can easily add implementations that use extra parameters (or fewer, in the case of cudagraph benchmarking.) And if we ever need something like fast_flush in the future, we should implement that by creating a copy of do_bench_{device}, rather than shoehorning it into the DeviceInterface. I think having DeviceInterface methods that really only get implemented by one device & are no-ops on all others is pretty ugly, and I want to minimize that.

int3 · 2024-08-07T20:27:58Z

So to summarize the discussion / next steps:

I think we should land this, it's a step in the right direction (cc @abulavin @Jokeren)
We should remove the fast_flush option (pretty trivial)
We should extend DeviceInterface in PyTorch to support retrieving the last level cache size, then use it in do_bench
Changes to the autotuner will proceed as outlined in Proposal for device-agnostic autotuner #4417

These changes can all proceed concurrently, merge conflicts shouldn't be too hard to resolve. And I think landing things incrementally will make development easier, rather than blocking everything on the new autotuner interface. @Jokeren how does that sound?

nmacchioni · 2024-08-07T20:47:58Z

We should extend DeviceInterface in PyTorch to support retrieving the last level cache size, then use it in do_bench

We should be very cautious changing the size of the cache flush. I've attempted this before, and the end result was that we could not reliably obtain similar benchmarking results using exact L2 cache size vs. the default 256MB. In experimental runs this did significantly decrease E2E performance of many models.

int3 · 2024-08-07T20:57:21Z

I see... that's unfortunate :/ I guess we should make cache flush size a parameter then and default it to 256MB. No need to modify the DeviceInterface after all.

Jokeren · 2024-08-07T20:58:49Z

I'm ok with merging

nmacchioni · 2024-08-07T21:02:53Z

I see... that's unfortunate :/ I guess we should make cache flush size a parameter then and default it to 256MB. No need to modify the DeviceInterface after all.

BTW pytorch/pytorch#132819 just merged, so we could always make this "more" device agnostic by taking max(256MB, L2 cache size), but this isn't really needed just yet

int3 · 2024-08-09T20:11:27Z

For cross-referencing:

Autotuner PR: [AUTOTUNER] Make autotuner take do_bench as a parameter #4496
fast_flush PR: [TESTING] Remove the fast_flush parameter from do_bench #4485

Removed hard-coded references to "cuda" in the do_bench function of testing.py in favour of using PyTorch's `DeviceInterface` object The existing do_bench hard codes references to a `cuda` device type, making the implementation not flexible to other device types in torch. PyTorch has a `DeviceInterface` class that is a generic interface to different device types. Using the device interface we can continue using the `cuda` device by default in do_bench, but the implementation is now flexible to be used with other device types e.g. out-of-tree backends. --- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `existing tests in triton exist that exercise the implementation of do_bench`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

Use torch._dynamo.device_interface in do_bench to make it device

fca4aeb

agnostic

abulavin requested a review from ptillet as a code owner August 5, 2024 13:41

Jokeren changed the title ~~Use PyTorch device_interface in do_bench~~ [FRONTEND] Use PyTorch device_interface in do_bench Aug 9, 2024

Jokeren approved these changes Aug 9, 2024

View reviewed changes

Jokeren merged commit 36789d2 into triton-lang:main Aug 9, 2024

abulavin deleted the artemiyb/use-device-interface-in-do-bench branch August 12, 2024 11:34

jlebar mentioned this pull request Sep 3, 2024

Build LLVMAarch64CodeGen if CMAKE_OSX_ARCHITECTURES is arm64. #4637

Merged

[FRONTEND] Use PyTorch device_interface in do_bench #4470

[FRONTEND] Use PyTorch device_interface in do_bench #4470

Uh oh!

Conversation

abulavin commented Aug 5, 2024

Uh oh!

Jokeren commented Aug 5, 2024

Uh oh!

int3 commented Aug 5, 2024

Uh oh!

int3 commented Aug 5, 2024

Uh oh!

AnthonyBarbier commented Aug 6, 2024

Uh oh!

nmacchioni commented Aug 6, 2024

Uh oh!

int3 commented Aug 6, 2024

Uh oh!

nmacchioni commented Aug 6, 2024

Uh oh!

int3 commented Aug 6, 2024

Uh oh!

int3 commented Aug 7, 2024

Uh oh!

nmacchioni commented Aug 7, 2024

Uh oh!

int3 commented Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jokeren commented Aug 7, 2024

Uh oh!

nmacchioni commented Aug 7, 2024

Uh oh!

int3 commented Aug 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

int3 commented Aug 7, 2024 •

edited

Loading