Skip to content

Conversation

@abulavin
Copy link
Contributor

@abulavin abulavin commented Aug 5, 2024

Removed hard-coded references to "cuda" in the do_bench function of testing.py in favour of using PyTorch's DeviceInterface object

The existing do_bench hard codes references to a cuda device type, making the implementation not flexible to other device types in torch. PyTorch has a DeviceInterface class that is a generic interface to different device types. Using the device interface we can continue using the cuda device by default in do_bench, but the implementation is now flexible to be used with other device types e.g. out-of-tree backends.


The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because existing tests in triton exist that exercise the implementation of do_bench.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@abulavin abulavin requested a review from ptillet as a code owner August 5, 2024 13:41
@Jokeren
Copy link
Contributor

Jokeren commented Aug 5, 2024

@int3 I think it might be good to block these PRs before the new benchmark interface. Could you please coordinate the effort and provide us a timeline when you think the new benchmark interface will be ready?

@int3
Copy link
Contributor

int3 commented Aug 5, 2024

Thanks for the heads up @Jokeren! Yes, we should definitely coordinate this.

Could I get input on whether backwards compatibility is something worth accounting for, per #4417 (comment)? That's mostly what I've been waiting for.

I should be able to put up a PR for the new benchmark interface within a week.

@int3
Copy link
Contributor

int3 commented Aug 5, 2024

@abulavin I actually was thinking about your approach initially, but I opted to instead make the autotuner generic and keep do_bench platform-specific because things like do_bench_cudagraph don't fit cleanly into the do_bench code structure. I think having multiple do_bench_{platform} implementations isn't the worst thing, there'll be some code duplication, but that's better than having to shoehorn multiple implementations into the same template. What do you think?

@AnthonyBarbier
Copy link

I think having multiple do_bench_{platform} implementations isn't the worst thing

I don't think that's a good idea: do_bench is called directly by pytorch in the Triton inductor backend and that part of the code is device agnostic. (The changes in this PR + updating the callsites in PyTorch to forward the device type is all we need to make it work with our out-of-tree + non-cuda device type torch backend).

If you start having do_bench_{platform} functions then pytorch would have to hardcode some kind of switch / case dispatch with a set list of supported devices.

I think what you described in the other thread would be more sensible: having a default (device agnostic) version which works for all backends with the option for the backends to provide their own do_bench implementation if they want to do things differently. (But at least the default would be that it works).

@nmacchioni
Copy link
Contributor

I don't think that's a good idea: do_bench is called directly by pytorch in the Triton inductor backend and that part of the code is device agnostic. (The changes in this PR + updating the callsites in PyTorch to forward the device type is all we need to make it work with our out-of-tree + non-cuda device type torch backend).

If you start having do_bench_{platform} functions then pytorch would have to hardcode some kind of switch / case dispatch with a set list of supported devices.

PyTorch (Inductor specifically) is moving towards using its own implementation of do_bench, so this should not be an issue on that front: pytorch/pytorch#130926

@int3
Copy link
Contributor

int3 commented Aug 6, 2024

The changes in this PR + updating the callsites in PyTorch to forward the device type is all we need to make it work with our out-of-tree + non-cuda device type torch backend

It would generally work, but making do_bench properly device-agnostic will require a bit more -- the cache size should be parameterizable. We could handle that with a extra field on DeviceInterface, I suppose, or a flush_cache() method. But the fast_flush option is also device-specific (see #840), and we don't have a way of abstracting that out with DeviceInterface since the interface is stateless. Now fast_flush could possibly be removed as a parameter, it seems like it was there for migration purposes, but I could imagine that we would want to pass in other device-specific parameters in the future.

I think what you described in the other thread would be more sensible

I have been describing the same thing in both threads, actually -- each backend would have its own default do_bench implementation, and the autotuner would pick the appropriate one.

@nmacchioni
Copy link
Contributor

the cache size should be parameterizable

Device-aware cache size for L2 flushing is a good (but dangerous) idea; we are adding L2 cache size as an attribute in torch.cuda.get_device_properties (the name is confusing, but this is not Nvidia specific). I say this is dangerous because, depending on your benchmarker, changing the cache flush size can and probably will dramatically affect the benchmarking results (i.e. you can become enter CPU-GPU lock step, and now your benchmarking results are completely false).

Now fast_flush could possibly be removed as a parameter, it seems like it was there for migration purposes,

I agree with this, I can't see any situation in which fast_flush=False is better.

but I could imagine that we would want to pass in other device-specific parameters in the future.

Shouldn't we figure this out first? In my opinion we should first have concrete evidence that, for example, do_bench is slow, inaccurate, or unusable on AMD devices before we go through the effort (and added complexity) of having backends select (or define) their benchmarker. And, if this concrete evidence does exist, we should first try to use that new knowledge to improve the generic do_bench before going down the per-device-benchmarker route.

@int3
Copy link
Contributor

int3 commented Aug 6, 2024

Shouldn't we figure this out first?

I mean there's value in making things easily extensible, even if you're not using it at the moment. I don't think there's much added complexity here, it just makes the code a little bit less DRY which IMO is not a bad thing in a quickly evolving codebase.

But if we can make the existing do_bench properly generic by removing the fast_flush parameter & making the cache parameterizable, I think having it as the default implementation is clean enough. We should still make do_bench itself a parameter to the autotuner, so we can easily add implementations that use extra parameters (or fewer, in the case of cudagraph benchmarking.) And if we ever need something like fast_flush in the future, we should implement that by creating a copy of do_bench_{device}, rather than shoehorning it into the DeviceInterface. I think having DeviceInterface methods that really only get implemented by one device & are no-ops on all others is pretty ugly, and I want to minimize that.

@int3
Copy link
Contributor

int3 commented Aug 7, 2024

So to summarize the discussion / next steps:

  1. I think we should land this, it's a step in the right direction (cc @abulavin @Jokeren)
  2. We should remove the fast_flush option (pretty trivial)
  3. We should extend DeviceInterface in PyTorch to support retrieving the last level cache size, then use it in do_bench
  4. Changes to the autotuner will proceed as outlined in Proposal for device-agnostic autotuner #4417

These changes can all proceed concurrently, merge conflicts shouldn't be too hard to resolve. And I think landing things incrementally will make development easier, rather than blocking everything on the new autotuner interface. @Jokeren how does that sound?

@nmacchioni
Copy link
Contributor

  1. We should extend DeviceInterface in PyTorch to support retrieving the last level cache size, then use it in do_bench

We should be very cautious changing the size of the cache flush. I've attempted this before, and the end result was that we could not reliably obtain similar benchmarking results using exact L2 cache size vs. the default 256MB. In experimental runs this did significantly decrease E2E performance of many models.

@int3
Copy link
Contributor

int3 commented Aug 7, 2024

I see... that's unfortunate :/ I guess we should make cache flush size a parameter then and default it to 256MB. No need to modify the DeviceInterface after all.

@Jokeren
Copy link
Contributor

Jokeren commented Aug 7, 2024

I'm ok with merging

@nmacchioni
Copy link
Contributor

I see... that's unfortunate :/ I guess we should make cache flush size a parameter then and default it to 256MB. No need to modify the DeviceInterface after all.

BTW pytorch/pytorch#132819 just merged, so we could always make this "more" device agnostic by taking max(256MB, L2 cache size), but this isn't really needed just yet

@int3
Copy link
Contributor

int3 commented Aug 9, 2024

@Jokeren Jokeren changed the title Use PyTorch device_interface in do_bench [FRONTEND] Use PyTorch device_interface in do_bench Aug 9, 2024
@Jokeren Jokeren merged commit 36789d2 into triton-lang:main Aug 9, 2024
@abulavin abulavin deleted the artemiyb/use-device-interface-in-do-bench branch August 12, 2024 11:34
galexite pushed a commit to graphcore/triton-fork that referenced this pull request Sep 16, 2024
Removed hard-coded references to "cuda" in the do_bench function of
testing.py in favour of using PyTorch's `DeviceInterface` object

The existing do_bench hard codes references to a `cuda` device type,
making the implementation not flexible to other device types in torch.
PyTorch has a `DeviceInterface` class that is a generic interface to
different device types. Using the device interface we can continue using
the `cuda` device by default in do_bench, but the implementation is now
flexible to be used with other device types e.g. out-of-tree backends.

---

The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because `existing tests in triton
exist that exercise the implementation of do_bench`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
Removed hard-coded references to "cuda" in the do_bench function of
testing.py in favour of using PyTorch's `DeviceInterface` object

The existing do_bench hard codes references to a `cuda` device type,
making the implementation not flexible to other device types in torch.
PyTorch has a `DeviceInterface` class that is a generic interface to
different device types. Using the device interface we can continue using
the `cuda` device by default in do_bench, but the implementation is now
flexible to be used with other device types e.g. out-of-tree backends.

---

The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because `existing tests in triton
exist that exercise the implementation of do_bench`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants