-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[FRONTEND] Use PyTorch device_interface in do_bench #4470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FRONTEND] Use PyTorch device_interface in do_bench #4470
Conversation
|
@int3 I think it might be good to block these PRs before the new benchmark interface. Could you please coordinate the effort and provide us a timeline when you think the new benchmark interface will be ready? |
|
Thanks for the heads up @Jokeren! Yes, we should definitely coordinate this. Could I get input on whether backwards compatibility is something worth accounting for, per #4417 (comment)? That's mostly what I've been waiting for. I should be able to put up a PR for the new benchmark interface within a week. |
|
@abulavin I actually was thinking about your approach initially, but I opted to instead make the autotuner generic and keep |
I don't think that's a good idea: If you start having I think what you described in the other thread would be more sensible: having a default (device agnostic) version which works for all backends with the option for the backends to provide their own |
PyTorch (Inductor specifically) is moving towards using its own implementation of do_bench, so this should not be an issue on that front: pytorch/pytorch#130926 |
It would generally work, but making
I have been describing the same thing in both threads, actually -- each backend would have its own default |
Device-aware cache size for L2 flushing is a good (but dangerous) idea; we are adding L2 cache size as an attribute in torch.cuda.get_device_properties (the name is confusing, but this is not Nvidia specific). I say this is dangerous because, depending on your benchmarker, changing the cache flush size can and probably will dramatically affect the benchmarking results (i.e. you can become enter CPU-GPU lock step, and now your benchmarking results are completely false).
I agree with this, I can't see any situation in which
Shouldn't we figure this out first? In my opinion we should first have concrete evidence that, for example, |
I mean there's value in making things easily extensible, even if you're not using it at the moment. I don't think there's much added complexity here, it just makes the code a little bit less DRY which IMO is not a bad thing in a quickly evolving codebase. But if we can make the existing |
|
So to summarize the discussion / next steps:
These changes can all proceed concurrently, merge conflicts shouldn't be too hard to resolve. And I think landing things incrementally will make development easier, rather than blocking everything on the new autotuner interface. @Jokeren how does that sound? |
We should be very cautious changing the size of the cache flush. I've attempted this before, and the end result was that we could not reliably obtain similar benchmarking results using exact L2 cache size vs. the default 256MB. In experimental runs this did significantly decrease E2E performance of many models. |
|
I see... that's unfortunate :/ I guess we should make cache flush size a parameter then and default it to 256MB. No need to modify the DeviceInterface after all. |
|
I'm ok with merging |
BTW pytorch/pytorch#132819 just merged, so we could always make this "more" device agnostic by taking max(256MB, L2 cache size), but this isn't really needed just yet |
|
For cross-referencing: |
Removed hard-coded references to "cuda" in the do_bench function of testing.py in favour of using PyTorch's `DeviceInterface` object The existing do_bench hard codes references to a `cuda` device type, making the implementation not flexible to other device types in torch. PyTorch has a `DeviceInterface` class that is a generic interface to different device types. Using the device interface we can continue using the `cuda` device by default in do_bench, but the implementation is now flexible to be used with other device types e.g. out-of-tree backends. --- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `existing tests in triton exist that exercise the implementation of do_bench`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)
Removed hard-coded references to "cuda" in the do_bench function of testing.py in favour of using PyTorch's `DeviceInterface` object The existing do_bench hard codes references to a `cuda` device type, making the implementation not flexible to other device types in torch. PyTorch has a `DeviceInterface` class that is a generic interface to different device types. Using the device interface we can continue using the `cuda` device by default in do_bench, but the implementation is now flexible to be used with other device types e.g. out-of-tree backends. --- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `existing tests in triton exist that exercise the implementation of do_bench`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)
Removed hard-coded references to "cuda" in the do_bench function of testing.py in favour of using PyTorch's
DeviceInterfaceobjectThe existing do_bench hard codes references to a
cudadevice type, making the implementation not flexible to other device types in torch. PyTorch has aDeviceInterfaceclass that is a generic interface to different device types. Using the device interface we can continue using thecudadevice by default in do_bench, but the implementation is now flexible to be used with other device types e.g. out-of-tree backends.The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.
Complete the following tasks before sending your PR, and replace
[ ]with[x]to indicate you have done them.I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD.Select one of the following.
/testforlittests/unittestfor C++ tests/python/testfor end-to-end testsexisting tests in triton exist that exercise the implementation of do_bench.Select one of the following.
littests.littests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)