forked from pytorch/FBGEMM
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ifu 2022 09 15 #18
Merged
Merged
Ifu 2022 09 15 #18
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
On AMD Instinct MI250 system (optimization in progress), this increases performance by ~10% in case of fp32 weights, some lower for fp16 weights, for various types of output.
Summary: Pull Request resolved: pytorch#1257 Add x86 instrinsics guard Reviewed By: pallab-zz, jianyuh Differential Revision: D38968961 fbshipit-source-id: 462eb5c31f5fba919c0772eb22d93e18aca61f7b
Summary: Pull Request resolved: pytorch#1258 Fix several things on cache flushing, time recording, etc. Reviewed By: jspark1105 Differential Revision: D39007727 fbshipit-source-id: cccc13b794c0b3761a8a04c331b30eb775a73471
Summary: Pull Request resolved: pytorch#1260 This diff just adds automatic fix hints for facebook-cuda-safe-api-call-check and facebook-cuda-safe-kernel-call-check from clang-tidy Reviewed By: r-barnes Differential Revision: D38945713 fbshipit-source-id: 11c676a71bed8d23bc414c8498141c7afa94b033
Summary: Pull Request resolved: pytorch#1262 PyTorch Tensor should be passed by const reference by default; copying it incurs an atomic reference count bump. I understand that we're launching GPU kernels so this is unlikely to improve latency, but removing waste seems better than not removing it. Reviewed By: ajtulloch, jspark1105 Differential Revision: D39040322 fbshipit-source-id: 3941ef27e02002a57e9cb2811db14d8bc471132d
… and optimize for ROCm (pytorch#1240) Summary: Make weihanmines's PR #13 upstreamable. sryap, would you please review the PR and consider converting it to a draft? Thank you. Pull Request resolved: pytorch#1240 Reviewed By: sryap Differential Revision: D38507621 Pulled By: shintaro-iwasaki fbshipit-source-id: 5b4532c0e79ce49a2f93c2a455a6392a1c7c2f16
Summary: tPull Request resolved: pytorch#1264 - Update the metric rule dependency for using it outside FBGEMM. - Update the accumulator type for AUC Reviewed By: jspark1105 Differential Revision: D39090840 fbshipit-source-id: 0556c53f7ffedac68f6475a487e9aa8e9c28d4e4
Summary: There will have a data overflow for int type when input len > 2,147,483,647, PyTorch side has reported an issue which using FBGEMM Quantize, see pytorch/pytorch#80501, the user example's input len is 5,117,410,688, but FBGEMM side use **int** to represent the input len, there will get a wrong number for single thread case. This PR just fixed those two **Quantize** and **FindMinMax** API which are used in PyTorch side, there may have other functions which need to be updated using high-precison dtype. Pull Request resolved: pytorch#1261 Reviewed By: jianyuh Differential Revision: D39089686 Pulled By: jspark1105 fbshipit-source-id: 9623bbb20bdba0f98040a1c8143e4bc552d2a6cb
Summary: Add Mx2, Mx4, 2xN, and 4xN specific transposes on avx512 to improve the transpose performance of shapes of Mx2, Mx4, 2xN, and 4xN. * When the shape is Mx2 or Mx4 and N == ld_src, Mx2 and Mx4 transposes will achieve higher performance. * When the shape is 2xN or 4xN and M == ld_dst, 2xN and 4xN transposes will achieve higher performance. Pull Request resolved: pytorch#1177 Reviewed By: jianyuh Differential Revision: D37638879 Pulled By: jiyuanzFB fbshipit-source-id: e9fba362fc1c86ea96cb5cfe35408ab4488b0750
…ch#1265) Summary: Pull Request resolved: pytorch#1265 Similar idea to D34830535 but w/o H<->D synchronization. Use at most 16 times of SM thread blocks to avoid unnecessary work. Make lru/lfu_cache_find_uncached_kernel work when gridDim.x * blockDim.y < N_unique Reviewed By: suphoff Differential Revision: D39086495 fbshipit-source-id: 32e83ad92986ddf6481f90b39f65f1e131076276
…pytorch#1268) Summary: Pull Request resolved: pytorch#1268 Follow up of D39086495 applying to other cache kernels Reviewed By: jianyuh Differential Revision: D39124243 fbshipit-source-id: e0cdc8436c1c0036560eb5a9b53c2209d404e6cf
…rch#1269) Summary: Pull Request resolved: pytorch#1269 The sample inputs are loaded to CPU, and for merge, we suppose to still be able to sample inputs on CPU. For mpe, if all the inputs are on CPU, the CPU op is dispatched. So add a corner case to mpe CPU op. Reviewed By: jianyuh Differential Revision: D39108861 fbshipit-source-id: a0ff132f07f2a5b5583aa26ff227ebda61ae7833
Summary: Pull Request resolved: pytorch#1259 This operator is similar to padding_fused_tbe_input_combine_cpu, but padding lengths as zeros instead. Reviewed By: jianyuh Differential Revision: D39008642 fbshipit-source-id: 51075a80868742353ff22ad763354174df3a5097
Summary: Pull Request resolved: pytorch#1270 Reviewed By: xiaosun86 Differential Revision: D39124669 fbshipit-source-id: b6c3abea10613149a05f406c4c91c23009896a64
Summary: Pull Request resolved: pytorch#1272 Copying zipf generation from cupy. cupy is not in third_party yet. Quickly tried but ran into a couple of problems so decided to just copy zipf part. We may also want to consider add zipf generation to PT. TBE benchmarking significantly faster so should help TBE optimization. TODO: more rigorous correctness check Reviewed By: ajtulloch Differential Revision: D39093435 fbshipit-source-id: 7aad735d4ef6e79d9d19114cb22fdad80d69b5fb
…torch#1276) Summary: A bracket is missing when using the `skipIfRocm` decorator in the `uvm_test.py` and causes the tests to be skipped on CUDA. Pull Request resolved: pytorch#1276 Reviewed By: jspark1105 Differential Revision: D39263693 Pulled By: suphoff fbshipit-source-id: 2787d44d4885e830ec505f75772e172abeec2bc7
Summary: Pull Request resolved: pytorch#1271 fbgemm building aarch64 using simde Reviewed By: pallab-zz, suphoff Differential Revision: D39175324 fbshipit-source-id: 7331fe5d43a9ff455a9348451c790ad2322187a7
Summary: Pull Request resolved: pytorch#1284 TODO: CPU implementation for backward pass + autograd function. Reviewed By: xiaosun86 Differential Revision: D39124668 fbshipit-source-id: 92b0c201654af0e452c118a55720ad539214fdf7
Summary: Pull Request resolved: pytorch#1287 Before this diff, HIP does 4 sequential scalar loads for the half input in TBE's Vec4T. This diff does a vector load for 4 halves. Reviewed By: jspark1105 Differential Revision: D39267283 fbshipit-source-id: 657b3ce7ea771a664b24450bb24da9a05535ddc2
Summary: Pull Request resolved: pytorch#1286 Reviewed By: jspark1105, mjanderson09 Differential Revision: D39335880 fbshipit-source-id: 885e3e300239be775eb6c760e5ba28af832986d9
Summary: Pull Request resolved: pytorch#1292 Reviewed By: jspark1105 Differential Revision: D39408174 fbshipit-source-id: de4481ee6a96200df3d147dc4ba8bc98faf050f9
Summary: Pull Request resolved: pytorch#1294 Reviewed By: jspark1105 Differential Revision: D39408601 fbshipit-source-id: 187c82aa6db475827b359298ff9639e9f6b784e2
pytorch#1293) Summary: Pull Request resolved: pytorch#1293 Reviewed By: jspark1105 Differential Revision: D39408600 fbshipit-source-id: 26c6c1c430764e0bb4ec5276c956c08ea4933ab5
Summary: Pull Request resolved: pytorch#1295 The LXU cache logic is in the critical path of the forward TBE kernel. Even when the LXU cache is not used, the kernel still checks whether a row should be fetched from the cache or HBM at runtime. The branching logic should be harmless for the memory (subsystem) bound case. However, it could add significant overhead if TBE is conditional bound. (We have observed that FP16 weight type is generally compute or conditional bound, while FP32 weight type is memory bound.) This diff adds a static conditional in the forward TBE kernel to enable/disable the LXU cache code path at compile time. At runtime, the host selects the kernel with/without cache enabled based on whether the LXU cache is present. This diff also moves the conditional outside the D loop. It should add a small benefit for the large D cases when cache is used. Reviewed By: jspark1105 Differential Revision: D39353035 fbshipit-source-id: ee34e04d78fc9b4098923a7600e6a3b9e6ce388f
Summary: Pull Request resolved: pytorch#1277 A similar idea to D37739239 (pytorch@fbd89e8) applied to pooled case. TODO * ~~We can try this for other places like bwd.~~ * ~~Handle AMD~~ * Maybe we want to set kThreadGroupSize based on common/average D not based on max_D * Extend to inference quantized emb fwd * Check code size Reviewed By: sryap Differential Revision: D39247288 fbshipit-source-id: 38af2dabac6652ac347d4b5ae185beed2b2c50ba
…h#1278) Summary: Pull Request resolved: pytorch#1278 Similar to D39247288. Applied to bwd warp_per_row Reviewed By: jianyuh Differential Revision: D39247622 fbshipit-source-id: c6fe9c7974acbac8a8558ddbb87e60dba6917a85
Summary: Pull Request resolved: pytorch#1296 ITT Reviewed By: brad-mengchi Differential Revision: D39413020 fbshipit-source-id: c8ec25fbb0ec86d6e4ed3624140febfb67dc7446
Summary: Pull Request resolved: pytorch#1273 ITT Reviewed By: jspark1105 Differential Revision: D39185322 fbshipit-source-id: 0e3ffd6d7b58115325a02e92ab2aec590154c099
…in test-infra (pytorch#1289) Summary: [As communicated internally](https://fb.workplace.com/groups/pytorch.dev/permalink/1189033455008466/), all repositories now rely on a single scale-config.yml that is on pytorch/test-infra. As such this file is no longer used and to avoid confusion it is better to remove it. Here is a short summary of the announcement: > [As previously announced](https://fb.workplace.com/groups/pytorch.dev/permalink/1173939633184515/), the scale-config.yml file in each repository for the pytorch/ organization is now not being used to control GHA runners. On its place, [the file with same path on test-infra](https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml) repository is controlling and enabling runners. If you feel the need for new runners, or change settings for current ones, feel free to submit a PR with required changes on the former file. Pull Request resolved: pytorch#1289 Reviewed By: DanilBaibak Differential Revision: D39424893 Pulled By: jeanschmidt fbshipit-source-id: ecc4a3d74a77ddec3d484111618979fdc59705f7
Summary: Pull Request resolved: pytorch#1279 Similar to D39247288 (pytorch@411e550). Applied to bwd warp_per_row Reviewed By: sryap Differential Revision: D39262119 fbshipit-source-id: 111269dc6c92328eb5252b504aca68452618315b
Summary: Pull Request resolved: pytorch#1282 To prepare for D39266484 Reviewed By: jianyuh Differential Revision: D39266468 fbshipit-source-id: e81e272daaa4259b2444f8dcf2686614cf8d81e2
Summary: Pull Request resolved: pytorch#1299 This PR fixes FBGEMM compilation for AMD GPUs. Because `kThreadGroupSize` is 64 for AMD GPUs, we cannot compile `shfl_sync_mask` correctly (1 << 64 causes an overflow). Reviewed By: sryap Differential Revision: D39452453 fbshipit-source-id: a5500c285f8ecfb8da744afdcb09a9f97668df2f
and fix up comments On AMD Instinct MI-250 system (optimization in progress), this increases performance by ~10% in fp32 weights cases, some lower for fp16 weights, for various types of output.
HIP: improve perf for quantized embedding forward kernel
Summary: Pull Request resolved: pytorch#1303 ITT Reviewed By: jspark1105 Differential Revision: D39457909 fbshipit-source-id: 9a2bb3a6462d123c175db2dd8b6bda546487cc5e
Summary: Pull Request resolved: pytorch#1316 https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-and-self-modifying-code Reviewed By: pallab-zz Differential Revision: D39420206 fbshipit-source-id: 6b92a829a0865cf62251c9757debc3ca2dff14ed
Summary: Pull Request resolved: pytorch#1300 Update all Vec4T structs to use vector load for FP16 (mainly for HIP) Reviewed By: mjanderson09 Differential Revision: D39454067 fbshipit-source-id: 9441be236b6fd2ad45a45e5ebbd2cdc83b12f308
…ch#1283) Summary: Pull Request resolved: pytorch#1283 Introduce a new group runs called really_long_runs and use multiple thread blocks on them. ~~TODO: check if the very long run is covered in unit tests~~ Reviewed By: jianyuh Differential Revision: D39266484 fbshipit-source-id: 37e70639b35c08816926e71812635869f2788da6
Summary: Pull Request resolved: pytorch#1317 Just for simplification. Reviewed By: jspark1105 Differential Revision: D39454460 fbshipit-source-id: 78096327ffee239bb85ba2207069060a16f1c23d
Summary: Pull Request resolved: pytorch#1249 Use WeightDecayMode.NONE as default instead of L2 to make sure weight decay is not unintentionally turned on when `weight_decay` is set to non-zero by other default values. Reviewed By: jianyuh Differential Revision: D38853048 fbshipit-source-id: 2d019cf332507fbb2758384811e652676957a494
Summary: In the pull request, I am committing changes to add documentation to fbgemm_gpu module. - modified: fbgemm_gpu/docs/source/python-api/index.rst - modified: fbgemm_gpu/fbgemm_gpu/__init__.py - added: fbgemm_gpu/fbgemm_gpu/_fbgemm_gpu_docs.py Pull Request resolved: pytorch#1290 Reviewed By: suphoff Differential Revision: D39394632 Pulled By: gvavvari fbshipit-source-id: e4544d57935a75f636b2a4451a7acdbc33b4e573
Summary: Pull Request resolved: pytorch#1323 Before this diff, TBE benchmark profile and report both forward and backward performance together for backward profiling. This diff updates TBE benchmark to profile only backward. Reviewed By: jspark1105 Differential Revision: D39491994 fbshipit-source-id: ade42fe8ada80324f314c89d6c320ca261315840
amathews-amd
approved these changes
Sep 15, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
liligwu
pushed a commit
that referenced
this pull request
Feb 8, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
20220915_tests_log.txt