Ifu 2022 09 15 #18

liligwu · 2022-09-15T16:41:02Z

No conflict
All tests passed
20220915_tests_log.txt

On AMD Instinct MI250 system (optimization in progress), this increases performance by ~10% in case of fp32 weights, some lower for fp16 weights, for various types of output.

Summary: Pull Request resolved: pytorch#1257 Add x86 instrinsics guard Reviewed By: pallab-zz, jianyuh Differential Revision: D38968961 fbshipit-source-id: 462eb5c31f5fba919c0772eb22d93e18aca61f7b

Summary: Pull Request resolved: pytorch#1258 Fix several things on cache flushing, time recording, etc. Reviewed By: jspark1105 Differential Revision: D39007727 fbshipit-source-id: cccc13b794c0b3761a8a04c331b30eb775a73471

Summary: Pull Request resolved: pytorch#1260 This diff just adds automatic fix hints for facebook-cuda-safe-api-call-check and facebook-cuda-safe-kernel-call-check from clang-tidy Reviewed By: r-barnes Differential Revision: D38945713 fbshipit-source-id: 11c676a71bed8d23bc414c8498141c7afa94b033

Summary: Pull Request resolved: pytorch#1262 PyTorch Tensor should be passed by const reference by default; copying it incurs an atomic reference count bump. I understand that we're launching GPU kernels so this is unlikely to improve latency, but removing waste seems better than not removing it. Reviewed By: ajtulloch, jspark1105 Differential Revision: D39040322 fbshipit-source-id: 3941ef27e02002a57e9cb2811db14d8bc471132d

… and optimize for ROCm (pytorch#1240) Summary: Make weihanmines's PR #13 upstreamable. sryap, would you please review the PR and consider converting it to a draft? Thank you. Pull Request resolved: pytorch#1240 Reviewed By: sryap Differential Revision: D38507621 Pulled By: shintaro-iwasaki fbshipit-source-id: 5b4532c0e79ce49a2f93c2a455a6392a1c7c2f16

Summary: tPull Request resolved: pytorch#1264 - Update the metric rule dependency for using it outside FBGEMM. - Update the accumulator type for AUC Reviewed By: jspark1105 Differential Revision: D39090840 fbshipit-source-id: 0556c53f7ffedac68f6475a487e9aa8e9c28d4e4

Summary: There will have a data overflow for int type when input len > 2,147,483,647, PyTorch side has reported an issue which using FBGEMM Quantize, see pytorch/pytorch#80501, the user example's input len is 5,117,410,688, but FBGEMM side use **int** to represent the input len, there will get a wrong number for single thread case. This PR just fixed those two **Quantize** and **FindMinMax** API which are used in PyTorch side, there may have other functions which need to be updated using high-precison dtype. Pull Request resolved: pytorch#1261 Reviewed By: jianyuh Differential Revision: D39089686 Pulled By: jspark1105 fbshipit-source-id: 9623bbb20bdba0f98040a1c8143e4bc552d2a6cb

Summary: Add Mx2, Mx4, 2xN, and 4xN specific transposes on avx512 to improve the transpose performance of shapes of Mx2, Mx4, 2xN, and 4xN. * When the shape is Mx2 or Mx4 and N == ld_src, Mx2 and Mx4 transposes will achieve higher performance. * When the shape is 2xN or 4xN and M == ld_dst, 2xN and 4xN transposes will achieve higher performance. Pull Request resolved: pytorch#1177 Reviewed By: jianyuh Differential Revision: D37638879 Pulled By: jiyuanzFB fbshipit-source-id: e9fba362fc1c86ea96cb5cfe35408ab4488b0750

…ch#1265) Summary: Pull Request resolved: pytorch#1265 Similar idea to D34830535 but w/o H<->D synchronization. Use at most 16 times of SM thread blocks to avoid unnecessary work. Make lru/lfu_cache_find_uncached_kernel work when gridDim.x * blockDim.y < N_unique Reviewed By: suphoff Differential Revision: D39086495 fbshipit-source-id: 32e83ad92986ddf6481f90b39f65f1e131076276

…pytorch#1268) Summary: Pull Request resolved: pytorch#1268 Follow up of D39086495 applying to other cache kernels Reviewed By: jianyuh Differential Revision: D39124243 fbshipit-source-id: e0cdc8436c1c0036560eb5a9b53c2209d404e6cf

…rch#1269) Summary: Pull Request resolved: pytorch#1269 The sample inputs are loaded to CPU, and for merge, we suppose to still be able to sample inputs on CPU. For mpe, if all the inputs are on CPU, the CPU op is dispatched. So add a corner case to mpe CPU op. Reviewed By: jianyuh Differential Revision: D39108861 fbshipit-source-id: a0ff132f07f2a5b5583aa26ff227ebda61ae7833

Summary: Pull Request resolved: pytorch#1259 This operator is similar to padding_fused_tbe_input_combine_cpu, but padding lengths as zeros instead. Reviewed By: jianyuh Differential Revision: D39008642 fbshipit-source-id: 51075a80868742353ff22ad763354174df3a5097

Summary: Pull Request resolved: pytorch#1270 Reviewed By: xiaosun86 Differential Revision: D39124669 fbshipit-source-id: b6c3abea10613149a05f406c4c91c23009896a64

Summary: Pull Request resolved: pytorch#1272 Copying zipf generation from cupy. cupy is not in third_party yet. Quickly tried but ran into a couple of problems so decided to just copy zipf part. We may also want to consider add zipf generation to PT. TBE benchmarking significantly faster so should help TBE optimization. TODO: more rigorous correctness check Reviewed By: ajtulloch Differential Revision: D39093435 fbshipit-source-id: 7aad735d4ef6e79d9d19114cb22fdad80d69b5fb

…torch#1276) Summary: A bracket is missing when using the `skipIfRocm` decorator in the `uvm_test.py` and causes the tests to be skipped on CUDA. Pull Request resolved: pytorch#1276 Reviewed By: jspark1105 Differential Revision: D39263693 Pulled By: suphoff fbshipit-source-id: 2787d44d4885e830ec505f75772e172abeec2bc7

Summary: Pull Request resolved: pytorch#1271 fbgemm building aarch64 using simde Reviewed By: pallab-zz, suphoff Differential Revision: D39175324 fbshipit-source-id: 7331fe5d43a9ff455a9348451c790ad2322187a7

Summary: Pull Request resolved: pytorch#1284 TODO: CPU implementation for backward pass + autograd function. Reviewed By: xiaosun86 Differential Revision: D39124668 fbshipit-source-id: 92b0c201654af0e452c118a55720ad539214fdf7

Summary: Pull Request resolved: pytorch#1287 Before this diff, HIP does 4 sequential scalar loads for the half input in TBE's Vec4T. This diff does a vector load for 4 halves. Reviewed By: jspark1105 Differential Revision: D39267283 fbshipit-source-id: 657b3ce7ea771a664b24450bb24da9a05535ddc2

Summary: Pull Request resolved: pytorch#1286 Reviewed By: jspark1105, mjanderson09 Differential Revision: D39335880 fbshipit-source-id: 885e3e300239be775eb6c760e5ba28af832986d9

Summary: Pull Request resolved: pytorch#1292 Reviewed By: jspark1105 Differential Revision: D39408174 fbshipit-source-id: de4481ee6a96200df3d147dc4ba8bc98faf050f9

Summary: Pull Request resolved: pytorch#1294 Reviewed By: jspark1105 Differential Revision: D39408601 fbshipit-source-id: 187c82aa6db475827b359298ff9639e9f6b784e2

pytorch#1293) Summary: Pull Request resolved: pytorch#1293 Reviewed By: jspark1105 Differential Revision: D39408600 fbshipit-source-id: 26c6c1c430764e0bb4ec5276c956c08ea4933ab5

Summary: Pull Request resolved: pytorch#1295 The LXU cache logic is in the critical path of the forward TBE kernel. Even when the LXU cache is not used, the kernel still checks whether a row should be fetched from the cache or HBM at runtime. The branching logic should be harmless for the memory (subsystem) bound case. However, it could add significant overhead if TBE is conditional bound. (We have observed that FP16 weight type is generally compute or conditional bound, while FP32 weight type is memory bound.) This diff adds a static conditional in the forward TBE kernel to enable/disable the LXU cache code path at compile time. At runtime, the host selects the kernel with/without cache enabled based on whether the LXU cache is present. This diff also moves the conditional outside the D loop. It should add a small benefit for the large D cases when cache is used. Reviewed By: jspark1105 Differential Revision: D39353035 fbshipit-source-id: ee34e04d78fc9b4098923a7600e6a3b9e6ce388f

Summary: Pull Request resolved: pytorch#1277 A similar idea to D37739239 (pytorch@fbd89e8) applied to pooled case. TODO * ~~We can try this for other places like bwd.~~ * ~~Handle AMD~~ * Maybe we want to set kThreadGroupSize based on common/average D not based on max_D * Extend to inference quantized emb fwd * Check code size Reviewed By: sryap Differential Revision: D39247288 fbshipit-source-id: 38af2dabac6652ac347d4b5ae185beed2b2c50ba

…h#1278) Summary: Pull Request resolved: pytorch#1278 Similar to D39247288. Applied to bwd warp_per_row Reviewed By: jianyuh Differential Revision: D39247622 fbshipit-source-id: c6fe9c7974acbac8a8558ddbb87e60dba6917a85

Summary: Pull Request resolved: pytorch#1296 ITT Reviewed By: brad-mengchi Differential Revision: D39413020 fbshipit-source-id: c8ec25fbb0ec86d6e4ed3624140febfb67dc7446

Summary: Pull Request resolved: pytorch#1273 ITT Reviewed By: jspark1105 Differential Revision: D39185322 fbshipit-source-id: 0e3ffd6d7b58115325a02e92ab2aec590154c099

…in test-infra (pytorch#1289) Summary: [As communicated internally](https://fb.workplace.com/groups/pytorch.dev/permalink/1189033455008466/), all repositories now rely on a single scale-config.yml that is on pytorch/test-infra. As such this file is no longer used and to avoid confusion it is better to remove it. Here is a short summary of the announcement: > [As previously announced](https://fb.workplace.com/groups/pytorch.dev/permalink/1173939633184515/), the scale-config.yml file in each repository for the pytorch/ organization is now not being used to control GHA runners. On its place, [the file with same path on test-infra](https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml) repository is controlling and enabling runners. If you feel the need for new runners, or change settings for current ones, feel free to submit a PR with required changes on the former file. Pull Request resolved: pytorch#1289 Reviewed By: DanilBaibak Differential Revision: D39424893 Pulled By: jeanschmidt fbshipit-source-id: ecc4a3d74a77ddec3d484111618979fdc59705f7

Summary: Pull Request resolved: pytorch#1279 Similar to D39247288 (pytorch@411e550). Applied to bwd warp_per_row Reviewed By: sryap Differential Revision: D39262119 fbshipit-source-id: 111269dc6c92328eb5252b504aca68452618315b

Summary: Pull Request resolved: pytorch#1282 To prepare for D39266484 Reviewed By: jianyuh Differential Revision: D39266468 fbshipit-source-id: e81e272daaa4259b2444f8dcf2686614cf8d81e2

Summary: Pull Request resolved: pytorch#1299 This PR fixes FBGEMM compilation for AMD GPUs. Because `kThreadGroupSize` is 64 for AMD GPUs, we cannot compile `shfl_sync_mask` correctly (1 << 64 causes an overflow). Reviewed By: sryap Differential Revision: D39452453 fbshipit-source-id: a5500c285f8ecfb8da744afdcb09a9f97668df2f

and fix up comments On AMD Instinct MI-250 system (optimization in progress), this increases performance by ~10% in fp32 weights cases, some lower for fp16 weights, for various types of output.

HIP: improve perf for quantized embedding forward kernel

Summary: Pull Request resolved: pytorch#1303 ITT Reviewed By: jspark1105 Differential Revision: D39457909 fbshipit-source-id: 9a2bb3a6462d123c175db2dd8b6bda546487cc5e

Summary: Pull Request resolved: pytorch#1316 https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-and-self-modifying-code Reviewed By: pallab-zz Differential Revision: D39420206 fbshipit-source-id: 6b92a829a0865cf62251c9757debc3ca2dff14ed

Summary: Pull Request resolved: pytorch#1300 Update all Vec4T structs to use vector load for FP16 (mainly for HIP) Reviewed By: mjanderson09 Differential Revision: D39454067 fbshipit-source-id: 9441be236b6fd2ad45a45e5ebbd2cdc83b12f308

…ch#1283) Summary: Pull Request resolved: pytorch#1283 Introduce a new group runs called really_long_runs and use multiple thread blocks on them. ~~TODO: check if the very long run is covered in unit tests~~ Reviewed By: jianyuh Differential Revision: D39266484 fbshipit-source-id: 37e70639b35c08816926e71812635869f2788da6

Summary: Pull Request resolved: pytorch#1317 Just for simplification. Reviewed By: jspark1105 Differential Revision: D39454460 fbshipit-source-id: 78096327ffee239bb85ba2207069060a16f1c23d

Summary: Pull Request resolved: pytorch#1249 Use WeightDecayMode.NONE as default instead of L2 to make sure weight decay is not unintentionally turned on when `weight_decay` is set to non-zero by other default values. Reviewed By: jianyuh Differential Revision: D38853048 fbshipit-source-id: 2d019cf332507fbb2758384811e652676957a494

Summary: In the pull request, I am committing changes to add documentation to fbgemm_gpu module. - modified: fbgemm_gpu/docs/source/python-api/index.rst - modified: fbgemm_gpu/fbgemm_gpu/__init__.py - added: fbgemm_gpu/fbgemm_gpu/_fbgemm_gpu_docs.py Pull Request resolved: pytorch#1290 Reviewed By: suphoff Differential Revision: D39394632 Pulled By: gvavvari fbshipit-source-id: e4544d57935a75f636b2a4451a7acdbc33b4e573

Summary: Pull Request resolved: pytorch#1323 Before this diff, TBE benchmark profile and report both forward and backward performance together for backward profiling. This diff updates TBE benchmark to profile only backward. Reviewed By: jspark1105 Differential Revision: D39491994 fbshipit-source-id: ade42fe8ada80324f314c89d6c320ca261315840

…into main

amathews-amd

LGTM

root and others added 30 commits August 16, 2022 00:20

HIP: improve perf for quantized embedding forward kernel

b97390b

On AMD Instinct MI250 system (optimization in progress), this increases performance by ~10% in case of fp32 weights, some lower for fp16 weights, for various types of output.

Add x86 instrinsics guard (pytorch#1257)

de510f4

Summary: Pull Request resolved: pytorch#1257 Add x86 instrinsics guard Reviewed By: pallab-zz, jianyuh Differential Revision: D38968961 fbshipit-source-id: 462eb5c31f5fba919c0772eb22d93e18aca61f7b

Update benchmark util function (pytorch#1258)

6f1bef4

Summary: Pull Request resolved: pytorch#1258 Fix several things on cache flushing, time recording, etc. Reviewed By: jspark1105 Differential Revision: D39007727 fbshipit-source-id: cccc13b794c0b3761a8a04c331b30eb775a73471

Update metric ops

c1db456

Summary: tPull Request resolved: pytorch#1264 - Update the metric rule dependency for using it outside FBGEMM. - Update the accumulator type for AUC Reviewed By: jspark1105 Differential Revision: D39090840 fbshipit-source-id: 0556c53f7ffedac68f6475a487e9aa8e9c28d4e4

move stacked jagged op (pytorch#1270)

d0339f4

Summary: Pull Request resolved: pytorch#1270 Reviewed By: xiaosun86 Differential Revision: D39124669 fbshipit-source-id: b6c3abea10613149a05f406c4c91c23009896a64

fbgemm building aarch64 using simde (pytorch#1271)

0f2ff7a

Summary: Pull Request resolved: pytorch#1271 fbgemm building aarch64 using simde Reviewed By: pallab-zz, suphoff Differential Revision: D39175324 fbshipit-source-id: 7331fe5d43a9ff455a9348451c790ad2322187a7

fix jagged tensor shared memory issue (pytorch#1286)

4d6209f

Summary: Pull Request resolved: pytorch#1286 Reviewed By: jspark1105, mjanderson09 Differential Revision: D39335880 fbshipit-source-id: 885e3e300239be775eb6c760e5ba28af832986d9

Use FixedDivisor trick in training forward kernel (pytorch#1292)

148ce03

Summary: Pull Request resolved: pytorch#1292 Reviewed By: jspark1105 Differential Revision: D39408174 fbshipit-source-id: de4481ee6a96200df3d147dc4ba8bc98faf050f9

Use FixedDivisor trick in inference forward kernel (pytorch#1294)

15a7069

Summary: Pull Request resolved: pytorch#1294 Reviewed By: jspark1105 Differential Revision: D39408601 fbshipit-source-id: 187c82aa6db475827b359298ff9639e9f6b784e2

Specialize inference forward kernel for the device-only placement case (

4e9d23f

pytorch#1293) Summary: Pull Request resolved: pytorch#1293 Reviewed By: jspark1105 Differential Revision: D39408600 fbshipit-source-id: 26c6c1c430764e0bb4ec5276c956c08ea4933ab5

Increased test coverage for dense to jagged ops (pytorch#1296)

8d7f6df

Summary: Pull Request resolved: pytorch#1296 ITT Reviewed By: brad-mengchi Differential Revision: D39413020 fbshipit-source-id: c8ec25fbb0ec86d6e4ed3624140febfb67dc7446

Workaround for CUDA 11.7 compilation (pytorch#1273)

f7b55e8

Summary: Pull Request resolved: pytorch#1273 ITT Reviewed By: jspark1105 Differential Revision: D39185322 fbshipit-source-id: 0e3ffd6d7b58115325a02e92ab2aec590154c099

jspark1105 and others added 15 commits September 12, 2022 19:54

refactor emb bwd cta_per_row (pytorch#1282)

2d822dc

Summary: Pull Request resolved: pytorch#1282 To prepare for D39266484 Reviewed By: jianyuh Differential Revision: D39266468 fbshipit-source-id: e81e272daaa4259b2444f8dcf2686614cf8d81e2

HIP: improve perf for quantized embedding forward kernel

08f7cc9

and fix up comments On AMD Instinct MI-250 system (optimization in progress), this increases performance by ~10% in fp32 weights cases, some lower for fp16 weights, for various types of output.

Merge pull request #16 from ROCmSoftwarePlatform/hip-improve-quant

135490f

HIP: improve perf for quantized embedding forward kernel

Adding benchmark for batched_dense_vec_jagged_2d_mul (pytorch#1303)

c44d37f

Summary: Pull Request resolved: pytorch#1303 ITT Reviewed By: jspark1105 Differential Revision: D39457909 fbshipit-source-id: 9a2bb3a6462d123c175db2dd8b6bda546487cc5e

avoid using cooperative groups (pytorch#1317)

89b93b6

Summary: Pull Request resolved: pytorch#1317 Just for simplification. Reviewed By: jspark1105 Differential Revision: D39454460 fbshipit-source-id: 78096327ffee239bb85ba2207069060a16f1c23d

Merge branch 'main' of https://github.com/ROCmSoftwarePlatform/FBGEMM …

3480ac9

…into main

Merge remote-tracking branch 'upstream/main' into IFU-2022-09-15

8e39fca

keep aligned with upstream

f93cc0d

liligwu requested a review from amathews-amd September 15, 2022 16:41

amathews-amd approved these changes Sep 15, 2022

View reviewed changes

liligwu merged commit 5f72103 into main Sep 15, 2022

liligwu pushed a commit that referenced this pull request Feb 8, 2023

Change py3.8 syntax to py3.7 syntax (#18)

3a915a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ifu 2022 09 15 #18

Ifu 2022 09 15 #18

liligwu commented Sep 15, 2022

amathews-amd left a comment

Ifu 2022 09 15 #18

Ifu 2022 09 15 #18

Conversation

liligwu commented Sep 15, 2022

amathews-amd left a comment

Choose a reason for hiding this comment