Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ifu 2022 09 15 #18

Merged
merged 45 commits into from
Sep 15, 2022
Merged

Ifu 2022 09 15 #18

merged 45 commits into from
Sep 15, 2022

Conversation

liligwu
Copy link
Collaborator

@liligwu liligwu commented Sep 15, 2022

root and others added 30 commits August 16, 2022 00:20
On AMD Instinct MI250 system (optimization in progress), this increases performance by ~10% in case of fp32 weights, some lower for fp16 weights, for various types of output.
Summary:
Pull Request resolved: pytorch#1257

Add x86 instrinsics guard

Reviewed By: pallab-zz, jianyuh

Differential Revision: D38968961

fbshipit-source-id: 462eb5c31f5fba919c0772eb22d93e18aca61f7b
Summary:
Pull Request resolved: pytorch#1258

Fix several things on cache flushing, time recording, etc.

Reviewed By: jspark1105

Differential Revision: D39007727

fbshipit-source-id: cccc13b794c0b3761a8a04c331b30eb775a73471
Summary:
Pull Request resolved: pytorch#1260

This diff just adds automatic fix hints for facebook-cuda-safe-api-call-check and facebook-cuda-safe-kernel-call-check from clang-tidy

Reviewed By: r-barnes

Differential Revision: D38945713

fbshipit-source-id: 11c676a71bed8d23bc414c8498141c7afa94b033
Summary:
Pull Request resolved: pytorch#1262

PyTorch Tensor should be passed by const reference by default; copying it incurs an atomic reference count bump.

I understand that we're launching GPU kernels so this is unlikely to improve latency, but removing waste seems better than not removing it.

Reviewed By: ajtulloch, jspark1105

Differential Revision: D39040322

fbshipit-source-id: 3941ef27e02002a57e9cb2811db14d8bc471132d
… and optimize for ROCm (pytorch#1240)

Summary:
Make weihanmines's PR #13 upstreamable.
sryap, would you please review the PR and consider converting it to a draft? Thank you.

Pull Request resolved: pytorch#1240

Reviewed By: sryap

Differential Revision: D38507621

Pulled By: shintaro-iwasaki

fbshipit-source-id: 5b4532c0e79ce49a2f93c2a455a6392a1c7c2f16
Summary:
tPull Request resolved: pytorch#1264

- Update the metric rule dependency for using it outside FBGEMM.
- Update the accumulator type for AUC

Reviewed By: jspark1105

Differential Revision: D39090840

fbshipit-source-id: 0556c53f7ffedac68f6475a487e9aa8e9c28d4e4
Summary:
There will have a data overflow for int type when input len > 2,147,483,647, PyTorch side has reported an issue which using FBGEMM Quantize, see pytorch/pytorch#80501, the user example's input len is 5,117,410,688,  but FBGEMM side use **int** to represent the input len, there will get a wrong number for single thread case.

This PR just fixed those two **Quantize** and **FindMinMax** API which are used in PyTorch side, there may have other functions which need to be updated using high-precison dtype.

Pull Request resolved: pytorch#1261

Reviewed By: jianyuh

Differential Revision: D39089686

Pulled By: jspark1105

fbshipit-source-id: 9623bbb20bdba0f98040a1c8143e4bc552d2a6cb
Summary:
Add Mx2, Mx4, 2xN, and 4xN specific transposes on avx512 to improve the transpose performance of shapes of Mx2, Mx4, 2xN, and 4xN.
* When the shape is Mx2 or Mx4 and N == ld_src,  Mx2 and Mx4 transposes will achieve higher performance.
* When the shape is 2xN or 4xN and M == ld_dst,  2xN and 4xN transposes will achieve higher performance.

Pull Request resolved: pytorch#1177

Reviewed By: jianyuh

Differential Revision: D37638879

Pulled By: jiyuanzFB

fbshipit-source-id: e9fba362fc1c86ea96cb5cfe35408ab4488b0750
…ch#1265)

Summary:
Pull Request resolved: pytorch#1265

Similar idea to D34830535 but w/o H<->D synchronization.
Use at most 16 times of SM thread blocks to avoid unnecessary work.
Make lru/lfu_cache_find_uncached_kernel work when gridDim.x * blockDim.y < N_unique

Reviewed By: suphoff

Differential Revision: D39086495

fbshipit-source-id: 32e83ad92986ddf6481f90b39f65f1e131076276
…pytorch#1268)

Summary:
Pull Request resolved: pytorch#1268

Follow up of D39086495 applying to other cache kernels

Reviewed By: jianyuh

Differential Revision: D39124243

fbshipit-source-id: e0cdc8436c1c0036560eb5a9b53c2209d404e6cf
…rch#1269)

Summary:
Pull Request resolved: pytorch#1269

The sample inputs are loaded to CPU, and for merge, we suppose to still be able to sample inputs on CPU.

For mpe, if all the inputs are on CPU, the CPU op is dispatched. So add a corner case to mpe CPU op.

Reviewed By: jianyuh

Differential Revision: D39108861

fbshipit-source-id: a0ff132f07f2a5b5583aa26ff227ebda61ae7833
Summary:
Pull Request resolved: pytorch#1259

This operator is similar to padding_fused_tbe_input_combine_cpu, but padding lengths as zeros instead.

Reviewed By: jianyuh

Differential Revision: D39008642

fbshipit-source-id: 51075a80868742353ff22ad763354174df3a5097
Summary: Pull Request resolved: pytorch#1270

Reviewed By: xiaosun86

Differential Revision: D39124669

fbshipit-source-id: b6c3abea10613149a05f406c4c91c23009896a64
Summary:
Pull Request resolved: pytorch#1272

Copying zipf generation from cupy. cupy is not in third_party yet. Quickly tried but ran into a couple of problems so decided to just copy zipf part. We may also want to consider add zipf generation to PT.

TBE benchmarking significantly faster so should help TBE optimization.

TODO: more rigorous correctness check

Reviewed By: ajtulloch

Differential Revision: D39093435

fbshipit-source-id: 7aad735d4ef6e79d9d19114cb22fdad80d69b5fb
…torch#1276)

Summary:
A bracket is missing when using the `skipIfRocm` decorator in the `uvm_test.py` and causes the tests to be skipped on CUDA.

Pull Request resolved: pytorch#1276

Reviewed By: jspark1105

Differential Revision: D39263693

Pulled By: suphoff

fbshipit-source-id: 2787d44d4885e830ec505f75772e172abeec2bc7
Summary:
Pull Request resolved: pytorch#1271

fbgemm building aarch64 using simde

Reviewed By: pallab-zz, suphoff

Differential Revision: D39175324

fbshipit-source-id: 7331fe5d43a9ff455a9348451c790ad2322187a7
Summary:
Pull Request resolved: pytorch#1284

TODO: CPU implementation for backward pass + autograd function.

Reviewed By: xiaosun86

Differential Revision: D39124668

fbshipit-source-id: 92b0c201654af0e452c118a55720ad539214fdf7
Summary:
Pull Request resolved: pytorch#1287

Before this diff, HIP does 4 sequential scalar loads for the half
input in TBE's Vec4T.  This diff does a vector load for 4 halves.

Reviewed By: jspark1105

Differential Revision: D39267283

fbshipit-source-id: 657b3ce7ea771a664b24450bb24da9a05535ddc2
Summary: Pull Request resolved: pytorch#1286

Reviewed By: jspark1105, mjanderson09

Differential Revision: D39335880

fbshipit-source-id: 885e3e300239be775eb6c760e5ba28af832986d9
Summary: Pull Request resolved: pytorch#1292

Reviewed By: jspark1105

Differential Revision: D39408174

fbshipit-source-id: de4481ee6a96200df3d147dc4ba8bc98faf050f9
Summary: Pull Request resolved: pytorch#1294

Reviewed By: jspark1105

Differential Revision: D39408601

fbshipit-source-id: 187c82aa6db475827b359298ff9639e9f6b784e2
pytorch#1293)

Summary: Pull Request resolved: pytorch#1293

Reviewed By: jspark1105

Differential Revision: D39408600

fbshipit-source-id: 26c6c1c430764e0bb4ec5276c956c08ea4933ab5
Summary:
Pull Request resolved: pytorch#1295

The LXU cache logic is in the critical path of the forward TBE kernel.
Even when the LXU cache is not used, the kernel still checks whether a
row should be fetched from the cache or HBM at runtime.  The branching
logic should be harmless for the memory (subsystem) bound case.
However, it could add significant overhead if TBE is conditional
bound.  (We have observed that FP16 weight type is generally compute
or conditional bound, while FP32 weight type is memory bound.)

This diff adds a static conditional in the forward TBE kernel to
enable/disable the LXU cache code path at compile time.  At runtime,
the host selects the kernel with/without cache enabled based on
whether the LXU cache is present.

This diff also moves the conditional outside the D loop.  It should
add a small benefit for the large D cases when cache is used.

Reviewed By: jspark1105

Differential Revision: D39353035

fbshipit-source-id: ee34e04d78fc9b4098923a7600e6a3b9e6ce388f
Summary:
Pull Request resolved: pytorch#1277

A similar idea to D37739239 (pytorch@fbd89e8) applied to pooled case.

TODO
* ~~We can try this for other places like bwd.~~
* ~~Handle AMD~~
* Maybe we want to set kThreadGroupSize based on common/average D not based on max_D
* Extend to inference quantized emb fwd
* Check code size

Reviewed By: sryap

Differential Revision: D39247288

fbshipit-source-id: 38af2dabac6652ac347d4b5ae185beed2b2c50ba
…h#1278)

Summary:
Pull Request resolved: pytorch#1278

Similar to D39247288. Applied to bwd warp_per_row

Reviewed By: jianyuh

Differential Revision: D39247622

fbshipit-source-id: c6fe9c7974acbac8a8558ddbb87e60dba6917a85
Summary:
Pull Request resolved: pytorch#1296

ITT

Reviewed By: brad-mengchi

Differential Revision: D39413020

fbshipit-source-id: c8ec25fbb0ec86d6e4ed3624140febfb67dc7446
Summary:
Pull Request resolved: pytorch#1273

ITT

Reviewed By: jspark1105

Differential Revision: D39185322

fbshipit-source-id: 0e3ffd6d7b58115325a02e92ab2aec590154c099
…in test-infra (pytorch#1289)

Summary:
[As communicated internally](https://fb.workplace.com/groups/pytorch.dev/permalink/1189033455008466/), all repositories now rely on a single scale-config.yml that is on pytorch/test-infra. As such this file is no longer used and to avoid confusion it is better to remove it.

Here is a short summary of the announcement:

> [As previously announced](https://fb.workplace.com/groups/pytorch.dev/permalink/1173939633184515/), the scale-config.yml file in each repository for the pytorch/ organization is now not being used to control GHA runners. On its place, [the file with same path on test-infra](https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml) repository is controlling and enabling runners. If you feel the need for new runners, or change settings for current ones, feel free to submit a PR with required changes on the former file.

Pull Request resolved: pytorch#1289

Reviewed By: DanilBaibak

Differential Revision: D39424893

Pulled By: jeanschmidt

fbshipit-source-id: ecc4a3d74a77ddec3d484111618979fdc59705f7
Summary:
Pull Request resolved: pytorch#1279

Similar to D39247288 (pytorch@411e550). Applied to bwd warp_per_row

Reviewed By: sryap

Differential Revision: D39262119

fbshipit-source-id: 111269dc6c92328eb5252b504aca68452618315b
jspark1105 and others added 15 commits September 12, 2022 19:54
Summary:
Pull Request resolved: pytorch#1282

To prepare for D39266484

Reviewed By: jianyuh

Differential Revision: D39266468

fbshipit-source-id: e81e272daaa4259b2444f8dcf2686614cf8d81e2
Summary:
Pull Request resolved: pytorch#1299

This PR fixes FBGEMM compilation for AMD GPUs. Because `kThreadGroupSize` is 64 for AMD GPUs, we cannot compile `shfl_sync_mask` correctly (1 << 64 causes an overflow).

Reviewed By: sryap

Differential Revision: D39452453

fbshipit-source-id: a5500c285f8ecfb8da744afdcb09a9f97668df2f
     and fix up comments

     On AMD Instinct MI-250 system (optimization in progress),
     this increases performance by ~10% in fp32 weights cases,
     some lower for fp16 weights, for various types of output.
HIP: improve perf for quantized embedding forward kernel
Summary:
Pull Request resolved: pytorch#1303

ITT

Reviewed By: jspark1105

Differential Revision: D39457909

fbshipit-source-id: 9a2bb3a6462d123c175db2dd8b6bda546487cc5e
Summary:
Pull Request resolved: pytorch#1316

https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-and-self-modifying-code

Reviewed By: pallab-zz

Differential Revision: D39420206

fbshipit-source-id: 6b92a829a0865cf62251c9757debc3ca2dff14ed
Summary:
Pull Request resolved: pytorch#1300

Update all Vec4T structs to use vector load for FP16 (mainly for HIP)

Reviewed By: mjanderson09

Differential Revision: D39454067

fbshipit-source-id: 9441be236b6fd2ad45a45e5ebbd2cdc83b12f308
…ch#1283)

Summary:
Pull Request resolved: pytorch#1283

Introduce a new group runs called really_long_runs and use multiple thread blocks on them.

~~TODO: check if the very long run is covered in unit tests~~

Reviewed By: jianyuh

Differential Revision: D39266484

fbshipit-source-id: 37e70639b35c08816926e71812635869f2788da6
Summary:
Pull Request resolved: pytorch#1317

Just for simplification.

Reviewed By: jspark1105

Differential Revision: D39454460

fbshipit-source-id: 78096327ffee239bb85ba2207069060a16f1c23d
Summary:
Pull Request resolved: pytorch#1249

Use WeightDecayMode.NONE as default instead of L2 to make sure weight decay is not unintentionally turned on when `weight_decay` is set to non-zero by other default values.

Reviewed By: jianyuh

Differential Revision: D38853048

fbshipit-source-id: 2d019cf332507fbb2758384811e652676957a494
Summary:
In the pull request, I am committing changes to add documentation to fbgemm_gpu module.

-         modified:   fbgemm_gpu/docs/source/python-api/index.rst
-         modified:   fbgemm_gpu/fbgemm_gpu/__init__.py
-         added: fbgemm_gpu/fbgemm_gpu/_fbgemm_gpu_docs.py

Pull Request resolved: pytorch#1290

Reviewed By: suphoff

Differential Revision: D39394632

Pulled By: gvavvari

fbshipit-source-id: e4544d57935a75f636b2a4451a7acdbc33b4e573
Summary:
Pull Request resolved: pytorch#1323

Before this diff, TBE benchmark profile and report both forward and
backward performance together for backward profiling.  This diff
updates TBE benchmark to profile only backward.

Reviewed By: jspark1105

Differential Revision: D39491994

fbshipit-source-id: ade42fe8ada80324f314c89d6c320ca261315840
Copy link

@amathews-amd amathews-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liligwu liligwu merged commit 5f72103 into main Sep 15, 2022
liligwu pushed a commit that referenced this pull request Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.