Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix set value grad #59034

Merged
merged 7 commits into from
Dec 25, 2023
Merged

Conversation

zoooo0820
Copy link
Contributor

@zoooo0820 zoooo0820 commented Nov 15, 2023

PR types

Bug fixes

PR changes

OPs

Description

Pcard-66985

在此前,为了适配分布式动半模式,set_value算子及其反向算子迁移到phi下。在此之前,set_value算子可同时处理value 为scalar或tensor两种场景,由input ValueTensor是否存在来决定。相应的,其反向set_value_grad也是如此。

由于phi的要求,需要显式区分set_value (对应value 为scalar) 和 set_value_with_tensor (对应value 为tensor)两个算子。因此,反向也需要对应区分。在phi之前的算子历史定义fluid/operator中,将前者的反向行为错误地设置为assign,这使得在迁移phi时 #58893 的行为参考有误。导致目前value 为scalar时,赋值的反向结果与预期不符,需要修复。
本PR 中新增kernel set_value_with_scalar_grad,用于该场景的计算,替代此前错误的assign行为,其底层仍然复用SetValueGradImpl

Copy link
Contributor

@wanghuancoder wanghuancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个修改还需要麻烦永康Review一下。这么改了以后paddle/fluid/ir_adaptor/translator/op_translator.cc里的SetValueGradOpTranscriber需要做调整吗?

Comment on lines 356 to 358
switch (rank) {
case 1:
SetValueGradImpl<T, Context, 1>(dev_ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是直接调SetValueGradKernel就可以了,value_grad传入nullptr。

Comment on lines 154 to 155
op->SetType("set_value_grad");
op->SetInput(framework::GradVarName("Out"), this->OutputGrad("Out"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该讨论,如果有ValueTensor调用set_value_grad,没有ValueTensor调用set_value_with_scalar_grad

op->SetType("assign");
op->SetInput("X", this->OutputGrad("Out"));
op->SetOutput("Out", this->InputGrad("Input"));
op->SetType("set_value_with_scalar_grad");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这行看起来依然没有被覆盖到,是不是需要注册一个新Op set_value_with_scalar_grad?

Copy link

paddle-ci-bot bot commented Nov 26, 2023

Sorry to inform you that 6cc1f71's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

Copy link
Contributor

@kangguangli kangguangli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jeff41404 jeff41404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeff41404 jeff41404 merged commit 85e3693 into PaddlePaddle:develop Dec 25, 2023
29 checks passed
@zoooo0820 zoooo0820 deleted the fix_set_value_grad branch December 25, 2023 07:11
Wanglongzhi2001 pushed a commit to Wanglongzhi2001/Paddle that referenced this pull request Jan 7, 2024
* first fix the UT

* fix set value grad

* polish code

* add static mode backward test

* always has input valuetensor

* add dygraph test
zoooo0820 added a commit to zoooo0820/Paddle that referenced this pull request Jan 18, 2024
* first fix the UT

* fix set value grad

* polish code

* add static mode backward test

* always has input valuetensor

* add dygraph test
XiaoguangHu01 pushed a commit that referenced this pull request Jan 19, 2024
* Fix set value grad (#59034)

* first fix the UT

* fix set value grad

* polish code

* add static mode backward test

* always has input valuetensor

* add dygraph test

* Fix shape error in combined-indexing setitem (#60447)

* add ut

* fix shape error in combine-indexing

* fix ut

* Set value with scalar (#60452)

* set_value with scalar

* fix ut

* remove test_pir

* remove one test since 2.6 not support uint8-add
xiaoguoguo626807 pushed a commit that referenced this pull request Sep 30, 2024
* fix windows bug for common lib (#60308)

* fix windows bug

* fix windows bug

* fix windows bug

* fix windows bug

* fix windows bug

* fix windows bug

* Update inference_lib.cmake

* [Dy2St] Disable `test_bert` on CPU (#60173) (#60324)

Co-authored-by: gouzil <[email protected]>

* [Cherry-pick] fix weight quant kernel bug when n div 64 != 0 (#60184)

* fix weight-only quant kernel error for n div 64 !=0

* code style fix

* tile (#60261)

* add chunk allocator posix_memalign return value check (#60208) (#60495)

* fix chunk allocator posix_memalign return value check;test=develop

* fix chunk allocator posix_memalign return value check;test=develop

* fix chunk allocator posix_memalign return value check;test=develop

* update 2023 security advisory, test=document_fix (#60532)

* fix fleetutil get_online_pass_interval bug2; test=develop (#60545)

* fix fused_rope diff (#60217) (#60593)

* [cherry-pick]fix fleetutil get_online_pass_interval bug3 (#60620)

* fix fleetutil get_online_pass_interval bug3; test=develop

* fix fleetutil get_online_pass_interval bug3; test=develop

* fix fleetutil get_online_pass_interval bug3; test=develop

* [cherry-pick]update pdsa-2023-019 (#60649)

* update 2023 security advisory, test=document_fix

* update pdsa-2023-019, test=document_fix

* [Dy2St][2.6] Disable `test_grad` on release/2.6 (#60662)

* fix bug of ci (#59926) (#60785)

* [Dy2St][2.6] Disable `test_transformer` on `release/2.6` and update README (#60786)

* [Dy2St][2.6] Disable `test_transformer` on release/2.6 and update README

* [Docs] Update latest release version in README (#60691)

* restore order

* [Dy2St][2.6] Increase `test_transformer` and `test_mobile_net` ut time (#60829) (#60875)

* [Cherry-pick] fix set_value with scalar grad (#60930)

* Fix set value grad (#59034)

* first fix the UT

* fix set value grad

* polish code

* add static mode backward test

* always has input valuetensor

* add dygraph test

* Fix shape error in combined-indexing setitem (#60447)

* add ut

* fix shape error in combine-indexing

* fix ut

* Set value with scalar (#60452)

* set_value with scalar

* fix ut

* remove test_pir

* remove one test since 2.6 not support uint8-add

* [cherry-pick] This PR enable offset of generator for custom device. (#60616) (#60772)

* fix core dump when fallback gather_nd_grad and MemoryAllocateHost (#61067)

* fix qat tests (#61211) (#61284)

* [Security] fix draw security problem (#61161) (#61338)

* fix draw security problem

* fix _decompress security problem (#61294) (#61337)

* Fix CVE-2024-0521 (#61032) (#61287)

This uses shlex for safe command parsing to fix arbitrary code injection

Co-authored-by: ndren <[email protected]>

* [Security] fix security problem for prune_by_memory_estimation (#61382)

* OS Command Injection prune_by_memory_estimation fix

* Fix StyleCode

* [Security] fix security problem for run_cmd (#61285) (#61398)

* fix security problem for run_cmd

* [Security] fix download security problem (#61162) (#61388)

* fix download security problem

* check eval for security (#61389)

* [cherry-pick] adapt c_embedding to phi namespace for custom devices (#60774) (#61045)

Co-authored-by: Tian <[email protected]>

* [CherryPick] Fix issue 60092 (#61427)

* fix issue 60092

* update

* update

* update

* Fix unique (#60840) (#61044)

* fix unique kernel, row to num_out

* cinn(py-dsl): skip eval string in python-dsl (#61380) (#61586)

* remove _wget (#61356) (#61569)

* remove _wget

* remove _wget

* remove wget test

* fix layer_norm decompose dtyte bugs, polish codes (#61631)

* fix doc style (#61688)

* merge (#61866)

* [security] refine _get_program_cache_key (#61827) (#61896)

* security, refine _get_program_cache_key

* repeat_interleave support bf16 dtype (#61854) (#61899)

* repeat_interleave support bf16 dtype

* support bf16 on cpu

* Support Fake GroupWise Quant (#61900)

* fix launch when elastic run (#61847) (#61878)

* [Paddle-TRT] fix solve (#61806)

* [Cherry-Pick] Fix CacheKV Quant Bug (#61966)

* fix cachekv quant problem

* add unittest

* Sychronized the paddle2.4 adaptation changes

* clear third_part dependencies

* change submodules to right commits

* build pass with cpu only

* build success with maca

* build success with cutlass and fused kernels

* build with flash_attn and mccl

* build with test, fix some bugs

* fix some bugs

* fixed some compilation bugs

* fix bug in previous commit

* fix bug with split when col_size biger than 256

* add row_limit to show full kernel name

* add env.sh

Change-Id: I6fded2761a44af952a4599691e19a1976bd9b9d1

* add shape record

Change-Id: I273f5a5e97e2a31c1c8987ee1c3ce44a6acd6738

* modify paddle version

Change-Id: I97384323c38066e22562a6fe8f44b245cbd68f98

* wuzhao optimized the performance of elementwise kernel.

Change-Id: I607bc990415ab5ff7fb3337f628b3ac765d3186c

* fix split when dtype is fp16

Change-Id: Ia55d31d11e6fa214d555326a553eaee3e928e597

* fix bug in previous commit

Change-Id: I0fa66120160374da5a774ef2c04f133a54517069

* adapt flash_attn  new capi

Change-Id: Ic669be18daee9cecbc8542a14e02cdc4b8d429ba

* change eigen path

Change-Id: I514c0028e16d19a3084656cc9aa0838a115fc75c

* modify mcname -> replaced_name

Change-Id: Idc520d2db200ed5aa32da9573b19483d81a0fe9e

* fix some build bugs

Change-Id: I50067dfa3fcaa019b5736f4426df6d4e5f64107d

* add PADDLE_ENABLE_SAME_RAND_A100

Change-Id: I2d4ab6ed0b5fac3568562860b0ba1c4f8e346c61
done

* remove redundant warning, add patch from 2.6.1

Change-Id: I958d5bebdc68eb42fe433c76a3737330e00a72aa

* improve VectorizedBroadcastKernel

(cherry picked from commit 19069b26c0bf05a80cc834162db072f6b8aa2536)
Change-Id: Iaf5719d72ab52adbedc40d4788c52eb1ce4d517c
Signed-off-by: m00891 <[email protected]>

* fix bugs

(cherry picked from commit b007853a75dbd5de63028f4af82c15a5d3d81f7c)
Change-Id: Iaec0418c384ad2c81c354ef09d81f3e9dfcf82f1
Signed-off-by: m00891 <[email protected]>

* split ElementwiseDivGrad

(cherry picked from commit eb6470406b7d440c135a3f7ff68fbed9494e9c1f)
Change-Id: I60e8912be8f8d40ca83a54af1493adfa2962b2d6
Signed-off-by: m00891 <[email protected]>

* in VectorizedElementwiseKernel, it can now use vecSize = 8

(cherry picked from commit a873000a6c3bc9e2540e178d460e74e15a3d4de5)
Change-Id: Ia703b1e9e959558988fcd09182387da839d33922
Signed-off-by: m00891 <[email protected]>

* improve ModulatedDeformableCol2imCoordGpuKernel:1.block size 512->64;2.FastDivMod;3.fix VL1;4.remove DmcnGetCoordinateWeight divergent branches.

(cherry picked from commit 82c914bdd29f0eef87a52b229ff84bc456a1beeb)
Change-Id: I60b1fa9a9c89ade25e6b057c38e08616a24fa5e3
Signed-off-by: m00891 <[email protected]>

* Optimize depthwise_conv2d_grad compute (InputGrad):
1.use shared memory to optimize data load from global memory;
2.different blocksize for different input shape
3.FastDivMod for input shape div, >> and & for stride div.

(cherry picked from commit b34a5634d848f3799f5a8bcf884731dba72d3b20)
Change-Id: I0d8f22f2a2b9d99dc9fbfc1fb69b7bed66010229
Signed-off-by: m00891 <[email protected]>

* improve VectorizedBroadcastKernel with LoadType =
 2(kMixed)

(cherry picked from commit 728b9547f65e096b45f39f096783d2bb49e8556f)
Change-Id: I282dd8284a7cde54061780a22b397133303f51e5
Signed-off-by: m00891 <[email protected]>

* fix ElementwiseDivGrad

(cherry picked from commit 5f99c31904e94fd073bdd1696c3431cccaa376cb)
Change-Id: I3ae0d6c01eec124d12fa226a002b10d0c40f820c
Signed-off-by: m00891 <[email protected]>

* Revert "Optimize depthwise_conv2d_grad compute (InputGrad):"

This reverts commit b34a5634d848f3799f5a8bcf884731dba72d3b20.

(cherry picked from commit 398f5cde81e2131ff7014edfe1d7beaaf806adbb)
Change-Id: I637685b91860a7dea6df6cbba0ff2cf31363e766
Signed-off-by: m00891 <[email protected]>

* improve ElementwiseDivGrad and ElementwiseMulGrad

(cherry picked from commit fe32db418d8f075e083f31dca7010398636a6e67)
Change-Id: I4f7e0f2b5afd4e704ffcd7258def63afc43eea9c
Signed-off-by: m00891 <[email protected]>

* improve FilterBBoxes

(cherry picked from commit fe4655e86b92f5053fa886af49bf199307960a05)
Change-Id: I35003420292359f8a41b19b7ca2cbaae17dc5b45
Signed-off-by: m00891 <[email protected]>

* improve deformable_conv_grad op:1.adaptive block size;2.FastDivMod;3.move ldg up.

(cherry picked from commit a7cb0ed275a3488f79445ef31456ab6560e9de43)
Change-Id: Ia89df4e5a26de64baae4152837d2ce3076c56df1
Signed-off-by: m00891 <[email protected]>

* improve ModulatedDeformableIm2colGpuKernel:1.adaptive block size;2.FastDivMod;3.move ldg up.

(cherry picked from commit 4fb857655d09f55783d9445b91a2d953ed14d0b8)
Change-Id: I7df7f3af7b4615e5e96d33b439e5276be6ddb732
Signed-off-by: m00891 <[email protected]>

* improve KeBNBackwardData:replace 1.0/sqrt with rsqrt

(cherry picked from commit 333cba7aca1edf7a0e87623a0e55e230cd1e9451)
Change-Id: Ic808d42003677ed543621eb22a797f0ab7751baa
Signed-off-by: m00891 <[email protected]>

* Improve KeBNBackwardData, FilterGradAddupGpuKernel kernels. Improve nonzero and masked_select (forward only) OP.

(cherry picked from commit c907b40eb3f9ded6ee751e522c2a97a353ac93bd)
Change-Id: I7f4845405e64e7599134a8c497f464ac04dead88
Signed-off-by: m00891 <[email protected]>

* Optimize depthwise_conv2d:
1. 256 Blocksize launch for small shape inputgrad;
2. FastDivMod in inputgrad and filtergrad;
3. shared memory to put output_grad_data in small shape.

(cherry picked from commit f9f29bf7b8d929fb95eb1153a79d8a6b96d5b6d2)
Change-Id: I1a3818201784031dbedc320286ea5f4802dbb6b1
Signed-off-by: m00891 <[email protected]>

* Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors.

(cherry picked from commit 3bd200f262271a333b3947326442b86af7fb6da1)
Change-Id: I57c94cc5e709be8926e1b21da14b653cb18eabc3
Signed-off-by: m00891 <[email protected]>

* Revert "Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors."

This reverts commit 3bd200f262271a333b3947326442b86af7fb6da1.

(cherry picked from commit 86ed8adaa8c20d3c824eecb0ee1e10d365bcea37)
Change-Id: I5b8b7819fdf99255c65fe832d5d77f8e439bdecb
Signed-off-by: m00891 <[email protected]>

* improve ScatterInitCUDAKernel and ScatterCUDAKernel

(cherry picked from commit cddb01a83411c45f68363248291c0c4685e60b24)
Change-Id: Ie106ff8d65c21a8545c40636f021b73f3ad84587
Signed-off-by: m00891 <[email protected]>

* fix bugs and make the code easier to read

(cherry picked from commit 07ea3acf347fda434959c8c9cc3533c0686d1836)
Change-Id: Id7a727fd18fac4a662f8af1bf6c6b5ebc6233c9f
Signed-off-by: m00891 <[email protected]>

* Optimize FilterGard and InputGradSpL

Use tmp to store ldg data in the loop so calculate and ldg time
can fold each other.

(cherry picked from commit 7ddab49d868cdb6deb7c3e17c5ef9bbdbab86c3e)
Change-Id: I46399594d1d7f76b78b9860e483716fdae8fc7d6
Signed-off-by: m00891 <[email protected]>

* Improve CheckFiniteAndUnscaleKernel by putting address access to shared memory and making single thread do more tasks.

(cherry picked from commit 631ffdda2847cda9562e591dc87b3f529a51a978)
Change-Id: Ie9ffdd872ab06ff34d4daf3134d6744f5221e41e
Signed-off-by: m00891 <[email protected]>

* Optimize SwinTransformer

1.LayerNormBackward: remove if statement, now will always loop VPT
times for ldg128 in compiler, bool flag to control if write action
will be taken or not;
2.ContiguousCaseOneFunc: tmp saving division result for less division

(cherry picked from commit 422d676507308d26f6107bed924424166aa350d3)
Change-Id: I37aab7e2f97ae6b61c0f50ae4134f5eb1743d429
Signed-off-by: m00891 <[email protected]>

* Optimize LayerNormBackwardComputeGradInputWithSmallFeatureSize

Set BlockDim.z to make blockSize always be 512, each block can
handle several batches.
Then all threads will loop 4 times for better performance.

(cherry picked from commit 7550c90ca29758952fde13eeea74857ece41908b)
Change-Id: If24de87a0af19ee07e29ac2e7e237800f0181148
Signed-off-by: m00891 <[email protected]>

* improve KeMatrixTopK:1.fix private memory;2.modify max grid size;3.change it to 64 warp reduce.

(cherry picked from commit a346af182b139dfc7737e5f6473dc394b21635d7)
Change-Id: I6c8d8105fd77947c662e6d22a0d15d7bad076bde
Signed-off-by: m00891 <[email protected]>

* Modify LayerNorm Optimization

Might have lossdiff with old optimization without atomicAdd.

(cherry picked from commit 80b0bcaa9a307c94dbeda658236fd75e104ccccc)
Change-Id: I4a7c4ec2a0e885c2d581dcebc74464830dae7637
Signed-off-by: m00891 <[email protected]>

* improve roi_align op:1.adaptive block size;2.FastDivMod.

(cherry picked from commit cc421d7861c359740de0d2870abcfde4354d8c71)
Change-Id: I55c049e951f93782af1c374331f44b521ed75dfe
Signed-off-by: m00891 <[email protected]>

* add workaround for parameters dislocation when calling BatchedGEMM<float16>.

Change-Id: I5788c73a9c45f65e60ed5a88d16a473bbb888927

* fix McFlashAttn string

Change-Id: I8b34f02958ddccb3467f639daaac8044022f3d34

* [C500-27046] fix wb issue

Change-Id: I77730da567903f43ef7a9992925b90ed4ba179c7

* Support compiling external ops

Change-Id: I1b7eb58e7959daff8660ce7889ba390cdfae0c1a

* support flash attn varlen api and support arm build

Change-Id: I94d422c969bdb83ad74262e03efe38ca85ffa673

* Add a copyright notice

Change-Id: I8ece364d926596a40f42d973190525d9b8224d99

* Modify some third-party dependency addresses to public network addresses

---------

Signed-off-by: m00891 <[email protected]>
Co-authored-by: risemeup1 <[email protected]>
Co-authored-by: Nyakku Shigure <[email protected]>
Co-authored-by: gouzil <[email protected]>
Co-authored-by: Wang Bojun <[email protected]>
Co-authored-by: lizexu123 <[email protected]>
Co-authored-by: danleifeng <[email protected]>
Co-authored-by: Vigi Zhang <[email protected]>
Co-authored-by: tianhaodongbd <[email protected]>
Co-authored-by: zyfncg <[email protected]>
Co-authored-by: JYChen <[email protected]>
Co-authored-by: zhaohaixu <[email protected]>
Co-authored-by: Spelling <[email protected]>
Co-authored-by: zhouzj <[email protected]>
Co-authored-by: wanghuancoder <[email protected]>
Co-authored-by: ndren <[email protected]>
Co-authored-by: Nguyen Cong Vinh <[email protected]>
Co-authored-by: Ruibin Cheung <[email protected]>
Co-authored-by: Tian <[email protected]>
Co-authored-by: Yuanle Liu <[email protected]>
Co-authored-by: zhuyipin <[email protected]>
Co-authored-by: 6clc <[email protected]>
Co-authored-by: Wenyu <[email protected]>
Co-authored-by: Xianduo Li <[email protected]>
Co-authored-by: Wang Xin <[email protected]>
Co-authored-by: Chang Xu <[email protected]>
Co-authored-by: wentao yu <[email protected]>
Co-authored-by: zhink <[email protected]>
Co-authored-by: handiz <[email protected]>
Co-authored-by: zhimin Pan <[email protected]>
Co-authored-by: m00891 <[email protected]>
Co-authored-by: shuliu <[email protected]>
Co-authored-by: Yanxin Zhou <[email protected]>
Co-authored-by: Zhao Wu <[email protected]>
Co-authored-by: m00932 <[email protected]>
Co-authored-by: Fangzhou Feng <[email protected]>
Co-authored-by: junwang <[email protected]>
Co-authored-by: m01097 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants