[HybridParallel]Support fp16 in dygraph hybrid parallel #36420

haohongxiang · 2021-10-13T14:00:06Z

PR types

Bug fixes

PR changes

Others

Describe

[HybridParallel]Support fp16 in dygraph hybrid parallel

单卡FP32与3D混合并行+recompute+FP16的loss曲线精度对比

… pure_fp16_for_hp

ForFishes

LGTM

zhiqiu · 2021-10-14T08:54:45Z

python/paddle/distributed/fleet/meta_parallel/pp_utils/utils.py

-        else:
-            ctx.is_fw_autocast = True
-        ctx.amp_mode = 'O1'
+        ctx.is_fw_autocast = False if tracer._amp_level == 0 else True


use tracer._amp_level==core.AmpLevel.O0 instead of 0

zhiqiu · 2021-10-14T08:55:06Z

python/paddle/distributed/fleet/meta_parallel/pp_utils/utils.py

-            ctx.is_fw_autocast = True
-        ctx.amp_mode = 'O1'
+        ctx.is_fw_autocast = False if tracer._amp_level == 0 else True
+        ctx.amp_level = 'O2' if tracer._amp_level == 2 else 'O1'


save for other amp level

zhiqiu · 2021-10-14T08:56:44Z

python/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py

-
-        train_loss = self._broadcast_final_loss()
-
+        with paddle.amp.auto_cast(enable=False):


It is ok to put guard here, but I wonder if it is needed.

Removing guard will cause diff of precision while broadcasting train_loss. So it is necessary to put guard here.

zhiqiu · 2021-10-14T08:58:29Z

python/paddle/distributed/fleet/base/fleet_base.py


            # TODO(shenliang03) Since dp allreduce in the optimizer is 
            # after the gradscaler, check_finite needs to synchronize global 
            # information. In the future, we should use check_group to speed.
+            self._found_inf = paddle.cast(self._found_inf, dtype="int32")


I think we can consider make found_if int32 or fp32 originally to avoid these casts afterward.

zhiqiu

LGTM

ForFishes

LGTM

…imizer (#36707) * fix bugs in HybridParallelClipGrad of hybrid_parallel_optimizer (#36237) * fix bugs in HybridParallelClipGrad of hybrid_parallel_optimizer * update * update * fix bugs in mp_layers、pp_layers and HybridParallelClipGrad (#36144) * fix calling bug of HybridParallelClipGrad * fix bugs of HybridParallelClipGrad * add unittest of pp with HybridParallelClipGrad * fix bugs in mp_layers.py * update * fix bugs in pp_layers.py * update * [HybridParallel]Rebuild code for pipeline (#36396) * add no_sync for parameters sync * add pipeline for moe * [HybridParallel]Support fp16 in dygraph hybrid parallel (#36420) * [HybridParallel]Support fp16 in dygraph hybrid parallel * update * update * update for recompute * add unittest of pp+fp16 * add unittest of recompute+fp16 * update * modify ut * modify ut of cond (#36475) * fix bugs of ClipGradByGlobalNorm in HybridParallel (#36555) * fix bugs of ClipGradByGlobalNorm * add unittests * add unittests * [HybridParallel]fix bug of check_inf in fleet_base.py (#36651) * fix bug of check_inf * fix allreduce * support ClipGradByGlobalNorm in sharding (#36012) * support ClipGradByGlobalNorm in sharding * support ClipGradByGlobalNorm in sharding * test=allcase * Update test_linalg_cond.py * Update hybrid_parallel_util.py * Update hybrid_parallel_util.py Co-authored-by: ShenLiang <[email protected]> Co-authored-by: zhaoyingli <[email protected]>

[HybridParallel]Support fp16 in dygraph hybrid parallel

37c6e64

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

285eff3

… pure_fp16_for_hp

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

6a637a3

… pure_fp16_for_hp

ForFishes previously approved these changes Oct 14, 2021

View reviewed changes

zhiqiu reviewed Oct 14, 2021

View reviewed changes

update

2cf4707

haohongxiang dismissed ForFishes’s stale review via 2cf4707 October 14, 2021 11:49

haohongxiang added 6 commits October 14, 2021 20:42

update

dfdc02e

update for recompute

6177ba1

add unittest of pp+fp16

190b15a

add unittest of recompute+fp16

fc70624

update

36acffc

modify ut

c454e00

zhiqiu approved these changes Oct 18, 2021

View reviewed changes

ForFishes approved these changes Oct 18, 2021

View reviewed changes

ForFishes merged commit 10f0a0f into PaddlePaddle:develop Oct 18, 2021

haohongxiang mentioned this pull request Oct 25, 2021

[HybridParallel]fix bug of check_inf in fleet_base.py #36651

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HybridParallel]Support fp16 in dygraph hybrid parallel #36420

[HybridParallel]Support fp16 in dygraph hybrid parallel #36420

haohongxiang commented Oct 13, 2021 •

edited

Loading

ForFishes left a comment

zhiqiu Oct 14, 2021

haohongxiang Oct 14, 2021

zhiqiu Oct 14, 2021

haohongxiang Oct 14, 2021

zhiqiu Oct 14, 2021

haohongxiang Oct 14, 2021

zhiqiu Oct 14, 2021

haohongxiang Oct 14, 2021

zhiqiu left a comment

ForFishes left a comment


		train_loss = self._broadcast_final_loss()

		with paddle.amp.auto_cast(enable=False):

[HybridParallel]Support fp16 in dygraph hybrid parallel #36420

[HybridParallel]Support fp16 in dygraph hybrid parallel #36420

Conversation

haohongxiang commented Oct 13, 2021 • edited Loading

PR types

PR changes

Describe

ForFishes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhiqiu left a comment

Choose a reason for hiding this comment

ForFishes left a comment

Choose a reason for hiding this comment

haohongxiang commented Oct 13, 2021 •

edited

Loading