New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Auto Parallel] Logical Partition & Dist Op #35117

Merged

JZ-LIANG merged 28 commits into PaddlePaddle:develop from JZ-LIANG:AutoParallel/Transpiler1

Sep 2, 2021

Contributor

JZ-LIANG commented Aug 24, 2021 •

edited

Loading

PR types

New features

PR changes

Others

Describe

Add Partitioner and Dist Op implement of Auto Parallel.
Partitioner: convert a serial network to distributed networks where the op and var are partitioned into different ranks.
Dist Op: implement the computation and communication logic of dist op.

Tensor-Parallel & Data-Parallel are supported in Auto Parallel now~

Functions added by this PR are not supposed to be called by user directly, how to use Auto Parallel please refer to PR

JZ-LIANG added 14 commits

August 24, 2021 15:30


          support shard reader

2ef97ba


          support shard reader

5993a30


          add parallel mode

a8a26de


          update process mesh

93348eb


          add method to compute comm_group

e957043


          implement dist_embedding forward func

291d1b7


          implement dist matmul forward func

c74ee5a


          implement dist reshape forward func

4c00571


          add transpiler framework

b75ceca


          add transpiler forward

67abec3


          implement transpiler forward

cd2526c


          implement transpiler backward & update

d7d3b74


          add process

52d054c


          add unitest

1b8ddfb

paddle-bot-old bot commented Aug 24, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

JZ-LIANG requested a review from sandyhouse

August 24, 2021 09:33

JZ-LIANG added 3 commits

August 24, 2021 17:36


          chmod

ae3e506


          chmod

e2fa7cd


          chmod

de53039

JZ-LIANG changed the title ~~[Auto Parallel] Logical Partition & Update Dist Op~~ [Auto Parallel] Logical Partition & Dist Op

JZ-LIANG added 2 commits

August 24, 2021 19:21


          update unitest

fbe3356


          add unitest for gpt

d0798cb

Caozhou1995 reviewed

View reviewed changes

python/paddle/distributed/auto_parallel/operators/dist_matmul.py Outdated

+                                                        process_mesh.topology,
+                                                        model_parallel_axis, rank_id)
+                          group = new_process_group(group_ranks)
+                          # print("@@@@@@@@@@@@@@@@@@@@@ 5", group)

Contributor

Caozhou1995 Aug 25, 2021

Is this comment necessary？

Contributor Author

JZ-LIANG Aug 25, 2021

removed ~

JZ-LIANG added 6 commits

August 25, 2021 11:48


          remove unused print

fbc42d6


          rename transpiler --> partitioner

f0f58dc


          rename transpiler --> partitioner

f5cd926


          chmod

2ebece8


          chmod

b22ea19


          bug fixed

cc694b1

aoyulong reviewed

View reviewed changes

python/paddle/distributed/auto_parallel/partitioner.py

+                          # NOTE Theoretically, the MP param init broadcast should be handled by
+                          # each dist op itself. but if we insert the broadcast op at that moment, the broadcast
+                          # will before the initializer, which lead to a undertermined case.
+                          if self._enable_tensor_parallel:

Contributor

aoyulong Aug 26, 2021

A little curious why mp doesn't split parameters? Since the purpose of mp is used to split parameters.

Contributor Author

JZ-LIANG Aug 30, 2021

we expose the nn.Linear to user which is consist of two ops: weight-matmul & bias-add.
in MP all weight（matmul）will be spilted, but in row parallel, the bias in nn.Linear is not spilted.

this is the NOTE for the special case for nn.Linear bias in row parallel.

aoyulong reviewed

View reviewed changes

python/paddle/distributed/auto_parallel/partitioner.py

		return no_grad_set_name


		def _get_no_grad_set(loss, no_grad_set=None):

Contributor

aoyulong Aug 26, 2021

Why do we need to take care of no grad set ourselves?

Contributor Author

JZ-LIANG Aug 30, 2021

this is used for finetuning, in finetuning we should allow user to set which parameter will not be update (no grad)

aoyulong reviewed

View reviewed changes

python/paddle/distributed/auto_parallel/partitioner.py

+                      for idx, op in reversed(list(enumerate(main_global_block.ops))):
+                          if is_loss_grad_op(op):
+                              loss_grad_var = main_global_block.vars[op.output_arg_names[0]]
+                              main_global_block._insert_op_without_sync(

Contributor

aoyulong Aug 26, 2021

When should we use without_sync and when should we use sync_with? What't the purpose of sync_with function?

Contributor Author

JZ-LIANG Aug 30, 2021

the sync means the synchronize between python-end program and C++-end program. every time you modify one end and want the modification to be effected in another end, you should sync them.

aoyulong reviewed

View reviewed changes

python/paddle/distributed/auto_parallel/partitioner.py

+                      # NOTE naive gradient sync without overlapping
+                      # so there is not need to sync between calc and comm
+                      # collecting grad var
+                      grad_to_sync = []

Contributor

aoyulong Aug 26, 2021

The following statements to build grad_to_sync are very hard for me to understand.

Contributor Author

JZ-LIANG Aug 30, 2021

the actual meaning for that is: the allreduce （of that var ）to be sync. we should sync the allreduce to ensure the optimizer update is conduct after grad allreduce.

aoyulong reviewed

View reviewed changes

python/paddle/distributed/auto_parallel/partitioner.py Outdated

+                      by now the fleet.distributed_strategy that need transpile forward program are following:
+. AMP
+. Recompute
+. sharding

Contributor

aoyulong Aug 26, 2021

In my opinion, it will be better remove some statements without corresponding implementations.

Contributor Author

JZ-LIANG Aug 30, 2021

done !

JZ-LIANG added 3 commits

August 26, 2021 19:45


          remove amp function

1cc96ca


          update case for dp mode

4fb30ef


          update case for dp mode

56ff62e

fuyinno4 reviewed

View reviewed changes

python/paddle/distributed/auto_parallel/context.py

+                              self._data_parallel_axis = 0
+                              self._model_parallel_axis = 1
+                      else:
+                          self._data_parallel_axis = -1

Contributor

fuyinno4 Aug 30, 2021

delete this dp/mp strategy in the next step

Contributor Author

JZ-LIANG Aug 30, 2021

Got, this will be deleted in next major in Sep. Auto Parallel will NOT hold a global view of MP/DP/PP

sandyhouse approved these changes

View reviewed changes

sandyhouse left a comment

LGTM

Caozhou1995 approved these changes

View reviewed changes

Contributor

aoyulong commented Aug 30, 2021

LGTM

aoyulong approved these changes

View reviewed changes

raindrops2sea approved these changes

View reviewed changes

JZ-LIANG merged commit a622b70 into PaddlePaddle:develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

fuyinno4 fuyinno4 left review comments

aoyulong aoyulong approved these changes

sandyhouse sandyhouse approved these changes

raindrops2sea raindrops2sea approved these changes

Caozhou1995 Caozhou1995 approved these changes

Labels

None yet