Wrong parameter order in `all_to_all` by eee4017 · Pull Request #65093 · PaddlePaddle/Paddle

eee4017 · 2024-06-12T17:06:48Z

PR Category

Distributed Strategy

PR Types

Bug fixes

Description

The parameter order in PyBind differs from that in Python, causing a bug. By swapping the parameter order, this issue can be resolved.

Please refer to the PyBind C++ Interface (all_to_all_tensor_on_calc_stream and all_to_all_tensor.), where out_tensor should be the first parameter.

However, in the _all_to_all_tensor_in_dygraph, in_tensor is the first parameter.

Additionally, the parameter order in PaddlePaddle is not standardized, even in the highest-level APIs, which lack a consistent parameter sequence. Please check alltoall (stream) and alltoall (distributed)

paddle-bot · 2024-06-12T17:06:53Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

jeng1220 · 2024-06-13T02:25:43Z

CI failed but it was NOT related to this PR

ERROR: test_simple_net_hybrid_strategy (__main__.TestSemiAutoParallelLlamaDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/Paddle/test/collective/test_communication_api_base.py", line 79, in run_test_case
    self._launcher = subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', '-m', 'paddle.distributed.launch', '--log_dir', '/tmp/tmpkx_sv8w9', '--devices', '0,1,2,3,4,5,6,7', 'semi_auto_llama_dataloader.py']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/workspace/Paddle/test/auto_parallel/hybrid_strategy/test_semi_auto_parallel_llama_model.py", line 224, in test_simple_net_hybrid_strategy
    self.run_test_case(
  File "/workspace/Paddle/test/collective/test_communication_api_base.py", line 90, in run_test_case
    raise RuntimeError(
RuntimeError: Error occurs when running this test case. The return code of command ['/usr/bin/python', '-u', '-m', 'paddle.distributed.launch', '--log_dir', '/tmp/tmpkx_sv8w9', '--devices', '0,1,2,3,4,5,6,7', 'semi_auto_llama_dataloader.py'] is 1
----------------------------------------------------------------------
Ran 9 tests in 444.641s
FAILED (errors=1)

tianshuo78520a · 2024-06-14T08:17:12Z

这个问题还在修复中，修复后我会重新执行这个CI

JZ-LIANG

后续最好补一个单测吧
test/collective/process_group_nccl.py

JZ-LIANG

LGTM

Co-authored-by: lawrence910426 <lawu@nvidia.com>

This reverts commit 5d3df27.

paddle-bot bot added the contributor External developers label Jun 12, 2024

eee4017 mentioned this pull request Jun 12, 2024

The CUDA Async Allocator #65092

Merged

jeng1220 added the NVIDIA label Jun 13, 2024

onecatcn assigned zyfncg and JZ-LIANG and unassigned zyfncg Jun 14, 2024

Wrong parameter order

914697d

eee4017 force-pushed the lawu/all_to_all branch from 6c15a23 to 914697d Compare June 19, 2024 06:37

JZ-LIANG reviewed Jun 21, 2024

View reviewed changes

JZ-LIANG approved these changes Jun 21, 2024

View reviewed changes

JZ-LIANG merged commit 5d3df27 into PaddlePaddle:develop Jun 21, 2024

co63oc pushed a commit to co63oc/Paddle that referenced this pull request Jun 25, 2024

Wrong parameter order (PaddlePaddle#65093)

188f7b3

Co-authored-by: lawrence910426 <lawu@nvidia.com>

co63oc pushed a commit to co63oc/Paddle that referenced this pull request Jun 25, 2024

Wrong parameter order (PaddlePaddle#65093)

a49dba8

Co-authored-by: lawrence910426 <lawu@nvidia.com>

quanxiang-liu mentioned this pull request Jul 4, 2024

Revert "Wrong parameter order in all_to_all" #65699

Closed

deepllz added a commit to deepllz/Paddle that referenced this pull request Jul 4, 2024

Revert "Wrong parameter order (PaddlePaddle#65093)"

4620a0a

This reverts commit 5d3df27.

deepllz mentioned this pull request Jul 4, 2024

Revert "Wrong parameter order in all_to_all" #65701

Merged

sneaxiy pushed a commit that referenced this pull request Jul 4, 2024

Revert "Wrong parameter order (#65093)" (#65701)

b0c74a8

This reverts commit 5d3df27.

eee4017 mentioned this pull request Jul 5, 2024

[2024.Q4] Wrong parameter order in all_to_all #65732

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong parameter order in `all_to_all`#65093

Wrong parameter order in `all_to_all`#65093
JZ-LIANG merged 1 commit intoPaddlePaddle:developfrom
eee4017:lawu/all_to_all

eee4017 commented Jun 12, 2024

Uh oh!

paddle-bot bot commented Jun 12, 2024

Uh oh!

jeng1220 commented Jun 13, 2024 •

edited

Loading

Uh oh!

tianshuo78520a commented Jun 14, 2024

Uh oh!

JZ-LIANG left a comment •

edited

Loading

Uh oh!

JZ-LIANG left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

eee4017 commented Jun 12, 2024

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Jun 12, 2024

Uh oh!

jeng1220 commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianshuo78520a commented Jun 14, 2024

Uh oh!

JZ-LIANG left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JZ-LIANG left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jeng1220 commented Jun 13, 2024 •

edited

Loading

JZ-LIANG left a comment •

edited

Loading