Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PaddlePaddle 2.6.0 buglist, part 1 #60882

Closed
jeng1220 opened this issue Jan 17, 2024 · 16 comments
Closed

PaddlePaddle 2.6.0 buglist, part 1 #60882

jeng1220 opened this issue Jan 17, 2024 · 16 comments
Assignees

Comments

@jeng1220
Copy link
Collaborator

jeng1220 commented Jan 17, 2024

bug描述 Describe the Bug

使用Ampere GPU 或 Hopper GPU執行單測有多個錯誤
目前先整理 24 個錯誤:
PaddlePaddle 2.6.0 buglist - part 1.xlsx

其他补充信息 Additional Supplementary Information

Paddle version: 2.6.0
Paddle With CUDA: True

OS: ubuntu 22.04
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
CMake version: version 3.25.1
Libc version: glibc 2.35
Python version: 3.10.12

CUDA version: 12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
cuDNN version: 8.9.7
Nvidia driver version: 535.129.03
Nvidia driver List:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB

@jeng1220
Copy link
Collaborator Author

@onecatcn ,
更新 release/2.6 和解決 nvrtc 問題 (#60943) 後,仍有不少錯誤
已更新 buglist 在第一則貼文

@jeng1220
Copy link
Collaborator Author

jeng1220 commented Jan 22, 2024

@onecatcn ,
試了 develop (4db394f),仍有12個嚴重錯誤
複現腳本:

#!/bin/bash
set -x
cd paddle/build
ctest --output-on-failure -R test_cuda_graph_partial_graph_static_run
ctest --output-on-failure -R test_graph_reindex
ctest --output-on-failure -R test_cuda_graphed_layer
ctest --output-on-failure -R test_unique
ctest --output-on-failure -R test_weight_decay
ctest --output-on-failure -R test_unique_static_build
ctest --output-on-failure -R test_post_training_quantization_resnet50
ctest --output-on-failure -R test_communicator_half_async
ctest --output-on-failure -R test_trt_convert_scatter
ctest --output-on-failure -R test_trt_convert_assign
ctest --output-on-failure -R test_trt_convert_lookup_table
ctest --output-on-failure -R test_post_training_quantization_mobilenetv1
ctest --output-on-failure -R test_trt_convert_yolo_box #超時

log:
unittest-dev.log

下列是在 develop 分支已被修正,但在release/2.6.0沒有,需要 cherry-pick

@paddle-bot paddle-bot bot added status/following-up 跟进中 and removed status/new-issue 新建 labels Jan 23, 2024
@jeng1220
Copy link
Collaborator Author

希望能高優處裡以下單測:

  • test_layer_norm_op_static_build (Failed)
  • test_layer_norm_op (Failed)
  • test_llm_int8_linear (Failed)
  • test_post_training_quantization_resnet50 (Failed)
  • test_post_training_quantization_mobilenetv1 (Failed)

@tianshuo78520a
Copy link
Contributor

希望能高優處裡以下單測:

  • test_layer_norm_op_static_build (Failed)
  • test_layer_norm_op (Failed)
  • test_llm_int8_linear (Failed)
  • test_post_training_quantization_resnet50 (Failed)
  • test_post_training_quantization_mobilenetv1 (Failed)

我已经找负责人排查了,会尽快解决

@tianshuo78520a
Copy link
Contributor

已提PR 61284修复下面单测
test_post_training_quantization_resnet50 (Failed)
test_post_training_quantization_mobilenetv1 (Failed)

@jeng1220
Copy link
Collaborator Author

jeng1220 commented Feb 1, 2024

@leo0519 提交了 #61377.
其在 develop 修復了

  • test_trt_convert_bitwise_or
  • test_trt_convert_bitwise_and
  • test_trt_convert_assign
  • test_trt_convert_scatter
  • test_trt_convert_cast

@XieYunshen
Copy link
Contributor

test_layer_norm_op_static_build (Failed)
test_layer_norm_op (Failed)
单测已修复 #61631

@onecatcn
Copy link
Contributor

onecatcn commented Feb 22, 2024

提交了#61591
修复:test_llm_int8_linear (Failed)

@onecatcn
Copy link
Contributor

we are not able to reproduce the failures in follow 2 tests:
test_sparse_fused_attention_op
trt_dynamic_shape_test

@zlsh80826
Copy link
Collaborator

@onecatcn
We solved both test_sparse_fused_attention_op and trt_dynamic_shape_test. I will submit a PR to fix them.

@tianshuo78520a
Copy link
Contributor

test_communicator_half_async Fix.
PR: #62092

@jeng1220
Copy link
Collaborator Author

jeng1220 commented Feb 29, 2024

@tianshuo78520a ,

test_communicator_half_async passed with V100 but still failed with Ampere GPU:

53/122 Test  #509: test_communicator_half_async .........................................***Failed    2.91 sec
[2024-02-27 13:01:33,775] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
[2024-02-27 13:01:33,776] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
I0227 13:01:33.808914 29935 program_interpreter.cc:212] New Executor is Running.
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:739: UserWarning: The PS mode must use MemorySparseTable.
  warnings.warn("The PS mode must use MemorySparseTable.")
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:750: UserWarning: The shard_num of sparse table is not set, use default value 1000 in cpups.
  warnings.warn(
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:772: UserWarning: The accessor of sparse table is not set, use default value.
  warnings.warn(
I0227 13:01:33.823201 29935 server.cpp:1107] Server[paddle::distributed::DownpourPsClientService] is serving on port=8500.
I0227 13:01:33.823215 29935 server.cpp:1110] Check out http://8dc94ca26cec:8500 in web browser.
I0227 13:01:33.823283 29935 brpc_ps_client.cc:131] BrpcPsClient Service addr: 192.168.128.5, 8500, 0
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:329: UserWarning: gloo is not initialized, will not communicator with other nodes
  warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:[373](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L373): UserWarning: gloo is not initialized, will not communicator with other nodes
  warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:1249: UserWarning: gloo may not initialize correctly
  warnings.warn("gloo may not initialize correctly")
I0227 13:01:33.824164 29935 brpc_ps_client.cc:200] Client connect success:192.168.128.5:8500,
E0227 13:01:33.824612 30340 brpc_ps_client.cc:386] resquest cmd_id:11 failed, err:[E111]Fail to connect Socket{id=3 addr=127.0.0.1:52887} (0x0x558e724878d0): Connection refused [R1][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R3][E112]Not connected to 127.0.0.1:52887 yet, server_id=0
F0227 13:01:33.824656 29935 fleet.cc:445] Check failed: status == 0 push dense param failed, status[-1]
*** Check failure stack trace: ***
    @     0x7f6b458e3cd3  google::LogMessage::Fail()
    @     0x7f6b458e6254  google::LogMessage::SendToLog()
    @     0x7f6b458e[381](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L381)0  google::LogMessage::Flush()
    @     0x7f6b458e67cf  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6b467c609a  paddle::distributed::FleetWrapper::PushDenseParamSync()
    @     0x7f6b45410064  (unknown)
    @     0x7f6b450edae3  (unknown)
    @     0x558e6b13810e  (unknown)
    @     0x558e6b12ea7b  _PyObject_MakeTpCall
    @     0x558e6b146acb  (unknown)
    @     0x558e6b126cfa  _PyEval_EvalFrameDefault
    @     0x558e6b1[389](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L389)fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1467f1  (unknown)
    @     0x558e6b126cfa  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
Subprocess aborted

@tianshuo78520a
Copy link
Contributor

@tianshuo78520a ,

test_communicator_half_async passed with V100 but still failed with Ampere GPU:

53/122 Test  #509: test_communicator_half_async .........................................***Failed    2.91 sec
[2024-02-27 13:01:33,775] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
[2024-02-27 13:01:33,776] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
I0227 13:01:33.808914 29935 program_interpreter.cc:212] New Executor is Running.
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:739: UserWarning: The PS mode must use MemorySparseTable.
  warnings.warn("The PS mode must use MemorySparseTable.")
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:750: UserWarning: The shard_num of sparse table is not set, use default value 1000 in cpups.
  warnings.warn(
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:772: UserWarning: The accessor of sparse table is not set, use default value.
  warnings.warn(
I0227 13:01:33.823201 29935 server.cpp:1107] Server[paddle::distributed::DownpourPsClientService] is serving on port=8500.
I0227 13:01:33.823215 29935 server.cpp:1110] Check out http://8dc94ca26cec:8500 in web browser.
I0227 13:01:33.823283 29935 brpc_ps_client.cc:131] BrpcPsClient Service addr: 192.168.128.5, 8500, 0
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:329: UserWarning: gloo is not initialized, will not communicator with other nodes
  warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:[373](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L373): UserWarning: gloo is not initialized, will not communicator with other nodes
  warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:1249: UserWarning: gloo may not initialize correctly
  warnings.warn("gloo may not initialize correctly")
I0227 13:01:33.824164 29935 brpc_ps_client.cc:200] Client connect success:192.168.128.5:8500,
E0227 13:01:33.824612 30340 brpc_ps_client.cc:386] resquest cmd_id:11 failed, err:[E111]Fail to connect Socket{id=3 addr=127.0.0.1:52887} (0x0x558e724878d0): Connection refused [R1][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R3][E112]Not connected to 127.0.0.1:52887 yet, server_id=0
F0227 13:01:33.824656 29935 fleet.cc:445] Check failed: status == 0 push dense param failed, status[-1]
*** Check failure stack trace: ***
    @     0x7f6b458e3cd3  google::LogMessage::Fail()
    @     0x7f6b458e6254  google::LogMessage::SendToLog()
    @     0x7f6b458e[381](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L381)0  google::LogMessage::Flush()
    @     0x7f6b458e67cf  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6b467c609a  paddle::distributed::FleetWrapper::PushDenseParamSync()
    @     0x7f6b45410064  (unknown)
    @     0x7f6b450edae3  (unknown)
    @     0x558e6b13810e  (unknown)
    @     0x558e6b12ea7b  _PyObject_MakeTpCall
    @     0x558e6b146acb  (unknown)
    @     0x558e6b126cfa  _PyEval_EvalFrameDefault
    @     0x558e6b1[389](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L389)fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1467f1  (unknown)
    @     0x558e6b126cfa  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
Subprocess aborted

We attempted to replicate it in the A100 environment, but it was not successful. Could you please confirm if there is any merged repair code?

@GhostScreaming
Copy link
Contributor

GhostScreaming commented Mar 1, 2024

test_semi_auto_parallel_hybrid_strategy 在本地复测,release/2.6分支可能出现曹氏问题。Docker 容器需要设置足够大的shared_memory,否则NCCL通信可能报错
修复PR 62278

@zlsh80826
Copy link
Collaborator

@onecatcn
#62477 這個 PR 修復 test_sparse_fused_attention_optrt_dynamic_shape_test

@jeng1220
Copy link
Collaborator Author

jeng1220 commented Mar 7, 2024

@tianshuo78520a ,
test_communicator_half_async passed with V100 but still failed with Ampere GPU:

53/122 Test  #509: test_communicator_half_async .........................................***Failed    2.91 sec
...
*** Check failure stack trace: ***
    @     0x7f6b458e3cd3  google::LogMessage::Fail()
    @     0x7f6b458e6254  google::LogMessage::SendToLog()
    @     0x7f6b458e[381](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L381)0  google::LogMessage::Flush()
    @     0x7f6b458e67cf  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6b467c609a  paddle::distributed::FleetWrapper::PushDenseParamSync()
    @     0x7f6b45410064  (unknown)
...
    @     0x558e6b1389fc  _PyFunction_Vectorcall
Subprocess aborted

We attempted to replicate it in the A100 environment, but it was not successful. Could you please confirm if there is any merged repair code?

After discussion, the test_communicator_half_async only affects CPU computing, so we will disable it on our side.

@paddle-bot paddle-bot bot added status/close 已关闭 and removed status/following-up 跟进中 labels Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants