PaddlePaddle 2.6.0 buglist, part 1 #60882

jeng1220 · 2024-01-17T05:54:08Z

bug描述 Describe the Bug

使用Ampere GPU 或 Hopper GPU執行單測有多個錯誤
目前先整理 24 個錯誤:
PaddlePaddle 2.6.0 buglist - part 1.xlsx

其他补充信息 Additional Supplementary Information

Paddle version: 2.6.0
Paddle With CUDA: True

OS: ubuntu 22.04
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
CMake version: version 3.25.1
Libc version: glibc 2.35
Python version: 3.10.12

CUDA version: 12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
cuDNN version: 8.9.7
Nvidia driver version: 535.129.03
Nvidia driver List:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB

jeng1220 · 2024-01-22T07:45:49Z

@onecatcn ,
更新 release/2.6 和解決 nvrtc 問題 (#60943) 後，仍有不少錯誤
已更新 buglist 在第一則貼文

jeng1220 · 2024-01-22T22:13:55Z

@onecatcn ,
試了 develop (4db394f)，仍有12個嚴重錯誤
複現腳本:

#!/bin/bash
set -x
cd paddle/build
ctest --output-on-failure -R test_cuda_graph_partial_graph_static_run
ctest --output-on-failure -R test_graph_reindex
ctest --output-on-failure -R test_cuda_graphed_layer
ctest --output-on-failure -R test_unique
ctest --output-on-failure -R test_weight_decay
ctest --output-on-failure -R test_unique_static_build
ctest --output-on-failure -R test_post_training_quantization_resnet50
ctest --output-on-failure -R test_communicator_half_async
ctest --output-on-failure -R test_trt_convert_scatter
ctest --output-on-failure -R test_trt_convert_assign
ctest --output-on-failure -R test_trt_convert_lookup_table
ctest --output-on-failure -R test_post_training_quantization_mobilenetv1
ctest --output-on-failure -R test_trt_convert_yolo_box #超時

log:
unittest-dev.log

下列是在 develop 分支已被修正，但在release/2.6.0沒有，需要 cherry-pick

test_layer_norm_op_static_build
test_standalone_executor_multi_micro_batch
test_sparse_fused_attention_op
test_llm_int8_linear
test_semi_auto_parallel_hybrid_strategy
~~test_trt_convert_bitwise_and~~ (在 Fix Paddle-TRT UT fails #61377 修正)
~~test_trt_convert_bitwise_or~~ (在 Fix Paddle-TRT UT fails #61377 修正)
~~test_trt_convert_cast~~ (在 Fix Paddle-TRT UT fails #61377 修正)
test_trt_convert_solve
trt_dynamic_shape_test
test_layer_norm_op

jeng1220 · 2024-01-25T02:44:48Z

希望能高優處裡以下單測:

test_layer_norm_op_static_build (Failed)
test_layer_norm_op (Failed)
test_llm_int8_linear (Failed)
test_post_training_quantization_resnet50 (Failed)
test_post_training_quantization_mobilenetv1 (Failed)

tianshuo78520a · 2024-01-25T08:09:23Z

希望能高優處裡以下單測:

test_layer_norm_op_static_build (Failed)

test_layer_norm_op (Failed)

test_llm_int8_linear (Failed)

test_post_training_quantization_resnet50 (Failed)

test_post_training_quantization_mobilenetv1 (Failed)

我已经找负责人排查了，会尽快解决

tianshuo78520a · 2024-01-29T07:29:09Z

已提PR 61284修复下面单测
test_post_training_quantization_resnet50 (Failed)
test_post_training_quantization_mobilenetv1 (Failed)

jeng1220 · 2024-02-01T02:46:29Z

@leo0519 提交了 #61377.
其在 develop 修復了

test_trt_convert_bitwise_or
test_trt_convert_bitwise_and
test_trt_convert_assign
test_trt_convert_scatter
test_trt_convert_cast

XieYunshen · 2024-02-06T06:58:19Z

test_layer_norm_op_static_build (Failed)
test_layer_norm_op (Failed)
单测已修复 #61631

onecatcn · 2024-02-22T02:35:45Z

提交了#61591
修复：test_llm_int8_linear (Failed)

onecatcn · 2024-02-22T02:43:18Z

we are not able to reproduce the failures in follow 2 tests:
test_sparse_fused_attention_op
trt_dynamic_shape_test

zlsh80826 · 2024-02-22T03:03:00Z

@onecatcn
We solved both test_sparse_fused_attention_op and trt_dynamic_shape_test. I will submit a PR to fix them.

tianshuo78520a · 2024-02-27T07:19:39Z

test_communicator_half_async Fix.
PR: #62092

jeng1220 · 2024-02-29T02:39:58Z

@tianshuo78520a ,

test_communicator_half_async passed with V100 but still failed with Ampere GPU:

53/122 Test  #509: test_communicator_half_async .........................................***Failed    2.91 sec
[2024-02-27 13:01:33,775] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
[2024-02-27 13:01:33,776] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
I0227 13:01:33.808914 29935 program_interpreter.cc:212] New Executor is Running.
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:739: UserWarning: The PS mode must use MemorySparseTable.
  warnings.warn("The PS mode must use MemorySparseTable.")
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:750: UserWarning: The shard_num of sparse table is not set, use default value 1000 in cpups.
  warnings.warn(
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:772: UserWarning: The accessor of sparse table is not set, use default value.
  warnings.warn(
I0227 13:01:33.823201 29935 server.cpp:1107] Server[paddle::distributed::DownpourPsClientService] is serving on port=8500.
I0227 13:01:33.823215 29935 server.cpp:1110] Check out http://8dc94ca26cec:8500 in web browser.
I0227 13:01:33.823283 29935 brpc_ps_client.cc:131] BrpcPsClient Service addr: 192.168.128.5, 8500, 0
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:329: UserWarning: gloo is not initialized, will not communicator with other nodes
  warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:[373](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L373): UserWarning: gloo is not initialized, will not communicator with other nodes
  warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:1249: UserWarning: gloo may not initialize correctly
  warnings.warn("gloo may not initialize correctly")
I0227 13:01:33.824164 29935 brpc_ps_client.cc:200] Client connect success:192.168.128.5:8500,
E0227 13:01:33.824612 30340 brpc_ps_client.cc:386] resquest cmd_id:11 failed, err:[E111]Fail to connect Socket{id=3 addr=127.0.0.1:52887} (0x0x558e724878d0): Connection refused [R1][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R3][E112]Not connected to 127.0.0.1:52887 yet, server_id=0
F0227 13:01:33.824656 29935 fleet.cc:445] Check failed: status == 0 push dense param failed, status[-1]
*** Check failure stack trace: ***
    @     0x7f6b458e3cd3  google::LogMessage::Fail()
    @     0x7f6b458e6254  google::LogMessage::SendToLog()
    @     0x7f6b458e[381](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L381)0  google::LogMessage::Flush()
    @     0x7f6b458e67cf  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6b467c609a  paddle::distributed::FleetWrapper::PushDenseParamSync()
    @     0x7f6b45410064  (unknown)
    @     0x7f6b450edae3  (unknown)
    @     0x558e6b13810e  (unknown)
    @     0x558e6b12ea7b  _PyObject_MakeTpCall
    @     0x558e6b146acb  (unknown)
    @     0x558e6b126cfa  _PyEval_EvalFrameDefault
    @     0x558e6b1[389](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L389)fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1467f1  (unknown)
    @     0x558e6b126cfa  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
Subprocess aborted

tianshuo78520a · 2024-03-01T02:48:08Z

@tianshuo78520a ,

test_communicator_half_async passed with V100 but still failed with Ampere GPU:

53/122 Test  #509: test_communicator_half_async .........................................***Failed    2.91 sec
[2024-02-27 13:01:33,775] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
[2024-02-27 13:01:33,776] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
I0227 13:01:33.808914 29935 program_interpreter.cc:212] New Executor is Running.
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:739: UserWarning: The PS mode must use MemorySparseTable.
  warnings.warn("The PS mode must use MemorySparseTable.")
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:750: UserWarning: The shard_num of sparse table is not set, use default value 1000 in cpups.
  warnings.warn(
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:772: UserWarning: The accessor of sparse table is not set, use default value.
  warnings.warn(
I0227 13:01:33.823201 29935 server.cpp:1107] Server[paddle::distributed::DownpourPsClientService] is serving on port=8500.
I0227 13:01:33.823215 29935 server.cpp:1110] Check out http://8dc94ca26cec:8500 in web browser.
I0227 13:01:33.823283 29935 brpc_ps_client.cc:131] BrpcPsClient Service addr: 192.168.128.5, 8500, 0
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:329: UserWarning: gloo is not initialized, will not communicator with other nodes
  warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/fleet/base/role_maker.py:[373](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L373): UserWarning: gloo is not initialized, will not communicator with other nodes
  warnings.warn(self._err_init)
/opt/paddle/paddle/build/python/paddle/distributed/ps/the_one_ps.py:1249: UserWarning: gloo may not initialize correctly
  warnings.warn("gloo may not initialize correctly")
I0227 13:01:33.824164 29935 brpc_ps_client.cc:200] Client connect success:192.168.128.5:8500,
E0227 13:01:33.824612 30340 brpc_ps_client.cc:386] resquest cmd_id:11 failed, err:[E111]Fail to connect Socket{id=3 addr=127.0.0.1:52887} (0x0x558e724878d0): Connection refused [R1][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:52887 yet, server_id=0 [R3][E112]Not connected to 127.0.0.1:52887 yet, server_id=0
F0227 13:01:33.824656 29935 fleet.cc:445] Check failed: status == 0 push dense param failed, status[-1]
*** Check failure stack trace: ***
    @     0x7f6b458e3cd3  google::LogMessage::Fail()
    @     0x7f6b458e6254  google::LogMessage::SendToLog()
    @     0x7f6b458e[381](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L381)0  google::LogMessage::Flush()
    @     0x7f6b458e67cf  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6b467c609a  paddle::distributed::FleetWrapper::PushDenseParamSync()
    @     0x7f6b45410064  (unknown)
    @     0x7f6b450edae3  (unknown)
    @     0x558e6b13810e  (unknown)
    @     0x558e6b12ea7b  _PyObject_MakeTpCall
    @     0x558e6b146acb  (unknown)
    @     0x558e6b126cfa  _PyEval_EvalFrameDefault
    @     0x558e6b1[389](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L389)fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b1235d7  _PyEval_EvalFrameDefault
    @     0x558e6b1467f1  (unknown)
    @     0x558e6b126cfa  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
    @     0x558e6b12145c  _PyEval_EvalFrameDefault
    @     0x558e6b1389fc  _PyFunction_Vectorcall
Subprocess aborted

We attempted to replicate it in the A100 environment, but it was not successful. Could you please confirm if there is any merged repair code?

GhostScreaming · 2024-03-01T07:29:31Z

test_semi_auto_parallel_hybrid_strategy 在本地复测，release/2.6分支可能出现曹氏问题。Docker 容器需要设置足够大的shared_memory，否则NCCL通信可能报错
修复PR 62278

zlsh80826 · 2024-03-06T09:56:28Z

@onecatcn
#62477 這個 PR 修復 test_sparse_fused_attention_op 和 trt_dynamic_shape_test

jeng1220 · 2024-03-07T02:45:37Z

@tianshuo78520a ,
test_communicator_half_async passed with V100 but still failed with Ampere GPU:

53/122 Test  #509: test_communicator_half_async .........................................***Failed    2.91 sec
...
*** Check failure stack trace: ***
    @     0x7f6b458e3cd3  google::LogMessage::Fail()
    @     0x7f6b458e6254  google::LogMessage::SendToLog()
    @     0x7f6b458e[381](https://gitlab-master.nvidia.com/dl/dgx/paddle/-/jobs/83997915#L381)0  google::LogMessage::Flush()
    @     0x7f6b458e67cf  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6b467c609a  paddle::distributed::FleetWrapper::PushDenseParamSync()
    @     0x7f6b45410064  (unknown)
...
    @     0x558e6b1389fc  _PyFunction_Vectorcall
Subprocess aborted

We attempted to replicate it in the A100 environment, but it was not successful. Could you please confirm if there is any merged repair code?

After discussion, the test_communicator_half_async only affects CPU computing, so we will disable it on our side.

jeng1220 added status/new-issue 新建 type/bug-report 报bug labels Jan 17, 2024

paddle-bot bot assigned gongel Jan 17, 2024

jeng1220 added the NVIDIA label Jan 17, 2024

paddle-bot bot added status/following-up 跟进中 and removed status/new-issue 新建 labels Jan 23, 2024

jeng1220 mentioned this issue Feb 22, 2024

change GPU memory allocating policy #6159

Merged

onecatcn mentioned this issue Feb 29, 2024

[cherrypick]fix test_communicator_half_async unittest random core #62092

Merged

jeng1220 closed this as completed Apr 11, 2024

paddle-bot bot added status/close 已关闭 and removed status/following-up 跟进中 labels Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PaddlePaddle 2.6.0 buglist, part 1 #60882

PaddlePaddle 2.6.0 buglist, part 1 #60882

jeng1220 commented Jan 17, 2024 •

edited

Loading

jeng1220 commented Jan 22, 2024

jeng1220 commented Jan 22, 2024 •

edited

Loading

jeng1220 commented Jan 25, 2024

tianshuo78520a commented Jan 25, 2024

tianshuo78520a commented Jan 29, 2024

jeng1220 commented Feb 1, 2024

XieYunshen commented Feb 6, 2024

onecatcn commented Feb 22, 2024 •

edited

Loading

onecatcn commented Feb 22, 2024

zlsh80826 commented Feb 22, 2024

tianshuo78520a commented Feb 27, 2024

jeng1220 commented Feb 29, 2024 •

edited

Loading

tianshuo78520a commented Mar 1, 2024

GhostScreaming commented Mar 1, 2024 •

edited

Loading

zlsh80826 commented Mar 6, 2024

jeng1220 commented Mar 7, 2024 •

edited

Loading

PaddlePaddle 2.6.0 buglist, part 1 #60882

PaddlePaddle 2.6.0 buglist, part 1 #60882

Comments

jeng1220 commented Jan 17, 2024 • edited Loading

bug描述 Describe the Bug

其他补充信息 Additional Supplementary Information

jeng1220 commented Jan 22, 2024

jeng1220 commented Jan 22, 2024 • edited Loading

jeng1220 commented Jan 25, 2024

tianshuo78520a commented Jan 25, 2024

tianshuo78520a commented Jan 29, 2024

jeng1220 commented Feb 1, 2024

XieYunshen commented Feb 6, 2024

onecatcn commented Feb 22, 2024 • edited Loading

onecatcn commented Feb 22, 2024

zlsh80826 commented Feb 22, 2024

tianshuo78520a commented Feb 27, 2024

jeng1220 commented Feb 29, 2024 • edited Loading

tianshuo78520a commented Mar 1, 2024

GhostScreaming commented Mar 1, 2024 • edited Loading

zlsh80826 commented Mar 6, 2024

jeng1220 commented Mar 7, 2024 • edited Loading

jeng1220 commented Jan 17, 2024 •

edited

Loading

jeng1220 commented Jan 22, 2024 •

edited

Loading

onecatcn commented Feb 22, 2024 •

edited

Loading

jeng1220 commented Feb 29, 2024 •

edited

Loading

GhostScreaming commented Mar 1, 2024 •

edited

Loading

jeng1220 commented Mar 7, 2024 •

edited

Loading