Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix 'BlasAXPBY uninplemented' error when run deepfefm with custom npu #47546

Closed
wants to merge 3 commits into from

Conversation

USTCKAY
Copy link
Contributor

@USTCKAY USTCKAY commented Nov 1, 2022

PR types

Bug fixes

PR changes

Others

Describe

fix 'BlasAXPBY uninplemented' error when run deepfefm model with custom npu.

为了解决BlasAXPBY uninplemented的问题,在paddle/fluid/eager/accumulation/accumulation_node.cc中包含了paddle/fluid/eager/api/generated/eager_generated/forwards/dygraph_functions.h,但dygraph_functions在inference中没有编译,且accumulation node在inference中没有被用到,所以选择在inference中不编译accumulation node。同时修改了paddle/fulid/eager和paddle/fluid/eager/api/utils目录下的CmakeLists,因为其中包含或依赖了accumualtion node。

而paddle/fluid/eager/tests/data_structure_tests目录下的几个tests也需要依赖dygraph_functions,所以需要加上依赖。
paddle/fluid/eager/tests/task_tests目录下的几个tests也需要依赖dygraph_functions,但仅在windowsCI环境下体现,所以也顺道加上了依赖。这些改动参考了这个PR #41198

另外增加了单测以覆盖accumulation node中新增的add_ad_func

@paddle-bot
Copy link

paddle-bot bot commented Nov 1, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

ronny1996
ronny1996 previously approved these changes Nov 15, 2022
custom_operator_node
${DEVICE_EVENT_LIBS}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里 DEVICE_EVENT_LIBS 的修改是不是误带进来的?没有看到哪里对 DEVICE_EVENT_LIBS 做了定义?

Copy link
Contributor Author

@USTCKAY USTCKAY Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

经过测试,确实是误加进来的,已去除。

tensor);
#ifdef PADDLE_WITH_CUSTOM_DEVICE
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这段代码直接参考 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/eager/grad_tensor_holder.cc#L145 修改就行,is_custom_device 是 tensor.h 里面的函数,及时不开 with custom device 的编译选项这个接口也存在且能返回正确结果,所以这里需要把代码用 #ifdef PADDLE_WITH_CUSTOM_DEVICE 包起来。

Copy link
Contributor Author

@USTCKAY USTCKAY Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是因为目前“BlasAXPBY uninplemented”的问题仅在使用custom device时出现,为了不对其他情况造成影响,所以使用了条件编译。经过测试,不加条件编译的的话编译window和mac上的paddle会报错。
ae7dc6211b050d587da0101674668ec2

accumulation_node
SRCS accumulation_node.cc
DEPS gradient_accumulator grad_node_info)
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确认下这里为什么需要将 accumulation_node target 用 if(NOT (NOT WITH_PYTHON AND ON_INFER)) 包起来?是说当编译 C++ inference lib 包的时候就不需要编译 accumulation_node 了吗?

根据 accumulation_node.cc 里面的修改,应该对 accumulation_node 添加 ${eager_deps} 的依赖就行?而且这里还删除了对 phi_api 的依赖是为什么呢?

Copy link
Contributor Author

@USTCKAY USTCKAY Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.这里加条件是因为accumulation node依赖的dygraph_function在编inference时不会被编译,且accumulation node在inference中不会用到,所以选择在inference中不编译accumulation node。
2.之前删除phi_api是因为编译时会报错,怀疑是phi_api带来的循环依赖造成的,所以删除了phi_api。今天经过测试,不删除phi_api也不会再编译出错,准备加回来。

t_values, &tensor_values);
#ifdef PADDLE_WITH_CUSTOM_DEVICE
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,参考 grad_tensor_holder.cc 修改

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是因为目前“BlasAXPBY uninplemented”的问题仅在使用custom device时出现,为了不对其他情况造成影响,所以使用了条件编译。

hook_utils
SRCS hook_utils.cc
DEPS phi tensor_utils autograd_meta grad_node_info utils accumulation_node)
if(NOT (NOT WITH_PYTHON AND ON_INFER))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上这里,CMAKE修改的原因是?

Copy link
Contributor Author

@USTCKAY USTCKAY Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为hook_utils依赖accumulation_node,而accumulation_node在inference时不编译,所以这里也加上了条件。

if(WITH_CUSTOM_DEVICE)
set(test_deps ${eager_deps} ${generated_deps})
else()
set(test_deps ${eager_deps})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要添加 generated_deps 的原因是?

Copy link
Contributor Author

@USTCKAY USTCKAY Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前不加无法编译成功,今天经过测试发现不加也能编译成功,已去掉。

cc_test_old(test_egr_ds_tensor_wrapper SRCS tensor_wrapper_test.cc DEPS
${eager_deps})
${test_deps})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要添加 generated_deps 的原因是?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为这几个tests依赖dyprapg_function,不加编译通过不了。
截屏2022-11-15 18 12 42

@USTCKAY USTCKAY force-pushed the fix_blasaxpby branch 2 times, most recently from 8593b2f to a7e3748 Compare November 18, 2022 07:03
@USTCKAY USTCKAY force-pushed the fix_blasaxpby branch 2 times, most recently from 1611bb0 to 12b7281 Compare December 5, 2022 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants