Skip to content

Conversation

@cangtianhuang
Copy link
Contributor

@cangtianhuang cangtianhuang commented Aug 6, 2025

PR Category

Operator Mechanism

PR Types

Performance

Description

修复 #74081 精度修复时,对部分模型造成的性能下降:https://console.cloud.baidu-int.com/devops/icafe/issue/DLTP-92332/show

修复方法为:

  1. 回退 ThrustCumsumKernel 快速路径
  2. ThrustCumsumKernel 增加 fp16 与 bf16 类型支持

在之前的测试中,错误地判断了 Thrust 库的计算精度;在新的测试中,对于 1D 超大张量的边缘情况(即单个巨型行), Thrust 库表现完美,而 BlockScanKernel 由于 grid_size == 1 ,导致其退化为串行执行,计算速度显著下降

以下为 20 万至 20 亿元素个数时, paddle.cumsum API 通过 BlockScanKernel 分支与 ThrustCumsumKernel 分支的计算精度(与 torch 相比)与计算速度对比:

2d42baff-9e1b-4a92-8ac7-0288e93ee05a 2d211282-ed69-4660-a899-fc870bc23de6

结果说明,在 1D 张量的情况下, Thrust 库的计算精度与计算速度均显著优于当前的 BlockScanKernel 内核实现。当前 BlockScanKernel 内核实现主要为多行数据设计,其每个 Block 都在并行处理不同的数据行。

Pcard-85711

@paddle-bot
Copy link

paddle-bot bot commented Aug 6, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@cangtianhuang cangtianhuang changed the title [PHI] Fix BlockPrefixCallbackOp [PHI] Fix paddle.cumsum calculation speed Aug 10, 2025
@lshpku lshpku merged commit 9db2cad into PaddlePaddle:develop Aug 12, 2025
68 of 69 checks passed
maxiaolong001 pushed a commit to maxiaolong001/Paddle that referenced this pull request Aug 12, 2025
* fix ThrustCumsumKernel

* refine

* refine ThrustCumsumKernel

* fix

* update ThrustCumsumKernel

* fix logcumsumexp in ThrustCumsumKernel
@cangtianhuang cangtianhuang deleted the fix-cumsum branch September 4, 2025 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants