Skip to content

Conversation

@straigrand
Copy link
Contributor

@straigrand straigrand commented Jun 20, 2025

PR Category

Execute Infrastructure

PR Types

Improvements

Description

paddle.take_along_axis API 支持0-Size Tensor。

修改历程介绍如下:

  1. 问题复现与分析:

    • 根据任务要求,首先尝试使用PaddleAPITest复现BUG。在PaddleAPITest--accuracy=True精度对比模式下,由于paddle.take_along_axis(arr, index, axis)torch.take_along_dim(input, indices, dim)的参数名不统一,反复出现TypeError: missing a required argument的参数绑定错误,无法准确定位问题。
    • 切换到PaddleAPITest--paddle_only=True模式后,成功触发了Python层的TypeError: take_along_axis() got an unexpected keyword argument 'index'报错,这暴露了API前后端参数名不一致的问题。
    • 为绕过Python接口问题,并直接验证C++ Kernel,最终采用unittestOpTest框架编写单元测试。在未修复Kernel的情况下,成功复现了底层的C++错误:
      PreconditionNotMetError: Tensor holds no memory. Call Tensor::mutable_data firstly.
      [Hint: holder_ should not be null.] (at /home/aistudio/Paddle/paddle/phi/core/dense_tensor_impl.cc:46)
      [operator < take_along_axis > error]
      
    • 错误分析表明,当输入Tensor的元素个数为0时,算子Kernel未做特殊处理,导致程序在访问空Tensor的内存时崩溃。
  2. 前向修复 (Forward Fix):
    a. 定位API: 在Paddle/python/paddle/tensor/manipulation.py中找到了def take_along_axis(...)的Python定义,其核心实现调用了_C_ops.take_along_axis
    b. 定位算子定义: 使用grep发现,该算子没有独立的.yml文件,其定义位于paddle/phi/ops/yaml/ops.yaml中。
    c. 检查InferMeta: 根据ops.yaml的指引,在paddle/phi/infermeta/binary.cc中找到了TakeAlongAxisInferMeta函数。经分析,其out->set_dims(index.dims())逻辑能正确推导0-size Tensor的输出形状,无需修改。
    d. 修改Kernel:
    * 根据grep结果,定位到CPU Kernel文件为paddle/phi/kernels/cpu/take_along_axis_kernel.cc
    * 参照标准修复范式,在TakeAlongAxisKernel函数开头加入了对0-size情况的保护。核心逻辑是判断index.numel()是否为0,因为输出的形状完全由index决定。
    * 修复代码如下:

        // 案例一: 输出本身就是0-size
        if (index.numel() == 0) {
          dev_ctx.template Alloc<T>(out);
          return;
        }
        // 案例二: 输入是0-size,但输出不是0-size(广播情况)
        if (x.numel() == 0) {
          dev_ctx.template Alloc<T>(out);
          phi::funcs::SetConstant<Context, T> functor;
          functor(dev_ctx, out, static_cast<T>(0));
          return;
        }

    d. 依照以上原则修改CPU、GPU、XPU Kernel

  3. 反向修复 (Backward Fix):

    • 本次修复的核心在于解决前向Kernel(take_along_axis_kernel)在处理0-size输入时,因访问未初始化内存而导致的PreconditionNotMetError。
    • 在添加的OpTest单元测试中,我们通过self.check_grad()对反向梯度进行了检查。
    • 在修复了前向Kernel之后,check_grad能够顺利通过。这表明,对于我们测试的0-size场景,正确的前向输出(一个0-size Tensor)能够保证反向计算图的正确建立和执行,因此本次并未对take_along_axis_grad_kernel.cc文件进行修改。
  4. 添加单测 (Add Unit Test):

    • test/legacy_test/test_take_along_axis_op.py文件中,添加了新的TestTakeAlongAxis0Size测试类,它继承自op_test.OpTest
    • 该单测专门构造了输入arr为0-size(shape为[2, 0, 5]),而index不为0-size(shape为[2, 3, 5])的广播场景。
    • setUp方法如下:
      def setUp(self):
          self.python_api = paddle.take_along_axis
          self.op_type = "take_along_axis"
          self.dtype = "float64" # 使用fp64以满足梯度检查的精度要求
      
          x = np.zeros((2, 0, 5)).astype(self.dtype)
          indices = np.zeros((2, 3, 5)).astype("int64")
      
          self.inputs = {'Input': x, 'Index': indices}
          self.attrs = {'Axis': 1}
      
          output = np.zeros((2, 3, 5)).astype(self.dtype)
          self.outputs = {'Result': output}
    • 同时,该单测也覆盖了check_outputcheck_grad,确保了前后向的正确性。
  5. 测试结果

    • 本地单元测试: 在修复C++ Kernel并重新编译后,在feature/fix_take_along_axis_0size分支上运行添加的OpTest单元测试,结果为 OK,证明修复成功。
    • PaddleAPITest回测: 由于此API存在参数名不兼容问题,--accuracy模式无法使用。在--paddle_only模式下,修复后可顺利通过。

@paddle-bot
Copy link

paddle-bot bot commented Jun 20, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Jun 20, 2025
@CLAassistant
Copy link

CLAassistant commented Jun 20, 2025

CLA assistant check
All committers have signed the CLA.

@straigrand
Copy link
Contributor Author

我在本地补充了所有必要的验证测试。

  1. 单元测试: 我在test/legacy_test/test_take_along_axis_op.py中添加了符合OpTest规范的单元测试,在修复C++ Kernel并重新编译后,测试结果为 OK
  2. PaddleAPITest回归测试: 我创建了一个只包含take_along_axis 0-size case的独立测试文件,并使用--paddle_only模式进行了回归测试,测试命令顺利执行完毕,无任何报错,结果为空。

结论:
所有测试均已通过,证明本次修复是有效且健壮的。感谢您的审查!

@luotao1 luotao1 added the HappyOpenSource Pro 进阶版快乐开源活动,更具挑战性的任务 label Jun 20, 2025
@luotao1
Copy link
Contributor

luotao1 commented Jun 20, 2025

请签署CLA、修复CodeStyle流水线

README.md Outdated

PaddlePaddle is provided under the [Apache-2.0 license](LICENSE).


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件不需要修改,请恢复

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到!十分抱歉!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已完成修改!

@straigrand straigrand force-pushed the feature/fix_take_along_axis_0size branch 3 times, most recently from 080875a to 8f0cf6a Compare June 22, 2025 00:15
@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Jun 23, 2025
@PaddlePaddle PaddlePaddle unlocked this conversation Jun 23, 2025
@straigrand straigrand force-pushed the feature/fix_take_along_axis_0size branch from 8f0cf6a to b128ba7 Compare June 23, 2025 03:31
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes missing coverage. Please review.

Please upload report for BASE (develop@2327fff). Learn more about missing BASE report.

Files with missing lines Patch % Lines
paddle/phi/kernels/cpu/take_along_axis_kernel.cc 25.00% 6 Missing ⚠️

❌ Your patch status has failed because the patch coverage (25.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #73499   +/-   ##
==========================================
  Coverage           ?   25.00%           
==========================================
  Files              ?        1           
  Lines              ?        8           
  Branches           ?        0           
==========================================
  Hits               ?        2           
  Misses             ?        6           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@straigrand straigrand force-pushed the feature/fix_take_along_axis_0size branch 3 times, most recently from f6d123a to d67f794 Compare June 24, 2025 05:21
Comment on lines +32 to +39
if (index.numel() == 0) {
dev_ctx.template Alloc<T>(out);
return;
}
if (x.numel() == 0) {
phi::Full<T, Context>(
dev_ctx, common::vectorize(out->dims()), static_cast<T>(0), out);
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同学好!现在一般 0size 的修复方法 be like:

if (out && out->numel() == 0) {
  dev_ctx.template Alloc<T>(out);
  return;
}

同学可以参考 PR 示例😊~:#72821

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为输出 out 形状已经被设置好了,如果 infermeta 无误,可以直接为其分配内存然后返回即可~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

但是当输入x是0-size,但index不是0-size时,输出out的元素个数是不为0的,所以会跳出保护,则会导致测试失败,所以还是原方案更全面。

Copy link
Contributor

@cangtianhuang cangtianhuang Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

但是当输入x是0-size,但index不是0-size时,输出out的元素个数是不为0的,所以会跳出保护,则会导致测试失败,所以还是原方案更全面。

可行👍那可以参考我之前的pr:#71961



# --- 场景一: 输入x为0-size, index不为0-size ---
class TestTakeAlongAxis0Size(OpTest):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同学可以尝试继承 TestTakeAlongAxisOp 类,然后重写 init_data() 方法,不必再新建一个类😉

当然下面的写法也没有错~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

通过继承TestTakeAlongAxisOp后,两个新写的测试类出现了报错,其中计算numpy结果和广播问题等无法兼容0-size的特殊情况,因此放弃继承TestTakeAlongAxisOp的方案。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

通过继承TestTakeAlongAxisOp后,两个新写的测试类出现了报错,其中计算numpy结果和广播问题等无法兼容0-size的特殊情况,因此放弃继承TestTakeAlongAxisOp的方案。

是 numpy 对 0size 的支持有问题吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

通过继承TestTakeAlongAxisOp后,两个新写的测试类出现了报错,其中计算numpy结果和广播问题等无法兼容0-size的特殊情况,因此放弃继承TestTakeAlongAxisOp的方案。

是 numpy 对 0size 的支持有问题吗?

好像是的,在通过继承父类方法,其中通过numpy.take_along_axis来预先计算一个答案,但是输入数组为维度为0时,NumPy也无法处理这种情况,直接报了“索引越界”的错误(IndexError: index 0 is out of bounds for axis 1 with size 0)。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好滴~

@straigrand straigrand force-pushed the feature/fix_take_along_axis_0size branch 3 times, most recently from 344a067 to 5ec5f82 Compare June 25, 2025 15:11
@cangtianhuang
Copy link
Contributor

@luotao1 流水线尚未触发😶‍🌫️

@straigrand straigrand force-pushed the feature/fix_take_along_axis_0size branch 2 times, most recently from 4f0be74 to 7498d1b Compare June 26, 2025 14:40
@straigrand
Copy link
Contributor Author

@luotao1 麻烦您帮我触发流水线

@cangtianhuang
Copy link
Contributor

@luotao1 麻烦您帮我触发流水线

你尝试提交一个commit触发,比如merge一下最新的develop

self.check_output(check_pir=self.check_pir)

def test_check_grad(self):
self.grad = np.zeros_like(self.inputs['Input']).astype(self.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处的self.grad是不是没有被用到?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实是,将更新将其删除。

@straigrand straigrand force-pushed the feature/fix_take_along_axis_0size branch from 7498d1b to e4d74a7 Compare June 27, 2025 05:50
@straigrand
Copy link
Contributor Author

@luotao1 麻烦您帮我触发流水线

你尝试提交一个commit触发,比如merge一下最新的develop

可能是因为首次提交PR的缘故,需要维护者审核。

@luotao1
Copy link
Contributor

luotao1 commented Jun 27, 2025

我还是触发不起来。

  1. 不要覆盖原来的commit,试试merge develop吧。参考 https://www.paddlepaddle.org.cn/documentation/docs/zh/dev_guides/code_contributing_path_cn.html#span-id-caution1-pr-span 评审人 Code Review 过后,请不要使用 git push -f 强行提交代码,这样评审人无法看到修改前后的变化(diff),使评审变得困难。
  2. 确实要merge develop了,现在是落后107个commit。
image

@straigrand
Copy link
Contributor Author

我还是触发不起来。

  1. 不要覆盖原来的commit,试试merge develop吧。参考 https://www.paddlepaddle.org.cn/documentation/docs/zh/dev_guides/code_contributing_path_cn.html#span-id-caution1-pr-span 评审人 Code Review 过后,请不要使用 git push -f 强行提交代码,这样评审人无法看到修改前后的变化(diff),使评审变得困难。
  2. 确实要merge develop了,现在是落后107个commit。
image

收到,我将尽快完成修改。

@luotao1
Copy link
Contributor

luotao1 commented Jun 27, 2025

触发起来了😓,你先等等结果

@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Jun 27, 2025
@PaddlePaddle PaddlePaddle unlocked this conversation Jun 27, 2025
@straigrand straigrand force-pushed the feature/fix_take_along_axis_0size branch from e4d74a7 to 9e28676 Compare June 29, 2025 09:38
@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Jun 30, 2025
@PaddlePaddle PaddlePaddle unlocked this conversation Jun 30, 2025
@luotao1
Copy link
Contributor

luotao1 commented Jun 30, 2025

部分CI还是没有触发起来,比如 PR-CI-Windows-OPENBLAS (这个不需要approval)。你可以试试再提交一个commit,如果不行的话,重新提交一个PR

@straigrand
Copy link
Contributor Author

部分CI还是没有触发起来,比如 PR-CI-Windows-OPENBLAS (这个不需要approval)。你可以试试再提交一个commit,如果不行的话,重新提交一个PR

好滴,收到

@straigrand
Copy link
Contributor Author

已被新的PR #73736 代替。此PR的提交历史较为混乱且遇到了CI问题,因此关闭。所有修复已转移到新的PR中。

@straigrand straigrand closed this Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers HappyOpenSource Pro 进阶版快乐开源活动,更具挑战性的任务

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants