Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify emit() of TopK to emit on batch_size rather than batch_size-1 #10030

Merged
merged 1 commit into from
Apr 10, 2024

Conversation

JasonLi-cn
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Simply modify the logic of "break into record batches as needed". This modification avoid generating an empty RecordBatch where the batch.num_rows() eq batch_size.

https://github.com/apache/arrow-datafusion/blob/03d8ba1f0d94bac6bb8bb33e95f00f9f6fb5275a/datafusion/physical-plan/src/topk/mod.rs#L210-L219

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me -- thank you @JasonLi-cn

@alamb alamb changed the title modify emit() of TopK modify emit() of TopK to emit on batch_size rather than batch_size-1 Apr 10, 2024
@alamb
Copy link
Contributor

alamb commented Apr 10, 2024

(BTW @JasonLi-cn I am not sure what is inspiring all your work recently to optimize DataFusion but it is much appreciated 🙏 )

@Dandandan Dandandan merged commit 69595a4 into apache:main Apr 10, 2024
25 checks passed
@Dandandan
Copy link
Contributor

Thanks @JasonLi-cn

@JasonLi-cn
Copy link
Contributor Author

JasonLi-cn commented Apr 11, 2024

(BTW @JasonLi-cn I am not sure what is inspiring all your work recently to optimize DataFusion but it is much appreciated 🙏 )

I have been using DataFusion and arrow-rs since 2021, but most of my work has been on developing TableScan or upper-level applications, as well as some optimizations for specific scenarios. My current work is mainly focused on performance optimization (which is also my interest) and feature enhancement, and I think it is necessary to contribute the improvements directly to the community, which is beneficial for the healthy development of both DataFusion and our own projects.
My PR may not be perfect, and I also thank all the community friends for their suggestions, from which I have learned a lot 🙏.

@alamb
Copy link
Contributor

alamb commented Apr 11, 2024

My current work is mainly focused on performance optimization (which is also my interest) and feature enhancement, and I think it is necessary to contribute the improvements directly to the community

100%

My PR may not be perfect, and I also thank all the community friends for their suggestions, from which I have learned a lot 🙏.

I think this describes everyone here (including myself!) I also learn a lot from interacting with other developers and working with the communnity

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants