-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-37961][SQL] Override maxRows/maxRowsPerPartition for some logical operators #35250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37961][SQL] Override maxRows/maxRowsPerPartition for some logical operators #35250
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether it is wrong, sampling with replacement should not generate more rows than the input dataset.
But we can not impl a strict sampling with replacement, so PoissonSampler is used instead, which can not guarantee this attribute.
scala> val df = spark.range(0, 1000)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.count
res0: Long = 1000
scala> df.sample(true, 0.999999, 10).count
res1: Long = 1004
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, cc @sigmod @maryannxue @srielau
this seems like a correctness issue and we should fix it ASAP. @zhengruifeng can you open a new PR for the sample fix only so that we can backport?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Sure
|
cc @wangyum FYI |
019c1e5 to
5d97ec8
Compare
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
Somehow this PR lost track. @zhengruifeng do you want to reopen it and get it in? |
|
@cloud-fan Sure, Let me update this PR |
5d97ec8 to
a57729a
Compare
da4efb0 to
5ef3cea
Compare
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if tests pass
5ef3cea to
0d4693c
Compare
|
@cloud-fan the tests passed after I rebased the PR, it should be ready now. |
|
thanks, merging to master! |
What changes were proposed in this pull request?
1, override
maxRowsPerPartitioninSort,Expand,Sample,CollectMetrics;2, override
maxRowsinExcept,Expand,CollectMetrics;Why are the changes needed?
to provide an accurate value if possible
Does this PR introduce any user-facing change?
No
How was this patch tested?
added testsuites