Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jan 19, 2022

What changes were proposed in this pull request?

1, override maxRowsPerPartition in Sort,Expand,Sample,CollectMetrics;
2, override maxRows in Except,Expand,CollectMetrics;

Why are the changes needed?

to provide an accurate value if possible

Does this PR introduce any user-facing change?

No

How was this patch tested?

added testsuites

@github-actions github-actions bot added the SQL label Jan 19, 2022
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether it is wrong, sampling with replacement should not generate more rows than the input dataset.

But we can not impl a strict sampling with replacement, so PoissonSampler is used instead, which can not guarantee this attribute.

scala> val df = spark.range(0, 1000)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.count
res0: Long = 1000

scala> df.sample(true, 0.999999, 10).count
res1: Long = 1004

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, cc @sigmod @maryannxue @srielau

this seems like a correctness issue and we should fix it ASAP. @zhengruifeng can you open a new PR for the sample fix only so that we can backport?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Sure

@HyukjinKwon HyukjinKwon changed the title [SPARK-37961][SQL] override maxRows/maxRowsPerPartition for some logical operators [SPARK-37961][SQL] Override maxRows/maxRowsPerPartition for some logical operators Jan 20, 2022
@HyukjinKwon
Copy link
Member

cc @wangyum FYI

@github-actions
Copy link

github-actions bot commented Jun 5, 2022

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jun 5, 2022
@github-actions github-actions bot closed this Jun 6, 2022
@cloud-fan
Copy link
Contributor

Somehow this PR lost track. @zhengruifeng do you want to reopen it and get it in?

@zhengruifeng
Copy link
Contributor Author

@cloud-fan Sure, Let me update this PR

@zhengruifeng zhengruifeng reopened this Jun 9, 2022
@zhengruifeng zhengruifeng force-pushed the add_some_maxRows_maxRowsPerPartition branch from 5d97ec8 to a57729a Compare June 9, 2022 03:12
@github-actions github-actions bot closed this Jun 10, 2022
@wangyum wangyum removed the Stale label Jun 10, 2022
@wangyum wangyum reopened this Jun 10, 2022
@zhengruifeng zhengruifeng force-pushed the add_some_maxRows_maxRowsPerPartition branch 2 times, most recently from da4efb0 to 5ef3cea Compare June 11, 2022 07:23
Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tests pass

@zhengruifeng zhengruifeng force-pushed the add_some_maxRows_maxRowsPerPartition branch from 5ef3cea to 0d4693c Compare June 14, 2022 04:10
@zhengruifeng
Copy link
Contributor Author

@cloud-fan the tests passed after I rebased the PR, it should be ready now.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e841fa3 Jun 16, 2022
@zhengruifeng zhengruifeng deleted the add_some_maxRows_maxRowsPerPartition branch June 18, 2022 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants