-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-37038][SQL][WIP] DSV2 Sample Push Down #34311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #144354 has finished for PR 34311 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
|
Test build #144356 has finished for PR 34311 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is not for ScanBuilder like SupportsPushDownFilters and others? Seems an inconsistent API design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made both SupportsPushDownLimit and SupportsPushDownTableSample extend scan. Because at the time of pushing down Limit or Sample, we have already created Scan. The child inside Limit or Sample is a DataSourceV2ScanRelation.
|
cc @cloud-fan |
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
|
Test build #144433 has finished for PR 34311 at commit
|
|
Test build #144434 has finished for PR 34311 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this groupBy thing mixed in but not related to sample push down ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It's not related to sample push down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not grasping all context, so this work requires the underlying DS support sample expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It requires the underlying support. Sample is not part of the ANSI SQL standard so not all data sources support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about the treatment for csv parquet format, will it make difference when sampling is pushed down to scan ? would that be supported ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work if you make parquet scan or csv scan implement the interface SupportsPushDownTableSample, but I am not sure how parquet or csv handles sample.
|
Kubernetes integration test starting |
bd947bb to
1ee1105
Compare
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #144679 has finished for PR 34311 at commit
|
|
Test build #144677 has finished for PR 34311 at commit
|
|
Kubernetes integration test status failure |
dohongdayi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finishing my review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would Option[TableSample] be better type for sample might not be provided ?
| sql(s"INSERT INTO TABLE $catalogName.new_table values (15, 16)") | ||
| sql(s"INSERT INTO TABLE $catalogName.new_table values (17, 18)") | ||
| sql(s"INSERT INTO TABLE $catalogName.new_table values (19, 20)") | ||
| if (supportsTableSample) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If supportsTableSample was false, it would be no need to create testing table or insert testing data at all
| Set.empty, | ||
| Set.empty, | ||
| None, | ||
| null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here should be None
| filters: Set[Filter], | ||
| handledFilters: Set[Filter], | ||
| aggregation: Option[Aggregation], | ||
| sample: Option[TableSample], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So many pushdown related parameters, would it be better if they were wrapped by some parent case class?
|
|
||
| override def getTableSample(sample: Option[TableSample]): String = { | ||
| if (sample.nonEmpty) { | ||
| val method = if (sample.get.methodName.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If many of the dialects have default sample methods, would Option[String] be better type for TableSample.methodName?
| withReplacement: Boolean, | ||
| seed: Long) extends TableSample { | ||
|
|
||
| override def describe(): String = s"$methodName $lowerBound $lowerBound $upperBound" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two lowerBounds ?
| new AnalysisException(message, cause = Some(e)) | ||
| } | ||
|
|
||
| def supportsTableSample: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would supportsTableSample() need parameter methodName: Option[String], for the dialect might not support the specified sample method or not support any sample method at all ?
| case sample @ Sample(_, _, _, _, child) => child match { | ||
| case ScanOperation(_, _, sHolder: ScanBuilderHolder) => | ||
| val tableSample = LogicalExpressions.tableSample( | ||
| "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any other possible value of TableSample.methodName beside "" here, so I'm not sure thatTableSample.methodName was important ?
|
|
||
| sample | ||
| : TABLESAMPLE '(' sampleMethod? ')' | ||
| : TABLESAMPLE '(' sampleMethod? ')' (REPEATABLE '('seed=INTEGER_VALUE')')? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make a separate PR for this SQL syntax change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
submitted #34442 for syntax change
|
It's too much work to rebase. I have submitted a new PR #34451 and will close this one. |
NP |
What changes were proposed in this pull request?
Push down Sample to data source for better performance. If Sample is pushed down, it will be removed from logical plan so it will not be applied at Spark any more.
Current Plan without Sample push down:
after Sample push down:
The new interface is implemented using JDBC for POC and end to end test.
TABLESAMPLEis not supported by all the databases.TABLESAMPLEhas been implemented using postgresql in this PR.Why are the changes needed?
Reduce IO and improve performance.
For SAMPLE, e.g.
SELECT * FROM t TABLESAMPLE (1 PERCENT), Spark retrieves all the data from table and then return 1% rows. It will dramatically reduce the transferred data size and improve performance if we can push Sample to data source side.Does this PR introduce any user-facing change?
Yes. new interface
SupportsPushDownTableSampleHow was this patch tested?
New test