[SPARK-37038][SQL][WIP] DSV2 Sample Push Down #34311

huaxingao · 2021-10-18T06:45:03Z

What changes were proposed in this pull request?

Push down Sample to data source for better performance. If Sample is pushed down, it will be removed from logical plan so it will not be applied at Spark any more.

Current Plan without Sample push down:

== Parsed Logical Plan ==
'Project [*]
+- 'Sample 0.0, 0.8, false, 157
   +- 'UnresolvedRelation [postgresql, new_table], [], false

== Analyzed Logical Plan ==
col1: int, col2: int
Project [col1#163, col2#164]
+- Sample 0.0, 0.8, false, 157
   +- SubqueryAlias postgresql.new_table
      +- RelationV2[col1#163, col2#164] new_table

== Optimized Logical Plan ==
Sample 0.0, 0.8, false, 157
+- RelationV2[col1#163, col2#164] new_table

== Physical Plan ==
*(1) Sample 0.0, 0.8, false, 157
+- *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@6dde4769 [col1#163,col2#164] PushedAggregates: [], PushedFilters: [], PushedGroupby: [], PushedLimit: [], PushedSample: TABLESAMPLE  0.0 0.8 false 157, ReadSchema: struct<col1:int,col2:int>

after Sample push down:

== Parsed Logical Plan ==
'Project [*]
+- 'Sample 0.0, 0.8, false, 187
   +- 'UnresolvedRelation [postgresql, new_table], [], false

== Analyzed Logical Plan ==
col1: int, col2: int
Project [col1#163, col2#164]
+- Sample 0.0, 0.8, false, 187
   +- SubqueryAlias postgresql.new_table
      +- RelationV2[col1#163, col2#164] new_table

== Optimized Logical Plan ==
RelationV2[col1#163, col2#164] new_table

== Physical Plan ==
*(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@65b57543 [col1#163,col2#164] PushedAggregates: [], PushedFilters: [], PushedGroupby: [], PushedLimit: [], PushedSample: TABLESAMPLE  0.0 0.8 false 187, ReadSchema: struct<col1:int,col2:int>

The new interface is implemented using JDBC for POC and end to end test. TABLESAMPLE is not supported by all the databases. TABLESAMPLE has been implemented using postgresql in this PR.

Why are the changes needed?

Reduce IO and improve performance.
For SAMPLE, e.g. SELECT * FROM t TABLESAMPLE (1 PERCENT), Spark retrieves all the data from table and then return 1% rows. It will dramatically reduce the transferred data size and improve performance if we can push Sample to data source side.

Does this PR introduce any user-facing change?

Yes. new interface SupportsPushDownTableSample

How was this patch tested?

New test

SparkQA · 2021-10-18T06:50:42Z

Test build #144354 has finished for PR 34311 at commit aaca7fb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-18T07:43:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48832/

SparkQA · 2021-10-18T08:22:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48833/

SparkQA · 2021-10-18T08:44:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48832/

SparkQA · 2021-10-18T09:02:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48833/

SparkQA · 2021-10-18T12:48:26Z

Test build #144356 has finished for PR 34311 at commit eb176f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-10-18T22:49:42Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownTableSample.java

Why this is not for ScanBuilder like SupportsPushDownFilters and others? Seems an inconsistent API design.

I made both SupportsPushDownLimit and SupportsPushDownTableSample extend scan. Because at the time of pushing down Limit or Sample, we have already created Scan. The child inside Limit or Sample is a DataSourceV2ScanRelation.

huaxingao · 2021-10-18T23:56:10Z

cc @cloud-fan

SparkQA · 2021-10-20T00:24:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48906/

SparkQA · 2021-10-20T00:33:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48907/

SparkQA · 2021-10-20T01:18:14Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48907/

SparkQA · 2021-10-20T01:24:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48906/

SparkQA · 2021-10-20T04:02:27Z

Test build #144433 has finished for PR 34311 at commit 0d9158a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-20T04:38:40Z

Test build #144434 has finished for PR 34311 at commit 09e4e9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zinking · 2021-10-20T14:52:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

is this groupBy thing mixed in but not related to sample push down ?

Right. It's not related to sample push down.

zinking · 2021-10-20T14:54:43Z

...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala

not grasping all context, so this work requires the underlying DS support sample expression?

Yes. It requires the underlying support. Sample is not part of the ANSI SQL standard so not all data sources support it.

what about the treatment for csv parquet format, will it make difference when sampling is pushed down to scan ? would that be supported ?

It should work if you make parquet scan or csv scan implement the interface SupportsPushDownTableSample, but I am not sure how parquet or csv handles sample.

SparkQA · 2021-10-28T00:10:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49146/

SparkQA · 2021-10-28T01:03:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49148/

SparkQA · 2021-10-28T01:11:29Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49146/

SparkQA · 2021-10-28T01:14:10Z

Test build #144679 has finished for PR 34311 at commit 1ee1105.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-28T01:45:04Z

Test build #144677 has finished for PR 34311 at commit bd947bb.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2021-10-28T01:46:38Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49148/

dohongdayi

Finishing my review

dohongdayi · 2021-10-23T12:36:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

Would Option[TableSample] be better type for sample might not be provided ?

dohongdayi · 2021-10-28T10:36:35Z

external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala

+      sql(s"INSERT INTO TABLE $catalogName.new_table values (15, 16)")
+      sql(s"INSERT INTO TABLE $catalogName.new_table values (17, 18)")
+      sql(s"INSERT INTO TABLE $catalogName.new_table values (19, 20)")
+      if (supportsTableSample) {


If supportsTableSample was false, it would be no need to create testing table or insert testing data at all

dohongdayi · 2021-10-28T10:42:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

        Set.empty,
        Set.empty,
        None,
+        null,


I think here should be None

dohongdayi · 2021-10-28T10:45:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

    filters: Set[Filter],
    handledFilters: Set[Filter],
    aggregation: Option[Aggregation],
+    sample: Option[TableSample],


So many pushdown related parameters, would it be better if they were wrapped by some parent case class?

dohongdayi · 2021-10-28T11:02:38Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala

+
+  override def getTableSample(sample: Option[TableSample]): String = {
+    if (sample.nonEmpty) {
+      val method = if (sample.get.methodName.isEmpty) {


If many of the dialects have default sample methods, would Option[String] be better type for TableSample.methodName?

dohongdayi · 2021-10-28T11:19:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

+    withReplacement: Boolean,
+    seed: Long) extends TableSample {
+
+  override def describe(): String = s"$methodName $lowerBound $lowerBound $upperBound" +


two lowerBounds ?

dohongdayi · 2021-10-28T11:26:53Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

    new AnalysisException(message, cause = Some(e))
  }
+
+  def supportsTableSample: Boolean = false


Would supportsTableSample() need parameter methodName: Option[String], for the dialect might not support the specified sample method or not support any sample method at all ?

dohongdayi · 2021-10-28T14:11:18Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+    case sample @ Sample(_, _, _, _, child) => child match {
+        case ScanOperation(_, _, sHolder: ScanBuilderHolder) =>
+          val tableSample = LogicalExpressions.tableSample(
+            "",


I didn't see any other possible value of TableSample.methodName beside "" here, so I'm not sure thatTableSample.methodName was important ?

cloud-fan · 2021-10-29T13:43:13Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4


 sample
-    : TABLESAMPLE '(' sampleMethod? ')'
+    : TABLESAMPLE '(' sampleMethod? ')' (REPEATABLE '('seed=INTEGER_VALUE')')?


Can we make a separate PR for this SQL syntax change?

submitted #34442 for syntax change

huaxingao · 2021-10-31T16:04:39Z

It's too much work to rebase. I have submitted a new PR #34451 and will close this one.
@dohongdayi I have addressed your comments in the new PR. Thanks for reviewing!

dohongdayi · 2021-11-01T02:24:23Z

It's too much work to rebase. I have submitted a new PR #34451 and will close this one. @dohongdayi I have addressed your comments in the new PR. Thanks for reviewing!

NP

github-actions bot added DOCS SQL labels Oct 18, 2021

viirya reviewed Oct 18, 2021

View reviewed changes

zinking reviewed Oct 20, 2021

View reviewed changes

huaxingao added 5 commits October 27, 2021 16:29

[SPARK-37038][SQL] DSV2 Sample Push Down

94198c9

fix style

751578c

add seed in TableSample SQL syntax

0dc45e6

add REPEATABLE in sql-ref-ansi-compliance.md

28c392c

SupportsPushDownTableSample shoud extend ScanBuilder

0d7ddbd

fix file conflicts

1ee1105

huaxingao force-pushed the pushdownSample branch from bd947bb to 1ee1105 Compare October 28, 2021 00:18

dohongdayi reviewed Oct 28, 2021

View reviewed changes

cloud-fan reviewed Oct 29, 2021

View reviewed changes

huaxingao closed this Oct 31, 2021

huaxingao deleted the pushdownSample branch October 31, 2021 16:04

[SPARK-37038][SQL][WIP] DSV2 Sample Push Down #34311

[SPARK-37038][SQL][WIP] DSV2 Sample Push Down #34311

Uh oh!

Conversation

huaxingao commented Oct 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 18, 2021

Uh oh!

SparkQA commented Oct 18, 2021

Uh oh!

SparkQA commented Oct 18, 2021

Uh oh!

SparkQA commented Oct 18, 2021

Uh oh!

SparkQA commented Oct 18, 2021

Uh oh!

SparkQA commented Oct 18, 2021

Uh oh!

viirya Oct 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Oct 18, 2021

Uh oh!

SparkQA commented Oct 20, 2021

Uh oh!

SparkQA commented Oct 20, 2021

Uh oh!

SparkQA commented Oct 20, 2021

Uh oh!

SparkQA commented Oct 20, 2021

Uh oh!

SparkQA commented Oct 20, 2021

Uh oh!

SparkQA commented Oct 20, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 28, 2021

Uh oh!

SparkQA commented Oct 28, 2021

Uh oh!

SparkQA commented Oct 28, 2021

Uh oh!

SparkQA commented Oct 28, 2021

Uh oh!

SparkQA commented Oct 28, 2021

Uh oh!

SparkQA commented Oct 28, 2021

Uh oh!

dohongdayi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Oct 18, 2021 •

edited

Loading

viirya Oct 18, 2021 •

edited

Loading