[SPARK-37627][SQL][FOLLOWUP] Separate SortedBucketTransform from BucketTransform #34914

huaxingao · 2021-12-15T23:52:52Z

What changes were proposed in this pull request?

Currently only a single bucket column is supported in BucketTransform, fix the code to make multiple bucket columns work.
Separate SortedBucketTransform from BucketTransform, and make the arguments in SortedBucketTransform in the format of columns numBuckets sortedColumns so we have a way to find out the columns and sortedColumns.
add more test coverage.

Why are the changes needed?

Fix bugs in BucketTransform and SortedBucketTransform.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests

SparkQA · 2021-12-16T01:06:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50721/

SparkQA · 2021-12-16T02:06:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50721/

SparkQA · 2021-12-16T02:31:44Z

Test build #146247 has finished for PR 34914 at commit 7188482.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-12-16T05:34:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

does this really work? I don't see a way for people to extract bucket and sort columns from arguments with the Transform API.

viirya · 2022-01-03T23:07:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

Shall we keep consistent order of columns and numBuckets for two cases in arguments?

If there are sortedColumn, we need numBuckets in between of columns and sortedColumns, because we need a way to figure out which elements in the array are for columns, and which elements are for sortedColumns.

viirya · 2022-01-03T23:10:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

There is an extra space between case and Lit.

viirya · 2022-01-03T23:12:06Z

...lyst/src/test/scala/org/apache/spark/sql/connector/expressions/TransformExtractorSuite.scala

unnecessary change.

cloud-fan · 2022-01-05T16:32:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

+      this.copy(columns = newReferences)
+    } else {
+      val splits = newReferences.grouped(columns.length).toList
+      this.copy(columns = splits(0), sortedColumns = splits(1))


is it: columns = newReferences.take(columns.length), sortedColumns = newReferences.drop(columns.length)

Changed. Thanks!

cloud-fan · 2022-01-05T16:33:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

    sortedColumns: Seq[NamedReference] = Seq.empty[NamedReference]) extends RewritableTransform {

-  override val name: String = "bucket"
+  override val name: String = if (sortedColumns.nonEmpty) "sortedBucket" else "bucket"


Can we create a new class SortedBucketTransform to be clearer?

Added a new class SortedBucketTransform. Thanks!

cloud-fan · 2022-01-05T16:36:44Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

    val identifier = "testcat.table_name"
    withTable(identifier) {
-      sql(s"CREATE TABLE $identifier (a int, b string, c int) USING $v2Source PARTITIONED BY (c)" +
-        s" CLUSTERED BY (b) SORTED by (a) INTO 4 BUCKETS")


why changing this test?

Just want to make sure multiple columns/sortedColumns work ok.

cloud-fan · 2022-01-06T14:27:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

    BucketTransform(literal(numBuckets, IntegerType), references)

-  def bucket(
+  def sortedBucket(


It's OK to keep the name bucket, to match the name of this SQL feature

cloud-fan · 2022-01-06T14:32:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

+    columns: Seq[NamedReference],
+    sortedColumns: Seq[NamedReference] = Seq.empty[NamedReference]) extends RewritableTransform {
+
+  override val name: String = "sortedBucket"


sorted_bucket is more SQL-ish.

cloud-fan · 2022-01-06T14:33:29Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryTable.scala

            throw new IllegalArgumentException(s"Match: unsupported argument(s) type - ($v, $t)")
        }
-      case BucketTransform(numBuckets, ref, _) =>
+      case BucketTransform(numBuckets, ref) =>


I think we can have a single BucketTransform.unapply, to match both BucketTransform and SortedBucketTransform, so that we can have a single case here and avoid duplicated code.

cloud-fan · 2022-01-07T03:42:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

 private[sql] object BucketTransform {
-  def unapply(expr: Expression): Option[(Int, FieldReference, FieldReference)] =
-      expr match {
+  def unapply(expr: Expression): Option[(Int, FieldReference, FieldReference)] = expr match {


where do we use this unapply?

This was introduced in #30706 but doesn't seem to be used. I will remove for now.

cloud-fan · 2022-01-07T03:43:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

+      var index: Int = -1
+      var posOfLit: Int = -1
+      var numOfBucket: Int = -1
+      arguments.foreach {


nit: we can do arguments.zipWithIndex.foreach, so that it's much easier to get posOfLit.

cloud-fan · 2022-01-07T03:45:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

+          posOfLit = index
+        case _ => index = index + 1
+      }
+      Some(numOfBucket, FieldReference(arguments.take(posOfLit).map(_.describe)),


we know that the arguments of bucket/sorted_bucketare all NamedReference, how about arguments.take(posOfLit).map(_.asInstanceOf[NamedReference])?

cloud-fan · 2022-01-07T03:46:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

+      }
+      Some(numOfBucket, FieldReference(arguments.take(posOfLit).map(_.describe)),
+        FieldReference(arguments.drop(posOfLit + 1).map(_.describe)))
+    case NamedTransform("bucket", Seq(Lit(value: Int, IntegerType), Ref(seq: Seq[String]))) =>


this doesn't seem to be right. It only matches bucket with a single bucket column.

Seems somehow only a single column is supported in BucketTransform. Will fix this.

viirya · 2022-01-07T06:09:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

 private[sql] object BucketTransform {
-  def unapply(expr: Expression): Option[(Int, FieldReference, FieldReference)] =
-      expr match {
+  def unapply(expr: Expression): Option[(Int, FieldReference, FieldReference)] = expr match {


Could you add some comments on unapply (if it is really used) about what it returns?

viirya · 2022-01-07T06:11:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

 private[sql] object BucketTransform {
-  def unapply(expr: Expression): Option[(Int, FieldReference, FieldReference)] =
-      expr match {
+  def unapply(expr: Expression): Option[(Int, FieldReference, FieldReference)] = expr match {


BTW, why def unapply(expr: Expression) addresses only BucketTransform but def unapply(transform: Transform) addresses both sorted_bucket and bucket?

viirya

This change seems to be not only "Add tests ...". It's better to update the title and description accordingly before merging.

huaxingao · 2022-01-10T01:05:19Z

It's better to update the title and description accordingly before merging.

Updated. Thanks!

cloud-fan

LGTM if tests pass

cloud-fan · 2022-01-14T01:44:26Z

thanks, merging to master!

huaxingao · 2022-01-14T06:02:07Z

Thanks!

…etTransform ### What changes were proposed in this pull request? 1. Currently only a single bucket column is supported in `BucketTransform`, fix the code to make multiple bucket columns work. 2. Separate `SortedBucketTransform` from `BucketTransform`, and make the `arguments` in `SortedBucketTransform` in the format of `columns numBuckets sortedColumns` so we have a way to find out the `columns` and `sortedColumns`. 3. add more test coverage. ### Why are the changes needed? Fix bugs in `BucketTransform` and `SortedBucketTransform`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests Closes apache#34914 from huaxingao/sorted_followup. Lead-authored-by: Huaxin Gao <[email protected]> Co-authored-by: huaxingao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Dec 15, 2021

huaxingao marked this pull request as draft December 16, 2021 05:19

cloud-fan reviewed Dec 16, 2021

View reviewed changes

huaxingao marked this pull request as ready for review January 3, 2022 07:24

viirya reviewed Jan 3, 2022

View reviewed changes

...lyst/src/test/scala/org/apache/spark/sql/connector/expressions/TransformExtractorSuite.scala Outdated

Copy link

Member

viirya Jan 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change.

huaxingao added 4 commits January 4, 2022 15:52

[SPARK-37627][SQL][FOLLOWUP] Add test for sorted BucketTransform

73bd0e6

use literal to separate bucket cols and sorted cols

4db60f7

remove extra space and extra blank line

6b9f42c

resolve conflict

00a90da

huaxingao force-pushed the sorted_followup branch from 4719a02 to 00a90da Compare January 5, 2022 00:26

remove unnessary change

51ead2b

cloud-fan reviewed Jan 5, 2022

View reviewed changes

separate BucketTransform and SortedBucketTransform

3f220d0

cloud-fan reviewed Jan 6, 2022

View reviewed changes

address comments

61dc795

cloud-fan reviewed Jan 7, 2022

View reviewed changes

viirya reviewed Jan 7, 2022

View reviewed changes

address comments

77b2c12

huaxingao changed the title ~~[SPARK-37627][SQL][FOLLOWUP] Add tests for sorted BucketTransform~~ [SPARK-37627][SQL][FOLLOWUP] Separate SortedBucketTransform from BucketTransform Jan 10, 2022

cloud-fan approved these changes Jan 12, 2022

View reviewed changes

Trigger Build

dac5693

cloud-fan closed this in 2ed827a Jan 14, 2022

CTTY mentioned this pull request Jul 12, 2022

[HUDI-4186] Support Hudi with Spark 3.3.0 apache/hudi#5943

Merged

5 tasks

[SPARK-37627][SQL][FOLLOWUP] Separate SortedBucketTransform from BucketTransform #34914

[SPARK-37627][SQL][FOLLOWUP] Separate SortedBucketTransform from BucketTransform #34914

Uh oh!

Conversation

huaxingao commented Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Dec 16, 2021

Uh oh!

SparkQA commented Dec 16, 2021

Uh oh!

SparkQA commented Dec 16, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jan 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Jan 10, 2022

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 14, 2022

Uh oh!

huaxingao commented Jan 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

huaxingao commented Dec 15, 2021 •

edited

Loading

cloud-fan Jan 5, 2022 •

edited

Loading