-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-37627][SQL][FOLLOWUP] Separate SortedBucketTransform from BucketTransform #34914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #146247 has finished for PR 34914 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this really work? I don't see a way for people to extract bucket and sort columns from arguments with the Transform API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we keep consistent order of columns and numBuckets for two cases in arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are sortedColumn, we need numBuckets in between of columns and sortedColumns, because we need a way to figure out which elements in the array are for columns, and which elements are for sortedColumns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an extra space between case and Lit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change.
4719a02 to
00a90da
Compare
| this.copy(columns = newReferences) | ||
| } else { | ||
| val splits = newReferences.grouped(columns.length).toList | ||
| this.copy(columns = splits(0), sortedColumns = splits(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it: columns = newReferences.take(columns.length), sortedColumns = newReferences.drop(columns.length)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. Thanks!
| sortedColumns: Seq[NamedReference] = Seq.empty[NamedReference]) extends RewritableTransform { | ||
|
|
||
| override val name: String = "bucket" | ||
| override val name: String = if (sortedColumns.nonEmpty) "sortedBucket" else "bucket" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create a new class SortedBucketTransform to be clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new class SortedBucketTransform. Thanks!
| val identifier = "testcat.table_name" | ||
| withTable(identifier) { | ||
| sql(s"CREATE TABLE $identifier (a int, b string, c int) USING $v2Source PARTITIONED BY (c)" + | ||
| s" CLUSTERED BY (b) SORTED by (a) INTO 4 BUCKETS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why changing this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure multiple columns/sortedColumns work ok.
| BucketTransform(literal(numBuckets, IntegerType), references) | ||
|
|
||
| def bucket( | ||
| def sortedBucket( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK to keep the name bucket, to match the name of this SQL feature
| columns: Seq[NamedReference], | ||
| sortedColumns: Seq[NamedReference] = Seq.empty[NamedReference]) extends RewritableTransform { | ||
|
|
||
| override val name: String = "sortedBucket" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorted_bucket is more SQL-ish.
| throw new IllegalArgumentException(s"Match: unsupported argument(s) type - ($v, $t)") | ||
| } | ||
| case BucketTransform(numBuckets, ref, _) => | ||
| case BucketTransform(numBuckets, ref) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can have a single BucketTransform.unapply, to match both BucketTransform and SortedBucketTransform, so that we can have a single case here and avoid duplicated code.
| private[sql] object BucketTransform { | ||
| def unapply(expr: Expression): Option[(Int, FieldReference, FieldReference)] = | ||
| expr match { | ||
| def unapply(expr: Expression): Option[(Int, FieldReference, FieldReference)] = expr match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we use this unapply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was introduced in #30706 but doesn't seem to be used. I will remove for now.
| var index: Int = -1 | ||
| var posOfLit: Int = -1 | ||
| var numOfBucket: Int = -1 | ||
| arguments.foreach { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can do arguments.zipWithIndex.foreach, so that it's much easier to get posOfLit.
| posOfLit = index | ||
| case _ => index = index + 1 | ||
| } | ||
| Some(numOfBucket, FieldReference(arguments.take(posOfLit).map(_.describe)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we know that the arguments of bucket/sorted_bucketare all NamedReference, how about arguments.take(posOfLit).map(_.asInstanceOf[NamedReference])?
| } | ||
| Some(numOfBucket, FieldReference(arguments.take(posOfLit).map(_.describe)), | ||
| FieldReference(arguments.drop(posOfLit + 1).map(_.describe))) | ||
| case NamedTransform("bucket", Seq(Lit(value: Int, IntegerType), Ref(seq: Seq[String]))) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem to be right. It only matches bucket with a single bucket column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems somehow only a single column is supported in BucketTransform. Will fix this.
| private[sql] object BucketTransform { | ||
| def unapply(expr: Expression): Option[(Int, FieldReference, FieldReference)] = | ||
| expr match { | ||
| def unapply(expr: Expression): Option[(Int, FieldReference, FieldReference)] = expr match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some comments on unapply (if it is really used) about what it returns?
| private[sql] object BucketTransform { | ||
| def unapply(expr: Expression): Option[(Int, FieldReference, FieldReference)] = | ||
| expr match { | ||
| def unapply(expr: Expression): Option[(Int, FieldReference, FieldReference)] = expr match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, why def unapply(expr: Expression) addresses only BucketTransform but def unapply(transform: Transform) addresses both sorted_bucket and bucket?
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems to be not only "Add tests ...". It's better to update the title and description accordingly before merging.
Updated. Thanks! |
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if tests pass
|
thanks, merging to master! |
|
Thanks! |
…etTransform ### What changes were proposed in this pull request? 1. Currently only a single bucket column is supported in `BucketTransform`, fix the code to make multiple bucket columns work. 2. Separate `SortedBucketTransform` from `BucketTransform`, and make the `arguments` in `SortedBucketTransform` in the format of `columns numBuckets sortedColumns` so we have a way to find out the `columns` and `sortedColumns`. 3. add more test coverage. ### Why are the changes needed? Fix bugs in `BucketTransform` and `SortedBucketTransform`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests Closes apache#34914 from huaxingao/sorted_followup. Lead-authored-by: Huaxin Gao <[email protected]> Co-authored-by: huaxingao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
BucketTransform, fix the code to make multiple bucket columns work.SortedBucketTransformfromBucketTransform, and make theargumentsinSortedBucketTransformin the format ofcolumns numBuckets sortedColumnsso we have a way to find out thecolumnsandsortedColumns.Why are the changes needed?
Fix bugs in
BucketTransformandSortedBucketTransform.Does this PR introduce any user-facing change?
No
How was this patch tested?
New tests