-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40900][SQL] Reimplement frequentItems with dataframe operations
#38375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @HyukjinKwon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can leverage Utils.tryWithResource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seem we can not leverage Utils.tryWithResource here, since Utils.tryWithResource only support single Closeable but there are two ones bos and out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use Utils.tryWithSafeFinally but that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, let me update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a hot path, you can just use a plan for loop or while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it , will update
|
this is nice! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse one GenericArrayData with cleaning up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is only invoked once, so no need to reuse?
|
cc @cloud-fan FYI |
0afdc5a to
5b6800f
Compare
|
|
||
| override def dataType: DataType = ArrayType(child.dataType, containsNull = child.nullable) | ||
|
|
||
| override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we don't have any input type requirement, we don't need to extend ImplicitCastInputTypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, will update
|
Merged to master. |
### What changes were proposed in this pull request? Reimplement `frequentItems` with dataframe operations ### Why are the changes needed? 1, do not truncate the sql plan any more; 2, enable sql optimization like column pruning ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing UTs and manually check Closes apache#38375 from zhengruifeng/sql_stat_freq_item. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
Reimplement
frequentItemswith dataframe operationsWhy are the changes needed?
1, do not truncate the sql plan any more;
2, enable sql optimization like column pruning
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing UTs and manually check