Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Reimplement frequentItems with dataframe operations

Why are the changes needed?

1, do not truncate the sql plan any more;
2, enable sql optimization like column pruning

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing UTs and manually check

@zhengruifeng
Copy link
Contributor Author

cc @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leverage Utils.tryWithResource

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seem we can not leverage Utils.tryWithResource here, since Utils.tryWithResource only support single Closeable but there are two ones bos and out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use Utils.tryWithSafeFinally but that's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, let me update it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a hot path, you can just use a plan for loop or while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it , will update

@HyukjinKwon
Copy link
Member

this is nice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse one GenericArrayData with cleaning up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is only invoked once, so no need to reuse?

@HyukjinKwon
Copy link
Member

cc @cloud-fan FYI

@github-actions github-actions bot added the SQL label Oct 24, 2022

override def dataType: DataType = ArrayType(child.dataType, containsNull = child.nullable)

override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we don't have any input type requirement, we don't need to extend ImplicitCastInputTypes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will update

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the sql_stat_freq_item branch October 26, 2022 01:56
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?
Reimplement `frequentItems` with dataframe operations

### Why are the changes needed?
1, do not truncate the sql plan any more;
2, enable sql optimization like column pruning

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing UTs and manually check

Closes apache#38375 from zhengruifeng/sql_stat_freq_item.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants