-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark - Implement FunctionCatalog and Truncate #5305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
After this, I'll add This is to facilitate the usage of the various transforms from PySpark as well as SQL. Additionally, having a |
b6f983d to
ac78639
Compare
bd734c6 to
fa5ad6e
Compare
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/BaseCatalog.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/SparkFunctions.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkTruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
1e21a4b to
bf3b2fd
Compare
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/BaseCatalog.java
Show resolved
Hide resolved
6126c68 to
1074435
Compare
05ad451 to
f81e780
Compare
|
@rdblue PTAL. I was going to tag other people as well. |
2743a2a to
0f5c8f8
Compare
|
@rdblue Do you think I should break this into a 2 PRs? |
|
Looking at the storage partition joins, it looks like the function |
|
I'd love to take a look as well. I should have some time in a day. We also had some progress on bucketed joins internally. |
Thanks Anton. Was going to tag you today now that it's cleaned up. Also cc @huaxingao @flyrain @nastra @Fokko |
|
Right not we're requiring the call be to the So we should probably allow the empty namespace to resolve functions as well. |
|
Link to the code that resolves our own |
16626eb to
ed2f738
Compare
…on't verify for perf reasons
…ade in storage partitioned joins implementation
ed2f738 to
24efcba
Compare
|
Because this PR is so big, I'm going to separate out the I'm going to add a very simple function to be able to test it but keep the code to review a lot smaller. 👍 |
|
I've opened #5377 to cover just the This PR is too big, and this way we can focus on just the I've added an |
Implements
FunctionCatalogfor Spark 3.3 and implements all variants ofTruncate.FunctionCatalog
This allows users of
SparkCatalogandSparkSessionCatalogto usetruncatewithout having to register it as a UDF.All Iceberg functions that we register into the function catalog are accessible when used with an Iceberg spark catalog and:
e.g.
my_catalog.truncate(width, value).Note - Using
truncate(width, value)typically does not work, as Spark adds the namespace to the call.system.truncateshould be preferred.systemnamespace is referenced, to match called procedure syntax. Note this only works right now with theSparkCatalog, as theSparkSessionCataloghas logic in Spark to verify the namespace exists.e.g.
my_catalog.system.truncate(width, value)orsystem.truncate(6, column)Truncate
The truncate function also allows for a dynamic width or the width to come from a column - though typically the width will likely be static for one given call as it's mostly intended to be used to match partition transforms (specifically with joins or on non-partition columns to create a new column in the data without needing to partition on it).
This PR refactors the definition of the transform functions into a utility class where needed so that Spark’s magic functions can call them via the static
invokefunction and not duplicate logic. This allows Spark to include the functions in codegen.Special Considerations for Using Function Catalog Efficiently via Magic Functions and Code Gen
The requirements for magic functions to be used with codegen include that:
invokeis a static functioninvoketakes in the primitive types / native Spark types corresponding to each of Spark's input DataTypes (e.g. int for IntegerType and UTF8String for StringType).Further documentation on the magic functions is found here in the ScalarFunction JavaDoc
This partially closes #5349