Spark: Support truncate in FunctionCatalog #5431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

aokolnychyi merged 27 commits into apache:master from kbendick:kb-add-spark-truncate-function

Aug 12, 2022

Contributor

kbendick commented Aug 4, 2022

This is an offshoot of #5305 and partially closes #5349.

Adds a system.truncate function that can be used in Spark SQL, as well as can be used as a FunctionCatalog function that can be turned into a transform for storage partitioned joins.

This also breaks the definition of Truncate out into utility functions inside of TruncateUtil. Because different usages validate input at different times, all of the functions in TruncateUtil do not validate their input and instead assume that the input is validated by the calling code. This allows for the Truncate transforms to validate their width one time (on instantiation), and for the Spark truncate function to skip input validation for faster generated code.

github-actions bot added API core spark labels

kbendick force-pushed the kb-add-spark-truncate-function branch 5 times, most recently from a602e50 to eae6bb8 Compare

August 4, 2022 18:09

rdblue changed the title ~~[SPARK] - Add Truncate Function to FunctionCatalog and Break Out Truncate Definition Into Shared Utility Class~~ Spark: Support truncate in FunctionCatalog

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

api/src/main/java/org/apache/iceberg/util/TruncateUtil.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

core/src/test/java/org/apache/iceberg/util/TestTruncateUtil.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

core/src/test/java/org/apache/iceberg/util/TestTruncateUtil.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkTruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkTruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkTruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkTruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkTruncateFunction.java Outdated Show resolved Hide resolved

rdblue reviewed

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkTruncateFunction.java Outdated Show resolved Hide resolved

kbendick added 3 commits

August 11, 2022 10:23


          Skip isNullAt check for non-primitive types that already check for nu…

eedaec4

…ll input values in invoke


          Add isNull check in all produceResults calls to avoid calling getXXXX…

5452d84

… on null on the special accesssors of certain row types that might not tolerate it


          Add missing type tests for truncation

f33b356

kbendick force-pushed the kb-add-spark-truncate-function branch from ddb851c to f33b356 Compare

August 11, 2022 17:23

kbendick added 4 commits

August 11, 2022 10:40


          Use WIDTH_ORDINAL and VALUE_ORDINAL constants

75efb62


          Remove odd ternary expression

e3fe438


          Add line between pulling types out of schema and validating the width…

c43e74e

… type


          Ensure string truncation has two byte, three byte, and four byte unic…

b6340b9

…ode codepoints in tests

kbendick force-pushed the kb-add-spark-truncate-function branch from 1d63a04 to 9e86776 Compare

August 11, 2022 19:02

kbendick added 7 commits

August 11, 2022 12:02


          Use constant set for width types as none have variable length and rem…

9e86776

…ove separate function for validation


          Use inputType.size

1b8f76c


          Use assertNull where possible

1de18f3


          Simplify exception message for wrong number of inputs and add test fo…

05b809b

…r fewer than 2 and more than 2 arguments


          Use value consistently for the truncation value field

93ff98d


          Ensure truncation width is non-null - need to validate this will be a…

cd4f2fd

…ccepted in storage partitioned joins impl


          Add check that dataframe with nullable integer will be rejected for u…

ad5343f

…se with truncate

Contributor Author

kbendick commented Aug 11, 2022

@aokolnychyi @rdblue I addressed all of your feedback.

Please take a look when you get a chance. Also, on the subject of nullability for the width column, I was able to make it work but reverted it based on this comment thread: #5431 (comment)

Knowing that Spark isn't the best at keeping track of nullability, I think this is better and adheres to the contract laid out in ScalarFunction as quoted by @aokolnychyi.

But take a look at the commit that added nullability-checking on the width field if you'd like: ad5343f


          Remove assertion that width field is non-null and associated tests pe…

ce6dba1

…r PR discussion

kbendick force-pushed the kb-add-spark-truncate-function branch from fd7ef4b to ce6dba1 Compare

August 11, 2022 20:58

rdblue approved these changes

View reviewed changes

Contributor

rdblue commented Aug 11, 2022

Looks good to me. @aokolnychyi, do you want to take another look?

Contributor

aokolnychyi commented Aug 12, 2022

@rdblue, let me take a quick now.

aokolnychyi approved these changes

View reviewed changes

Contributor

aokolnychyi left a comment

Looks great!

aokolnychyi merged commit 6a5051b into apache:master

Contributor

aokolnychyi commented Aug 12, 2022

Thanks, @kbendick! Could you cherry-pick this to 3.2?

kbendick deleted the kb-add-spark-truncate-function branch

August 12, 2022 18:19

Contributor Author

kbendick commented Aug 12, 2022

Here's the PR for bucket. All the feedback from this PR was more or less applied there as well - #5513

Contributor Author

kbendick commented Aug 12, 2022

Thanks, @kbendick! Could you cherry-pick this to 3.2?

Sure thing!

kbendick mentioned this pull request

Spark 3.2: Support truncate in FunctionCatalog #5514

Merged

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request


          Spark 3.3: Support truncate in FunctionCatalog (apache#5431)

1849b63

(cherry picked from commit 6a5051b)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels