-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-40687][SQL] Support data masking built-in function 'mask' #38146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@HyukjinKwon , this PR is a generic approach to mask the string based on the arguments. This mask function can be applied to any string value and it does not expect a pattern on the input string. Apache Hive mask function has the same logic. |
|
Reference snowflake: https://docs.snowflake.com/en/user-guide/security-column-ddm-use.html |
@melin This PR implements the same data masking functionality as Hive
https://issues.apache.org/jira/browse/SPARK-40686
|
@HyukjinKwon @gengliangwang @vinodkc Yes, it looks like there is an overlap between this work in https://issues.apache.org/jira/browse/SPARK-40686, and https://issues.apache.org/jira/browse/SPARK-40623 which I created to track implementing masking functions in Spark. We should probably dedup these into one effort. @vinodkc do you want to take this one and I can close mine as a dup? I can then help with the review if needed. |
|
@dtenedor , yes, please close yours as a dup. I appreciate your help in reviewing this PR and on top of this change, I'm planning to add additional built-in mask functions supported in Hive |
|
@vinodkc sounds good! Thanks for working on this 👍 |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/resources/sql-tests/inputs/string-functions.sql
Outdated
Show resolved
Hide resolved
dtenedor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. The test coverage is thorough, and the implementation is relatively simple because there are no error cases. I tried to think of test ideas.
6d5fbaf to
e7fdf2b
Compare
|
Hi @dtenedor , @HyukjinKwon @gengliangwang |
|
This change LGTM now |
2d407a6 to
7a8cd22
Compare
|
@HyukjinKwon , @dtenedor , Can you please check this PR? |
|
@vinodkc Yes, I said the change LGTM :) sadly I am unable to merge this PR on my own though. We will need @HyukjinKwon to merge it. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/resources/sql-tests/inputs/string-functions.sql
Outdated
Show resolved
Hide resolved
|
@vinodkc this one LGTM overall. I will merge it after my comments are addressed. |
7a8cd22 to
a02ff23
Compare
a02ff23 to
bf3cf93
Compare
|
@gengliangwang, Review comments are addressed. |
|
@vinodkc Thanks for the work! I am focusing on other tasks this week. |
|
@vinodkc Can you reason why you needed -1 here rather than using e.g. NULL? Note that I found the docs here: |
|
@srielau , Thanks for checking this. There is no specific reason other than following same approach as in Hive documentation: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMask.java#L41 |
Yes please |
Yes, changing this to NULL would make the behavior more standard. I think it is OK to diverge from Hive in this case. 👍 |
|
Thanks for the suggestions , I'll raise a new PR to change -1 to NULL |
|
Thanks so much for your work, this data masking functionality will be very useful! |
|
@vinodkc In the source code of hive, the types supported by the mask have this annotation: value - value to mask. Supported types: TINYINT, SMALLINT, INT, BIGINT, STRING, VARCHAR, CHAR, DATE You only implemented the String type, do you need to further add parameter types? |
What changes were proposed in this pull request?
This PR supports data masking built-in Function mask, which returns a masked version of input string.
By default, upper case letters will be converted to "X", lower case letters will be converted to "x" and numbers will be converted to "n".
For example mask("abcd-EFGH-8765-4321") results in xxxx-XXXX-nnnn-nnnn will be able to override the characters used in the mask by supplying additional arguments: the second argument controls the mask character for upper case letters, the third argument for lower case letters and the fourth argument for numbers. For example, mask("abcd-EFGH-8765-4321", "U", "l", "#") should result in llll-UUUU-####-####
Examples:
Why are the changes needed?
To support data masking built-in function mask, which returns a masked version of the input string
Ref : Data masking functions
Does this PR introduce any user-facing change?
Yes, added a new build-in function named 'mask'
How was this patch tested?
Added test cases