Skip to content

Conversation

@vinodkc
Copy link
Contributor

@vinodkc vinodkc commented Oct 7, 2022

What changes were proposed in this pull request?

This PR supports data masking built-in Function mask, which returns a masked version of input string.
By default, upper case letters will be converted to "X", lower case letters will be converted to "x" and numbers will be converted to "n".

For example mask("abcd-EFGH-8765-4321") results in xxxx-XXXX-nnnn-nnnn will be able to override the characters used in the mask by supplying additional arguments: the second argument controls the mask character for upper case letters, the third argument for lower case letters and the fourth argument for numbers. For example, mask("abcd-EFGH-8765-4321", "U", "l", "#") should result in llll-UUUU-####-####

Examples:

 > SELECT mask('abcd-EFGH-8765-4321');
   xxxx-XXXX-nnnn-nnnn
 > SELECT mask('abcd-EFGH-8765-4321', 'Q');
   xxxx-QQQQ-nnnn-nnnn
 > SELECT mask('AbCD123-@$#', 'Q','q');
   QqQQnnn-@$#
 > SELECT mask('AbCD123-@$#');
   XxXXnnn-@$#
 > SELECT mask('AbCD123-@$#', 'Q');
   QxQQnnn-@$#
 > SELECT mask('AbCD123-@$#', 'Q','q');
   QqQQnnn-@$#
 > SELECT mask('AbCD123-@$#', 'Q','q', 'd');
   QqQQddd-@$#
 > SELECT mask('AbCD123-@$#', 'Q','q', 'd', 'o');
   QqQQdddoooo
 > SELECT mask('AbCD123-@$#', -1, 'q', 'd', 'o');
   AqCDdddoooo
 > SELECT mask('AbCD123-@$#', -1,-1, 'd', 'o');
   AbCDdddoooo
 > SELECT mask('AbCD123-@$#', -1,-1, -1, 'o');
   AbCD123oooo
 > SELECT mask(NULL, -1,-1, -1, 'o');
   NULL
 > SELECT mask(NULL);
   NULL
 > SELECT mask('AbCD123-@$#', -1, -1, -1, -1);
   AbCD123-@$#

Why are the changes needed?

To support data masking built-in function mask, which returns a masked version of the input string
Ref : Data masking functions

Does this PR introduce any user-facing change?

Yes, added a new build-in function named 'mask'

How was this patch tested?

Added test cases

@github-actions github-actions bot added the SQL label Oct 7, 2022
@HyukjinKwon
Copy link
Member

cc @dtenedor FYI. seems like it's sort of similar with #38101 somehow.

@vinodkc
Copy link
Contributor Author

vinodkc commented Oct 7, 2022

@HyukjinKwon , this PR is a generic approach to mask the string based on the arguments. This mask function can be applied to any string value and it does not expect a pattern on the input string. Apache Hive mask function has the same logic.
Eg:
Arguments:
* input - string value to mask. Supported types: STRING, VARCHAR, CHAR
* upperChar - character to replace upper-case characters with. Specify -1 to retain the original character. Default value: 'X'
* lowerChar - character to replace lower-case characters with. Specify -1 to retain the original character. Default value: 'x'
* digitChar - character to replace digit characters with. Specify -1 to retain the original character. Default value: 'n'
* otherChar - character to replace all other characters with. Specify -1 to retain the original character. Default value: -1

@vinodkc vinodkc changed the title [SPARK-40687][SQL] Support data masking built-in Function 'mask' [SPARK-40687][SQL] Support data masking built-in function 'mask' Oct 9, 2022
@melin
Copy link

melin commented Oct 25, 2022

@vinodkc
Copy link
Contributor Author

vinodkc commented Oct 25, 2022

Reference snowflake: https://docs.snowflake.com/en/user-guide/security-column-ddm-use.html

@melin
Thanks to the reference from Snoflake, which describes dynamic data masking based on users/roles. As Spark does not support access control/roles, so we cannot implement dynamic data masking based on users/roles.

This PR implements the same data masking functionality as Hive

  • mask(string str[, string upper[, string lower[, string number]]])

https://issues.apache.org/jira/browse/SPARK-40686

@dtenedor
Copy link
Contributor

cc @dtenedor FYI. seems like it's sort of similar with #38101 somehow.

@HyukjinKwon @gengliangwang @vinodkc Yes, it looks like there is an overlap between this work in https://issues.apache.org/jira/browse/SPARK-40686, and https://issues.apache.org/jira/browse/SPARK-40623 which I created to track implementing masking functions in Spark. We should probably dedup these into one effort. @vinodkc do you want to take this one and I can close mine as a dup? I can then help with the review if needed.

@vinodkc
Copy link
Contributor Author

vinodkc commented Oct 27, 2022

@dtenedor , yes, please close yours as a dup. I appreciate your help in reviewing this PR and on top of this change, I'm planning to add additional built-in mask functions supported in Hive

@dtenedor
Copy link
Contributor

@vinodkc sounds good! Thanks for working on this 👍

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. The test coverage is thorough, and the implementation is relatively simple because there are no error cases. I tried to think of test ideas.

@vinodkc vinodkc force-pushed the br_supportMask branch 2 times, most recently from 6d5fbaf to e7fdf2b Compare November 3, 2022 21:42
@vinodkc
Copy link
Contributor Author

vinodkc commented Nov 7, 2022

Hi @dtenedor , @HyukjinKwon @gengliangwang
Review comments are resolved, can you please check

@dtenedor
Copy link
Contributor

dtenedor commented Nov 7, 2022

This change LGTM now

@vinodkc vinodkc force-pushed the br_supportMask branch 3 times, most recently from 2d407a6 to 7a8cd22 Compare November 9, 2022 05:50
@vinodkc
Copy link
Contributor Author

vinodkc commented Dec 5, 2022

@HyukjinKwon , @dtenedor , Can you please check this PR?

@dtenedor
Copy link
Contributor

dtenedor commented Dec 5, 2022

@vinodkc Yes, I said the change LGTM :) sadly I am unable to merge this PR on my own though. We will need @HyukjinKwon to merge it.

@gengliangwang
Copy link
Member

@vinodkc this one LGTM overall. I will merge it after my comments are addressed.

@vinodkc
Copy link
Contributor Author

vinodkc commented Dec 9, 2022

@gengliangwang, Review comments are addressed.

@gengliangwang
Copy link
Member

@vinodkc Thanks for the work! I am focusing on other tasks this week.
Merging this one to the master branch now.

@srielau
Copy link
Contributor

srielau commented Jan 9, 2023

@vinodkc Can you reason why you needed -1 here rather than using e.g. NULL?
AFAIK using a value from another type domain altogether for an argument is quite "unique".
Would you object to changing -1 to NULL?
@gengliangwang @dtenedor

Note that I found the docs here:
https://cwiki.apache.org/confluence/display/hive/languagemanual+udf#LanguageManualUDF-DataMaskingFunctions
But they do NOT describe -1 (or NULL) behavior

@vinodkc
Copy link
Contributor Author

vinodkc commented Jan 10, 2023

@srielau , Thanks for checking this. There is no specific reason other than following same approach as in Hive documentation: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMask.java#L41
If still need to change to NULL I can raise a new PR. @gengliangwang @dtenedor , please share your recommendations too

@srielau
Copy link
Contributor

srielau commented Jan 10, 2023

@srielau , Thanks for checking this. There is no specific reason other than following same approach as in Hive documentation: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMask.java#L41 If still need to change to NULL I can raise a new PR. @gengliangwang @dtenedor , please share your recommendations too

Yes please

@dtenedor
Copy link
Contributor

If still need to change to NULL I can raise a new PR. @gengliangwang @dtenedor , please share your recommendations too

Yes, changing this to NULL would make the behavior more standard. I think it is OK to diverge from Hive in this case. 👍

@vinodkc
Copy link
Contributor Author

vinodkc commented Jan 10, 2023

Thanks for the suggestions , I'll raise a new PR to change -1 to NULL

@dtenedor
Copy link
Contributor

Thanks so much for your work, this data masking functionality will be very useful!

@whutpencil
Copy link

@vinodkc In the source code of hive, the types supported by the mask have this annotation:

value - value to mask. Supported types: TINYINT, SMALLINT, INT, BIGINT, STRING, VARCHAR, CHAR, DATE

You only implemented the String type, do you need to further add parameter types?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants