Skip to content

Conversation

@liujinhui1994
Copy link
Contributor

@liujinhui1994 liujinhui1994 commented Sep 7, 2021

Tips

What is the purpose of the pull request

Data security is becoming more and more important, if hudi can support encryption, it is very welcome

  1. Specify column encryption
  2. Support footer encryption
  3. Custom encrypted client interface(Provide memory-based encryption client by default)
  4. Specify the encryption key

When querying, you need to pass the relevant key or obtain query permission based on the client's encrypted interface. If it fails, the result cannot be returned.

  1. When querying non-encrypted fields, the key is not passed, and the data is returned normally
  2. When querying encrypted fields, the key is not passed and the data is not returned
  3. When the encrypted field is queried, the key is passed, and the data is returned normally
  4. When querying all fields, the key is not passed and no result is returned. If passed, the data returns normally

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@liujinhui1994
Copy link
Contributor Author

Processing...

@yanghua
Copy link
Contributor

yanghua commented Sep 7, 2021

Hi @liujinhui1994 Can you share the use case or some documentation for this feature?

@rubenssoto
Copy link

Hi Guys,
Only to know, this feature will work with spark and prestodb?

Thank you

@vinothchandar vinothchandar self-assigned this Sep 7, 2021
pom.xml Outdated
<confluent.version>5.3.4</confluent.version>
<glassfish.version>2.17</glassfish.version>
<parquet.version>1.10.1</parquet.version>
<parquet.version>1.12.0</parquet.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main thing is the parquet upgrade. Would this be compatible with Spark 2.x. guess not? I think this needs Spark 3.2?
https://github.com/apache/spark/blob/branch-3.2/pom.xml

this will also upgrade parquet-avro that we use everywhere. So just thinking about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, vc,I am testing...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liujinhui1994 : may I know whats the status on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently working on the last problem I found. Some classes cannot be found during runtime. After this is processed, I think it should be fine. @nsivabalan

@liujinhui1994
Copy link
Contributor Author

https://issues.apache.org/jira/browse/HUDI-2370 @yanghua Here are some discussions

@liujinhui1994
Copy link
Contributor Author

@hudi-bot run azure

@liujinhui1994
Copy link
Contributor Author

@hudi-bot run azure

@nsivabalan
Copy link
Contributor

nsivabalan commented Sep 22, 2021

@liujinhui1994 : how did your testing go. Can you update w/ your findings. @vinothchandar : apart from parquet version upgrade, I also see that we are enabling vectorized reading by default in this patch. Just wanted to remind you just incase we need to watch out for something. Also, should we do parquet upgrade in a separate patch, so that we can do some testing around diff query types, engines etc to certify the upgrade.

@liujinhui1994
Copy link
Contributor Author

I tested and verified last week. After upgrading the parquet version, many unit tests and integration tests failed. I am still looking for a solution.

@liujinhui1994
Copy link
Contributor Author

@hudi-bot run azure

@liujinhui1994
Copy link
Contributor Author

Of course, we first upgrade parquet 1.12.0.
I still encounter some CI problems @nsivabalan

@liujinhui1994
Copy link
Contributor Author

@hudi-bot run azure

@liujinhui1994
Copy link
Contributor Author

@hudi-bot run azure

@xushiyan xushiyan self-assigned this Nov 6, 2021
@nsivabalan nsivabalan removed the priority:blocker Production down; release blocker label Nov 7, 2021
@xushiyan xushiyan added pr:author-action status:in-progress Work in progress priority:blocker Production down; release blocker labels Nov 7, 2021
@xushiyan
Copy link
Member

xushiyan commented Nov 8, 2021

@liujinhui1994 thanks for making this feature. As discussed, some suggestions on the next steps:

We'd like to see this gets into 0.10, but only without major parquet version upgrades. Understand you're using org.apache.parquet.crypto.* that only available from 1.12. Would it be possible to keep the feature in a separate module say hudi-security and make it useable only when people build Hudi with parquet 1.12? say -Dcustom.build=true. The main question here is whether building with parquet 1.12 makes other features work or not.

Kindly share with us any compatibility issues from your testings. cc @vinothchandar @nsivabalan

@xushiyan xushiyan removed the priority:blocker Production down; release blocker label Nov 9, 2021
@xushiyan
Copy link
Member

xushiyan commented Nov 9, 2021

@liujinhui1994 @vinothchandar @nsivabalan as discussed, removing this feature from 0.10. Let's continue the collaboration on next release. thanks

@vinothchandar vinothchandar self-assigned this Dec 25, 2021
Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liujinhui1994 some high level feedback:

  • can you put some more info in the PR description to explain the high-level functionalities we want to add here?
  • parquet 1.12 is only used when hudi is built with spark 3.2. the encryption feature needs to be somehow guarded by checking the intended spark version. so we need to find a way to make the functionality only available when people using spark 3.2+

@liujinhui1994
Copy link
Contributor Author

@liujinhui1994 some high level feedback:

  • can you put some more info in the PR description to explain the high-level functionalities we want to add here?
  • parquet 1.12 is only used when hudi is built with spark 3.2. the encryption feature needs to be somehow guarded by checking the intended spark version. so we need to find a way to make the functionality only available when people using spark 3.2+

Point 1: already added
Point 2: Is it possible to add a description to the InMemoryKMS annotation? Only allow spark3.2+ to use it, or if there is any other good suggestion please let me know

@xushiyan
Copy link
Member

xushiyan commented Feb 9, 2022

@liujinhui1994 per our discussions, some points here

  • hoodie.parquet.encryption.keys can be used as the switch. default to not set. if not set, then encryption feature is disabled. any parquet settings users provide will be ignored
  • If users enable encryption but hoodie.table.base.file.format is not parquet, there should be some warning to highlight it to the users
  • it'll be good to mark this as experimental feature. you can clarify this with more details (like why experimental) in the config's docs and related java classes' docs. Release notes will also highlight it.

@liujinhui1994
Copy link
Contributor Author

@liujinhui1994根据我们的讨论,这里有几点

  • hoodie.parquet.encryption.keys可以用作开关。默认不设置。如果未设置,则禁用加密功能。用户提供的任何镶木地板设置都将被忽略
  • 如果用户启用了加密但hoodie.table.base.file.format不是镶木地板,则应该有一些警告以向用户突出显示
  • 将其标记为实验性功能会很好。您可以在配置的文档和相关的 java 类的文档中用更多的细节(比如为什么是实验性的)来澄清这一点。发行说明也将突出显示它。

@liujinhui1994 per our discussions, some points here

  • hoodie.parquet.encryption.keys can be used as the switch. default to not set. if not set, then encryption feature is disabled. any parquet settings users provide will be ignored
  • If users enable encryption but hoodie.table.base.file.format is not parquet, there should be some warning to highlight it to the users
  • it'll be good to mark this as experimental feature. you can clarify this with more details (like why experimental) in the config's docs and related java classes' docs. Release notes will also highlight it.

ok

@hudi-bot
Copy link
Collaborator

hudi-bot commented Feb 9, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@xushiyan
Copy link
Member

xushiyan commented Feb 14, 2022

@liujinhui1994 let's do #4804 instead. thanks

@xushiyan xushiyan closed this Feb 14, 2022
@liujinhui1994
Copy link
Contributor Author

@liujinhui1994 let's do #4804 instead. thanks

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status:in-progress Work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants