-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite #32146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
thanks Maya! |
|
Jenkins, okay to test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add new line.
sql/core/pom.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove unneeded change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbtsai , Unfortunately, without the change I get the following exception:
java.lang.NoClassDefFoundError: org/codehaus/jackson/type/TypeReference at org.apache.parquet.crypto.keytools.FileKeyWrapper.getEncryptionKeyMetadata(FileKeyWrapper.java:140) at org.apache.parquet.crypto.keytools.FileKeyWrapper.getEncryptionKeyMetadata(FileKeyWrapper.java:113) at org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory.getFileEncryptionProperties(PropertiesDrivenCryptoFactory.java:127) at org.apache.parquet.hadoop.ParquetOutputFormat.createEncryptionProperties(ParquetOutputFormat.java:554) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:478) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:36) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:271) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:211) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.type.TypeReference at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the extra new line change.
cc @dongjoon-hyun to check if it's okay to add jackson-mapper-asl as new dep in core module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to use AADs here, they don't enhance integrity protection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to add a link to a sample KmsClient in this comment,
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/test/java/org/apache/parquet/crypto/keytools/samples/VaultClient.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pinging me, @dbtsai .
Thank you for your contribution, @andersonm-ibm . In terms of the dependency, adding jackson-mapper-asl into more important modules doesn't look good to me in many ways although it's used in Apache Spark.
jackson-mapper-aslis ancient (the last release is 2013). We cannot expect more bug patches.- The package itself is declared to be replaced with
com.fasterxml.jackson.core » jackson-databindofficially.
In sql/core module, can we use jackson-databind instead?
|
cc @viirya , @sunchao , @attilapiros , too. |
sql/core/pom.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In hive module, it was inevitable due to Apache Hive, but this new addition will affect Spark's mllib module and kafka-0-10-sql module, too. Where does this come from?
- If this is used only for testing, this should be a test dependency, @andersonm-ibm .
- Another option is that moving this test case to
hivemodule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this: ParquetEncryptionTest -> ParquetEncryptionSuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remote in Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind, please add a JIRA prefix like the following.
- test("Write and read an encrypted parquet") {
+ test("SPARK-34990: Write and read an encrypted parquet") {|
Thank you for pinging me, @dongjoon-hyun. Can we restore the PR description template? Spark pull requests follow the format and it will be in the commit log. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please separate imports with newline by the order of 1. java import, 2. scala import, 3. third-party import, 4. Spark import. E.g.,
import java.io.File
import java.nio.charset.StandardCharsets
import java.util.{Base64, HashMap, Map}
import scala.sys.process._
import org.apache.hadoop.conf.Configuration
import org.apache.parquet.crypto.{KeyAccessDeniedException, ParquetCryptoRuntimeException}
import org.apache.parquet.crypto.keytools.{KeyToolkit, KmsClient}
import org.apache.spark.sql.QueryTest
import org.apache.spark.sql.test.SharedSparkSessionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of show, should we verify the content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is println necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
|
@dongjoon-hyun @dbtsai I agree replacing codehaus jackson with the fasterxml one is the right thing to do in the next parquet version. Regarding the current situation - parquet 1.12.0 has been released, with the coudehaus runtime dependency. This jackson was leveraged by PME a few years back, and tested with Spark 2.4 and 3.0. All these Spark distros, and the latest 3.1.1, have the codhaus jar. The current master drops this dependency in the core, but maybe it can be kept for one more release, so PME is enabled in Spark 3.2.0? We will work on replacing the jackson in parquet, making sure it's properly tested (inc backwards compatibility with 1.1.2.0) etc, can take some time; the next parquet version would go into the next Spark version after 3.2.0? |
|
@ggershinsky . You should split (1)
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't forget this comment, @andersonm-ibm .
Another option is that moving this test case to hive module.
|
@dongjoon-hyun Got you, sounds good! Having the codehaus jackson jar in the Spark 3.2.0 distribution will enable users to work with Parquet encryption out-of-box in this Spark version. |
Thank you, @dongjoon-hyun , for all your comments and suggestions. I'll move the test case to the hive module and update the PR. |
|
ok to test |
|
Thank you for update, @andersonm-ibm . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you run dev/lint-scala and fix the Scala style error?
[error] /Users/dongjoon/PRS/SPARK-PR-32146/sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala:28:0:
org.apache.parquet.crypto. is in wrong order relative to org.apache.parquet.crypto.keytools..
|
Test build #137836 has finished for PR 32146 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137838 has finished for PR 32146 at commit
|
|
Retest this please |
|
Test build #137856 has started for PR 32146 at commit |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137841 has finished for PR 32146 at commit
|
attilapiros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andersonm-ibm We are very close. I just found two minor things.
sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala
Outdated
Show resolved
Hide resolved
sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala
Show resolved
Hide resolved
|
Please address @attilapiros 's comment. That doesn't make a different for the test result. |
… verify all parquet parts in parquet folder.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137880 has finished for PR 32146 at commit
|
|
jenkins retest this please |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Retest this please |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #137887 has finished for PR 32146 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @andersonm-ibm , @ggershinsky , @dbtsai , @viirya , @attilapiros .
Since all comments are addressed, I'll merged to master for Apache Spark 3.2.
|
@andersonm-ibm . I added you to the Apache Spark contributor group and SPARK-34990 is assigned to you. |
|
Test build #137894 has finished for PR 32146 at commit
|
Thank you, @dongjoon-hyun , and thank you for your patient help! |
What changes were proposed in this pull request?
A simple test that writes and reads an encrypted parquet and verifies that it's encrypted by checking its magic string (in encrypted footer mode).
Why are the changes needed?
To provide a test coverage for Parquet encryption.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
SBT / Hadoop 3.2 / Java11 by adding [test-java11] to the PR title.(Jenkins Java11 build is broken due to missing JDK11 installation)