Skip to content

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Nov 7, 2025

The EncryptedOutputFile objects generated by the StandardEncryptionManager.encrypt method hide the underlying OutputFile.

Unfortunately, there are some hidden requirements in the Parquet implementation for encryption to work properly.
Parquet encryption was only functioning when the target was a HadoopOutputFile. Using StandardEncryptedOutputFile.encryptingOutputFile() produced an AesGcmOutputFile, which resulted in corrupt files.

Updated StandardEncryptedOutputFile to return itself as the encryptingOutputFile, instead of exposing the underlying AesGcmOutputFile. The methods coming from the OutputFile are now wrapped and return the values directly from the wrapped output file.

Added a test to highlight the issue.

The test currently fails without the patch, but successful if we change:

        Parquet.write(encryptedOutputFile.encryptingOutputFile())

to

        Parquet.write(encryptedOutputFile)

After the patch both OutputFile objects could be used.

@pvary pvary force-pushed the stand_encry_fix branch 4 times, most recently from 8d5b0ef to 322292d Compare November 8, 2025 06:14
…ncryptionManager.ecrypt().encryptingOutputFile()
NativeEncryptionKeyMetadata keyMetadata =
((NativeEncryptionOutputFile) encryptedOutputFile).keyMetadata();
FileAppender<GenericRecord> writer =
Parquet.write(encryptedOutputFile.encryptingOutputFile())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this does not result in double encryption (sending PME to AesGcmOutputStream). Seemingly, encryptedOutputFile.encryptingOutputFile().create() calls new AesGcmOutputFile(..).create()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand your comment, but if you take a look at the Parquet.write implementation, there are different methods for Encrypted/non-Encrypted files, but all of them ends in using the EncryptedOutputFile.encryptingOutputFile().

With the new implementation the StandardEncryptedOutputFile.encryptingOutputFile() returns the same as StandardEncryptedOutputFile.encryptingOutputFile().encryptingOutputFile(). Parquet uses the OutputStreams, which are also the same, no matter how many times you call the encryptingOutputFile() before getting the OutputStream`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll need to have a deeper look, will play with the code.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Dec 11, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants