-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Parquet-2030: Expose page size row check configurations to ParquetWri… #895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gszadovszky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. I have some notes about the tests.
(Note: we do not need to validate the whole feature of the page size check only the fact that the properties have passed properly.)
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java
Outdated
Show resolved
Hide resolved
Yeah you are right, that should be in |
|
Unfortunately I cannot find a way to only assert that properties have been set. It probably could be done with extensive refactoring but I don't see the point to do that in this PR. Also other property setters in the builder are not unit tested so I guess I could just remove these tests completely. However I did notice that I can assert the number of flushes by counting the number of blocks in the Parquet file footer. Asserting that way would make the test easier to understand. Let me know what you prefer. Remove the tests or fix the assertations? I'll continue after that. |
|
Yeah, checking only that properties are really passed is not easy. I did not want to say to remove all the tests completely (or any) just that we don't need to test every aspect of the original feature. I think, it is enough to test the difference between setting the properties and not so you know they are passed. You can see that passing such properties are not that obvious so I would keep some test to be sure. The idea of checking the number of row groups is a good point. |
|
I now have two tests. One with default params and one with changes params. And I assert by checking the number of blocks in the footer. I also removed the changes I did for |
gszadovszky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your efforts!
* 'master' of https://github.com/apache/parquet-mr: (222 commits) PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding (apache#910) PARQUET-2041: Add zstd to `parquet.compression` description of ParquetOutputFormat Javadoc (apache#899) PARQUET-2050: Expose repetition & definition level from ColumnIO (apache#908) PARQUET-1761: Lower Logging Level in ParquetOutputFormat (apache#745) PARQUET-2046: Upgrade Apache POM to 23 (apache#904) PARQUET-2048: Deprecate BaseRecordReader (apache#906) PARQUET-1922: Deprecate IOExceptionUtils (apache#825) PARQUET-2037: Write INT96 with parquet-avro (apache#901) PARQUET-2044: Enable ZSTD buffer pool by default (apache#903) PARQUET-2038: Upgrade Jackson version used in parquet encryption. (apache#898) Revert "[WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894)" PARQUET-2027: Fix calculating directory offset for merge (apache#896) [WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894) PARQUET-2030: Expose page size row check configurations to ParquetWriter.Builder (apache#895) PARQUET-2031: Upgrade to parquet-format 2.9.0 (apache#897) PARQUET-1448: Review of ParquetFileReader (apache#892) PARQUET-2020: Remove deprecated modules (apache#888) PARQUET-2025: Update Snappy version to 1.1.8.3 (apache#893) PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` (apache#889) PARQUET-1982: Random access to row groups in ParquetFileReader (apache#871) ... # Conflicts: # parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java # parquet-hadoop/pom.xml # parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java # parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
This MR fixes https://issues.apache.org/jira/browse/PARQUET-2030. It exposes
page.size.row.check.maxandpage.size.row.check.maxtoParquetWriter.Builder