Skip to content

Conversation

@Miksu82
Copy link
Contributor

@Miksu82 Miksu82 commented Apr 19, 2021

This MR fixes https://issues.apache.org/jira/browse/PARQUET-2030. It exposes page.size.row.check.max and page.size.row.check.max to ParquetWriter.Builder

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I have some notes about the tests.
(Note: we do not need to validate the whole feature of the page size check only the fact that the properties have passed properly.)

@Miksu82
Copy link
Contributor Author

Miksu82 commented Apr 20, 2021

We do not need to validate the whole feature of the page size check only the fact that the properties have passed properly

Yeah you are right, that should be in InternalParquetRecordWriter unit tests. I'll try to simplify the tests I've added so that they only test that the property has been set.

@Miksu82
Copy link
Contributor Author

Miksu82 commented Apr 20, 2021

Unfortunately I cannot find a way to only assert that properties have been set. It probably could be done with extensive refactoring but I don't see the point to do that in this PR. Also other property setters in the builder are not unit tested so I guess I could just remove these tests completely.

However I did notice that I can assert the number of flushes by counting the number of blocks in the Parquet file footer. Asserting that way would make the test easier to understand.

Let me know what you prefer. Remove the tests or fix the assertations? I'll continue after that.

@gszadovszky
Copy link
Contributor

Yeah, checking only that properties are really passed is not easy. I did not want to say to remove all the tests completely (or any) just that we don't need to test every aspect of the original feature. I think, it is enough to test the difference between setting the properties and not so you know they are passed. You can see that passing such properties are not that obvious so I would keep some test to be sure.

The idea of checking the number of row groups is a good point.

@Miksu82
Copy link
Contributor Author

Miksu82 commented Apr 21, 2021

I now have two tests. One with default params and one with changes params. And I assert by checking the number of blocks in the footer.

I also removed the changes I did for TestOutputFile since they are not needed anymore.

@Miksu82 Miksu82 requested a review from gszadovszky April 21, 2021 08:28
Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your efforts!

@gszadovszky gszadovszky merged commit 8e40e69 into apache:master Apr 22, 2021
elikkatz added a commit to TheWeatherCompany/parquet-mr that referenced this pull request Jun 2, 2021
* 'master' of https://github.com/apache/parquet-mr: (222 commits)
  PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding (apache#910)
  PARQUET-2041: Add zstd to `parquet.compression` description of ParquetOutputFormat Javadoc (apache#899)
  PARQUET-2050: Expose repetition & definition level from ColumnIO (apache#908)
  PARQUET-1761: Lower Logging Level in ParquetOutputFormat (apache#745)
  PARQUET-2046: Upgrade Apache POM to 23 (apache#904)
  PARQUET-2048: Deprecate BaseRecordReader (apache#906)
  PARQUET-1922: Deprecate IOExceptionUtils (apache#825)
  PARQUET-2037: Write INT96 with parquet-avro (apache#901)
  PARQUET-2044: Enable ZSTD buffer pool by default (apache#903)
  PARQUET-2038: Upgrade Jackson version used in parquet encryption. (apache#898)
  Revert "[WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894)"
  PARQUET-2027: Fix calculating directory offset for merge (apache#896)
  [WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894)
  PARQUET-2030: Expose page size row check configurations to ParquetWriter.Builder (apache#895)
  PARQUET-2031: Upgrade to parquet-format 2.9.0 (apache#897)
  PARQUET-1448: Review of ParquetFileReader (apache#892)
  PARQUET-2020: Remove deprecated modules (apache#888)
  PARQUET-2025: Update Snappy version to 1.1.8.3 (apache#893)
  PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` (apache#889)
  PARQUET-1982: Random access to row groups in ParquetFileReader (apache#871)
  ...

# Conflicts:
#	parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java
#	parquet-hadoop/pom.xml
#	parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
#	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants