-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1647: [Java][Parquet] Implement FLOAT16 logical type #1142
Conversation
CI failures are likely due to the fact that the addition of the logical type to parquet-format is unmerged, so the specific PR branch needs to be manually installed for the build to pass. I'm not sure if there's a good solution yet, as this implementation needs to be present for said parquet-format PR to be voted on and merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This LGTM so far (good work on the conversions), just a few comments.
parquet-common/src/test/java/org/apache/parquet/util/TestFloat16.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java
Show resolved
Hide resolved
...t-hadoop/src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @zhangjiashen!
I think there are some follow-up work items:
- Add a
Float16Statistics
as suggested by @benibus. Internally it leverages theBINARY_AS_FLOAT16_COMPARATOR
to collect min/max values. Also make sure it is well covered in the unit test. - Add some e2e tests to make sure statistics are correctly collected by the parquet writer and can be read from the parquet reader.
parquet-common/src/main/java/org/apache/parquet/util/Float16.java
Outdated
Show resolved
Hide resolved
parquet-common/src/test/java/org/apache/parquet/util/TestFloat16.java
Outdated
Show resolved
Hide resolved
parquet-common/src/test/java/org/apache/parquet/util/TestFloat16.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/util/Float16.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/util/Float16.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveStringifier.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveStringifier.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveStringifier.java
Outdated
Show resolved
Hide resolved
parquet-column/src/test/java/org/apache/parquet/schema/TestTypeBuildersWithLogicalTypes.java
Outdated
Show resolved
Hide resolved
...t-hadoop/src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
I agree that this is not elegant. For now, we can only review the functionality and completeness of the current implementation. Once the format change is submitted (and released), we need to take another pass on this PR. |
Agree! we can merge PR first after this diff is ready |
45a1ae2
to
c170a47
Compare
I will take a look later this week. cc @shangxinli @gszadovszky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments quickly! I have left some comments mainly with regard to the issue that we should only use FIXED_LENGTH_BYTE_ARRAY type not BINARY type.
Other than that, we are missing an E2E test case. Please follow what https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriterTruncation.java does to write some float16 values in several row groups and check if the statistics from reader are correct.
parquet-column/src/main/java/org/apache/parquet/column/statistics/Float16Statistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/Float16Statistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/Float16Statistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveComparator.java
Outdated
Show resolved
Hide resolved
parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveComparator.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/type/InvalidFloat16ValueException.java
Outdated
Show resolved
Hide resolved
parquet-common/src/test/java/org/apache/parquet/type/TestFloat16.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/type/Float16.java
Outdated
Show resolved
Hide resolved
3dd25a9
to
efe5d34
Compare
parquet-common/src/main/java/org/apache/parquet/type/Float16.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/type/Float16.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/type/Float16.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac @benibus @gszadovszky Thanks a lot for reviewing this PR and providing good suggestions, most of them are addressed and could you please help take a look again in case I miss any, should we first merge parquet-format PR and release a lib for this PR if we think this PR is ready? CC: @shangxinli
parquet-column/src/test/java/org/apache/parquet/schema/TestPrimitiveComparator.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/type/Float16.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/type/Float16.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/type/Float16.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see the e2e test yet, please check https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriterTruncation.java for reference.
parquet-column/src/main/java/org/apache/parquet/column/statistics/Float16Statistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/test/java/org/apache/parquet/schema/TestTypeBuildersWithLogicalTypes.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/type/Float16.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestFloat16Statistics.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestFloat16Statistics.java
Outdated
Show resolved
Hide resolved
9c43d6c
to
d64761e
Compare
parquet-common/src/main/java/org/apache/parquet/type/Float16.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/type/Float16.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestFloat16Statistics.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestFloat16Statistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhangjiashen!
+1
apache/parquet-format#184 is merged. Could you try to set |
@wgtmac, I don't think we automatically deploy snapshot versions. And, we will need a final release of parquet-format anyway, before we can get this one merged. |
e3738f8
to
447dd5b
Compare
OK, then let's wait until format v2.10 is released. Once two PoC implementations of apache/parquet-format#197 have been finished, I will kick off the release process. |
@zhangjiashen This can be rebased to adopt parquet-format 2.10.0 |
447dd5b
to
2a7ecf6
Compare
@wgtmac I just rebased with master branch and please help take a look when you get a chance? |
8e77307
to
323899e
Compare
Could you please rebase it? |
96f40e2
to
3a2d06d
Compare
Rebased, can you help merge this PR? |
3a2d06d
to
caa030c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks @zhangjiashen for working on this!
BTW, it would be good to add an interoperability test (as a follow-up PR) to read parquet files from here: apache/parquet-testing@da467da. You may want to take a look at this example: https://github.com/apache/parquet-mr/blob/44b56225be6fe7b74667f4f2430326ef1f076cc5/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/codec/TestInteropReadLz4RawCodec.java#L40 |
Will merge this by the end of this week if there is no objection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhangjiashen for your persistence on this! In case I didn't already make this clear, there were a lot of eyes on this project and your contributions were crucial to its success - so thank you for stepping up here.
Merged this. Thanks @zhangjiashen and everyone! |
Is it planned to add the interoperability test mentioned in #1142 (comment) ? |
I think so. @pitrou |
I plan to add some interoperability tests. |
Added interOp tests in #1235 |
This is to implement logical type Float16.
Jira
Tests
Commits
Documentation