PARQUET-2037: Write INT96 with parquet-avro #901

gszadovszky · 2021-05-04T12:46:20Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

shangxinli · 2021-05-05T04:08:22Z

In the ticket, you mentioned there are two ways to solve this issue. I see you implemented it using the 2nd way. I wonder what is the reason behind it? I am not in favor of one over the other, just want to know what are the pros and cons for each.

gszadovszky · 2021-05-10T08:19:02Z

@shangxinli,
The other option was about using the doc field of the avro schema field. It would be a misuse of it as anyone can set anything to that field. It is more about documenting the related field for humans instead of taking decisions based on it in the code. On the Parquet sync that was my only idea to resolve this issue and even I did not like it.
I think the 2nd option (covered by this PR) is much better but only realized its possibility after the Parquet sync meeting so we could not discuss it personally.

shangxinli · 2021-05-10T14:47:15Z

Yeah, agree. Just one thing that sometimes it might not be straightforward for the user to know the exact path to manually set in the configuration for some deeply nested schema. I remember last time when I worked on Avro schema, there are 20+ layers nested in the field and there are 'Type' in the middle with 'name' in it. That is not very human being readable and easy to make mistake. But I have less experience working on Schema, I am not certain this is a real issue.

gszadovszky · 2021-05-10T15:44:49Z

@shangxinli, I agree this is not a perfect solution but I could not come up with any better one. Meanwhile, this feature will not be used widely since INT96 is deprecated. Maybe, it is even better that this feature is not always easy to use :)

shangxinli · 2021-05-10T16:49:58Z

LGTM

* 'master' of https://github.com/apache/parquet-mr: (222 commits) PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding (apache#910) PARQUET-2041: Add zstd to `parquet.compression` description of ParquetOutputFormat Javadoc (apache#899) PARQUET-2050: Expose repetition & definition level from ColumnIO (apache#908) PARQUET-1761: Lower Logging Level in ParquetOutputFormat (apache#745) PARQUET-2046: Upgrade Apache POM to 23 (apache#904) PARQUET-2048: Deprecate BaseRecordReader (apache#906) PARQUET-1922: Deprecate IOExceptionUtils (apache#825) PARQUET-2037: Write INT96 with parquet-avro (apache#901) PARQUET-2044: Enable ZSTD buffer pool by default (apache#903) PARQUET-2038: Upgrade Jackson version used in parquet encryption. (apache#898) Revert "[WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894)" PARQUET-2027: Fix calculating directory offset for merge (apache#896) [WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894) PARQUET-2030: Expose page size row check configurations to ParquetWriter.Builder (apache#895) PARQUET-2031: Upgrade to parquet-format 2.9.0 (apache#897) PARQUET-1448: Review of ParquetFileReader (apache#892) PARQUET-2020: Remove deprecated modules (apache#888) PARQUET-2025: Update Snappy version to 1.1.8.3 (apache#893) PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` (apache#889) PARQUET-1982: Random access to row groups in ParquetFileReader (apache#871) ... # Conflicts: # parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java # parquet-hadoop/pom.xml # parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java # parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

PARQUET-2037: Write INT96 with parquet-avro

0bc5bb2

gszadovszky merged commit c72862b into apache:master May 12, 2021

wgzhao mentioned this pull request Oct 28, 2021

[Bug]: hdfsreader读取parquet时显示INT96 is deprecated. As interim enable READ_INT96_AS_FIXED flag to read as byte array. wgzhao/Addax#422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PARQUET-2037: Write INT96 with parquet-avro #901

PARQUET-2037: Write INT96 with parquet-avro #901

Uh oh!

gszadovszky commented May 4, 2021

Uh oh!

shangxinli commented May 5, 2021

Uh oh!

gszadovszky commented May 10, 2021

Uh oh!

shangxinli commented May 10, 2021

Uh oh!

gszadovszky commented May 10, 2021

Uh oh!

shangxinli commented May 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PARQUET-2037: Write INT96 with parquet-avro #901

PARQUET-2037: Write INT96 with parquet-avro #901

Uh oh!

Conversation

gszadovszky commented May 4, 2021

Jira

Tests

Commits

Documentation

Uh oh!

shangxinli commented May 5, 2021

Uh oh!

gszadovszky commented May 10, 2021

Uh oh!

shangxinli commented May 10, 2021

Uh oh!

gszadovszky commented May 10, 2021

Uh oh!

shangxinli commented May 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants