PARQUET-2027: Fix calculating directory offset for merge #896

gszadovszky · 2021-04-19T12:17:23Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

shangxinli · 2021-04-22T15:56:52Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/Offsets.java

    long origPos = -1;
    try {
      origPos = in.getPos();
+      in.seek(chunk.getStartingPos());


Do we assume the dictionary page is always the chunk starting address?

It is not obvious that one have to search this statements in the Encoding docs but it is there:

The dictionary page is written first, before the data pages of the column chunk.

I know it is true today, but what if that assumption is broken when more and more page types are added. Can we add something in Encoding docs to not let people change that assumption?

I agree it should be specified more clearly and maybe not only in the Encoding doc but somewhere in the "main" page but I feel it a separate topic.

Can you create a Jira for it @gszadovszky so that we don't lose tracking of it?

Other than that, LGTM!

Sure, @shangxinli. Check out PARQUET-2034 for details.

Thanks @gszadovszky

(cherry picked from commit 2ce35c7)

* 'master' of https://github.com/apache/parquet-mr: (222 commits) PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding (apache#910) PARQUET-2041: Add zstd to `parquet.compression` description of ParquetOutputFormat Javadoc (apache#899) PARQUET-2050: Expose repetition & definition level from ColumnIO (apache#908) PARQUET-1761: Lower Logging Level in ParquetOutputFormat (apache#745) PARQUET-2046: Upgrade Apache POM to 23 (apache#904) PARQUET-2048: Deprecate BaseRecordReader (apache#906) PARQUET-1922: Deprecate IOExceptionUtils (apache#825) PARQUET-2037: Write INT96 with parquet-avro (apache#901) PARQUET-2044: Enable ZSTD buffer pool by default (apache#903) PARQUET-2038: Upgrade Jackson version used in parquet encryption. (apache#898) Revert "[WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894)" PARQUET-2027: Fix calculating directory offset for merge (apache#896) [WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894) PARQUET-2030: Expose page size row check configurations to ParquetWriter.Builder (apache#895) PARQUET-2031: Upgrade to parquet-format 2.9.0 (apache#897) PARQUET-1448: Review of ParquetFileReader (apache#892) PARQUET-2020: Remove deprecated modules (apache#888) PARQUET-2025: Update Snappy version to 1.1.8.3 (apache#893) PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` (apache#889) PARQUET-1982: Random access to row groups in ParquetFileReader (apache#871) ... # Conflicts: # parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java # parquet-hadoop/pom.xml # parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java # parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

PARQUET-2027: Fix calculating directory offset for merge

b0d8ed4

shangxinli reviewed Apr 22, 2021

View reviewed changes

shangxinli approved these changes Apr 23, 2021

View reviewed changes

shangxinli merged commit 2ce35c7 into apache:master Apr 23, 2021

gszadovszky added a commit that referenced this pull request Apr 26, 2021

PARQUET-2027: Fix calculating directory offset for merge (#896)

76f3594

(cherry picked from commit 2ce35c7)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PARQUET-2027: Fix calculating directory offset for merge #896

PARQUET-2027: Fix calculating directory offset for merge #896

Uh oh!

gszadovszky commented Apr 19, 2021

Uh oh!

shangxinli Apr 22, 2021

Uh oh!

gszadovszky Apr 22, 2021

Uh oh!

shangxinli Apr 22, 2021

Uh oh!

gszadovszky Apr 22, 2021

Uh oh!

shangxinli Apr 22, 2021

Uh oh!

gszadovszky Apr 22, 2021

Uh oh!

shangxinli Apr 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PARQUET-2027: Fix calculating directory offset for merge #896

PARQUET-2027: Fix calculating directory offset for merge #896

Uh oh!

Conversation

gszadovszky commented Apr 19, 2021

Jira

Tests

Commits

Documentation

Uh oh!

shangxinli Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

gszadovszky Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

shangxinli Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

gszadovszky Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

shangxinli Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

gszadovszky Apr 22, 2021

Choose a reason for hiding this comment

Uh oh!

shangxinli Apr 23, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants