Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet-1872: Add TransCompression command to parquet-tools #796

Merged
merged 3 commits into from
Jun 23, 2020

Conversation

shangxinli
Copy link
Contributor

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of code duplications between cli and tools. I would suggest adding these to parquet-hadoop as a utility for this functionality and use it from the two command line tools. The unit test also can be placed in parquet-hadoop so you write it once only. (You may keep simple unit tests at the tools side to verify the tool itself but the functionality shall be verified in the module where it is implemented.)

@shangxinli
Copy link
Contributor Author

shangxinli commented Jun 20, 2020

There are a lot of code duplications between cli and tools. I would suggest adding these to parquet-hadoop as a utility for this functionality and use it from the two command line tools. The unit test also can be placed in parquet-hadoop so you write it once only. (You may keep simple unit tests at the tools side to verify the tool itself but the functionality shall be verified in the module where it is implemented.)

reply:
I was able to move the core part to parquet-hadoop. However, I found it is hard to share the test class because between parquet-hadoop and parquet-cli, because by default the test class is not packaged as this article said https://dzone.com/articles/sharing-test-classes-between-multiple-modules-in-a. So I cannot utilize the test methods in parquet-hadoop test. Then we end up with 3 places implementations. There are several approaches as below. Let me know your thoughts.

  1. Keep as is. Parquet-tool will be removed later.
  2. Move the most parts into parquet-hadoop and implement the unit tests there. In parquet-cli and parquet-tool, we don't add unit test. Give the code in parquet-cli/tools would be just a simple wrapper, it should be OK.

My second commit is based on #2.

@@ -53,7 +53,7 @@ public void readFully(byte[] bytes) throws IOException {

@Override
public void readFully(byte[] bytes, int start, int len) throws IOException {
stream.readFully(bytes);
stream.readFully(bytes, start, len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One open finding (about the logic of retrieving the statistics) and one new about the data generation in the unit test. Otherwise, it looks great.

@gszadovszky gszadovszky merged commit e4988f3 into apache:master Jun 23, 2020
shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Aug 9, 2021
Summary:

source branch: prune
conflict: No
commits:
27c2d9625d8cc8375a48b972c202b9ec1f4a3acb
4f2997edbf5e2d67b56b134dcf78ebcb3ec28bc2
fac0f62af5163084abb8b302759cd62fbe477be6

####below message are auto generated by arc diff
Add parquet file diff utility

ParquetFileWriter missing Api for DataPageV2

Parquet-1872: Add TransCompression command to parquet-tools (apache#796)

Reviewers: shangx

Reviewed By: shangx

Differential Revision: https://code.uberinternal.com/D4970793
@asfimport asfimport mentioned this pull request Jun 23, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants