Skip to content

Conversation

@steveloughran
Copy link
Contributor

This updates Parquet's Hadoop dependency to 3.2.0.
This version adds compatibility with Java 11, as well
as many other features and bug fixes.

Jira

Tests!

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    it's a dependency update.

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

This updates Parquet's Hadoop dependency to 3.2.0.
This version adds compatibility with Java 11, as well
as many other features and bug fixes.
@steveloughran
Copy link
Contributor Author

thrift module doesn't compile is using an hadoop internal class tagged as private & which made an incompatible change in hadoop 3. see HADOOP-12436

Error:  Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project parquet-thrift: Compilation failure
Error:  /home/runner/work/parquet-mr/parquet-mr/parquet-thrift/src/main/java/org/apache/parquet/thrift/projection/deprecated/PathGlobPattern.java:[55,49] incompatible types: com.google.re2j.Pattern cannot be converted to java.util.regex.Pattern

the good news, the class is deprecated, which explains why nobody has seen it in the wild. Any attempt to use that class would fail with hadoop 3.x on the classpath.

The deprecated parquet-thrift class PathGlobPattern doesn't
compile against hadoop 3.x because in HADOOP-12436 the
nominally private class org.apache.hadoop.fs.GlobPattern
implementation switched from using java.util.regex.Pattern
to com.google.re2j.PatternSyntaxException.

The fact nobody has ever reported this problem implies that it
is never used on any hadoop 3 release, ever.

This commit fixes the build by moving to the google classes.
The alternative strategy would actually be to fork the hadoop
class. This will work unless/until the hadoop project changes
the class again.

It may be time to consider removing entirely. Clearly nobody
is actually using it.
@steveloughran steveloughran changed the title PARQUET-2158:. pgrade Hadoop dependency to version 3.2.0 PARQUET-2158: Upgrade Hadoop dependency to version 3.2.0 Jun 13, 2022
@steveloughran steveloughran marked this pull request as draft June 13, 2022 13:03
Disables the API compatibility check and adds rej2j as a 'provided'
dependency so that the relevant auditing checks do not fail.
@steveloughran steveloughran marked this pull request as ready for review June 13, 2022 14:46
@steveloughran
Copy link
Contributor Author

This PR fixes Parquet to build/link against Hadoop 3.2.0 and higher. It would be cleaner to remove the deprecated class causing compatibility issues -the fact that nobody has ever reported linkage errors implies it is not in active use

<japicmp.version>0.14.2</japicmp.version>
<shade.prefix>shaded.parquet</shade.prefix>
<hadoop.version>2.10.1</hadoop.version>
<hadoop.version>3.2.0</hadoop.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why 3.2.0, not 3.3.1/3.3.2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the question

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was being unambitious. move to this, the oldest 3.x release working on java11 ensures that anything else on a version >= to this should link properly.

if you do want to be more current, well, spark is on 3.3.3, hive is trying to move to 3.3.x and I will be doing a 3.3.4 release in a week's time, which is just some security changes mostly of relevance to servers


import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
import com.google.re2j.Pattern;
Copy link
Member

@sunchao sunchao Jun 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may not work for projects like Spark who are using Hadoop shaded client, since the GlobPattern.compiled is relocated to org.apache.hadoop.shaded.com.google.re2j.Pattern.

It might be easier to just remove the class as it has been marked as deprecated since Parquet 1.8.0, 2015. It is also not used anywhere in the project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for cutting. i will update the patch

@steveloughran
Copy link
Contributor Author

i will do a separate PR to remove PathGlobPattern; not this week though.

It is used in DeprecatedFieldProjectionFilter, and that is used in org.apache.parquet.hadoop.thrift.ThriftReadSupport if "parquet.thrift.column.filter" is set. that use would have to be cut and rather than just print a deprecation warning, actually fail.

nobody must be using this on anything with ASF hadoop binaries 3.2+ or they would have complained about linkage errors by now.

@shangxinli
Copy link
Contributor

LGTM

@shangxinli shangxinli merged commit 1e1c383 into apache:master Jul 19, 2022
@steveloughran
Copy link
Contributor Author

thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants