Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Apr 29, 2023

Make sure you have checked all steps below.

Complements the work of #951. This PR removes the breaking change that caused Parquet to be incompatible with Hadoop <2.9. In this change, we dynamically check if the hasCapabilities method is available, and if this is the case, it will use it. Otherwise, it will fall back to the previous implementation, with the addition that it will also check the wrapped streams.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@Fokko Fokko force-pushed the fd-support-hadoop-2-7-3 branch from c30a45a to e6d948f Compare April 29, 2023 18:54
@Fokko Fokko changed the title Bring back support for Hadoop 2.7.3 PARQUET-2276: Bring back support for Hadoop 2.7.3 Apr 29, 2023
private static final Logger LOG = LoggerFactory.getLogger(HadoopStreams.class);

private static final Class<?> byteBufferReadableClass = getReadableClass();
static final Constructor<SeekableInputStream> h2SeekableConstructor = getH2SeekableConstructor();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using two static methods, you can use DynConstructors instead to make this one expression and reduce error handling code:

private static final DynConstructors.Ctor<SeekableInputStream> h2streamCtor =
    DynConstructors.Builder(SeekableInputStream.class)
    .impl("org.apache.parquet.hadoop.util.H2SeekableInputStream", FSDataInputStream.class)
    .orNull()
    .build()

...
if (h2streamCtor != null) {
  return h2streamCtor.newInstance(stream);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I guess this was how it was before? Nevermind on the refactoring then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I copied most from the old code to avoid refactoring. I think we can greatly simplify it because it was still taking Hadoop1 into account. We still have to check if the wrapped stream is ByteBufferReadable: https://github.com/apache/hadoop/blob/release-2.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java#L142-L148

The `hasCapabilities does the same but in a more elegant way.

InputStream wrapped = stream.getWrappedStream();
if (wrapped instanceof FSDataInputStream) {
LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream);
return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) wrapped));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a FSDataInputStream have another one inside?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came from the issue from Presto: prestodb/presto#17435

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can if you try hard, it's just really unusual

you can never wrap an instance by itself.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work after comparing it with older code, but it seems like there are some easy improvements to me.

@Fokko Fokko force-pushed the fd-support-hadoop-2-7-3 branch from 2d8b57c to 7822034 Compare May 2, 2023 21:42
@Fokko Fokko force-pushed the fd-support-hadoop-2-7-3 branch from 7822034 to cef627a Compare May 2, 2023 21:49
@Fokko
Copy link
Contributor Author

Fokko commented May 2, 2023

I think this should work after comparing it with older code, but it seems like there are some easy improvements to me.

@rdblue I agree, I did some cleaning up. Let me know what you think.

@Fokko
Copy link
Contributor Author

Fokko commented May 3, 2023

FWIW, I also ran the Iceberg tests and it ran fine (except the bloom filter ones, more details here).

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Fokko and @rdblue

BTW, is it possible to warn the users somewhere that they are running on a low version of Hadoop and it is time to upgrade?

@steveloughran
Copy link
Contributor

I repeat my stance on this: to claim hadoop 2.7 runtime compatibility you should be building against java7. if you don't, well, make clear its a fairly qualified support "hadoop 2.7.3 on java8 only" and not worry about the bits of hadoop which break if they try to do that (kerberos, s3a, anything with joda-time, ...)

* @param stream stream to probe
* @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable
*/
private static SeekableInputStream isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, isSomething methods return a boolean. This is unwrapping, so I'd prefer naming it unwrapByteBufferReadableLegacy or something to be more clear.

LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream);
return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) wrapped));
}
if (stream.getWrappedStream() instanceof ByteBufferReadable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using the same whitespace conventions as in Iceberg, although that's a bit more relaxed over here.

return null;
}

Boolean hasCapabilities = hasCapabilitiesMethod.invoke(stream, "in:readbytebuffer");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a boxed boolean? If so, should we update the check to handle null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming that it needed to be an object, but a primitive works as well, so changed that. Thanks for catching this!


Boolean hasCapabilities = hasCapabilitiesMethod.invoke(stream, "in:readbytebuffer");

if (hasCapabilities) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable would be more clear if it were called isByteBufferReadable since that's the capability we are checking for.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge when you're ready. I think this is correct.

@Fokko
Copy link
Contributor Author

Fokko commented May 5, 2023

Thanks @shangxinli, @wgtmac, @rdblue, and @steveloughran for the review.

Steve, I'm aware of your stance and I also respect it. Unfortunately, a lot of companies are still using (internally heavy patched versions) of Hadoop, and to get traction in downstream projects we still have to maintain compatibility.

@Fokko Fokko merged commit dededb6 into apache:master May 5, 2023
@Fokko Fokko deleted the fd-support-hadoop-2-7-3 branch May 5, 2023 22:14
Fokko added a commit to Fokko/parquet-mr that referenced this pull request May 5, 2023
* Bring back support for Hadoop 2.7.3

* Simplify the code

* Fix the naming

* Comments
Fokko added a commit that referenced this pull request May 9, 2023
* Bring back support for Hadoop 2.7.3

* Simplify the code

* Fix the naming

* Comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants