Skip to content

Conversation

@steveloughran
Copy link
Contributor

This extends #951

It improves binding to streams which implement
ByteBufferReadable through recursive probes of wrapped
streams and direct querying of the stream on Hadoop 3.3.0+.

Since HDFS-14111 all input streams in the hadoop codebase
which implement ByteBufferReadable return true on the
StreamCapabilities probe hasCapability("in:readbytebuffer")

This means the best way to probe for the API on those versions
is to ask the stream.

The StreamCapabilities probe was added in Hadoop 2.9. Along with
making all use of ByteBufferReadable non-reflective, this makes
the checks fairly straightforward.

The recursive check is from #951; the change is it no longer
needs to use reflection.

Tests verify that if a stream implements `ByteBufferReadable' then
it will be bonded to H2SeekableInputStream, even if multiply wrapped
by FSDataInputStreams, and that if it doesn't, it won't.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

7c00 and others added 2 commits March 24, 2022 11:21
HadoopStreams.wrap produces a wrong H2SeekableInputStream if the
passed-in FSDataInputStream wraps another FSDataInputStream.
This extends apache#951

Since [HDFS-14111](https://issues.apache.org/jira/browse/HDFS-14111) all
input streams in the hadoop codebase which implement `ByteBufferReadable`
return true on the StreamCapabilities probe
`stream.hasCapability("in:readbytebuffer")`;
those which don't are forbidden to do so.

This means that on Hadoop 3.3.0+ the preferred way to probe for the API
is to ask the stream.

The StreamCapabilities probe was added in Hadoop 2.9. Along with
making all use of `ByteBufferReadable` non-reflective, this makes
the checks fairly straightforward.

Tests verify that if a stream implements `ByteBufferReadable' then
it will be bonded to H2SeekableInputStream, even if multiply wrapped
by FSDataInputStreams, and that if it doesn't, it won't.
* @return true if it is safe to a H2SeekableInputStream to access the data
*/
private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
if (stream.hasCapability("in:readbytebuffer")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears this is a relatively new method added in https://issues.apache.org/jira/browse/HADOOP-15012 (Hadoop 2.10.0, 2.9.1, 3.1.0 and 3.0.1). Should we care about older provided Hadoop versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are targeting the older hadoop releases, you'd also need to build java7 artifacts. does anyone want to do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'm in favor of moving on and adopt the new APIs especially if we are going to depend on Hadoop 3 features more. Maybe we can call the next Parquet release 1.13.0 and declare that it's no longer compatible with older Hadoop versions?

cc @shangxinli

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be nice. do that and the library we are doing to help give 3.2+ apps access to the higher performance cloud storage APIs when available would be great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be careful about introducing incompatibility & Hadoop is a fundamental dependency for Parquet.

}
}

@SuppressWarnings("unchecked")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why Parquet need to use reflection to look up a class defined by itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's because of the transitive dependencies;

@shangxinli
Copy link
Contributor

This PR is combined with #951.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants