PARQUET-2134: Improve binding to ByteBufferReadable #971

steveloughran · 2022-06-07T11:24:32Z

This extends #951

It improves binding to streams which implement
ByteBufferReadable through recursive probes of wrapped
streams and direct querying of the stream on Hadoop 3.3.0+.

Since HDFS-14111 all input streams in the hadoop codebase
which implement ByteBufferReadable return true on the
StreamCapabilities probe hasCapability("in:readbytebuffer")

This means the best way to probe for the API on those versions
is to ask the stream.

The StreamCapabilities probe was added in Hadoop 2.9. Along with
making all use of ByteBufferReadable non-reflective, this makes
the checks fairly straightforward.

The recursive check is from #951; the change is it no longer
needs to use reflection.

Tests verify that if a stream implements `ByteBufferReadable' then
it will be bonded to H2SeekableInputStream, even if multiply wrapped
by FSDataInputStreams, and that if it doesn't, it won't.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

HadoopStreams.wrap produces a wrong H2SeekableInputStream if the passed-in FSDataInputStream wraps another FSDataInputStream.

This extends apache#951 Since [HDFS-14111](https://issues.apache.org/jira/browse/HDFS-14111) all input streams in the hadoop codebase which implement `ByteBufferReadable` return true on the StreamCapabilities probe `stream.hasCapability("in:readbytebuffer")`; those which don't are forbidden to do so. This means that on Hadoop 3.3.0+ the preferred way to probe for the API is to ask the stream. The StreamCapabilities probe was added in Hadoop 2.9. Along with making all use of `ByteBufferReadable` non-reflective, this makes the checks fairly straightforward. Tests verify that if a stream implements `ByteBufferReadable' then it will be bonded to H2SeekableInputStream, even if multiply wrapped by FSDataInputStreams, and that if it doesn't, it won't.

sunchao · 2022-07-13T22:58:17Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java

+   * @return true if it is safe to a H2SeekableInputStream to access the data
+   */
+  private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) {
+    if (stream.hasCapability("in:readbytebuffer")) {


it appears this is a relatively new method added in https://issues.apache.org/jira/browse/HADOOP-15012 (Hadoop 2.10.0, 2.9.1, 3.1.0 and 3.0.1). Should we care about older provided Hadoop versions?

if you are targeting the older hadoop releases, you'd also need to build java7 artifacts. does anyone want to do that?

Personally I'm in favor of moving on and adopt the new APIs especially if we are going to depend on Hadoop 3 features more. Maybe we can call the next Parquet release 1.13.0 and declare that it's no longer compatible with older Hadoop versions?

cc @shangxinli

that would be nice. do that and the library we are doing to help give 3.2+ apps access to the higher performance cloud storage APIs when available would be great.

Let's be careful about introducing incompatibility & Hadoop is a fundamental dependency for Parquet.

sunchao · 2022-07-13T23:00:51Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java

    }
-  }
-
-  @SuppressWarnings("unchecked")


I don't understand why Parquet need to use reflection to look up a class defined by itself.

I believe it's because of the transitive dependencies;

shangxinli · 2022-07-24T21:48:56Z

This PR is combined with #951.

7c00 and others added 2 commits March 24, 2022 11:21

PARQUET-2134: Fix type checking in HadoopStreams.wrap

ab7494a

HadoopStreams.wrap produces a wrong H2SeekableInputStream if the passed-in FSDataInputStream wraps another FSDataInputStream.

steveloughran mentioned this pull request Jun 7, 2022

PARQUET-2134: Fix type checking in HadoopStreams.wrap #951

Merged

4 tasks

sunchao reviewed Jul 13, 2022

View reviewed changes

steveloughran closed this Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PARQUET-2134: Improve binding to ByteBufferReadable #971

PARQUET-2134: Improve binding to ByteBufferReadable #971

Uh oh!

steveloughran commented Jun 7, 2022

Uh oh!

sunchao Jul 13, 2022

Uh oh!

steveloughran Jul 14, 2022

Uh oh!

sunchao Jul 14, 2022

Uh oh!

steveloughran Jul 18, 2022

Uh oh!

shangxinli Jul 19, 2022

Uh oh!

sunchao Jul 13, 2022

Uh oh!

steveloughran Jul 14, 2022

Uh oh!

shangxinli commented Jul 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PARQUET-2134: Improve binding to ByteBufferReadable #971

PARQUET-2134: Improve binding to ByteBufferReadable #971

Uh oh!

Conversation

steveloughran commented Jun 7, 2022

Jira

Tests

Commits

Documentation

Uh oh!

sunchao Jul 13, 2022

Choose a reason for hiding this comment

Uh oh!

steveloughran Jul 14, 2022

Choose a reason for hiding this comment

Uh oh!

sunchao Jul 14, 2022

Choose a reason for hiding this comment

Uh oh!

steveloughran Jul 18, 2022

Choose a reason for hiding this comment

Uh oh!

shangxinli Jul 19, 2022

Choose a reason for hiding this comment

Uh oh!

sunchao Jul 13, 2022

Choose a reason for hiding this comment

Uh oh!

steveloughran Jul 14, 2022

Choose a reason for hiding this comment

Uh oh!

shangxinli commented Jul 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants