-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-2134: Improve binding to ByteBufferReadable #971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2134: Improve binding to ByteBufferReadable #971
Conversation
HadoopStreams.wrap produces a wrong H2SeekableInputStream if the passed-in FSDataInputStream wraps another FSDataInputStream.
This extends apache#951 Since [HDFS-14111](https://issues.apache.org/jira/browse/HDFS-14111) all input streams in the hadoop codebase which implement `ByteBufferReadable` return true on the StreamCapabilities probe `stream.hasCapability("in:readbytebuffer")`; those which don't are forbidden to do so. This means that on Hadoop 3.3.0+ the preferred way to probe for the API is to ask the stream. The StreamCapabilities probe was added in Hadoop 2.9. Along with making all use of `ByteBufferReadable` non-reflective, this makes the checks fairly straightforward. Tests verify that if a stream implements `ByteBufferReadable' then it will be bonded to H2SeekableInputStream, even if multiply wrapped by FSDataInputStreams, and that if it doesn't, it won't.
| * @return true if it is safe to a H2SeekableInputStream to access the data | ||
| */ | ||
| private static boolean isWrappedStreamByteBufferReadable(FSDataInputStream stream) { | ||
| if (stream.hasCapability("in:readbytebuffer")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears this is a relatively new method added in https://issues.apache.org/jira/browse/HADOOP-15012 (Hadoop 2.10.0, 2.9.1, 3.1.0 and 3.0.1). Should we care about older provided Hadoop versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are targeting the older hadoop releases, you'd also need to build java7 artifacts. does anyone want to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'm in favor of moving on and adopt the new APIs especially if we are going to depend on Hadoop 3 features more. Maybe we can call the next Parquet release 1.13.0 and declare that it's no longer compatible with older Hadoop versions?
cc @shangxinli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be nice. do that and the library we are doing to help give 3.2+ apps access to the higher performance cloud storage APIs when available would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be careful about introducing incompatibility & Hadoop is a fundamental dependency for Parquet.
| } | ||
| } | ||
|
|
||
| @SuppressWarnings("unchecked") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why Parquet need to use reflection to look up a class defined by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's because of the transitive dependencies;
|
This PR is combined with #951. |
This extends #951
It improves binding to streams which implement
ByteBufferReadable through recursive probes of wrapped
streams and direct querying of the stream on Hadoop 3.3.0+.
Since HDFS-14111 all input streams in the hadoop codebase
which implement ByteBufferReadable return true on the
StreamCapabilities probe hasCapability("in:readbytebuffer")
This means the best way to probe for the API on those versions
is to ask the stream.
The StreamCapabilities probe was added in Hadoop 2.9. Along with
making all use of
ByteBufferReadablenon-reflective, this makesthe checks fairly straightforward.
The recursive check is from #951; the change is it no longer
needs to use reflection.
Tests verify that if a stream implements `ByteBufferReadable' then
it will be bonded to H2SeekableInputStream, even if multiply wrapped
by FSDataInputStreams, and that if it doesn't, it won't.
Jira
Tests
Commits
Documentation