-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-2276: Bring back support for Hadoop 2.7.3 #1084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c30a45a to
e6d948f
Compare
| private static final Logger LOG = LoggerFactory.getLogger(HadoopStreams.class); | ||
|
|
||
| private static final Class<?> byteBufferReadableClass = getReadableClass(); | ||
| static final Constructor<SeekableInputStream> h2SeekableConstructor = getH2SeekableConstructor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using two static methods, you can use DynConstructors instead to make this one expression and reduce error handling code:
private static final DynConstructors.Ctor<SeekableInputStream> h2streamCtor =
DynConstructors.Builder(SeekableInputStream.class)
.impl("org.apache.parquet.hadoop.util.H2SeekableInputStream", FSDataInputStream.class)
.orNull()
.build()
...
if (h2streamCtor != null) {
return h2streamCtor.newInstance(stream);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you'd need to add the orNull handling. I don't see it here: https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/util/DynConstructors.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I guess this was how it was before? Nevermind on the refactoring then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I copied most from the old code to avoid refactoring. I think we can greatly simplify it because it was still taking Hadoop1 into account. We still have to check if the wrapped stream is ByteBufferReadable: https://github.com/apache/hadoop/blob/release-2.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java#L142-L148
The `hasCapabilities does the same but in a more elegant way.
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java
Outdated
Show resolved
Hide resolved
| InputStream wrapped = stream.getWrappedStream(); | ||
| if (wrapped instanceof FSDataInputStream) { | ||
| LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream); | ||
| return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) wrapped)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a FSDataInputStream have another one inside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came from the issue from Presto: prestodb/presto#17435
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can if you try hard, it's just really unusual
you can never wrap an instance by itself.
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java
Outdated
Show resolved
Hide resolved
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should work after comparing it with older code, but it seems like there are some easy improvements to me.
2d8b57c to
7822034
Compare
7822034 to
cef627a
Compare
@rdblue I agree, I did some cleaning up. Let me know what you think. |
|
FWIW, I also ran the Iceberg tests and it ran fine (except the bloom filter ones, more details here). |
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I repeat my stance on this: to claim hadoop 2.7 runtime compatibility you should be building against java7. if you don't, well, make clear its a fairly qualified support "hadoop 2.7.3 on java8 only" and not worry about the bits of hadoop which break if they try to do that (kerberos, s3a, anything with joda-time, ...) |
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java
Outdated
Show resolved
Hide resolved
| * @param stream stream to probe | ||
| * @return A H2SeekableInputStream to access, or H1SeekableInputStream if the stream is not seekable | ||
| */ | ||
| private static SeekableInputStream isWrappedStreamByteBufferReadableLegacy(FSDataInputStream stream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, isSomething methods return a boolean. This is unwrapping, so I'd prefer naming it unwrapByteBufferReadableLegacy or something to be more clear.
| LOG.debug("Checking on wrapped stream {} of {} whether is ByteBufferReadable", wrapped, stream); | ||
| return isWrappedStreamByteBufferReadableLegacy(((FSDataInputStream) wrapped)); | ||
| } | ||
| if (stream.getWrappedStream() instanceof ByteBufferReadable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using the same whitespace conventions as in Iceberg, although that's a bit more relaxed over here.
| return null; | ||
| } | ||
|
|
||
| Boolean hasCapabilities = hasCapabilitiesMethod.invoke(stream, "in:readbytebuffer"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a boxed boolean? If so, should we update the check to handle null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming that it needed to be an object, but a primitive works as well, so changed that. Thanks for catching this!
|
|
||
| Boolean hasCapabilities = hasCapabilitiesMethod.invoke(stream, "in:readbytebuffer"); | ||
|
|
||
| if (hasCapabilities) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable would be more clear if it were called isByteBufferReadable since that's the capability we are checking for.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge when you're ready. I think this is correct.
|
Thanks @shangxinli, @wgtmac, @rdblue, and @steveloughran for the review. Steve, I'm aware of your stance and I also respect it. Unfortunately, a lot of companies are still using (internally heavy patched versions) of Hadoop, and to get traction in downstream projects we still have to maintain compatibility. |
* Bring back support for Hadoop 2.7.3 * Simplify the code * Fix the naming * Comments
Make sure you have checked all steps below.
Complements the work of #951. This PR removes the breaking change that caused Parquet to be incompatible with Hadoop <2.9. In this change, we dynamically check if the
hasCapabilitiesmethod is available, and if this is the case, it will use it. Otherwise, it will fall back to the previous implementation, with the addition that it will also check the wrapped streams.Jira
Tests
Commits
Documentation