-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-28026: Reading proto data more than 2GB from multiple splits fails #5033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@zabetak, can you please review? |
|
why do we have this copy of ProtoMessageWritable in tez package namespace in hive code? @harishjp, do you know anything about this? |
@abstractdog , see HIVE-19288, looks like at that time copied code was in unreleased version of Tez. Now since it's available we can just drop the copy and use Tez libs. btw, why ProtoMessageWritable doesn't implement Closeable? DataOutputStream & DataInputStream are never closed, is that ok? |
|
Yes, patch should be done in tez and these files in hive should be removed. It was added because tez changes were not yet released and we wanted this in hive at the moment. I think I made a big mistake back then, the Writer and Reader should have been in hadoop and both tez and hive should be using the Writer and Reader. I was hesitant to co-ordinate between 3 projects and wait for releases in correct order to commit changes to each of the projects. Anyways, right now we can remove the tez code in hive and only have it in tez. |
no worries, thanks @harishjp for clarifying, I think this is not the first and not even the last time we do such thing :) thanks @deniskuzZ for double-checking |
|
|
Have written a java program to reproduce this issue: https://github.com/Aggarwal-Raghav/proto-reader |
|
Please correct me if I am wrong but it seems that the consensus so far is to fix this in the Tez repo and do nothing here. From a Hive perspective what we should do is move forward with HIVE-28028. Are we all on the same page? |
|
Ok, I will raise this PR in TEZ then. |
|
Thank you @Aggarwal-Raghav! Could you please after the tez change remove this unnecessary file? |
|
Sure |
|
Have created the PR in tez TEZ-4540. Hence closing this PR |

What changes were proposed in this pull request?
HIVE-28026
Why are the changes needed?
Query: select * from <table_name>
Explanation:
On running the above mentioned query on a hive proto table, multiple tez containers will be spawned to process the data. In a container, if there are multiple hdfs splits and the combined size of decompressed data is more than 2GB then the query fails with the following error:
"While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either that the input has been truncated or that an embedded message misreported its own length."This is happening because of CodedInputStream i.e. byteLimit += totalBytesRetired + pos;
byteLimit is getting InterOverflow as totalBytesRetired is retaining count of all the bytes that it has read as CodedInputStream is initiliazed once for a container.
hive/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java
Line 96 in 564d7e5
This is different from issue reproduced in https://github.com/zabetak/protobuf-large-message as there it is a single proto data file more than 2GB, but in my case, there are multiple file total resulting in 2GB.
Limitation:
This fix will still not resolve the issue which is mentioned protocolbuffers/protobuf#11729
Does this PR introduce any user-facing change?
NO
Is the change a dependency upgrade?
NO
How was this patch tested?
On a cluster