Skip to content

Conversation

@Aggarwal-Raghav
Copy link
Contributor

@Aggarwal-Raghav Aggarwal-Raghav commented Jan 24, 2024

What changes were proposed in this pull request?

HIVE-28026

Why are the changes needed?

Query: select * from <table_name>

Explanation:
On running the above mentioned query on a hive proto table, multiple tez containers will be spawned to process the data. In a container, if there are multiple hdfs splits and the combined size of decompressed data is more than 2GB then the query fails with the following error:
"While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either that the input has been truncated or that an embedded message misreported its own length."

This is happening because of CodedInputStream i.e. byteLimit += totalBytesRetired + pos;
byteLimit is getting InterOverflow as totalBytesRetired is retaining count of all the bytes that it has read as CodedInputStream is initiliazed once for a container.

This is different from issue reproduced in https://github.com/zabetak/protobuf-large-message as there it is a single proto data file more than 2GB, but in my case, there are multiple file total resulting in 2GB.

Limitation:
This fix will still not resolve the issue which is mentioned protocolbuffers/protobuf#11729

Does this PR introduce any user-facing change?

NO

Is the change a dependency upgrade?

NO

How was this patch tested?

On a cluster

@Aggarwal-Raghav
Copy link
Contributor Author

@zabetak, can you please review?

@abstractdog
Copy link
Contributor

why do we have this copy of ProtoMessageWritable in tez package namespace in hive code? @harishjp, do you know anything about this?

https://github.com/apache/tez/blob/master/tez-plugins/tez-protobuf-history-plugin/src/main/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java

@deniskuzZ
Copy link
Member

deniskuzZ commented Jan 24, 2024

why do we have this copy of ProtoMessageWritable in tez package namespace in hive code? @harishjp, do you know anything about this?

https://github.com/apache/tez/blob/master/tez-plugins/tez-protobuf-history-plugin/src/main/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java

@abstractdog , see HIVE-19288, looks like at that time copied code was in unreleased version of Tez. Now since it's available we can just drop the copy and use Tez libs.
That said, I think, patch should be done in Tez + imports change in Hive

btw, why ProtoMessageWritable doesn't implement Closeable? DataOutputStream & DataInputStream are never closed, is that ok?

@harishjp
Copy link
Contributor

Yes, patch should be done in tez and these files in hive should be removed. It was added because tez changes were not yet released and we wanted this in hive at the moment.

I think I made a big mistake back then, the Writer and Reader should have been in hadoop and both tez and hive should be using the Writer and Reader. I was hesitant to co-ordinate between 3 projects and wait for releases in correct order to commit changes to each of the projects. Anyways, right now we can remove the tez code in hive and only have it in tez.

@abstractdog
Copy link
Contributor

abstractdog commented Jan 24, 2024

Yes, patch should be done in tez and these files in hive should be removed. It was added because tez changes were not yet released and we wanted this in hive at the moment.

I think I made a big mistake back then, the Writer and Reader should have been in hadoop and both tez and hive should be using the Writer and Reader. I was hesitant to co-ordinate between 3 projects and wait for releases in correct order to commit changes to each of the projects. Anyways, right now we can remove the tez code in hive and only have it in tez.

no worries, thanks @harishjp for clarifying, I think this is not the first and not even the last time we do such thing :)
created https://issues.apache.org/jira/browse/HIVE-28028

thanks @deniskuzZ for double-checking

@sonarqubecloud
Copy link

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

@Aggarwal-Raghav
Copy link
Contributor Author

Have written a java program to reproduce this issue: https://github.com/Aggarwal-Raghav/proto-reader
Hope this helps

@zabetak
Copy link
Member

zabetak commented Jan 30, 2024

Please correct me if I am wrong but it seems that the consensus so far is to fix this in the Tez repo and do nothing here. From a Hive perspective what we should do is move forward with HIVE-28028. Are we all on the same page?

@Aggarwal-Raghav
Copy link
Contributor Author

Ok, I will raise this PR in TEZ then.

@aturoczy
Copy link

aturoczy commented Feb 1, 2024

Thank you @Aggarwal-Raghav! Could you please after the tez change remove this unnecessary file?

@Aggarwal-Raghav
Copy link
Contributor Author

Sure

@Aggarwal-Raghav
Copy link
Contributor Author

Have created the PR in tez TEZ-4540. Hence closing this PR

@Aggarwal-Raghav Aggarwal-Raghav deleted the proto branch August 16, 2024 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants