HIVE-28026: Reading proto data more than 2GB from multiple splits fails #5033

Aggarwal-Raghav · 2024-01-24T13:27:48Z

What changes were proposed in this pull request?

Why are the changes needed?

Query: select * from <table_name>

Explanation:
On running the above mentioned query on a hive proto table, multiple tez containers will be spawned to process the data. In a container, if there are multiple hdfs splits and the combined size of decompressed data is more than 2GB then the query fails with the following error:
"While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either that the input has been truncated or that an embedded message misreported its own length."

This is happening because of CodedInputStream i.e. byteLimit += totalBytesRetired + pos;
byteLimit is getting InterOverflow as totalBytesRetired is retaining count of all the bytes that it has read as CodedInputStream is initiliazed once for a container.

hive/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java

Line 96 in 564d7e5

cin = CodedInputStream.newInstance(din);

This is different from issue reproduced in https://github.com/zabetak/protobuf-large-message as there it is a single proto data file more than 2GB, but in my case, there are multiple file total resulting in 2GB.

Limitation:
This fix will still not resolve the issue which is mentioned protocolbuffers/protobuf#11729

Does this PR introduce any user-facing change?

NO

Is the change a dependency upgrade?

NO

How was this patch tested?

On a cluster

Aggarwal-Raghav · 2024-01-24T13:28:15Z

@zabetak, can you please review?

abstractdog · 2024-01-24T13:53:13Z

why do we have this copy of ProtoMessageWritable in tez package namespace in hive code? @harishjp, do you know anything about this?

https://github.com/apache/tez/blob/master/tez-plugins/tez-protobuf-history-plugin/src/main/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java

deniskuzZ · 2024-01-24T15:15:44Z

why do we have this copy of ProtoMessageWritable in tez package namespace in hive code? @harishjp, do you know anything about this?

https://github.com/apache/tez/blob/master/tez-plugins/tez-protobuf-history-plugin/src/main/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java

@abstractdog , see HIVE-19288, looks like at that time copied code was in unreleased version of Tez. Now since it's available we can just drop the copy and use Tez libs.
That said, I think, patch should be done in Tez + imports change in Hive

btw, why ProtoMessageWritable doesn't implement Closeable? DataOutputStream & DataInputStream are never closed, is that ok?

harishjp · 2024-01-24T15:23:54Z

Yes, patch should be done in tez and these files in hive should be removed. It was added because tez changes were not yet released and we wanted this in hive at the moment.

I think I made a big mistake back then, the Writer and Reader should have been in hadoop and both tez and hive should be using the Writer and Reader. I was hesitant to co-ordinate between 3 projects and wait for releases in correct order to commit changes to each of the projects. Anyways, right now we can remove the tez code in hive and only have it in tez.

abstractdog · 2024-01-24T15:30:37Z

Yes, patch should be done in tez and these files in hive should be removed. It was added because tez changes were not yet released and we wanted this in hive at the moment.

I think I made a big mistake back then, the Writer and Reader should have been in hadoop and both tez and hive should be using the Writer and Reader. I was hesitant to co-ordinate between 3 projects and wait for releases in correct order to commit changes to each of the projects. Anyways, right now we can remove the tez code in hive and only have it in tez.

no worries, thanks @harishjp for clarifying, I think this is not the first and not even the last time we do such thing :)
created https://issues.apache.org/jira/browse/HIVE-28028

thanks @deniskuzZ for double-checking

sonarqubecloud · 2024-01-24T18:35:07Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

Aggarwal-Raghav · 2024-01-29T06:38:17Z

Have written a java program to reproduce this issue: https://github.com/Aggarwal-Raghav/proto-reader
Hope this helps

zabetak · 2024-01-30T08:46:05Z

Please correct me if I am wrong but it seems that the consensus so far is to fix this in the Tez repo and do nothing here. From a Hive perspective what we should do is move forward with HIVE-28028. Are we all on the same page?

Aggarwal-Raghav · 2024-02-01T16:47:17Z

Ok, I will raise this PR in TEZ then.

aturoczy · 2024-02-01T20:03:39Z

Thank you @Aggarwal-Raghav! Could you please after the tez change remove this unnecessary file?

Aggarwal-Raghav · 2024-02-02T04:31:04Z

Sure

Aggarwal-Raghav · 2024-02-03T10:39:15Z

Have created the PR in tez TEZ-4540. Hence closing this PR

HIVE-28026: Reading proto data more than 2GB from multiple splits fails

a546ceb

github-actions bot requested a review from abstractdog January 24, 2024 13:28

asf-ci-hive added the tests pending label Jan 24, 2024

asf-ci-hive added tests passed and removed tests pending labels Jan 25, 2024

Aggarwal-Raghav mentioned this pull request Feb 3, 2024

TEZ-4540: Reading proto data more than 2GB from multiple splits fails apache/tez#334

Merged

Aggarwal-Raghav closed this Feb 3, 2024

Aggarwal-Raghav deleted the proto branch August 16, 2024 10:55

HIVE-28026: Reading proto data more than 2GB from multiple splits fails #5033

HIVE-28026: Reading proto data more than 2GB from multiple splits fails #5033

Uh oh!

Conversation

Aggarwal-Raghav commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

Uh oh!

Aggarwal-Raghav commented Jan 24, 2024

Uh oh!

abstractdog commented Jan 24, 2024

Uh oh!

deniskuzZ commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harishjp commented Jan 24, 2024

Uh oh!

abstractdog commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Jan 24, 2024

Quality Gate passed

Uh oh!

Aggarwal-Raghav commented Jan 29, 2024

Uh oh!

zabetak commented Jan 30, 2024

Uh oh!

Aggarwal-Raghav commented Feb 1, 2024

Uh oh!

aturoczy commented Feb 1, 2024

Uh oh!

Aggarwal-Raghav commented Feb 2, 2024

Uh oh!

Aggarwal-Raghav commented Feb 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Aggarwal-Raghav commented Jan 24, 2024 •

edited

Loading

deniskuzZ commented Jan 24, 2024 •

edited

Loading

abstractdog commented Jan 24, 2024 •

edited

Loading