[HUDI-313] Fix select count star error when querying a realtime table #972

zhedoubushishi · 2019-10-24T22:10:27Z

Jira: https://jira.apache.org/jira/browse/HUDI-313

n3nash · 2019-10-25T01:16:03Z

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java

+   * the read columns' id is an empty string and Hive will combine it with Hoodie required projection ids and becomes
+   * e.g. ",2,0,3" and will cause an error. This method is used to avoid this situation.
+   */
+  private static synchronized Configuration cleanProjectionColumnIds(Configuration conf) {


Is there a reason for using synchronized ? (Is this for non Hive on MR based jobs ?)

Good question. Actually I am not sure about this. But I find that HoodieParquetRealtimeInputFormat::addRequiredProjectionFields method is synchronized. I guess this method should be similar with that.

Like for Spark, multiple tasks run in the same executor. I think this could be a use case.

@zhedoubushishi That makes sense. Although, the hoodie projection column ids are added by the method addRequiredProjectionFields right below by the realtime format (which is invoked by hive). Can we perform this check before adding those projection columns themselves ?

As you said, the weird comma is added in the HiveInputFormat.java and then it directly calls getRecordReader from HoodieParquetRealtimeInputFormat.java. I didn't see a way to do this check even earlier unless we do it in the Hive code.

@zhedoubushishi : You can synchronize on the passed conf object instead of static synchronization which becomes a global lock at the JVM level.

You can do something like
synchronized(conf) {
....
}
inside your cleanProjectionColumnIds.

Otherwise, looks ok.

@zhedoubushishi : You can synchronize on the passed conf object instead of static synchronization which becomes a global lock at the JVM level.

You can do something like
synchronized(conf) {
....
}
inside your cleanProjectionColumnIds.

That make sense. Code changes are done.

vinothchandar

Just the 1 comment.. and you may also want to change the target branch to master instead of release-0.5.0 ?

zhedoubushishi · 2019-10-29T23:35:30Z

Just the 1 comment.. and you may also want to change the target branch to master instead of release-0.5.0 ?

Sorry I used the wrong branch. Fixed now.

n3nash · 2019-10-30T18:43:51Z

@zhedoubushishi Like @bvaradar mentioned, please synchronize on jobConf object after which this is good to go.

umehrot2 · 2019-10-31T01:22:22Z

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java

+  /**
+   * Hive will append read columns' ids to old columns' ids during getRecordReader. In some cases, e.g. SELECT COUNT(*),
+   * the read columns' id is an empty string and Hive will combine it with Hoodie required projection ids and becomes
+   * e.g. ",2,0,3" and will cause an error. This method is used to avoid this situation.
+   */


As discussed with you internally as well, this appears to be a bug in Hive. It is manifesting because Hudi has the need to append its minimum set of projection columns i.e its metadata columns even incase of a count query.

But ideally this needs to be fixed in Hive so it does not happen in the first place. https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/ColumnProjectionUtils.java#L119

Can we file a Jira with Hive, and add it to the comment here.

Yeah, after the discussion and some investigations, Hive is the first place causes this bug and creates the projection column ids like ",2,0,3". What my code does actually is to handle this bug inside Hudi.
Hive has fixed this bug after 3.0.0, but before 3.0.0 we would still face this problem. The Jira for Hive is here: https://issues.apache.org/jira/browse/HIVE-22438.

n3nash · 2019-11-01T04:54:22Z

@zhedoubushishi Thanks for addressing the comments. I'm planning to add some more changes on top of this PR and will add the JIRA in the comments when I open the PR.

Co-authored-by: Surya Prasanna Kumar Yalla <[email protected]> Co-authored-by: Timothy Brown <[email protected]>

n3nash reviewed Oct 25, 2019

View reviewed changes

zhedoubushishi changed the base branch from release-0.5.0 to master October 25, 2019 21:47

zhedoubushishi changed the base branch from master to release-0.5.0 October 25, 2019 21:48

vinothchandar reviewed Oct 29, 2019

View reviewed changes

zhedoubushishi mentioned this pull request Oct 29, 2019

[HUDI-314] Fix multi partition keys error when querying a realtime table #978

Merged

[HUDI-313] Fix select count star error when querying a realtime table

45a7d77

zhedoubushishi force-pushed the fix-select-count-star-error-for-realtime-table branch from 220cab5 to 45a7d77 Compare October 29, 2019 23:54

zhedoubushishi changed the base branch from release-0.5.0 to master October 29, 2019 23:55

umehrot2 reviewed Oct 31, 2019

View reviewed changes

synchronized lock on conf object instead of class

bd598bd

n3nash merged commit ee0fd06 into apache:master Nov 1, 2019

uncleGen mentioned this pull request Sep 9, 2021

[HUDI-313] NPE when select count start from a realtime table #3630

Merged

5 tasks

kroushan-nit pushed a commit to kroushan-nit/hudi-oss-fork that referenced this pull request Nov 10, 2024

[ENG-12790][INTERNAL] Adding support for empty cleans (apache#972)

186d43b

Co-authored-by: Surya Prasanna Kumar Yalla <[email protected]> Co-authored-by: Timothy Brown <[email protected]>

[HUDI-313] Fix select count star error when querying a realtime table #972

[HUDI-313] Fix select count star error when querying a realtime table #972

Uh oh!

Conversation

zhedoubushishi commented Oct 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

zhedoubushishi commented Oct 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

n3nash commented Oct 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n3nash commented Nov 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhedoubushishi commented Oct 29, 2019 •

edited

Loading