-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-313] Fix select count star error when querying a realtime table #972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-313] Fix select count star error when querying a realtime table #972
Conversation
| * the read columns' id is an empty string and Hive will combine it with Hoodie required projection ids and becomes | ||
| * e.g. ",2,0,3" and will cause an error. This method is used to avoid this situation. | ||
| */ | ||
| private static synchronized Configuration cleanProjectionColumnIds(Configuration conf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for using synchronized ? (Is this for non Hive on MR based jobs ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Actually I am not sure about this. But I find that HoodieParquetRealtimeInputFormat::addRequiredProjectionFields method is synchronized. I guess this method should be similar with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like for Spark, multiple tasks run in the same executor. I think this could be a use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhedoubushishi That makes sense. Although, the hoodie projection column ids are added by the method addRequiredProjectionFields right below by the realtime format (which is invoked by hive). Can we perform this check before adding those projection columns themselves ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said, the weird comma is added in the HiveInputFormat.java and then it directly calls getRecordReader from HoodieParquetRealtimeInputFormat.java. I didn't see a way to do this check even earlier unless we do it in the Hive code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhedoubushishi : You can synchronize on the passed conf object instead of static synchronization which becomes a global lock at the JVM level.
You can do something like
synchronized(conf) {
....
}
inside your cleanProjectionColumnIds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, looks ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhedoubushishi : You can synchronize on the passed conf object instead of static synchronization which becomes a global lock at the JVM level.
You can do something like
synchronized(conf) {
....
}
inside your cleanProjectionColumnIds.
That make sense. Code changes are done.
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the 1 comment.. and you may also want to change the target branch to master instead of release-0.5.0 ?
Sorry I used the wrong branch. Fixed now. |
220cab5 to
45a7d77
Compare
|
@zhedoubushishi Like @bvaradar mentioned, please synchronize on jobConf object after which this is good to go. |
| /** | ||
| * Hive will append read columns' ids to old columns' ids during getRecordReader. In some cases, e.g. SELECT COUNT(*), | ||
| * the read columns' id is an empty string and Hive will combine it with Hoodie required projection ids and becomes | ||
| * e.g. ",2,0,3" and will cause an error. This method is used to avoid this situation. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed with you internally as well, this appears to be a bug in Hive. It is manifesting because Hudi has the need to append its minimum set of projection columns i.e its metadata columns even incase of a count query.
But ideally this needs to be fixed in Hive so it does not happen in the first place. https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/ColumnProjectionUtils.java#L119
Can we file a Jira with Hive, and add it to the comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, after the discussion and some investigations, Hive is the first place causes this bug and creates the projection column ids like ",2,0,3". What my code does actually is to handle this bug inside Hudi.
Hive has fixed this bug after 3.0.0, but before 3.0.0 we would still face this problem. The Jira for Hive is here: https://issues.apache.org/jira/browse/HIVE-22438.
|
@zhedoubushishi Thanks for addressing the comments. I'm planning to add some more changes on top of this PR and will add the JIRA in the comments when I open the PR. |
Co-authored-by: Surya Prasanna Kumar Yalla <[email protected]> Co-authored-by: Timothy Brown <[email protected]>
Jira: https://jira.apache.org/jira/browse/HUDI-313