Skip to content

Conversation

@zhedoubushishi
Copy link
Contributor

* the read columns' id is an empty string and Hive will combine it with Hoodie required projection ids and becomes
* e.g. ",2,0,3" and will cause an error. This method is used to avoid this situation.
*/
private static synchronized Configuration cleanProjectionColumnIds(Configuration conf) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for using synchronized ? (Is this for non Hive on MR based jobs ?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Actually I am not sure about this. But I find that HoodieParquetRealtimeInputFormat::addRequiredProjectionFields method is synchronized. I guess this method should be similar with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like for Spark, multiple tasks run in the same executor. I think this could be a use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhedoubushishi That makes sense. Although, the hoodie projection column ids are added by the method addRequiredProjectionFields right below by the realtime format (which is invoked by hive). Can we perform this check before adding those projection columns themselves ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said, the weird comma is added in the HiveInputFormat.java and then it directly calls getRecordReader from HoodieParquetRealtimeInputFormat.java. I didn't see a way to do this check even earlier unless we do it in the Hive code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhedoubushishi : You can synchronize on the passed conf object instead of static synchronization which becomes a global lock at the JVM level.

You can do something like
synchronized(conf) {
....
}
inside your cleanProjectionColumnIds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, looks ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhedoubushishi : You can synchronize on the passed conf object instead of static synchronization which becomes a global lock at the JVM level.

You can do something like
synchronized(conf) {
....
}
inside your cleanProjectionColumnIds.

That make sense. Code changes are done.

@zhedoubushishi zhedoubushishi changed the base branch from release-0.5.0 to master October 25, 2019 21:47
@zhedoubushishi zhedoubushishi changed the base branch from master to release-0.5.0 October 25, 2019 21:48
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the 1 comment.. and you may also want to change the target branch to master instead of release-0.5.0 ?

@zhedoubushishi
Copy link
Contributor Author

zhedoubushishi commented Oct 29, 2019

Just the 1 comment.. and you may also want to change the target branch to master instead of release-0.5.0 ?

Sorry I used the wrong branch. Fixed now.

@zhedoubushishi zhedoubushishi force-pushed the fix-select-count-star-error-for-realtime-table branch from 220cab5 to 45a7d77 Compare October 29, 2019 23:54
@zhedoubushishi zhedoubushishi changed the base branch from release-0.5.0 to master October 29, 2019 23:55
@n3nash
Copy link
Contributor

n3nash commented Oct 30, 2019

@zhedoubushishi Like @bvaradar mentioned, please synchronize on jobConf object after which this is good to go.

Comment on lines 200 to 204
/**
* Hive will append read columns' ids to old columns' ids during getRecordReader. In some cases, e.g. SELECT COUNT(*),
* the read columns' id is an empty string and Hive will combine it with Hoodie required projection ids and becomes
* e.g. ",2,0,3" and will cause an error. This method is used to avoid this situation.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed with you internally as well, this appears to be a bug in Hive. It is manifesting because Hudi has the need to append its minimum set of projection columns i.e its metadata columns even incase of a count query.

But ideally this needs to be fixed in Hive so it does not happen in the first place. https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/ColumnProjectionUtils.java#L119

Can we file a Jira with Hive, and add it to the comment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, after the discussion and some investigations, Hive is the first place causes this bug and creates the projection column ids like ",2,0,3". What my code does actually is to handle this bug inside Hudi.
Hive has fixed this bug after 3.0.0, but before 3.0.0 we would still face this problem. The Jira for Hive is here: https://issues.apache.org/jira/browse/HIVE-22438.

@n3nash
Copy link
Contributor

n3nash commented Nov 1, 2019

@zhedoubushishi Thanks for addressing the comments. I'm planning to add some more changes on top of this PR and will add the JIRA in the comments when I open the PR.

@n3nash n3nash merged commit ee0fd06 into apache:master Nov 1, 2019
kroushan-nit pushed a commit to kroushan-nit/hudi-oss-fork that referenced this pull request Nov 10, 2024
Co-authored-by: Surya Prasanna Kumar Yalla <[email protected]>
Co-authored-by: Timothy Brown <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants