-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5275] Fix reading data using the HoodieHiveCatalog will cause the Spark write to fail #7295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ause the Spark write to fail
| parameters.put(PATH.key(), path); | ||
| Map<String, String> hoodieProps = loadFromHoodiePropertieFile(path, hiveConf); | ||
| parameters.putAll(TableOptionProperties.translateSparkTableProperties2Flink(hoodieProps)); | ||
| if (!parameters.containsKey(FlinkOptions.HIVE_STYLE_PARTITIONING.key())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to put the explicit table options here ? Which function needs this config and where we miss it then ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can read the table config through StreamerUtil.createMetaClient(xx).getTableConfig()
We actually get the original hive table params first here: parameters = hiveTable.getParameters(), you mean there are options missing for spark ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after Spark created the table, there are some table properties in .hoodie/hoodie.properties, such as hoodie.datasource.write.hive_style_partitioning. These attributes are not included in the hive table attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we only append hive options here, curious why this can cause error ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because spark uses the hoodie.datasource.write.hive_style_partitioning property to be true when creating hudi non-partitioned tables, and records hoodie.datasource.write.hive_style_partitioning=true in hoodie.properties. But this property is not seen in hoodiehivecatalog, and inferred It is also inferred that it is wrong, so hoodie.datasource.write.hive_style_partitioning is assigned a value of false and uploaded to hive options. As a result, spark has two different hoodie.datasource.write.hive_style_partitioning attribute values when viewing the table, thus reporting an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!parameters.containsKey(FlinkOptions.HIVE_STYLE_PARTITIONING.key())) {
Path hoodieTablePath = new Path(path);
boolean hiveStyle = Arrays.stream(FSUtils.getFs(hoodieTablePath, hiveConf).listStatus(hoodieTablePath))
.map(fileStatus -> fileStatus.getPath().getName())
.filter(f -> !f.equals(".hoodie") && !f.equals("default"))
.anyMatch(FilePathUtils::isHiveStylePartitioning);
parameters.put(FlinkOptions.HIVE_STYLE_PARTITIONING.key(), String.valueOf(hiveStyle));
}
--- The params in this code are not inferred according to the hoodie.properties property file, resulting in an inference error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you, i think we should not put extra options like FlinkOptions.HIVE_STYLE_PARTITIONING here, the right fix is to supplement the table config options for the flink read/write path, I have applied a patch actually.
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a test for this scenario?
|
@danny0405 Hello, do you have any comments on this PR? |
|
Yeah, I have reviewed and applied a path: The idea to fix is: we better not add extra options like Can you apply the patch and add a test case in
|
| public static Map<String, String> loadFromHoodiePropertieFile(String basePath, Configuration hadoopConf) { | ||
| Path propertiesFilePath = getHoodiePropertiesFilePath(basePath); | ||
| return getPropsFromFile(basePath, hadoopConf, propertiesFilePath); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, gentle ping :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just saw it. Is it better to replace loadFromHoodiePropertieFile with StreamerUtil.getTableConfig here?
|
Because there is no response for a long time and the code freeze is very near, I worked based on the path and would land it once the CI passed: #7666 Close this one instead. |
Change Logs
Fix reading data using the HoodieHiveCatalog will cause the Spark write to fail.
The original hoodiehivecatalog read the table only according to the attributes of the hive table, but not according to the hoodie.properties file, resulting in an error reading the table attributes. Changed hoodie.datasource.write hive_ style_ The partitioning attribute is inconsistent with the real attribute
Impact
Fix reading data using the HoodieHiveCatalog will cause the Spark write to fail.
Risk level (write none, low medium or high below)
low
Documentation Update
Contributor's checklist