[HUDI-5275] Fix reading data using the HoodieHiveCatalog will cause the Spark write to fail #7295

waywtdcc · 2022-11-24T06:27:04Z

Change Logs

Fix reading data using the HoodieHiveCatalog will cause the Spark write to fail.
The original hoodiehivecatalog read the table only according to the attributes of the hive table, but not according to the hoodie.properties file, resulting in an error reading the table attributes. Changed hoodie.datasource.write hive_ style_ The partitioning attribute is inconsistent with the real attribute

Impact

Fix reading data using the HoodieHiveCatalog will cause the Spark write to fail.

Risk level (write none, low medium or high below)

low

Documentation Update

Fix reading data using the HoodieHiveCatalog will cause the Spark write to fail

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…ause the Spark write to fail

danny0405 · 2022-11-24T07:04:33Z

...ink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java

        parameters.put(PATH.key(), path);
+        Map<String, String> hoodieProps = loadFromHoodiePropertieFile(path, hiveConf);
+        parameters.putAll(TableOptionProperties.translateSparkTableProperties2Flink(hoodieProps));
        if (!parameters.containsKey(FlinkOptions.HIVE_STYLE_PARTITIONING.key())) {


Do we need to put the explicit table options here ? Which function needs this config and where we miss it then ?

We can read the table config through StreamerUtil.createMetaClient(xx).getTableConfig()

We actually get the original hive table params first here: parameters = hiveTable.getParameters(), you mean there are options missing for spark ?

Yes, after Spark created the table, there are some table properties in .hoodie/hoodie.properties, such as hoodie.datasource.write.hive_style_partitioning. These attributes are not included in the hive table attribute.

But we only append hive options here, curious why this can cause error ?

Because spark uses the hoodie.datasource.write.hive_style_partitioning property to be true when creating hudi non-partitioned tables, and records hoodie.datasource.write.hive_style_partitioning=true in hoodie.properties. But this property is not seen in hoodiehivecatalog, and inferred It is also inferred that it is wrong, so hoodie.datasource.write.hive_style_partitioning is assigned a value of false and uploaded to hive options. As a result, spark has two different hoodie.datasource.write.hive_style_partitioning attribute values when viewing the table, thus reporting an error.

if (!parameters.containsKey(FlinkOptions.HIVE_STYLE_PARTITIONING.key())) {
Path hoodieTablePath = new Path(path);
boolean hiveStyle = Arrays.stream(FSUtils.getFs(hoodieTablePath, hiveConf).listStatus(hoodieTablePath))
.map(fileStatus -> fileStatus.getPath().getName())
.filter(f -> !f.equals(".hoodie") && !f.equals("default"))
.anyMatch(FilePathUtils::isHiveStylePartitioning);
parameters.put(FlinkOptions.HIVE_STYLE_PARTITIONING.key(), String.valueOf(hiveStyle));
}

--- The params in this code are not inferred according to the hoodie.properties property file, resulting in an inference error.

Got you, i think we should not put extra options like FlinkOptions.HIVE_STYLE_PARTITIONING here, the right fix is to supplement the table config options for the flink read/write path, I have applied a patch actually.

hudi-bot · 2022-11-24T09:57:34Z

CI report:

c165528 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codope

Could you please add a test for this scenario?

waywtdcc · 2023-01-10T08:39:10Z

@danny0405 Hello, do you have any comments on this PR?

danny0405 · 2023-01-11T09:35:25Z

Yeah, I have reviewed and applied a path:
5275.zip

The idea to fix is: we better not add extra options like FlinkOptions.HIVE_STYLE_PARTITIONING when reading the table, a more proper way to fix the issue is to merge the table config options for reader/writer path.

Can you apply the patch and add a test case in TestHoodieTableFactory to test 2 cases (to test that the write config options always have higher priority):

the table source merges the built-in table config that is not defined in write config.
the table source can not override the existing write config if the option value changes.

danny0405 · 2023-01-12T03:56:15Z

...datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/TableOptionProperties.java

+  public static Map<String, String> loadFromHoodiePropertieFile(String basePath, Configuration hadoopConf) {
+    Path propertiesFilePath = getHoodiePropertiesFilePath(basePath);
+    return getPropsFromFile(basePath, hadoopConf, propertiesFilePath);
+  }


Hi, gentle ping :)

Sorry, just saw it. Is it better to replace loadFromHoodiePropertieFile with StreamerUtil.getTableConfig here?

danny0405 · 2023-01-13T08:51:16Z

Because there is no response for a long time and the code freeze is very near, I worked based on the path and would land it once the CI passed: #7666

Close this one instead.

[HUDI-5275][flink]Fix reading data using the HoodieHiveCatalog will c…

c165528

…ause the Spark write to fail

danny0405 reviewed Nov 24, 2022

View reviewed changes

waywtdcc requested a review from danny0405 November 28, 2022 08:31

codope added priority:high Significant impact; potential bugs engine:flink Flink integration labels Nov 29, 2022

codope changed the title ~~[HUDI-5275][flink]Fix reading data using the HoodieHiveCatalog will cause the Spark write to fail~~ [HUDI-5275] Fix reading data using the HoodieHiveCatalog will cause the Spark write to fail Nov 29, 2022

codope added engine:spark Spark integration area:catalog Catalog integration labels Nov 29, 2022

nsivabalan added priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2 and removed priority:high Significant impact; potential bugs labels Dec 5, 2022

codope reviewed Dec 7, 2022

View reviewed changes

codope added priority:critical Production degraded; pipelines stalled and removed priority:blocker Production down; release blocker labels Dec 7, 2022

codope assigned danny0405 Dec 7, 2022

danny0405 mentioned this pull request Jan 11, 2023

[HUDI-5503] Optimize flink table factory option check #7608

Merged

4 tasks

danny0405 reviewed Jan 12, 2023

View reviewed changes

danny0405 closed this Jan 13, 2023

[HUDI-5275] Fix reading data using the HoodieHiveCatalog will cause the Spark write to fail #7295

[HUDI-5275] Fix reading data using the HoodieHiveCatalog will cause the Spark write to fail #7295

Uh oh!

Conversation

waywtdcc commented Nov 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Nov 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Nov 24, 2022

CI report:

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

waywtdcc commented Jan 10, 2023

Uh oh!

danny0405 commented Jan 11, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 commented Jan 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

waywtdcc commented Nov 24, 2022 •

edited

Loading

danny0405 Nov 24, 2022 •

edited

Loading