[HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 #5051

dongkelun · 2022-03-16T06:59:52Z

Spark SQL create non-partition hudi table:

create table test_hudi_table (
  id int,
  name string,
  price double,
  ts long,
  dt string
) using hudi
 options (
  primaryKey = 'id',
  preCombineField = 'ts',
  type = 'cow'
 )
location '/tmp/test_hudi_table';

hive tez count

select count(1) from test_hudi_table;

then exception:

hudi 0.9.0

ERROR : Status: Failed
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1647336877182_0100_4_00, diagnostics=[Vertex vertex_1647336877182_0100_4_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: test_hudi_table initializer failed, vertex=vertex_1647336877182_0100_4_00 [Map 1], java.lang.NullPointerException
        at org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.getTableMetaClientForBasePath(HoodieInputFormatUtils.java:327)
        at org.apache.hudi.hadoop.InputPathHandler.parseInputPaths(InputPathHandler.java:107)
        at org.apache.hudi.hadoop.InputPathHandler.<init>(InputPathHandler.java:68)
        at org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:80)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:524)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:779)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

hudi master also exception

ERROR : Status: Failed
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1647336877182_0106_1_00, diagnostics=[Vertex vertex_1647336877182_0106_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: test_hudi_table initializer failed, vertex=vertex_1647336877182_0106_1_00 [Map 1], java.lang.NullPointerException
        at org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.getTableMetaClientForBasePathUnchecked(HoodieInputFormatUtils.java:335)
        at org.apache.hudi.hadoop.InputPathHandler.parseInputPaths(InputPathHandler.java:110)
        at org.apache.hudi.hadoop.InputPathHandler.<init>(InputPathHandler.java:72)
        at org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:109)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
        at org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:524)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:779)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

What is the purpose of the pull request

Fix hive count exception when the table is empty and the path depth is less than 3

Brief change log

Change the value of DEFAULT_LEVELS_TO_BASEPATH from 3 to 0

Verify this pull request

This change added tests and can be verified as follows:

Added testInputFormatLoadWithEmptyTable in TestHoodieParquetInputFormat.
Added testInputFormatLoadWithEmptyTable in TestHoodieHFileInputFormat.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java

xiarixiaoyao · 2022-03-30T08:45:16Z

@dongkelun
This is the problem of Hoodie logic, not the problem of default value

if we cannot find PartitionMetadata file, we can check whether ./hoodie directoy exist instead of check PartitionMetadata file

dongkelun · 2022-03-30T08:54:37Z

@dongkelun This is the problem of Hoodie logic, not the problem of default value

if we cannot find PartitionMetadata file, we can check whether ./hoodie directoy exist instead of check PartitionMetadata file

If it is a partition path, it does not exist the ./hoodie directoy
The purpose here is to find basepath, if it exists ./hoodie directoy means basepath. Otherwise, we still need to determine the basepath by reading the level of .hoodie_partition_metadata file

xiarixiaoyao · 2022-03-30T09:08:55Z

If it is not the root directory of the table, you can find the parent directory and go on， if we find .hoodie directory, we can parser hoodie.properties to do verify.
By the way, For partitioned tables, there is a situation that a partition path exist but Hoodie_ partition_ Metadata not exist?

dongkelun · 2022-03-30T09:17:51Z

If it is not the root directory of the table, you can find the parent directory and go on， if we find .hoodie directory, we can parser hoodie.properties to do verify. By the way, For partitioned tables, there is a situation that a partition path exist but Hoodie_ partition_ Metadata not exist?

1、truncate table will delete the .hoodie_partition_metadata file but not delete the partition path
2、I think your idea is OK. I'll try it according to your idea

…path depth is less than 3

hudi-bot · 2022-03-30T15:23:57Z

CI report:

0d444f0 UNKNOWN
1af81b4 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

xushiyan

LGTM.

…path depth is less than 3 (#5051)

dongkelun force-pushed the HUDI-3643 branch from d680ced to 8639262 Compare March 16, 2022 14:02

nsivabalan added the priority:critical Production degraded; pipelines stalled label Mar 16, 2022

nsivabalan assigned xushiyan Mar 16, 2022

dongkelun force-pushed the HUDI-3643 branch from 8639262 to 329216e Compare March 17, 2022 07:18

xushiyan force-pushed the HUDI-3643 branch from 329216e to df687a3 Compare March 25, 2022 08:38

xushiyan requested changes Mar 25, 2022

View reviewed changes

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java Outdated Show resolved Hide resolved

xushiyan changed the title ~~[Hudi-3643] Fix hive count exception when the table is empty and the path depth is less than 3~~ [HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 Mar 29, 2022

xiarixiaoyao self-assigned this Mar 30, 2022

dongkelun force-pushed the HUDI-3643 branch 2 times, most recently from 9c30548 to 0d444f0 Compare March 30, 2022 12:18

[HUDI-3643] Fix hive count exception when the table is empty and the …

1af81b4

…path depth is less than 3

dongkelun force-pushed the HUDI-3643 branch from 0d444f0 to 1af81b4 Compare March 30, 2022 13:05

nsivabalan added priority:blocker Production down; release blocker and removed priority:critical Production degraded; pipelines stalled labels Mar 30, 2022

xiarixiaoyao approved these changes Mar 31, 2022

View reviewed changes

xushiyan approved these changes Apr 7, 2022

View reviewed changes

xushiyan merged commit 6a83964 into apache:master Apr 7, 2022

xushiyan pushed a commit that referenced this pull request Apr 14, 2022

[HUDI-3643] Fix hive count exception when the table is empty and the …

63fc2ae

…path depth is less than 3 (#5051)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 #5051

[HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 #5051

Uh oh!

dongkelun commented Mar 16, 2022 •

edited

Loading

Uh oh!

Uh oh!

xiarixiaoyao commented Mar 30, 2022

Uh oh!

dongkelun commented Mar 30, 2022

Uh oh!

xiarixiaoyao commented Mar 30, 2022

Uh oh!

dongkelun commented Mar 30, 2022

Uh oh!

hudi-bot commented Mar 30, 2022

Uh oh!

xushiyan left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 #5051

[HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 #5051

Uh oh!

Conversation

dongkelun commented Mar 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Spark SQL create non-partition hudi table:

hive tez count

then exception:

hudi 0.9.0

hudi master also exception

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

Uh oh!

xiarixiaoyao commented Mar 30, 2022

Uh oh!

dongkelun commented Mar 30, 2022

Uh oh!

xiarixiaoyao commented Mar 30, 2022

Uh oh!

dongkelun commented Mar 30, 2022

Uh oh!

hudi-bot commented Mar 30, 2022

CI report:

Uh oh!

xushiyan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongkelun commented Mar 16, 2022 •

edited

Loading

xushiyan left a comment •

edited

Loading