Skip to content

Conversation

@dongkelun
Copy link
Contributor

@dongkelun dongkelun commented Mar 16, 2022

Spark SQL create non-partition hudi table:

create table test_hudi_table (
  id int,
  name string,
  price double,
  ts long,
  dt string
) using hudi
 options (
  primaryKey = 'id',
  preCombineField = 'ts',
  type = 'cow'
 )
location '/tmp/test_hudi_table';

hive tez count

select count(1) from test_hudi_table;

then exception:

hudi 0.9.0

ERROR : Status: Failed
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1647336877182_0100_4_00, diagnostics=[Vertex vertex_1647336877182_0100_4_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: test_hudi_table initializer failed, vertex=vertex_1647336877182_0100_4_00 [Map 1], java.lang.NullPointerException
        at org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.getTableMetaClientForBasePath(HoodieInputFormatUtils.java:327)
        at org.apache.hudi.hadoop.InputPathHandler.parseInputPaths(InputPathHandler.java:107)
        at org.apache.hudi.hadoop.InputPathHandler.<init>(InputPathHandler.java:68)
        at org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:80)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:524)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:779)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

hudi master also exception

ERROR : Status: Failed
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1647336877182_0106_1_00, diagnostics=[Vertex vertex_1647336877182_0106_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: test_hudi_table initializer failed, vertex=vertex_1647336877182_0106_1_00 [Map 1], java.lang.NullPointerException
        at org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.getTableMetaClientForBasePathUnchecked(HoodieInputFormatUtils.java:335)
        at org.apache.hudi.hadoop.InputPathHandler.parseInputPaths(InputPathHandler.java:110)
        at org.apache.hudi.hadoop.InputPathHandler.<init>(InputPathHandler.java:72)
        at org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:109)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
        at org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:524)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:779)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

What is the purpose of the pull request

Fix hive count exception when the table is empty and the path depth is less than 3

Brief change log

  • Change the value of DEFAULT_LEVELS_TO_BASEPATH from 3 to 0

Verify this pull request

This change added tests and can be verified as follows:

  • Added testInputFormatLoadWithEmptyTable in TestHoodieParquetInputFormat.
  • Added testInputFormatLoadWithEmptyTable in TestHoodieHFileInputFormat.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@xushiyan xushiyan changed the title [Hudi-3643] Fix hive count exception when the table is empty and the path depth is less than 3 [HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 Mar 29, 2022
@xiarixiaoyao xiarixiaoyao self-assigned this Mar 30, 2022
@xiarixiaoyao
Copy link
Contributor

@dongkelun
This is the problem of Hoodie logic, not the problem of default value

if we cannot find PartitionMetadata file, we can check whether ./hoodie directoy exist instead of check PartitionMetadata file

@dongkelun
Copy link
Contributor Author

@dongkelun This is the problem of Hoodie logic, not the problem of default value

if we cannot find PartitionMetadata file, we can check whether ./hoodie directoy exist instead of check PartitionMetadata file

If it is a partition path, it does not exist the ./hoodie directoy
The purpose here is to find basepath, if it exists ./hoodie directoy means basepath. Otherwise, we still need to determine the basepath by reading the level of .hoodie_partition_metadata file

@xiarixiaoyao
Copy link
Contributor

If it is not the root directory of the table, you can find the parent directory and go on, if we find .hoodie directory, we can parser hoodie.properties to do verify.
By the way, For partitioned tables, there is a situation that a partition path exist but Hoodie_ partition_ Metadata not exist?

@dongkelun
Copy link
Contributor Author

If it is not the root directory of the table, you can find the parent directory and go on, if we find .hoodie directory, we can parser hoodie.properties to do verify. By the way, For partitioned tables, there is a situation that a partition path exist but Hoodie_ partition_ Metadata not exist?

1、truncate table will delete the .hoodie_partition_metadata file but not delete the partition path
2、I think your idea is OK. I'll try it according to your idea

@dongkelun dongkelun force-pushed the HUDI-3643 branch 2 times, most recently from 9c30548 to 0d444f0 Compare March 30, 2022 12:18
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan added priority:blocker Production down; release blocker and removed priority:critical Production degraded; pipelines stalled labels Mar 30, 2022
Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@xushiyan xushiyan merged commit 6a83964 into apache:master Apr 7, 2022
xushiyan pushed a commit that referenced this pull request Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants