[HUDI-9334] Optimize Parallelism of show_invalid_parquet by fhan688 · Pull Request #13206 · apache/hudi

fhan688 · 2025-04-22T13:26:11Z

Change Logs

the former parallelism is based on partition path which may not enough in some scenarios.
this PR optimize this by building parallelism by number of files.

before:

after:

Impact

hudi-spark-datasource

Risk level (write none, low medium or high below)

low

Documentation Update

none

Contributor's checklist

zhangyue19921010

LGTM， Wait for CI Passed

TheR1sing3un · 2025-04-23T03:25:58Z

...rc/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowInvalidParquetProcedure.scala

-                  case ex: Exception =>
-                    isInvalid = true
-                }
+    val parquetRdd = jsc.parallelize(fileStatus, Math.max(fileStatus.size, 1)).filter(fileStatus => {


How about allowing a custom parallelism to be passed here and then aggregating the file status into this parallelism? Using the number of partitions as the parallelism leads to too low concurrency and a long running time for a single task. But will directly using the number of files as the concurrency degree result in too many tasks? In some scenarios, tens of thousands of files are possible, but tens of thousands of concurrent degrees will put a lot of pressure on the task scheduling of spark.

thanks for advising, I'll introduce a custom parallelism for avoiding too many tasks

zhangyue19921010 · 2025-04-23T03:50:33Z

@hudi-bot run azure

zhangyue19921010 · 2025-06-12T12:14:07Z

...rc/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowInvalidParquetProcedure.scala

-    ProcedureParameter.optional(2, "needDelete", DataTypes.BooleanType, false),
-    ProcedureParameter.optional(3, "partitions", DataTypes.StringType, ""),
-    ProcedureParameter.optional(4, "instants", DataTypes.StringType, "")
+    ProcedureParameter.required(1, "customParallelism", DataTypes.IntegerType),


How about

ProcedureParameter.optional(1, "parallelism", DataTypes.IntegerType, 100), if (fileStatus.isEmpty) { Seq.empty } else { val parquetRdd = jsc.parallelize(fileStatus, Math.max(fileStatus.size, parallelism)).filter(fileStatus => {

zhangyue19921010 · 2025-06-13T07:10:59Z

@hudi-bot run azure

fhan688 · 2025-06-13T08:05:07Z

@hudi-bot run azure

danny0405 · 2025-06-14T00:51:42Z

There are test failures:

[ERROR] Errors: 
[ERROR]   TestDataSourceUtils>HoodieClientTestBase.setUp:70->HoodieSparkClientTestHarness.initResources:

hudi-bot · 2025-06-16T10:46:34Z

CI report:

c0c05f1 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

* [HUDI-9334] optimize parallelism of show_invalid_parquet * solve type mismatch * introduce customParallelism * rm invalid imports * switch parallelism to optional * fix scalastyle * fix scalastyle --------- Co-authored-by: fhan <yfhanfei@jd.com>

[HUDI-9334] optimize parallelism of show_invalid_parquet

0aef7df

github-actions bot added the size:S PR with lines of changes in (10, 100] label Apr 22, 2025

solve type mismatch

46f1daa

zhangyue19921010 approved these changes Apr 23, 2025

View reviewed changes

TheR1sing3un reviewed Apr 23, 2025

View reviewed changes

introduce customParallelism

6514805

github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Jun 12, 2025

rm invalid imports

1ee1c35

zhangyue19921010 reviewed Jun 12, 2025

View reviewed changes

switch parallelism to optional

49d72c8

github-actions bot added size:S PR with lines of changes in (10, 100] and removed size:M PR with lines of changes in (100, 300] labels Jun 12, 2025

fhan added 2 commits June 13, 2025 11:44

fix scalastyle

6662689

fix scalastyle

d08f745

Merge branch 'master' into optimize-show-invalid-parquet

c0c05f1

zhangyue19921010 merged commit d4edee2 into apache:master Jun 16, 2025
114 of 122 checks passed

hudi-bot mentioned this pull request Dec 9, 2025

Optimize Parallelism of show_invalid_parquet #16976

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-9334] Optimize Parallelism of show_invalid_parquet#13206

[HUDI-9334] Optimize Parallelism of show_invalid_parquet#13206
zhangyue19921010 merged 8 commits intoapache:masterfrom
fhan688:optimize-show-invalid-parquet

fhan688 commented Apr 22, 2025

Uh oh!

zhangyue19921010 left a comment

Uh oh!

TheR1sing3un Apr 23, 2025

Uh oh!

fhan688 Apr 23, 2025

Uh oh!

zhangyue19921010 commented Apr 23, 2025

Uh oh!

zhangyue19921010 Jun 12, 2025

Uh oh!

zhangyue19921010 commented Jun 13, 2025

Uh oh!

fhan688 commented Jun 13, 2025

Uh oh!

danny0405 commented Jun 14, 2025

Uh oh!

hudi-bot commented Jun 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

fhan688 commented Apr 22, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

zhangyue19921010 left a comment

Choose a reason for hiding this comment

Uh oh!

TheR1sing3un Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

fhan688 Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhangyue19921010 commented Apr 23, 2025

Uh oh!

zhangyue19921010 Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

zhangyue19921010 commented Jun 13, 2025

Uh oh!

fhan688 commented Jun 13, 2025

Uh oh!

danny0405 commented Jun 14, 2025

Uh oh!

hudi-bot commented Jun 16, 2025

CI report:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants