Skip to content

[HUDI-9334] Optimize Parallelism of show_invalid_parquet#13206

Merged
zhangyue19921010 merged 8 commits intoapache:masterfrom
fhan688:optimize-show-invalid-parquet
Jun 16, 2025
Merged

[HUDI-9334] Optimize Parallelism of show_invalid_parquet#13206
zhangyue19921010 merged 8 commits intoapache:masterfrom
fhan688:optimize-show-invalid-parquet

Conversation

@fhan688
Copy link
Contributor

@fhan688 fhan688 commented Apr 22, 2025

Change Logs

the former parallelism is based on partition path which may not enough in some scenarios.
this PR optimize this by building parallelism by number of files.

before:
101

after:
102

Impact

hudi-spark-datasource

Risk level (write none, low medium or high below)

low

Documentation Update

none

Contributor's checklist

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Apr 22, 2025
Copy link
Contributor

@zhangyue19921010 zhangyue19921010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Wait for CI Passed

case ex: Exception =>
isInvalid = true
}
val parquetRdd = jsc.parallelize(fileStatus, Math.max(fileStatus.size, 1)).filter(fileStatus => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about allowing a custom parallelism to be passed here and then aggregating the file status into this parallelism? Using the number of partitions as the parallelism leads to too low concurrency and a long running time for a single task. But will directly using the number of files as the concurrency degree result in too many tasks? In some scenarios, tens of thousands of files are possible, but tens of thousands of concurrent degrees will put a lot of pressure on the task scheduling of spark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for advising, I'll introduce a custom parallelism for avoiding too many tasks

@zhangyue19921010
Copy link
Contributor

@hudi-bot run azure

@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Jun 12, 2025
ProcedureParameter.optional(2, "needDelete", DataTypes.BooleanType, false),
ProcedureParameter.optional(3, "partitions", DataTypes.StringType, ""),
ProcedureParameter.optional(4, "instants", DataTypes.StringType, "")
ProcedureParameter.required(1, "customParallelism", DataTypes.IntegerType),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

    ProcedureParameter.optional(1, "parallelism", DataTypes.IntegerType, 100),

    if (fileStatus.isEmpty) {
      Seq.empty
    } else {
      val parquetRdd = jsc.parallelize(fileStatus, Math.max(fileStatus.size, parallelism)).filter(fileStatus => {

@github-actions github-actions bot added size:S PR with lines of changes in (10, 100] and removed size:M PR with lines of changes in (100, 300] labels Jun 12, 2025
@zhangyue19921010
Copy link
Contributor

@hudi-bot run azure

1 similar comment
@fhan688
Copy link
Contributor Author

fhan688 commented Jun 13, 2025

@hudi-bot run azure

@danny0405
Copy link
Contributor

There are test failures:

[ERROR] Errors: 
[ERROR]   TestDataSourceUtils>HoodieClientTestBase.setUp:70->HoodieSparkClientTestHarness.initResources:

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@zhangyue19921010 zhangyue19921010 merged commit d4edee2 into apache:master Jun 16, 2025
114 of 122 checks passed
alexr17 pushed a commit to alexr17/hudi that referenced this pull request Aug 25, 2025
* [HUDI-9334] optimize parallelism of show_invalid_parquet

* solve type mismatch

* introduce customParallelism

* rm invalid imports

* switch parallelism to optional

* fix scalastyle

* fix scalastyle

---------

Co-authored-by: fhan <yfhanfei@jd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants