[HUDI-5327] Fix spark stages when using row writer #7374

Zouxxyy · 2022-12-03T11:42:10Z

Change Logs

fix https://issues.apache.org/jira/browse/HUDI-5327

Currently, collect is used internally in bulk insert for [[Dataset] when execute clusting, which cause

A single spark job is generated within it, and if there are many clusting groups, too many spark jobs will be generated, which makes the spark app not simple enough
Because Executor is not explicitly specified when submiting spark Jobs through supplyAsync, the number of spark jobs that can be executed simultaneously is limited to the number of CPU cores of the driver, which may cause a performance bottleneck

So, just remove collect in bulk insert for [[Dataset<Row>]]

Impact

Make spark app simper, avoid possible performance bottlenecks when enable hoodie.datasource.write.row.writer.enable
In addition, performClusteringWithRecordsRDD does not have the above problems, because it does not use collect internally, so I just keep their behavior consistent

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

hudi-bot · 2022-12-03T16:42:15Z

CI report:

ed2d8d0 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

alexeykudinkin · 2022-12-03T17:58:48Z

...-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala


      writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator
-    }).collect()
-    table.getContext.parallelize(writeStatuses.toList.asJava)


I don't think this could be a reason for performance problems. Can you please elaborate what you're trying to achieve here?

cc @boneanxs

@alexeykudinkin

Currently, collect is used internally in bulk insert for [[Dataset] when execute clusting, which cause

A single spark job is generated within it, and if there are many clusting groups, too many spark jobs will be generated, which makes the spark app not simple enough

Because Executor is not explicitly specified when submiting spark Jobs throughCompletableFuture. supplyAsync, the number of spark jobs that can be executed simultaneously is limited to the number of CPU cores of the driver, which may cause a performance bottleneck

In addition, performClusteringWithRecordsRDD does not have the above problems, because it does not use collect internally, so I just keep their behavior consistent

You can see https://issues.apache.org/jira/browse/HUDI-5327, I introduced the case I encountered in it

cc @boneanxs

Hey @Zouxxyy, Thanks for raising this issue! It's so nice to see you're trying this feature!

The reason to collect the data here is that HoodieData<WriteStatus> will be used multiple times after performClustering, I recall there is an isEmpty check could take lots of time(validateWriteResult), so here we directly convert to a list of WriteStatus, which will reduce the time.

For the second issue, I noticed this and raised a pr to fix it: #7343, will that address your problem? Feel free to review it!

I think performClusteringWithRecordsRDD also has the same issue such as using RDDSpatialCurveSortPartitioner to optimize data layout, it will call RDD.isEmpty, which will raise a new job.

@boneanxs
For isEmpty check could take lots of time, I provided a PR to fix it, #7373, so maybe we don't need #7343

Yes, we can fix this by directly using getStat, but what if updateIndex will calculate writeStatusList multiple times? If we can directly dereference RDD<WriteStatus> to a list of WriteStatus at one feasible point(such as performClusteringWithRecordsAsRow has already done), we no need to take care of such issue anymore.

As for the parallelism of thread pool could cause the performance issue, I think performClusteringWithRecordsRDD also has the same issue. As we might call partitioner.repartitionRecords, there could also raise a new job inside the Future thread such as https://github.com/apache/hudi/blob/ea48a85efcf8e331d0cc105d426e830b8bfe5b37/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSpatialCurveSortPartitioner.java#L66(check if the RDD is empty or not), or sortBy function in RDDCustomColumnsSortPartitioner(sortBy use RangePartitoner which needs to sample the rdd first to decide the ranges, which will also raise a job in the Future)

@boneanxs
For, #7343, you should be right, I overlooked that other operations may also generate a job. However, I'm wondering if it's necessary to specifically set a parameter

@boneanxs
For RDD reuse problem, I think we should use persist (fixed #7373) instead of using collcet and creating a new RDD

For, #7343, you should be right, I overlooked that other operations may also generate a job. However, I'm wondering if it's necessary to specifically set a parameter

Very appreciate it if you can review the pr to share your thought, could you please explain more in that pr? :)

@Zouxxyy in this case we should actually not be relying on persist as a way to avoid double execution, since persisting is essentially just a caching mechanism (re-using cached blocks on executors) and it'd not be relied upon (it could fail at any point if, for ex, one of the executors fail, making you recompute whole RDD)

@alexeykudinkin, ok, WriteStatus has a large class attributeas writtenRecords, as long as collect does not cause OOM

[HUDI-5327] Fix spark stages when using row writer

ed2d8d0

alexeykudinkin reviewed Dec 3, 2022

View reviewed changes

YannByron self-requested a review December 5, 2022 04:49

YannByron self-assigned this Dec 5, 2022

nsivabalan added the priority:blocker Production down; release blocker label Dec 5, 2022

nsivabalan assigned alexeykudinkin Dec 5, 2022

nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 6, 2022

Zouxxyy closed this Dec 12, 2022

hudi-bot mentioned this pull request Dec 9, 2025

ClusteringWithRecordsAsRow generates too many spark jobs #15608

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HUDI-5327] Fix spark stages when using row writer #7374

[HUDI-5327] Fix spark stages when using row writer #7374

Uh oh!

Zouxxyy commented Dec 3, 2022 •

edited

Loading

Uh oh!

hudi-bot commented Dec 3, 2022

Uh oh!

alexeykudinkin Dec 3, 2022

Uh oh!

Zouxxyy Dec 3, 2022 •

edited

Loading

Uh oh!

boneanxs Dec 5, 2022 •

edited

Loading

Uh oh!

Zouxxyy Dec 5, 2022 •

edited

Loading

Uh oh!

boneanxs Dec 6, 2022

Uh oh!

Zouxxyy Dec 6, 2022 •

edited

Loading

Uh oh!

Zouxxyy Dec 6, 2022

Uh oh!

boneanxs Dec 6, 2022

Uh oh!

alexeykudinkin Dec 7, 2022

Uh oh!

Zouxxyy Dec 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[HUDI-5327] Fix spark stages when using row writer #7374

[HUDI-5327] Fix spark stages when using row writer #7374

Uh oh!

Conversation

Zouxxyy commented Dec 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Dec 3, 2022

CI report:

Uh oh!

alexeykudinkin Dec 3, 2022

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Dec 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

boneanxs Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

boneanxs Dec 6, 2022

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Dec 6, 2022

Choose a reason for hiding this comment

Uh oh!

boneanxs Dec 6, 2022

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Dec 7, 2022

Choose a reason for hiding this comment

Uh oh!

Zouxxyy Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Zouxxyy commented Dec 3, 2022 •

edited

Loading

Zouxxyy Dec 3, 2022 •

edited

Loading

boneanxs Dec 5, 2022 •

edited

Loading

Zouxxyy Dec 5, 2022 •

edited

Loading

Zouxxyy Dec 6, 2022 •

edited

Loading