[HUDI-3561] Avoid including whole `MultipleSparkJobExecutionStrategy` object into the closure for Spark to serialize #4954

alexeykudinkin · 2022-03-04T22:16:24Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize

Brief change log

See above

Verify this pull request

This pull request is a trivial rework / code cleanup without any test coverage.
This pull request is already covered by existing tests, such as (please describe tests).

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

… the closure for Spark to serialize

yihua

LGTM. Good catch and a nice fix!

xushiyan

Good catch!

alexeykudinkin · 2022-03-05T21:49:26Z

@hudi-bot run azure

hudi-bot · 2022-03-05T22:56:50Z

CI report:

00009bd UNKNOWN
de9ad15 Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

boneanxs · 2022-03-08T07:20:44Z

Hi guys, I also met this exception when enable async clustering in a HoodieSparkStreaming job, not the same as the stacktrace this issue hit, following is the stacktrace I met,

 ERROR AsyncClusteringService: Clustering executor failed java.util.concurrent.CompletionException: org.apache.spark.SparkException: Task not serializable 
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) 
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) 
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606) 
at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1596) 
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) 
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) 
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) 
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: org.apache.spark.SparkException: Task not serializable 
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416) 
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) 
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) 
at org.apache.spark.SparkContext.clean(SparkContext.scala:2467) 
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:912) 
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) 
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:911) 
at org.apache.spark.api.java.JavaRDDLike.mapPartitionsWithIndex(JavaRDDLike.scala:103) 
at org.apache.spark.api.java.JavaRDDLike.mapPartitionsWithIndex$(JavaRDDLike.scala:99) 
at org.apache.spark.api.java.AbstractJavaRDDLike.mapPartitionsWithIndex(JavaRDDLike.scala:45) 
at org.apache.hudi.table.action.commit.SparkBulkInsertHelper.bulkInsert(SparkBulkInsertHelper.java:115) 
at org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy.performClusteringWithRecordsRDD(SparkSortAndSizeExecutionStrategy.java:68) 
at org.apache.hudi.client.clustering.run.strategy.MultipleSparkJobExecutionStrategy.lambda$runClusteringForGroupAsync$4(MultipleSparkJobExecutionStrategy.java:175) 
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) ... 5 more

Caused by: java.util.ConcurrentModificationException 
at java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719) 
at java.util.LinkedHashMap$LinkedKeyIterator.next(LinkedHashMap.java:742) 
at java.util.HashSet.writeObject(HashSet.java:287) 
at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
at java.lang.reflect.Method.invoke(Method.java:498) 
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) 
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) 
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) 
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) 
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) 
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) 
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) 
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) 
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) 
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) 
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) 
at org.apache.spark.serializer.JavaSerializerInstance

From my perspective, it might be TypedProperties#keys is not thread safe, and another thread is trying to change this HashSet(like put or putall from TypedProperties) while spark is trying to iterate it to serialize it at the same time. TypedProperties could be used by HoodieTable's config(HoodieWriteConfig), so this pr could fix it by avoiding HoodieTable to be serialized.

But when I'm trying to solve it with the same way as this pr used, Unfortunately found there could be a lot changes to avoid serializing HoodieTable (Change construction methods of BulkInsertMapFunction, SparkLazyInsertIterable, HoodieLazyInsertIterable, and many kinds of WriteHandler), I'm afraid this could be a huge change.

Another solution is to make TypedProperties thread-safe, there are two ways to make TypedProperties thread-safe

Only change keys to be Collections.newSetFromMap(new ConcurrentHashMap<>()), this could avoid ConcurrentModificationException, but TypedProperties is not really thread-safe, as modify attribute keys and save key-value pair is divided into two steps, for example,

// Synchronized is not work actually, because get methods are not synchronized
  public synchronized Object put(Object key, Object value) {
    keys.remove(key);
    keys.add(key);
   // This could cause key is added in keys, but its value is not saved by TypedProperties
    return super.put(key, value);
  }

Not let TypedProperties to extend Properties, use an internal ConcurrentHashMap to save key and values, this could make TypedProperties to be real thread-safe.

public class TypedProperties implements Serializable {

  private final ConcurrentHashMap<Object, Object> props = new ConcurrentHashMap<Object, Object>();

  public TypedProperties() {

  }

  public TypedProperties(Properties defaults) {
    if (Objects.nonNull(defaults)) {
      for (String key : defaults.stringPropertyNames()) {
        put(key, defaults.getProperty(key));
      }
    }
  }

  public Enumeration<Object> keys() {
    return Collections.enumeration(props.keySet());
  }
...

Do you guys have any other suggestions? Thanks~

alexeykudinkin · 2022-03-08T23:20:25Z

@boneanxs can you create a JIRA for your issue so that we can keep track of it and concentrate all of the conversation in there.

boneanxs · 2022-03-09T11:34:34Z

@alexeykudinkin, @xushiyan, @yihua Sure, created a JIRA ticket: https://issues.apache.org/jira/browse/HUDI-3593, looking forward to get your feedback:-)

… object into the closure for Spark to serialize (apache#4954) - Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize

Alexey Kudinkin added 2 commits March 4, 2022 14:15

Avoid including whole MultipleSparkJobExecutionStrategy object into…

34a530a

… the closure for Spark to serialize

Fixing compilation

00009bd

yihua approved these changes Mar 4, 2022

View reviewed changes

Alexey Kudinkin added 2 commits March 4, 2022 14:23

Extracted Hadoop Conf as well

06cc6b4

Fixing compilation

3cd70db

yihua self-assigned this Mar 4, 2022

Alexey Kudinkin added 2 commits March 4, 2022 17:51

Fixed serialization issues

47dcbc4

Fixed getBulkInsertSortMode to properly fallback to default

de9ad15

xushiyan approved these changes Mar 5, 2022

View reviewed changes

nsivabalan merged commit f0bcee3 into apache:master Mar 7, 2022

nsivabalan added the priority:critical Production degraded; pipelines stalled label Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HUDI-3561] Avoid including whole `MultipleSparkJobExecutionStrategy` object into the closure for Spark to serialize #4954

[HUDI-3561] Avoid including whole `MultipleSparkJobExecutionStrategy` object into the closure for Spark to serialize #4954

Uh oh!

alexeykudinkin commented Mar 4, 2022

Uh oh!

yihua left a comment

Uh oh!

xushiyan left a comment

Uh oh!

alexeykudinkin commented Mar 5, 2022

Uh oh!

hudi-bot commented Mar 5, 2022

Uh oh!

boneanxs commented Mar 8, 2022

Uh oh!

alexeykudinkin commented Mar 8, 2022 •

edited

Loading

Uh oh!

boneanxs commented Mar 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[HUDI-3561] Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize #4954

[HUDI-3561] Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize #4954

Uh oh!

Conversation

alexeykudinkin commented Mar 4, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

xushiyan left a comment

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin commented Mar 5, 2022

Uh oh!

hudi-bot commented Mar 5, 2022

CI report:

Uh oh!

boneanxs commented Mar 8, 2022

Uh oh!

alexeykudinkin commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boneanxs commented Mar 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[HUDI-3561] Avoid including whole `MultipleSparkJobExecutionStrategy` object into the closure for Spark to serialize #4954

[HUDI-3561] Avoid including whole `MultipleSparkJobExecutionStrategy` object into the closure for Spark to serialize #4954

alexeykudinkin commented Mar 8, 2022 •

edited

Loading