Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Jan 18, 2023

Change Logs

Due to RFC-46 the profile of the data being serialized by Hudi had changed considerably: previously we're mostly passing around Avro payloads, while now we hold our own internal HoodieRecord implementations.

When classes are not explicitly registered w/ Kryo, it would have to serialize class fully qualified name (FQN) as id every time an object is serialized, which carries a lot of unnecessary overhead.

To work this around in #7026 added HoodieSparkKryoRegistrar registering some of the commonly serialized Hudi classes. However, during rebasing/merging of the RFC-46 feature branch this changes have been partially reverted and this PR takes a stab at reinstating these.

On top of that we had to revisit our current approach to bundling and shading Kryo universally for all bundles. Instead

  • For engines providing Kryo (Spark, Flink) we're not bundling it at all
  • For other bundles still requiring it we bundle and shade it (same way we do it today)

Impact

This will improve performance of ser/de during shuffles, since no FQNs will need to be serialized.

Risk level (write none, low medium or high below)

Low: our bundle validation and testing should uncover issues w/ the packaging if any

Documentation Update

Documentation update is required now specifying that --conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar property will be mandatory

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@alexeykudinkin alexeykudinkin added the priority:blocker Production down; release blocker label Jan 18, 2023
@alexeykudinkin alexeykudinkin changed the title Fixing Kryo registration to be properly wired into Spark sessions [HUDI-5579] Fixing Kryo registration to be properly wired into Spark sessions Jan 18, 2023
@alexeykudinkin alexeykudinkin force-pushed the ak/kryo-shd-fix branch 3 times, most recently from 70fcf7c to 11cceab Compare January 19, 2023 00:50
@alexeykudinkin alexeykudinkin added the engine:spark Spark integration label Jan 19, 2023
// NOTE: We're copying definition of the config introduced in Spark 3.0
// (to stay compatible w/ Spark 2.4)
private val KRYO_USER_REGISTRATORS = "spark.kryo.registrator"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess we can make it public so that there is no need to hard code the option key spark.kryo.registrator everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually won't be able to use it everywhere, so i rather stuck w/ the Spark option for consistency (which is the way we handle every other option as well)

def register(conf: SparkConf): SparkConf = {
conf.registerKryoClasses(new HoodieSparkKryoProvider().registerClasses())
conf.set(KRYO_USER_REGISTRATORS, Seq(classOf[HoodieSparkKryoRegistrar].getName).mkString(","))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does .mkString(",") make sense here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to convert it to a string, so i kept it generic so that we can drop in one more class. Not strictly necessary though

.setMaster("local[4]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "org.apache.spark.HoodieSparkKryoRegistrar")
.set("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also move these common options into a tool method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the method you're referring to (used in tests)

HoodieRecordGlobalLocation.class
};
})
.forEachOrdered(kryo::register);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A stateless function (function that does not take any side effect) is always a better choice especially for tool method, personally I prefer the old way we handle this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree in principle, but here we actually aligning it w/ an interface of KryoRegistrator

.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("hoodie.insert.shuffle.parallelism", "4")
.config("hoodie.upsert.shuffle.parallelism", "4")
.config("hoodie.delete.shuffle.parallelism", "4")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether we can remove these parallelism options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not removed -- they are replaced w/ options set in getSparkConfForTest

<relocation>
<pattern>com.esotericsoftware.kryo.</pattern>
<shadedPattern>org.apache.hudi.com.esotericsoftware.kryo.</shadedPattern>
</relocation>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose to move the common bundle dependencies to each bundle pom files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually move it just to bundles that will have the Kryo included (Spark and Flink won't have Kryo included)

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@xushiyan
Copy link
Member

--conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar

but this is a very broad usability change. we should have brought this up for highlighting earlier.

@alexeykudinkin
Copy link
Contributor Author

but this is a very broad usability change. we should have brought this up for highlighting earlier.

Agreed, not ideal, but unavoidable unfortunately -- w/o we'd be passing around ~20-30% more dead-weight data. And in some cases it would actually lead to failures as well.

@apache apache deleted a comment from hudi-bot Jan 20, 2023
@alexeykudinkin
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@alexeykudinkin
Copy link
Contributor Author

@alexeykudinkin alexeykudinkin merged commit a70355f into apache:master Jan 21, 2023
lokeshj1703 added a commit to lokeshj1703/hudi that referenced this pull request Jan 23, 2023
@xushiyan xushiyan mentioned this pull request Jan 30, 2023
4 tasks
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Jan 31, 2023
…sessions (apache#7702)

### Change Logs

Due to RFC-46 the profile of the data being serialized by Hudi had changed considerably: previously we're mostly passing around Avro payloads, while now we hold our own internal `HoodieRecord` implementations. 

When classes are not explicitly registered w/ Kryo, it would have to serialize class fully qualified name (FQN) as id every time an object is serialized, which carries a lot of [unnecessary overhead](https://github.com/apache/hudi/pull/7026/files#diff-81f9b48f7f7e71b46ea8764c7d63e310c871895d03640ae93c81b09f38306acb). 

To work this around in apache#7026 added `HoodieSparkKryoRegistrar` registering some of the commonly serialized Hudi classes. However, during rebasing/merging of the RFC-46 feature branch this changes have been partially reverted and this PR takes a stab at reinstating these.

On top of that we had to revisit our current approach to bundling and shading Kryo universally for all bundles. Instead
 
 - For engines providing Kryo (Spark, Flink) we're not bundling it at all
 - For other bundles still requiring it we bundle and shade it (same way we do it today)
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…sessions (apache#7702)

### Change Logs

Due to RFC-46 the profile of the data being serialized by Hudi had changed considerably: previously we're mostly passing around Avro payloads, while now we hold our own internal `HoodieRecord` implementations. 

When classes are not explicitly registered w/ Kryo, it would have to serialize class fully qualified name (FQN) as id every time an object is serialized, which carries a lot of [unnecessary overhead](https://github.com/apache/hudi/pull/7026/files#diff-81f9b48f7f7e71b46ea8764c7d63e310c871895d03640ae93c81b09f38306acb). 

To work this around in apache#7026 added `HoodieSparkKryoRegistrar` registering some of the commonly serialized Hudi classes. However, during rebasing/merging of the RFC-46 feature branch this changes have been partially reverted and this PR takes a stab at reinstating these.

On top of that we had to revisit our current approach to bundling and shading Kryo universally for all bundles. Instead
 
 - For engines providing Kryo (Spark, Flink) we're not bundling it at all
 - For other bundles still requiring it we bundle and shade it (same way we do it today)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine:spark Spark integration priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants