[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper by panbingkun · Pull Request #39192 · apache/spark

panbingkun · 2022-12-23T09:13:10Z

What changes were proposed in this pull request?

Add Protobuf serializer for StageDataWrapper.

Why are the changes needed?

Support fast and compact serialization/deserialization for StageDataWrapper over RocksDB.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT.

panbingkun · 2022-12-23T09:13:51Z

Waiting for me to add new UT.

panbingkun · 2022-12-24T13:23:47Z

core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto

+
+  repeated int64 rdd_ids = 43;
+  repeated AccumulableInfo accumulator_updates = 44;
+  map<int64, TaskData> tasks = 45;


optional map is not supported by pb

I see, hmm... should we encapsulate this map?

such as

optional TaskMap tasks = 45; message TaskMap { map<int64, TaskData> tasks = 1; }

also cc @gengliangwang

Simply a map is OK here. An empty map should make no difference with None here.

panbingkun · 2022-12-24T13:24:30Z

core/src/main/scala/org/apache/spark/status/protobuf/StageDataWrapperSerializer.scala

+    val description =
+      getOptional(binary.hasDescription, () => weakIntern(binary.getDescription))
+    val accumulatorUpdates = Utils.deserializeAccumulableInfos(binary.getAccumulatorUpdatesList)
+    val tasks = MapUtils.isEmpty(binary.getTasksMap) match {


optional map is not supported by pb

AmplabJenkins · 2022-12-24T17:29:58Z

Can one of the admins verify this patch?

panbingkun · 2022-12-26T02:25:10Z

cc @gengliangwang @LuciferYang

LuciferYang · 2022-12-26T07:39:20Z

also cc @techaddict

LuciferYang · 2022-12-27T11:00:50Z

core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto

+}
+
+message StageData {
+  enum StageStatus {


Why StageStatus designed as StageData inside enum ?

If StageStatus is defined outside, the error message is as follows：

Then If StageStatus is defined as follows:
enum StageStatus {
STAGE_STATUS_UNSPECIFIED = 0;
STAGE_STATUS_ACTIVE = 1;
STAGE_STATUS_COMPLETE = 2;
STAGE_STATUS_FAILED = 3;
STAGE_STATUS_PENDING = 4;
STAGE_STATUS_SKIPPED = 5;
}

The Code of Serializer and Deerializer will be very ugly!
Will have to handle the operations of adding prefix and deleting prefix.

Similarly, the enum definition of JobExecutionStatus seems more reasonable in JobData ?

JobExecutionStatus is used in SQLExecutionUIData. So it can't be moved into JobData

As described in https://github.com/apache/spark/pull/39270/files, UNSPECIFIED in StageStatus should change to STAGE_STATUS_UNSPECIFIED and moved out of StageData

Ok, Let us follow code style(https://developers.google.com/protocol-buffers/docs/style#enums):

New pr for JobExecutionStatus: #39286
@gengliangwang @LuciferYang

LuciferYang · 2022-12-27T11:18:03Z

core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto

+
+  repeated int64 rdd_ids = 43;
+  repeated AccumulableInfo accumulator_updates = 44;
+  map<int64, TaskData> tasks = 45;


I see, hmm... should we encapsulate this map?

such as

optional TaskMap tasks = 45; message TaskMap { map<int64, TaskData> tasks = 1; }

also cc @gengliangwang

LuciferYang · 2022-12-27T11:18:45Z

core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto

+  repeated int64 rdd_ids = 43;
+  repeated AccumulableInfo accumulator_updates = 44;
+  map<int64, TaskData> tasks = 45;
+  map<string, ExecutorStageSummary> executor_summary = 46;


LuciferYang · 2022-12-27T12:36:40Z

core/src/main/scala/org/apache/spark/status/protobuf/StageDataWrapperSerializer.scala

+    stageData.rddIds.foreach(id => stageDataBuilder.addRddIds(id.toLong))
+    stageData.accumulatorUpdates.foreach { update =>
+      stageDataBuilder.addAccumulatorUpdates(Utils.serializeAccumulableInfo(update))
+    }


I think there are 3 choices for the definition of serializeAccumulableInfo function:

Move it from class TaskDataWrapperSerializer to companion object TaskDataWrapperSerializer

Move it from class TaskDataWrapperSerializer to object AccumulableInfoSerializer

Keep the status quo and let StageDataWrapperSerializer hold a TaskDataWrapperSerializer instance

Similar suggestions for deserializeAccumulableInfo\serializeExecutorStageSummary\deserializeExecutorStageSummary and I think Utils should be a more general functions

+1 for AccumulableInfoSerializer

Ok, let me do it.

gengliangwang · 2022-12-27T22:54:16Z

This is a big one. @panbingkun Thanks for working on it!

LuciferYang · 2022-12-28T09:04:30Z

core/src/main/scala/org/apache/spark/status/protobuf/AccumulableInfoSerializer.scala

+
+object AccumulableInfoSerializer {
+
+  private[protobuf] def serializeAccumulableInfo(


serializeAccumulableInfo -> serialize

LuciferYang · 2022-12-28T09:09:04Z

core/src/main/scala/org/apache/spark/status/protobuf/AccumulableInfoSerializer.scala

+    builder.build()
+  }
+
+  private[protobuf] def deserializeAccumulableInfos(


deserializeAccumulableInfos -> deserialize,

nit: I prefer to deserialize(info AccumulableInfo), looks more generic, but now is also ok

LuciferYang · 2022-12-28T09:23:35Z

core/src/main/scala/org/apache/spark/status/protobuf/AccumulableInfoSerializer.scala

+
+  private[protobuf] def deserializeAccumulableInfos(
+      updates: JList[StoreTypes.AccumulableInfo]): ArrayBuffer[AccumulableInfo] = {
+    val accumulatorUpdates = new ArrayBuffer[AccumulableInfo]()


with a initialSize ?

LuciferYang · 2022-12-28T09:26:45Z

core/src/main/scala/org/apache/spark/status/protobuf/StageDataWrapperSerializer.scala

+
+  override val supportClass: Class[_] = classOf[StageDataWrapper]
+
+  override def serialize(input: Any): Array[Byte] =


we can merge the two serialize to one

LuciferYang · 2022-12-28T09:30:07Z

core/src/main/scala/org/apache/spark/status/protobuf/StageDataWrapperSerializer.scala

+    val executorSummary = MapUtils.isEmpty(binary.getExecutorSummaryMap) match {
+      case true => None
+      case _ => Some(binary.getExecutorSummaryMap.asScala.mapValues(
+          ExecutorStageSummarySerializer.deserializeExecutorStageSummary(_)).toMap


can convertible to a method value

LuciferYang · 2022-12-28T09:31:09Z

core/src/main/scala/org/apache/spark/status/protobuf/StageDataWrapperSerializer.scala

+      case _ => Some(binary.getTasksMap.asScala.map(
+        entry => (entry._1.toLong, deserializeTaskData(entry._2))).toMap)
+    }
+    val executorSummary = MapUtils.isEmpty(binary.getExecutorSummaryMap) match {


just true and false, I prefer to if {} else {}

LuciferYang · 2022-12-28T09:36:30Z

core/src/main/scala/org/apache/spark/status/protobuf/StageDataWrapperSerializer.scala

+    new ExecutorPeakMetricsDistributions(
+      quantiles = binary.getQuantilesList.asScala.map(_.toDouble).toIndexedSeq,
+      executorMetrics = binary.getExecutorMetricsList.asScala.map(
+        ExecutorMetricsSerializer.deserialize(_)).toIndexedSeq


can convertible to a method value

LuciferYang · 2022-12-28T09:46:01Z

core/src/main/scala/org/apache/spark/status/protobuf/StageDataWrapperSerializer.scala

+      launchTime = new Date(binary.getLaunchTime),
+      resultFetchStart = resultFetchStart,
+      duration = duration,
+      executorId = weakIntern(binary.getExecutorId),


When should we use weakIntern? Seems not all serializers use it, will this affect performance?

For example, when new AccumulableInfo in AccumulableInfoSerializer, we didn't use weakIntern

For consistency, we use weak here, eg:

spark/core/src/main/scala/org/apache/spark/status/protobuf/TaskDataWrapperSerializer.scala

Lines 100 to 103 in c4619b5

executorId = weakIntern(binary.getExecutorId),

host = weakIntern(binary.getHost),

status = weakIntern(binary.getStatus),

taskLocality = weakIntern(binary.getTaskLocality),

As far as I know, when the field is of type string (not include map<string....>)

gengliangwang · 2022-12-30T00:06:19Z

core/src/main/scala/org/apache/spark/status/protobuf/AccumulableInfoSerializer.scala

+import org.apache.spark.status.api.v1.AccumulableInfo
+import org.apache.spark.status.protobuf.Utils.getOptional
+
+object AccumulableInfoSerializer {


Let's put the private[protobuf] before the object AccumulableInfoSerializer

So that we don't need to have private[protobuf] before each method

gengliangwang · 2022-12-30T00:07:30Z

core/src/main/scala/org/apache/spark/status/protobuf/ExecutorStageSummarySerializer.scala

+
+object ExecutorStageSummarySerializer {
+
+  private[protobuf] def serialize(input: ExecutorStageSummary): StoreTypes.ExecutorStageSummary = {


gengliangwang · 2022-12-30T00:09:10Z

core/src/main/scala/org/apache/spark/status/protobuf/StageStatusSerializer.scala

+
+import org.apache.spark.status.api.v1.StageStatus
+
+object StageStatusSerializer {


gengliangwang · 2022-12-30T00:10:32Z

core/src/test/scala/org/apache/spark/status/protobuf/KVStoreProtobufSerializerSuite.scala

+    }
+  }
+
+  private def assert(result: TaskMetrics, input: TaskMetrics): Unit = {


nit: rename all the assert methods as checkAnwser(result, expected)?

gengliangwang

LGTM except a few minor comments

gengliangwang · 2022-12-30T21:49:05Z

@panbingkun Thanks for the work, merging to master

…atorUpdates for Scala 2.13 ### What changes were proposed in this pull request? This PR is a followup of #39192 that excludes `StageData.rddIds` and `StageData.accumulatorUpdates` for Scala 2.13 ### Why are the changes needed? To recover the Scala 2.13 build. It is currently broken (https://github.com/apache/spark/actions/runs/3824617107/jobs/6506925003): ``` [error] spark-core: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.3.0! Found 3 potential problems (filtered 997) [error] * method rddIds()scala.collection.immutable.Seq in class org.apache.spark.status.api.v1.StageData has a different result type in current version, where it is scala.collection.Seq rather than scala.collection.immutable.Seq [error] filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.status.api.v1.StageData.rddIds") [error] * method accumulatorUpdates()scala.collection.immutable.Seq in class org.apache.spark.status.api.v1.StageData has a different result type in current version, where it is scala.collection.Seq rather than scala.collection.immutable.Seq [error] filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.status.api.v1.StageData.accumulatorUpdates") [error] * method this(org.apache.spark.status.api.v1.StageStatus,Int,Int,Int,Int,Int,Int,Int,Int,scala.Option,scala.Option,scala.Option,scala.Option,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,java.lang.String,scala.Option,java.lang.String,java.lang.String,scala.collection.immutable.Seq,scala.collection.immutable.Seq,scala.Option,scala.Option,scala.Option,scala.collection.immutable.Map,Int,scala.Option,scala.Option,scala.Option)Unit in class org.apache.spark.status.api.v1.StageData's type is different in current version, where it is (org.apache.spark.status.api.v1.StageStatus,Int,Int,Int,Int,Int,Int,Int,Int,scala.Option,scala.Option,scala.Option,scala.Option,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,java.lang.String,scala.Option,java.lang.String,java.lang.String,scala.collection.Seq,scala.collection.Seq,scala.Option,scala.Option,scala.Option,scala.collection.immutable.Map,Int,scala.Option,scala.Option,scala.Option)Unit instead of (org.apache.spark.status.api.v1.StageStatus,Int,Int,Int,Int,Int,Int,Int,Int,scala.Option,scala.Option,scala.Option,scala.Option,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,Long,java.lang.String,scala.Option,java.lang.String,java.lang.String,scala.collection.immutable.Seq,scala.collection.immutable.Seq,scala.Option,scala.Option,scala.Option,scala.collection.immutable.Map,Int,scala.Option,scala.Option,scala.Option)Unit [error] filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.status.api.v1.StageData.this") ``` ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested. Closes #39356 from HyukjinKwon/SPARK-41423. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

2b7f1ef

github-actions bot added the CORE label Dec 23, 2022

panbingkun added 2 commits December 24, 2022 20:40

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

0dedace

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

f0ba3fc

panbingkun changed the title ~~[WIP][SPARK-41423][CORE] Protobuf serializer for StageDataWrapper~~ [SPARK-41423][CORE] Protobuf serializer for StageDataWrapper Dec 24, 2022

panbingkun commented Dec 24, 2022

View reviewed changes

panbingkun added 4 commits December 25, 2022 09:55

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

65e8e02

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

f89b17a

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

809d5ff

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

21bbfca

LuciferYang reviewed Dec 27, 2022

View reviewed changes

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

e28a727

panbingkun requested a review from gengliangwang December 28, 2022 05:48

LuciferYang reviewed Dec 28, 2022

View reviewed changes

panbingkun added 2 commits December 28, 2022 20:17

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

4bd85f8

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

59b5749

gengliangwang mentioned this pull request Dec 29, 2022

[SPARK-41754][UI] Add simple developer guides for UI Protobuf serializer #39270

Closed

panbingkun added 5 commits December 29, 2022 15:33

Merge branch 'master' into SPARK-41423

fac1115

Merge branch 'master' into SPARK-41423

3def54d

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

955b22a

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

bab6770

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

6cb476f

gengliangwang reviewed Dec 30, 2022

View reviewed changes

gengliangwang approved these changes Dec 30, 2022

View reviewed changes

panbingkun added 2 commits December 30, 2022 10:15

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

8aeb9c5

[SPARK-41423][CORE] Protobuf serializer for StageDataWrapper

f8e6162

gengliangwang closed this in 290d09d Dec 30, 2022

HyukjinKwon mentioned this pull request Jan 3, 2023

[SPARK-41423][CORE][BUILD] Exclude StageData.rddIds, this and accumulatorUpdates for Scala 2.13 #39356

Closed

thejdeep mentioned this pull request Jan 3, 2023

[SPARK-36620][SHUFFLE] Add Push Based Shuffle client side read metrics #36165

Closed


		object AccumulableInfoSerializer {

		private[protobuf] def serializeAccumulableInfo(


		override val supportClass: Class[_] = classOf[StageDataWrapper]

		override def serialize(input: Any): Array[Byte] =

	executorId = weakIntern(binary.getExecutorId),
	host = weakIntern(binary.getHost),
	status = weakIntern(binary.getStatus),
	taskLocality = weakIntern(binary.getTaskLocality),


		object ExecutorStageSummarySerializer {

		private[protobuf] def serialize(input: ExecutorStageSummary): StoreTypes.ExecutorStageSummary = {


		import org.apache.spark.status.api.v1.StageStatus

		object StageStatusSerializer {

Comments

Conversation

panbingkun commented Dec 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

panbingkun commented Dec 23, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Dec 24, 2022

Uh oh!

panbingkun commented Dec 26, 2022

Uh oh!

LuciferYang commented Dec 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

panbingkun Dec 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Dec 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Dec 27, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

panbingkun commented Dec 23, 2022 •

edited

Loading

panbingkun Dec 28, 2022 •

edited

Loading

LuciferYang Dec 29, 2022 •

edited

Loading

LuciferYang Dec 28, 2022 •

edited

Loading