Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

Introducing Protobuf serializer for KV store, which is 3 times as fast as the default serializer according to end-to-end benchmark against RocksDB.

Serializer Avg Write time(μs) Avg Read time(μs) RocksDB File Total Size(MB) Result total size in memory(MB)
Spark’s KV Serializer(JSON+gzip) 352.2 119.26 837 868
Protobuf 109.9 34.3 858 2105

To move fast and make PR review easier, this PR will:

  • Cover the class JobDataWrapper only. We can handle more UI data later.
  • Not adding configuration for setting serializer in SHS. We will have it as a follow-up.

Why are the changes needed?

A faster serializer for KV store. It supports schema evolution so that in the future SHS can leverage it as well.
More details in the SPIP: https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit

Does this PR introduce any user-facing change?

No

How was this patch tested?

@gengliangwang
Copy link
Member Author

cc @LuciferYang as well.

@gengliangwang gengliangwang changed the title [SPARK-41244][UI] Introducing Protobuf serializer for KV store [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store Nov 23, 2022
<hadoop.version>3.3.4</hadoop.version>
<protobuf.version>2.5.0</protobuf.version>
<!-- Protobuf version for building with Hadoop/Yarn dependencies -->
<protobuf.hadoopDependency.version>2.5.0</protobuf.hadoopDependency.version>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a better solution than this...

@gengliangwang gengliangwang changed the title [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store [WIP][SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store Nov 24, 2022
@gengliangwang
Copy link
Member Author

Pending on #38783

</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK-40593 and SPARK-41215 do some work to support Spark 3.4 to be compiled on CentOS6&7. I think the core module needs similar work, but may not be in this pr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WolverineJiang Are you interested in solving this problem after PR merge? Just like what you did in SPARK-41215

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WolverineJiang Are you interested in solving this problem after PR merge? Just like what you did in SPARK-41215

More than willing, I'll start in a few days.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description = description,
submissionTime = submissionTime,
completionTime = completionTime,
stageIds = info.getStageIdsList.asScala.map(_.toInt),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need toSeq for Scala 2.13, or should we explicitly declare stageIds in JobData as scala.collection.Seq?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added toSeq

@gengliangwang gengliangwang changed the title [WIP][SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store Nov 24, 2022
}

enum JobExecutionStatus {
RUNNING = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we nee to add UNSPECIFIED to follow protobuf style guideline? cc @amaliujia

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably it's fine as these proto message are only used by Spark internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get it. There is UNKNOWN status already. There is always a value set for JobExecutionStatus.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI that is a from the official style guide that 0 is used for UNSPECIFIED and keep UNSPECIFIED: https://developers.google.com/protocol-buffers/docs/style#enums

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, please reserve 0 for unspecified to allow for evolution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated


message JavaDate {
// The number of milliseconds since January 1, 1970, 00:00:00 GMT
optional int64 date = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use int64 directly instead of having JavaDate? milliseconds since epoch is a very common definition and not tied to java.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need utility methods for future encoding/decoding of java dates anyway. I prefer the current implementation. Looks more graceful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JavaDate is not really a data structure. Here it's more like a marker to indicate that the deserializer should turn it into Java Date. I think this should be the responsibility of JobDataWrapperSerializer, not the proto message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed to int64

lazy val settings = Seq(
// Setting version for the protobuf compiler. This has to be propagated to every sub-project
// even if the project is not using it.
PB.protocVersion := BuildCommons.protoVersion,
Copy link
Contributor

@cloud-fan cloud-fan Nov 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we copy these code from somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both connect and protobuf connector have these code

@github-actions github-actions bot removed the CONNECT label Nov 25, 2022
@mridulm
Copy link
Contributor

mridulm commented Nov 28, 2022

+CC @thejdeep, @shardulm94

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few queries.
Thanks for working on this @gengliangwang !

}

message JobData {
int32 job_id = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discuss:

Since we are defining the schema here, do we want to make all id's (rdd, stage, job, etc) int64 instead of int32 ?
This will be a pain to change in future for protobuf, if we need to - unlike json.

(Example, see this PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed to int64

@gengliangwang
Copy link
Member Author

@mridulm I am back from vacation today. Shall we merge this one and move forward?

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @gengliangwang !

@gengliangwang
Copy link
Member Author

@LuciferYang @cloud-fan @mridulm @amaliujia @tgravescs Thanks for the review.
I am merging this one to master

</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang It actually conflicts with parent pom.xml which causes the jetty can not be shaded into spark-core. The current master branch can not start Spark UI, see error message:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/09 14:58:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
java.lang.NoClassDefFoundError: org/eclipse/jetty/util/thread/Scheduler
  at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:67)
  at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:223)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:484)
  at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2739)
  at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:978)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:972)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)
  ... 55 elided
Caused by: java.lang.ClassNotFoundException: org.eclipse.jetty.util.thread.Scheduler
  at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
  ... 63 more
<console>:14: error: not found: value spark
       import spark.implicits._
              ^

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulysses-you Thanks for reporting it. Do you know how to fix it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe should add combine.children="append" to <relocations> tag

Copy link
Member

@pan3793 pan3793 Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sent a PR to fix it. #38999

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I also saw this when I build distributions. Thank you, @pan3793 !

dongjoon-hyun pushed a commit that referenced this pull request Dec 9, 2022
### What changes were proposed in this pull request?

Fix shading in core module.

### Why are the changes needed?

Fixed issue mentioned in #38779 (comment)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
➜  dist git:(SPARK-41244-followup) bin/spark-shell
22/12/09 16:35:43 WARN Utils: Your hostname, Chengs-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.221.96.92 instead (on interface en0)
22/12/09 16:35:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/09 16:35:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://10.221.96.92:4040
Spark context available as 'sc' (master = local[*], app id = local-1670574946548).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.0-SNAPSHOT
      /_/

Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_332)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select 1").show
+---+
|  1|
+---+
|  1|
+---+

```
<img width="1604" alt="image" src="https://user-images.githubusercontent.com/26535726/206659907-95d95d7f-dce2-4a4f-9aef-e54fb4612aba.png">

Closes #38999 from pan3793/SPARK-41244-followup.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
object JobDataWrapperSerializer {
def serialize(j: JobDataWrapper): Array[Byte] = {
val jobData = serializeJobData(j.info)
val builder = StoreTypes.JobDataWrapper.newBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gengliangwang

Where can we find StoreTypes object? I cannot locate it, is it auto generated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is generated during compile time. It should be under ./core/target/scala-2.12/src_managed/main/org/apache/spark/status/protobuf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
… store

### What changes were proposed in this pull request?

Introducing Protobuf serializer for KV store, which is 3 times as fast as the default serializer according to end-to-end benchmark against RocksDB.
| Serializer                       | Avg Write time(μs) | Avg Read time(μs) | RocksDB File Total Size(MB) | Result total size in memory(MB) |
|----------------------------------|--------------------|-------------------|-----------------------------|---------------------------------|
| Spark’s KV Serializer(JSON+gzip) | 352.2              | 119.26            | 837                         | 868                             |
| Protobuf                         | 109.9              | 34.3              | 858                         | 2105                            |

To move fast and make PR review easier, this PR will:
* Cover the class `JobDataWrapper` only. We can handle more UI data later.
* Not adding configuration for setting serializer in SHS. We will have it as a follow-up.

### Why are the changes needed?

A faster serializer for KV store. It supports schema evolution so that in the future SHS can leverage it as well.
More details in the SPIP: https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Closes apache#38779 from gengliangwang/protobuf.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
### What changes were proposed in this pull request?

Fix shading in core module.

### Why are the changes needed?

Fixed issue mentioned in apache#38779 (comment)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
➜  dist git:(SPARK-41244-followup) bin/spark-shell
22/12/09 16:35:43 WARN Utils: Your hostname, Chengs-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.221.96.92 instead (on interface en0)
22/12/09 16:35:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/09 16:35:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://10.221.96.92:4040
Spark context available as 'sc' (master = local[*], app id = local-1670574946548).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.0-SNAPSHOT
      /_/

Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_332)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select 1").show
+---+
|  1|
+---+
|  1|
+---+

```
<img width="1604" alt="image" src="https://user-images.githubusercontent.com/26535726/206659907-95d95d7f-dce2-4a4f-9aef-e54fb4612aba.png">

Closes apache#38999 from pan3793/SPARK-41244-followup.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
LuciferYang added a commit that referenced this pull request Sep 6, 2023
### What changes were proposed in this pull request?
This pr aims to set `shadeTestJar` of `core` module to `false` to skip shade `spark-core**-tests.jar` process.

### Why are the changes needed?
Before SPARK-41244, the core module used the `maven-shade-plugin` configuration in the `parent pom.xml` for shading, where the `shadeTestJar` configuration item was set to the default `false` in the `maven-shade-plugin` in `parent pom.xml`.

In SPARK-41244 | SPARK-41244 #38779, the core module started using a separate `maven-shade-plugin` configuration and set the `shadeTestJar` configuration item to `true`.

And when `shadeTestJar` is true, `maven-shade-plugin` always try to find(sometime is downloading) some non-existent jars:

```
[INFO] --- maven-shade-plugin:3.5.0:shade (default)  spark-core_2.12 ---
[INFO] Including org.eclipse.jetty:jetty-plus:jar:9.4.51.v20230217 in the shaded jar.
Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-plus/9.4.51.v20230217/jetty-plus-9.4.51.v20230217-tests.jar
Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-plus/9.4.51.v20230217/jetty-plus-9.4.51.v20230217-tests.jar
[WARNING] Could not get tests for org.eclipse.jetty:jetty-plus:jar:9.4.51.v20230217:compile
[INFO] Including org.eclipse.jetty:jetty-security:jar:9.4.51.v20230217 in the shaded jar.
Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-security/9.4.51.v20230217/jetty-security-9.4.51.v20230217-tests.jar
Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-security/9.4.51.v20230217/jetty-security-9.4.51.v20230217-tests.jar
[WARNING] Could not get tests for org.eclipse.jetty:jetty-security:jar:9.4.51.v20230217:compile
[INFO] Including org.eclipse.jetty:jetty-util:jar:9.4.51.v20230217 in the shaded jar.
Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-util/9.4.51.v20230217/jetty-util-9.4.51.v20230217-tests.jar
Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-util/9.4.51.v20230217/jetty-util-9.4.51.v20230217-tests.jar
[WARNING] Could not get tests for org.eclipse.jetty:jetty-util:jar:9.4.51.v20230217:compile
[INFO] Including org.eclipse.jetty:jetty-server:jar:9.4.51.v20230217 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-io:jar:9.4.51.v20230217 in the shaded jar.
Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-io/9.4.51.v20230217/jetty-io-9.4.51.v20230217-tests.jar
Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-io/9.4.51.v20230217/jetty-io-9.4.51.v20230217-tests.jar
[WARNING] Could not get tests for org.eclipse.jetty:jetty-io:jar:9.4.51.v20230217:compile
[INFO] Including org.eclipse.jetty:jetty-http:jar:9.4.51.v20230217 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-continuation:jar:9.4.51.v20230217 in the shaded jar.
Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-continuation/9.4.51.v20230217/jetty-continuation-9.4.51.v20230217-tests.jar
Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-continuation/9.4.51.v20230217/jetty-continuation-9.4.51.v20230217-tests.jar
[WARNING] Could not get tests for org.eclipse.jetty:jetty-continuation:jar:9.4.51.v20230217:compile
[INFO] Including org.eclipse.jetty:jetty-servlet:jar:9.4.51.v20230217 in the shaded jar.
[INFO] Including org.eclipse.jetty:jetty-proxy:jar:9.4.51.v20230217 in the shaded jar.
Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-proxy/9.4.51.v20230217/jetty-proxy-9.4.51.v20230217-tests.jar
Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-proxy/9.4.51.v20230217/jetty-proxy-9.4.51.v20230217-tests.jar
[WARNING] Could not get tests for org.eclipse.jetty:jetty-proxy:jar:9.4.51.v20230217:compile
[INFO] Including org.eclipse.jetty:jetty-client:jar:9.4.51.v20230217 in the shaded jar.
Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-client/9.4.51.v20230217/jetty-client-9.4.51.v20230217-tests.jar
Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-client/9.4.51.v20230217/jetty-client-9.4.51.v20230217-tests.jar
[WARNING] Could not get tests for org.eclipse.jetty:jetty-client:jar:9.4.51.v20230217:compile
[INFO] Including org.eclipse.jetty:jetty-servlets:jar:9.4.51.v20230217 in the shaded jar.
Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-servlets/9.4.51.v20230217/jetty-servlets-9.4.51.v20230217-tests.jar
Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-servlets/9.4.51.v20230217/jetty-servlets-9.4.51.v20230217-tests.jar
[WARNING] Could not get tests for org.eclipse.jetty:jetty-servlets:jar:9.4.51.v20230217:compile
[INFO] Including com.google.protobuf:protobuf-java:jar:3.23.4 in the shaded jar.
Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/com/google/protobuf/protobuf-java/3.23.4/protobuf-java-3.23.4-tests.jar
Downloading from central: https://repo.maven.apache.org/maven2/com/google/protobuf/protobuf-java/3.23.4/protobuf-java-3.23.4-tests.jar
[WARNING] Could not get tests for com.google.protobuf:protobuf-java:jar:3.23.4:compile
[INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded jar.
Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/spark-project/spark/unused/1.0.0/unused-1.0.0-tests.jar
Downloading from central: https://repo.maven.apache.org/maven2/org/spark-project/spark/unused/1.0.0/unused-1.0.0-tests.jar
[WARNING] Could not get tests for org.spark-project.spark:unused:jar:1.0.0:compile
```

Under poor network conditions, the time for Maven compilation will increase.

On the other hand, setting shadeTestJar back to false has not resulted in any Maven test failures, so this pr reset `shadeTestJar` to `false`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Action

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #42766 from LuciferYang/SPARK-45046.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants