-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store #38779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @LuciferYang as well. |
| <hadoop.version>3.3.4</hadoop.version> | ||
| <protobuf.version>2.5.0</protobuf.version> | ||
| <!-- Protobuf version for building with Hadoop/Yarn dependencies --> | ||
| <protobuf.hadoopDependency.version>2.5.0</protobuf.hadoopDependency.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a better solution than this...
|
Pending on #38783 |
| </executions> | ||
| </plugin> | ||
| <plugin> | ||
| <groupId>org.apache.maven.plugins</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SPARK-40593 and SPARK-41215 do some work to support Spark 3.4 to be compiled on CentOS6&7. I think the core module needs similar work, but may not be in this pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WolverineJiang Are you interested in solving this problem after PR merge? Just like what you did in SPARK-41215
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WolverineJiang Are you interested in solving this problem after PR merge? Just like what you did in SPARK-41215
More than willing, I'll start in a few days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @WolverineJiang
| description = description, | ||
| submissionTime = submissionTime, | ||
| completionTime = completionTime, | ||
| stageIds = info.getStageIdsList.asScala.map(_.toInt), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need toSeq for Scala 2.13, or should we explicitly declare stageIds in JobData as scala.collection.Seq?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added toSeq
core/src/main/scala/org/apache/spark/status/protobuf/KVStoreProtobufSerializer.scala
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| enum JobExecutionStatus { | ||
| RUNNING = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we nee to add UNSPECIFIED to follow protobuf style guideline? cc @amaliujia
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably it's fine as these proto message are only used by Spark internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get it. There is UNKNOWN status already. There is always a value set for JobExecutionStatus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI that is a from the official style guide that 0 is used for UNSPECIFIED and keep UNSPECIFIED: https://developers.google.com/protocol-buffers/docs/style#enums
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, please reserve 0 for unspecified to allow for evolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updated
|
|
||
| message JavaDate { | ||
| // The number of milliseconds since January 1, 1970, 00:00:00 GMT | ||
| optional int64 date = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use int64 directly instead of having JavaDate? milliseconds since epoch is a very common definition and not tied to java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need utility methods for future encoding/decoding of java dates anyway. I prefer the current implementation. Looks more graceful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JavaDate is not really a data structure. Here it's more like a marker to indicate that the deserializer should turn it into Java Date. I think this should be the responsibility of JobDataWrapperSerializer, not the proto message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, changed to int64
| lazy val settings = Seq( | ||
| // Setting version for the protobuf compiler. This has to be propagated to every sub-project | ||
| // even if the project is not using it. | ||
| PB.protocVersion := BuildCommons.protoVersion, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we copy these code from somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, both connect and protobuf connector have these code
|
+CC @thejdeep, @shardulm94 |
mridulm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few queries.
Thanks for working on this @gengliangwang !
| } | ||
|
|
||
| message JobData { | ||
| int32 job_id = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discuss:
Since we are defining the schema here, do we want to make all id's (rdd, stage, job, etc) int64 instead of int32 ?
This will be a pain to change in future for protobuf, if we need to - unlike json.
(Example, see this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, changed to int64
core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/status/protobuf/KVStoreProtobufSerializerSuite.scala
Show resolved
Hide resolved
|
@mridulm I am back from vacation today. Shall we merge this one and move forward? |
mridulm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @gengliangwang !
|
@LuciferYang @cloud-fan @mridulm @amaliujia @tgravescs Thanks for the review. |
| </plugin> | ||
| <plugin> | ||
| <groupId>org.apache.maven.plugins</groupId> | ||
| <artifactId>maven-shade-plugin</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang It actually conflicts with parent pom.xml which causes the jetty can not be shaded into spark-core. The current master branch can not start Spark UI, see error message:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/09 14:58:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
java.lang.NoClassDefFoundError: org/eclipse/jetty/util/thread/Scheduler
at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:67)
at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:223)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:484)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2739)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:978)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:972)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)
... 55 elided
Caused by: java.lang.ClassNotFoundException: org.eclipse.jetty.util.thread.Scheduler
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
... 63 more
<console>:14: error: not found: value spark
import spark.implicits._
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ulysses-you Thanks for reporting it. Do you know how to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe should add combine.children="append" to <relocations> tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sent a PR to fix it. #38999
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I also saw this when I build distributions. Thank you, @pan3793 !
### What changes were proposed in this pull request? Fix shading in core module. ### Why are the changes needed? Fixed issue mentioned in #38779 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` ➜ dist git:(SPARK-41244-followup) bin/spark-shell 22/12/09 16:35:43 WARN Utils: Your hostname, Chengs-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.221.96.92 instead (on interface en0) 22/12/09 16:35:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/12/09 16:35:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://10.221.96.92:4040 Spark context available as 'sc' (master = local[*], app id = local-1670574946548). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.4.0-SNAPSHOT /_/ Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_332) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("select 1").show +---+ | 1| +---+ | 1| +---+ ``` <img width="1604" alt="image" src="https://user-images.githubusercontent.com/26535726/206659907-95d95d7f-dce2-4a4f-9aef-e54fb4612aba.png"> Closes #38999 from pan3793/SPARK-41244-followup. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
| object JobDataWrapperSerializer { | ||
| def serialize(j: JobDataWrapper): Array[Byte] = { | ||
| val jobData = serializeJobData(j.info) | ||
| val builder = StoreTypes.JobDataWrapper.newBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where can we find StoreTypes object? I cannot locate it, is it auto generated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is generated during compile time. It should be under ./core/target/scala-2.12/src_managed/main/org/apache/spark/status/protobuf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
… store ### What changes were proposed in this pull request? Introducing Protobuf serializer for KV store, which is 3 times as fast as the default serializer according to end-to-end benchmark against RocksDB. | Serializer | Avg Write time(μs) | Avg Read time(μs) | RocksDB File Total Size(MB) | Result total size in memory(MB) | |----------------------------------|--------------------|-------------------|-----------------------------|---------------------------------| | Spark’s KV Serializer(JSON+gzip) | 352.2 | 119.26 | 837 | 868 | | Protobuf | 109.9 | 34.3 | 858 | 2105 | To move fast and make PR review easier, this PR will: * Cover the class `JobDataWrapper` only. We can handle more UI data later. * Not adding configuration for setting serializer in SHS. We will have it as a follow-up. ### Why are the changes needed? A faster serializer for KV store. It supports schema evolution so that in the future SHS can leverage it as well. More details in the SPIP: https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes apache#38779 from gengliangwang/protobuf. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
### What changes were proposed in this pull request? Fix shading in core module. ### Why are the changes needed? Fixed issue mentioned in apache#38779 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` ➜ dist git:(SPARK-41244-followup) bin/spark-shell 22/12/09 16:35:43 WARN Utils: Your hostname, Chengs-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.221.96.92 instead (on interface en0) 22/12/09 16:35:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/12/09 16:35:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://10.221.96.92:4040 Spark context available as 'sc' (master = local[*], app id = local-1670574946548). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.4.0-SNAPSHOT /_/ Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_332) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("select 1").show +---+ | 1| +---+ | 1| +---+ ``` <img width="1604" alt="image" src="https://user-images.githubusercontent.com/26535726/206659907-95d95d7f-dce2-4a4f-9aef-e54fb4612aba.png"> Closes apache#38999 from pan3793/SPARK-41244-followup. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This pr aims to set `shadeTestJar` of `core` module to `false` to skip shade `spark-core**-tests.jar` process. ### Why are the changes needed? Before SPARK-41244, the core module used the `maven-shade-plugin` configuration in the `parent pom.xml` for shading, where the `shadeTestJar` configuration item was set to the default `false` in the `maven-shade-plugin` in `parent pom.xml`. In SPARK-41244 | SPARK-41244 #38779, the core module started using a separate `maven-shade-plugin` configuration and set the `shadeTestJar` configuration item to `true`. And when `shadeTestJar` is true, `maven-shade-plugin` always try to find(sometime is downloading) some non-existent jars: ``` [INFO] --- maven-shade-plugin:3.5.0:shade (default) spark-core_2.12 --- [INFO] Including org.eclipse.jetty:jetty-plus:jar:9.4.51.v20230217 in the shaded jar. Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-plus/9.4.51.v20230217/jetty-plus-9.4.51.v20230217-tests.jar Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-plus/9.4.51.v20230217/jetty-plus-9.4.51.v20230217-tests.jar [WARNING] Could not get tests for org.eclipse.jetty:jetty-plus:jar:9.4.51.v20230217:compile [INFO] Including org.eclipse.jetty:jetty-security:jar:9.4.51.v20230217 in the shaded jar. Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-security/9.4.51.v20230217/jetty-security-9.4.51.v20230217-tests.jar Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-security/9.4.51.v20230217/jetty-security-9.4.51.v20230217-tests.jar [WARNING] Could not get tests for org.eclipse.jetty:jetty-security:jar:9.4.51.v20230217:compile [INFO] Including org.eclipse.jetty:jetty-util:jar:9.4.51.v20230217 in the shaded jar. Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-util/9.4.51.v20230217/jetty-util-9.4.51.v20230217-tests.jar Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-util/9.4.51.v20230217/jetty-util-9.4.51.v20230217-tests.jar [WARNING] Could not get tests for org.eclipse.jetty:jetty-util:jar:9.4.51.v20230217:compile [INFO] Including org.eclipse.jetty:jetty-server:jar:9.4.51.v20230217 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-io:jar:9.4.51.v20230217 in the shaded jar. Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-io/9.4.51.v20230217/jetty-io-9.4.51.v20230217-tests.jar Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-io/9.4.51.v20230217/jetty-io-9.4.51.v20230217-tests.jar [WARNING] Could not get tests for org.eclipse.jetty:jetty-io:jar:9.4.51.v20230217:compile [INFO] Including org.eclipse.jetty:jetty-http:jar:9.4.51.v20230217 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-continuation:jar:9.4.51.v20230217 in the shaded jar. Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-continuation/9.4.51.v20230217/jetty-continuation-9.4.51.v20230217-tests.jar Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-continuation/9.4.51.v20230217/jetty-continuation-9.4.51.v20230217-tests.jar [WARNING] Could not get tests for org.eclipse.jetty:jetty-continuation:jar:9.4.51.v20230217:compile [INFO] Including org.eclipse.jetty:jetty-servlet:jar:9.4.51.v20230217 in the shaded jar. [INFO] Including org.eclipse.jetty:jetty-proxy:jar:9.4.51.v20230217 in the shaded jar. Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-proxy/9.4.51.v20230217/jetty-proxy-9.4.51.v20230217-tests.jar Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-proxy/9.4.51.v20230217/jetty-proxy-9.4.51.v20230217-tests.jar [WARNING] Could not get tests for org.eclipse.jetty:jetty-proxy:jar:9.4.51.v20230217:compile [INFO] Including org.eclipse.jetty:jetty-client:jar:9.4.51.v20230217 in the shaded jar. Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-client/9.4.51.v20230217/jetty-client-9.4.51.v20230217-tests.jar Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-client/9.4.51.v20230217/jetty-client-9.4.51.v20230217-tests.jar [WARNING] Could not get tests for org.eclipse.jetty:jetty-client:jar:9.4.51.v20230217:compile [INFO] Including org.eclipse.jetty:jetty-servlets:jar:9.4.51.v20230217 in the shaded jar. Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/eclipse/jetty/jetty-servlets/9.4.51.v20230217/jetty-servlets-9.4.51.v20230217-tests.jar Downloading from central: https://repo.maven.apache.org/maven2/org/eclipse/jetty/jetty-servlets/9.4.51.v20230217/jetty-servlets-9.4.51.v20230217-tests.jar [WARNING] Could not get tests for org.eclipse.jetty:jetty-servlets:jar:9.4.51.v20230217:compile [INFO] Including com.google.protobuf:protobuf-java:jar:3.23.4 in the shaded jar. Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/com/google/protobuf/protobuf-java/3.23.4/protobuf-java-3.23.4-tests.jar Downloading from central: https://repo.maven.apache.org/maven2/com/google/protobuf/protobuf-java/3.23.4/protobuf-java-3.23.4-tests.jar [WARNING] Could not get tests for com.google.protobuf:protobuf-java:jar:3.23.4:compile [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded jar. Downloading from gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/maven2/org/spark-project/spark/unused/1.0.0/unused-1.0.0-tests.jar Downloading from central: https://repo.maven.apache.org/maven2/org/spark-project/spark/unused/1.0.0/unused-1.0.0-tests.jar [WARNING] Could not get tests for org.spark-project.spark:unused:jar:1.0.0:compile ``` Under poor network conditions, the time for Maven compilation will increase. On the other hand, setting shadeTestJar back to false has not resulted in any Maven test failures, so this pr reset `shadeTestJar` to `false`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Action ### Was this patch authored or co-authored using generative AI tooling? No Closes #42766 from LuciferYang/SPARK-45046. Authored-by: yangjie01 <[email protected]> Signed-off-by: yangjie01 <[email protected]>
What changes were proposed in this pull request?
Introducing Protobuf serializer for KV store, which is 3 times as fast as the default serializer according to end-to-end benchmark against RocksDB.
To move fast and make PR review easier, this PR will:
JobDataWrapperonly. We can handle more UI data later.Why are the changes needed?
A faster serializer for KV store. It supports schema evolution so that in the future SHS can leverage it as well.
More details in the SPIP: https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit
Does this PR introduce any user-facing change?
No
How was this patch tested?