Spark: refactor Spark modules to support multiple versions #3237

jackye1995 · 2021-10-07T05:44:14Z

This is a draft prototype for the approach 3 discussed in the mailing list to support multiple Spark versions.

The code is organized to the following format:

├── spark
│   ├── 2_4
│   │   ├── core
│   │   │   ├── benchmark
│   │   │   └── src
│   │   └── runtime
│   └── 3_0
│       ├── core
│       │   ├── benchmark
│       │   └── src
│       ├── extensions
│       │   └── src
│       └── runtime
│           └── src
├── spark-common
│   └── src
│       ├── jmh
│       │   └── java
│       ├── main
│       │   └── java
│       └── test
│           └── java

I introduce 3 system properties:

SparkBuildVersion: the actual Spark version to build against
SparkSrcVersion: the Spark source code version, which corresponds to the directory to use
ScalaVersion: scala version to use

Then we can build Spark against different versions:

# 3.1.1
./gradlew clean && ./gradlew build
# 3.0.3
./gradlew clean && ./gradlew build -DsparkBuild=3.0.3
# 2.4.4
./gradlew clean && ./gradlew build -DsparkBuild=2.4.4 -DsparkSrc=2_4

@aokolnychyi @RussellSpitzer @flyrain @szehon-ho @rdblue @openinx @stevenzwu

jackye1995 · 2021-10-07T05:46:10Z

settings.gradle

+}
+
+project(':spark-common').name = 'iceberg-spark-common'
+project(':spark').projectDir = file("spark/${SparkSrcVersion}/core")


technically we want the name to be something like spark_3_0, which should match the jar we will publish in maven. But not sure if we should also include scala version here.

jackye1995 · 2021-10-07T05:46:40Z

settings.gradle

+  throw new Exception("Expected java8 to build Spark 2.4, but got ${SparkSrcVersion}")
+}
+
+project(':spark-common').name = 'iceberg-spark-common'


I renamed this to common to make spark the main directory containing all versions.

The common package should contain only code compilable across all versions. Then for every backwards incompatible version, we create a new source version directory, which will build against Spark versions until it becomes incompatible. So for example here, source version 3_0 is built against both 3.0.3 and 3.1.1. When we introduce 3_2, it will build against 3.2.x, and maybe even 3.3.x.

jackye1995 · 2021-10-07T05:47:54Z

build.gradle

      implementation "com.github.ben-manes.caffeine:caffeine"

      compileOnly "org.apache.avro:avro"
      compileOnly("org.apache.spark:spark-hive_2.11") {


ideally we want this to also build against the correct Scala and spark versions. But currently the build would not pass if building against Spark 3. Should fix that in the formal PR.

jackye1995 · 2021-10-07T05:51:31Z

settings.gradle

+include 'spark-extensions'
+include 'spark-runtime'
+
+var SparkSrcVersion = System.getProperty("sparkSrc") != null ? System.getProperty("sparkSrc") : '3_0'


technically we don't have to expose this to users, and we can have a map of all build versions supported by a source version. But I am not sure yet which way is better.

rdblue · 2021-10-08T17:44:02Z

Thanks for getting this started, @jackye1995! I think this is a good first step, but there are a few things we should consider.

First, I agree that we probably want to parameterize the Scala version while we're doing this work. We are going to need to address it sooner or later and now seems like a good time to at least put in a parameter that is always 2.12 so that we can get a 2.13 build working later.

Next, I think we need to think about package naming. Right now, we produce iceberg-spark-runtime, iceberg-spark3-runtime, etc. I think that this produces just the iceberg-spark-runtime and not the spark3 Jars, which would possibly break scripts because people depend on certain module names. We will also need to publish multiple Spark Jars so I think we need to give them unique names. Scala has the convention of adding the Scala version into the module name like _2.12. I think we should probably do the same and start publishing iceberg-spark-runtime_3.2_2.12 and iceberg-spark_3.2_2.12. We can continue to publish Spark 2.4 as iceberg-spark-runtime and Spark 3.0 as iceberg-spark3-runtime for consistency. Both of those have fixed Scala versions anyway so that's good.

Since we probably want to be able to publish Jars for all Spark versions, I think it makes sense to be able to build the whole project with every Spark version. And since we will probably be embedding version strings in module names, I'm wondering if the conditional project logic in the main build.gradle is a good idea anymore. Gradle now recommends a separate build.gradle for each subproject, which would work well. We could have a build.gradle specific to each spark/v$major.$minor build directory and select whole projects using settings.gradle. That avoids big changes to build.gradle and makes it easy to evolve each Spark build independently.

I went ahead and tried out updating this build like this and the result is #3256.

aokolnychyi · 2021-10-11T19:56:49Z

Let me take a look at both PRs.

Jack Ye added 8 commits October 6, 2021 20:44

rename spark to spark-common

e2f3e01

reorganize spark modules in setting

5dd4389

use new modules in setting

09bdf3f

refactor spark directories

7039ff7

make build continue to work for spark3

b829a20

remove 30 and 31 version variables

d31847d

spark3 complete

065fea2

support spark24

b6b24e4

github-actions bot added build docs INFRA spark labels Oct 7, 2021

jackye1995 commented Oct 7, 2021

View reviewed changes

jackye1995 marked this pull request as draft October 7, 2021 05:53

rdblue mentioned this pull request Oct 8, 2021

Build: Move Spark version modules under spark directory #3256

Merged

jackye1995 closed this Oct 12, 2021

openinx mentioned this pull request Oct 25, 2021

Build: Flink CI work against both flink 1.12.5 & 1.13.2 #3364

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark: refactor Spark modules to support multiple versions #3237

Spark: refactor Spark modules to support multiple versions #3237

Uh oh!

jackye1995 commented Oct 7, 2021

Uh oh!

jackye1995 Oct 7, 2021

Uh oh!

jackye1995 Oct 7, 2021

Uh oh!

jackye1995 Oct 7, 2021

Uh oh!

jackye1995 Oct 7, 2021 •

edited

Loading

Uh oh!

jackye1995 Oct 7, 2021

Uh oh!

rdblue commented Oct 8, 2021

Uh oh!

aokolnychyi commented Oct 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Spark: refactor Spark modules to support multiple versions #3237

Spark: refactor Spark modules to support multiple versions #3237

Uh oh!

Conversation

jackye1995 commented Oct 7, 2021

Uh oh!

jackye1995 Oct 7, 2021

Choose a reason for hiding this comment

Uh oh!

jackye1995 Oct 7, 2021

Choose a reason for hiding this comment

Uh oh!

jackye1995 Oct 7, 2021

Choose a reason for hiding this comment

Uh oh!

jackye1995 Oct 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Oct 7, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue commented Oct 8, 2021

Uh oh!

aokolnychyi commented Oct 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jackye1995 Oct 7, 2021 •

edited

Loading