Skip to content

Conversation

@jackye1995
Copy link
Contributor

This is a draft prototype for the approach 3 discussed in the mailing list to support multiple Spark versions.

The code is organized to the following format:

├── spark
│   ├── 2_4
│   │   ├── core
│   │   │   ├── benchmark
│   │   │   └── src
│   │   └── runtime
│   └── 3_0
│       ├── core
│       │   ├── benchmark
│       │   └── src
│       ├── extensions
│       │   └── src
│       └── runtime
│           └── src
├── spark-common
│   └── src
│       ├── jmh
│       │   └── java
│       ├── main
│       │   └── java
│       └── test
│           └── java

I introduce 3 system properties:

  • SparkBuildVersion: the actual Spark version to build against
  • SparkSrcVersion: the Spark source code version, which corresponds to the directory to use
  • ScalaVersion: scala version to use

Then we can build Spark against different versions:

# 3.1.1
./gradlew clean && ./gradlew build
# 3.0.3
./gradlew clean && ./gradlew build -DsparkBuild=3.0.3
# 2.4.4
./gradlew clean && ./gradlew build -DsparkBuild=2.4.4 -DsparkSrc=2_4

@aokolnychyi @RussellSpitzer @flyrain @szehon-ho @rdblue @openinx @stevenzwu

}

project(':spark-common').name = 'iceberg-spark-common'
project(':spark').projectDir = file("spark/${SparkSrcVersion}/core")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically we want the name to be something like spark_3_0, which should match the jar we will publish in maven. But not sure if we should also include scala version here.

throw new Exception("Expected java8 to build Spark 2.4, but got ${SparkSrcVersion}")
}

project(':spark-common').name = 'iceberg-spark-common'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this to common to make spark the main directory containing all versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common package should contain only code compilable across all versions. Then for every backwards incompatible version, we create a new source version directory, which will build against Spark versions until it becomes incompatible. So for example here, source version 3_0 is built against both 3.0.3 and 3.1.1. When we introduce 3_2, it will build against 3.2.x, and maybe even 3.3.x.

implementation "com.github.ben-manes.caffeine:caffeine"

compileOnly "org.apache.avro:avro"
compileOnly("org.apache.spark:spark-hive_2.11") {
Copy link
Contributor Author

@jackye1995 jackye1995 Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we want this to also build against the correct Scala and spark versions. But currently the build would not pass if building against Spark 3. Should fix that in the formal PR.

include 'spark-extensions'
include 'spark-runtime'

var SparkSrcVersion = System.getProperty("sparkSrc") != null ? System.getProperty("sparkSrc") : '3_0'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically we don't have to expose this to users, and we can have a map of all build versions supported by a source version. But I am not sure yet which way is better.

@rdblue
Copy link
Contributor

rdblue commented Oct 8, 2021

Thanks for getting this started, @jackye1995! I think this is a good first step, but there are a few things we should consider.

First, I agree that we probably want to parameterize the Scala version while we're doing this work. We are going to need to address it sooner or later and now seems like a good time to at least put in a parameter that is always 2.12 so that we can get a 2.13 build working later.

Next, I think we need to think about package naming. Right now, we produce iceberg-spark-runtime, iceberg-spark3-runtime, etc. I think that this produces just the iceberg-spark-runtime and not the spark3 Jars, which would possibly break scripts because people depend on certain module names. We will also need to publish multiple Spark Jars so I think we need to give them unique names. Scala has the convention of adding the Scala version into the module name like _2.12. I think we should probably do the same and start publishing iceberg-spark-runtime_3.2_2.12 and iceberg-spark_3.2_2.12. We can continue to publish Spark 2.4 as iceberg-spark-runtime and Spark 3.0 as iceberg-spark3-runtime for consistency. Both of those have fixed Scala versions anyway so that's good.

Since we probably want to be able to publish Jars for all Spark versions, I think it makes sense to be able to build the whole project with every Spark version. And since we will probably be embedding version strings in module names, I'm wondering if the conditional project logic in the main build.gradle is a good idea anymore. Gradle now recommends a separate build.gradle for each subproject, which would work well. We could have a build.gradle specific to each spark/v$major.$minor build directory and select whole projects using settings.gradle. That avoids big changes to build.gradle and makes it easy to evolve each Spark build independently.

I went ahead and tried out updating this build like this and the result is #3256.

@aokolnychyi
Copy link
Contributor

Let me take a look at both PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants