[Feat][Spark] Run tests against multiple spark version && relax spark version requirements #320

SemyonSinchenko · 2024-01-10T09:51:03Z

Is your feature request related to a problem? Please describe.
Currently only Spark 3.2 + Hadoop 3.2 + Scala 2.12 is supported. But from the first view the spark/pyspark code looks like it should work with all the version of spark starting from 3.0 and with both Hadoop 2 and Hadoop 3. Also I do not see any blockers for support of scala 2.13.

Describe the solution you'd like

Provide multiple builds of graphar-spark, like graphar-spark-1.0-spark-3.1, graphar-spark-1.0-spark-3.2, etc. For me a nice example is aws deequ
Test pyspark bindings against multiple versions of pyspark and python. A good example of workflow is pydeequ
Relax spark requirements for pyspark package
Relax hadoop-3.2 requirements for spark/pyspark packages

Describe alternatives you've considered
I do not know an alternative solutions.

Additional context
In my experience, spark packages are used mostly on pre-configured clusters with fixed environment (like Databricks Runtimes). That's why it is important to relax spark-version requirements as much as possible. Otherwise, the package will be impossible to install just due dependencies conflicts.

The text was updated successfully, but these errors were encountered:

acezen · 2024-01-12T03:53:51Z

Good suggestion. I think we can make this issue as a tracking issue that keep track of a list task to support this feature.

SemyonSinchenko · 2024-01-29T19:46:29Z

Ok. It is not as easy as I thought. The problem is mostly in datasources; a lot of things are changing here from version to version. What would be a better solution? For example, in 3.2 ParquetPartitionReaderFactory does not have an aggregation argument. But in 3.3 it does.

Option one would be something like this:

val parquetPartitionReaderFactoryClassConstructor = Class
  .forName(
    "org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory"
  )
  .getMethod("apply")

sparkSession.version match {
  case v if v.startsWith("3.2") =>
    parquetPartitionReaderFactoryClassConstructor
      .invoke(
        null,
        sqlConf,
        broadcastedConf,
        dataSchema,
        readDataSchema,
        readPartitionSchema,
        pushedFilters,
        new ParquetOptions(
          options.asCaseSensitiveMap.asScala.toMap,
          sqlConf
        )
      )
      .asInstanceOf[ParquetPartitionReaderFactory]
  case v if v.startsWith("3.3") =>
    parquetPartitionReaderFactoryClassConstructor
      .invoke(
        null,
        sqlConf,
        broadcastedConf,
        dataSchema,
        readDataSchema,
        readPartitionSchema,
        pushedFilters,
        None, // The newly added argument that is Option
        new ParquetOptions(
          options.asCaseSensitiveMap.asScala.toMap,
          sqlConf
        )
      )
      .asInstanceOf[ParquetPartitionReaderFactory]
}

(looks terrible)

Option two will be something like having bundle subdirectories and subprojects (we may have it only for datasource). Or to have branch-tags like spark3.2, spark3.3, etc.

@acezen What do you think about it? To be honest, I did not face such a case in my experience and I do not know what would be the better option.

SemyonSinchenko · 2024-01-30T08:21:54Z

UPD: I had a talk with a couple of scala experienced guys about it, and it looks like reflection is a valid solution. Even spark itself uses reflection for working with different versions of Hive. So, I suggest going with reflection. I can do it at least for the scope of versions from 3.1.x to 3.4.x

acezen · 2024-01-30T09:04:33Z

UPD: I had a talk with a couple of scala experienced guys about it, and it looks like reflection is a valid solution. Even spark itself uses reflection for working with different versions of Hive. So, I suggest going with reflection. I can do it at least for the scope of versions from 3.1.x to 3.4.x

reflection looks good to me. As you said, spark uses reflection in the same way, we can follow the strategy.
hi, @lixueclaire, do you have any insight about the solution Sem list above?

lixueclaire · 2024-01-30T09:47:26Z

Hi, @SemyonSinchenko, I'm impressed with your proposals—they seem quite solid. I did notice, though, that certain features such as the "ZSTD" compression, are supported since Spark v3.2. To maintain compatibility with Spark v3.1.x, could we look into making these particular features optional?

SemyonSinchenko · 2024-01-30T10:06:03Z

Hi, @SemyonSinchenko, I'm impressed with your proposals—they seem quite solid. I did notice, though, that certain features such as the "ZSTD" compression, are supported since Spark v3.2. To maintain compatibility with Spark v3.1.x, could we look into making these particular features optional?

I would suggest just using other compression for older versions of spark. Or we can start with something like 3.2.x - 3.4.x. And extend the support only in the case of issues from users..

acezen · 2024-01-30T11:18:52Z

Hi, @SemyonSinchenko, I'm impressed with your proposals—they seem quite solid. I did notice, though, that certain features such as the "ZSTD" compression, are supported since Spark v3.2. To maintain compatibility with Spark v3.1.x, could we look into making these particular features optional?

I would suggest just using other compression for older versions of spark. Or we can start with something like 3.2.x - 3.4.x. And extend the support only in the case of issues from users..
we can just using snappy compression for older version of spark.

SemyonSinchenko · 2024-01-30T21:17:36Z

Bad news: I tried hard but failed to fix everything. A lot of changes from 3.2 to 3.3 in the part of FiltersPushDown. What do you think about splitting datasources to datasources.common and datasources.spark32/datasources.spark33/etc.? @acezen @lixueclaire

For example, there are cases when the signature of the parent class was changed and I got errors like error: method withFilters overrides nothing. And I have no idea how to fix it with reflection-trick.

acezen · 2024-01-30T23:26:55Z

Bad news: I tried hard but failed to fix everything. A lot of changes from 3.2 to 3.3 in the part of FiltersPushDown. What do you think about splitting datasources to datasources.common and datasources.spark32/datasources.spark33/etc.? @acezen @lixueclaire

For example, there are cases when the signature of the parent class was changed and I got errors like error: method withFilters overrides nothing. And I have no idea how to fix it with reflection-trick.

Yes，the datasource part may change a lot for different version of Spark, feel free to change the solution and choose the most appropriate for the moment.

acezen · 2024-02-05T03:00:30Z

hi, @SemyonSinchenko, if you have some work on the issue, you can just create a working in progress PR and I think make the issue splits to some small commits will help back tracing.

SemyonSinchenko · 2024-02-23T07:54:29Z

Split datasources and core in GraphAr spark
Support spark 3.3.x as an additional Maven profile and a separate datasoutces bundle
Provide multiple spark JARs in Maven
Relax GraphAr PySpark requirements from pyspark 3.2 to ">=3.2,<=3.3"
Create a page in the documentation about versions compatibility

SemyonSinchenko added the enhancement New feature or request label Jan 10, 2024

acezen assigned SemyonSinchenko and acezen Jan 11, 2024

acezen mentioned this issue Jan 18, 2024

🛣️ Roadmap #18

Open

SemyonSinchenko mentioned this issue Feb 7, 2024

[Feat][Format] Support nullable key-value in property metadata #351

Closed

SemyonSinchenko mentioned this issue Feb 21, 2024

[Feat][Spark] Split datasources and core, prepare for support of multiple spark versions #369

Merged

5 tasks

SemyonSinchenko mentioned this issue Feb 24, 2024

[Feat][Spark] Spark 3.3.x support as a Maven Profile #376

Merged

5 tasks

acezen added this to alibaba/GraphAr Feb 26, 2024

acezen moved this to In Progress in alibaba/GraphAr Feb 26, 2024

SemyonSinchenko mentioned this issue Aug 14, 2024

weird source code structure (maybe bug?) #593

Closed

SemyonSinchenko closed this as completed Aug 14, 2024

github-project-automation bot moved this from In Progress to Done in alibaba/GraphAr Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat][Spark] Run tests against multiple spark version && relax spark version requirements #320

[Feat][Spark] Run tests against multiple spark version && relax spark version requirements #320

SemyonSinchenko commented Jan 10, 2024

acezen commented Jan 12, 2024

SemyonSinchenko commented Jan 29, 2024

SemyonSinchenko commented Jan 30, 2024

acezen commented Jan 30, 2024 •

edited

Loading

lixueclaire commented Jan 30, 2024

SemyonSinchenko commented Jan 30, 2024

acezen commented Jan 30, 2024

SemyonSinchenko commented Jan 30, 2024

acezen commented Jan 30, 2024

acezen commented Feb 5, 2024

SemyonSinchenko commented Feb 23, 2024 •

edited

Loading

[Feat][Spark] Run tests against multiple spark version && relax spark version requirements #320

[Feat][Spark] Run tests against multiple spark version && relax spark version requirements #320

Comments

SemyonSinchenko commented Jan 10, 2024

acezen commented Jan 12, 2024

SemyonSinchenko commented Jan 29, 2024

SemyonSinchenko commented Jan 30, 2024

acezen commented Jan 30, 2024 • edited Loading

lixueclaire commented Jan 30, 2024

SemyonSinchenko commented Jan 30, 2024

acezen commented Jan 30, 2024

SemyonSinchenko commented Jan 30, 2024

acezen commented Jan 30, 2024

acezen commented Feb 5, 2024

SemyonSinchenko commented Feb 23, 2024 • edited Loading

acezen commented Jan 30, 2024 •

edited

Loading

SemyonSinchenko commented Feb 23, 2024 •

edited

Loading