Skip to content

fix: pin arrow version to 15.0#203

Closed
hamersaw wants to merge 6 commits intolance-format:mainfrom
hamersaw:bug/arrow-version
Closed

fix: pin arrow version to 15.0#203
hamersaw wants to merge 6 commits intolance-format:mainfrom
hamersaw:bug/arrow-version

Conversation

@hamersaw
Copy link
Copy Markdown
Collaborator

@hamersaw hamersaw commented Feb 4, 2026

The recent upgrade of arrow version (15.0 -> 18.3) in lance-core does not play well with Spark 3.5. This results in failures within the Lance Spark connector for simple operations (ex. CREATE TABLE). In this PR we pin the arrow version to 15.0 (used for ~2 years) for all 3.5 and 3.4 profiles. The other alternative approaches are to (1) downgrade lance-core arrow dependency or (2) maintain separate lance-core branches. This seems like the most reasonable approach.

Closes: #196

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@github-actions github-actions Bot added the bug Something isn't working label Feb 4, 2026
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@jackye1995
Copy link
Copy Markdown
Contributor

I recall we discussed about it in lance-format/lance#5565 (comment), can we check if this would still work with java 21 and spark 4.0? We are currently only testing java 17: https://github.com/lance-format/lance-spark/blob/main/.github/workflows/spark.yml#L60

@hamersaw
Copy link
Copy Markdown
Collaborator Author

hamersaw commented Feb 4, 2026

I recall we discussed about it in lance-format/lance#5565 (comment), can we check if this would still work with java 21 and spark 4.0? We are currently only testing java 17: https://github.com/lance-format/lance-spark/blob/main/.github/workflows/spark.yml#L60

Arrow:

Spark:

I tested using arrow 15 on Spark 3.5 with Java 21 locally and everything worked. This PR doesn't change anything for our Lance Spark 4.0 story (still using arrow 18 on whatever Java is configured), but downgrades arrow version on 3.4 / 3.5. So if those work with newer Java versions IIUC that covers our concerns right?

@jackye1995
Copy link
Copy Markdown
Contributor

Can we add Spark 4.0 + Java 21 in the CI matrix so we are certain it runs and passes?

@jackye1995
Copy link
Copy Markdown
Contributor

Also could you explain why the unit and docker tests are working fine, but it fails in the EMR environment for Spark 3.5? I think ideally we should keep a higher version. If it's just the problem of certain platform, we should consider just providing a guide for how to use lower version arrow with that platform.

@hamersaw
Copy link
Copy Markdown
Collaborator Author

hamersaw commented Feb 5, 2026

Also could you explain why the unit and docker tests are working fine, but it fails in the EMR environment for Spark 3.5? I think ideally we should keep a higher version. If it's just the problem of certain platform, we should consider just providing a guide for how to use lower version arrow with that platform.

Great question. I just opened a PR with integration tests on docker, AFAIK we do not actually have anything in docker that runs as part of CI (once I clean these up we probably should). Running these tests on HEAD with Spark 3.5 breaks with issues on the incompatible arrow version between lance-core and Spark. So I think this is not isolated to EMR. The error(s) are like:

java.lang.RuntimeException: LanceError(Arrow): C Data interface error: java.lang.NoSuchMethodError: 'java.util.List org.apache.arrow.vector.ipc.message.ArrowRecordBatch.getVariadicBufferCounts()'

The issue, as I understand it, is that Spark ships with a version of Arrow that is lower than lance-core:
(ex Spark 3.4 - ships with Arrow ~14.x, Spark 3.5 with ~15.x/16.x, and Spark 4.0 with Arrow ~17.x). So when we include Arrow 18.3 in lance-core the Java classloader between Spark / Lance Spark picks one and causes problems.

I attempted to circumvent this using mavens shading plugin to essentially host the Arrow dependency in lance-core at a different path (ex. org.lance.org.apache.arrow... rather than org.apache.arrow...) to allow lance-core to use v18.3 and Spark whatever native version it ships with. This doesn't work because shading does not support native JNI code, which results in hardcoded bytes as org.apache.arrow... and cannot be modified.

So the ways I see to fix this:

  • Use Arrow version pinning to the Spark version to ensure compatibility. There may be issues as we iterate on lance-core but at least they will be caught and known in CI.
  • Downgrade lance-core version to Arrow v17 (lowest version for Spark 4.0). This probably will not work because Spark 3.4 / 3.5 may not support Arrow 17.
  • Maintain lance-core java packages with different Arrow versions that we can pull. It does not feel like we are at this level yet.

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@jackye1995
Copy link
Copy Markdown
Contributor

Running these tests on HEAD with Spark 3.5 breaks with issues on the incompatible arrow version between lance-core and Spark. So I think this is not isolated to EMR.

I see, thanks for the verification!

Spark ships with a version of Arrow that is lower than lance-core

If that is the case, should we just exclude arrow from the lance-core dependencies when imported in lance-spark? Something like

<exclusion>
  <groupId>org.apache.arrow</groupId>
  <artifactId>*</artifactId>
</exclusion>

Would that work, so we don't have to ping the arrow version and it just use the one comes with Spark?

@jackye1995
Copy link
Copy Markdown
Contributor

Some reference for how Iceberg Spark does it:

⏺ How Iceberg Handles Arrow in Spark

  Iceberg declares its own Arrow dependency and excludes Spark's bundled Arrow, then shades it into the runtime JAR. Here's the breakdown:

  1. Own Arrow Version

  Iceberg pins Arrow 15.0.2 in gradle/libs.versions.toml:30, and brings it in via:

  - The iceberg-arrow module (all Spark versions depend on project(':iceberg-arrow'))
  - Direct implementation(libs.arrow.vector) in the Spark v3.4/v3.5/v4.0 build files

  2. Spark's Arrow is Excluded

  When depending on Spark itself, Iceberg explicitly excludes Spark's bundled Arrow. For example in spark/v3.5/build.gradle:

  compileOnly("org.apache.spark:spark-hive_${scalaVersion}:...") {
    exclude group: 'org.apache.arrow'
    // ...
  }

  The comment in the build file explains: "to make sure netty libs only come from project(':iceberg-arrow')" — they want full control over Arrow and its transitive dependencies (especially netty).

  3. Arrow is Shaded in the Runtime JAR

  In the shadow/runtime JAR config (spark/v3.5/build.gradle, etc.):

  relocate 'org.apache.arrow', 'org.apache.iceberg.shaded.org.apache.arrow'
  relocate 'io.netty', 'org.apache.iceberg.shaded.io.netty'
  relocate 'com.carrotsearch', 'org.apache.iceberg.shaded.com.carrotsearch'

  So at runtime, Iceberg's Arrow classes live under org.apache.iceberg.shaded.org.apache.arrow.*, completely isolated from whatever Arrow version Spark ships with.

  4. Netty is Also Carefully Managed

  Arrow depends on netty. Iceberg excludes netty from the arrow-vector dependency and instead provides its own controlled netty version through the iceberg-arrow module (build.gradle:941):

  implementation(libs.arrow.vector) {
    exclude group: 'io.netty', module: 'netty-buffer'
    exclude group: 'io.netty', module: 'netty-common'
  }
  // then in iceberg-arrow:
  runtimeOnly libs.netty.buffer

  Summary

  The strategy is: bring your own Arrow, exclude Spark's, shade everything. This gives Iceberg full control over the Arrow version and avoids classpath conflicts regardless of which Arrow version a given Spark release ships with.

@hamersaw
Copy link
Copy Markdown
Collaborator Author

hamersaw commented Feb 5, 2026

Some reference for how Iceberg Spark does it:

... Java sorcery ...

@hamersaw
Copy link
Copy Markdown
Collaborator Author

hamersaw commented Feb 6, 2026

Closing because these changes are included in #205

@hamersaw hamersaw closed this Feb 6, 2026
@hamersaw hamersaw deleted the bug/arrow-version branch February 6, 2026 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EMR Spark 3.5 Arrow version conflict

2 participants