Spark 4.0: Migrate Iceberg Stored Procedures to Spark built-in implementations #13106

pan3793 · 2025-05-20T13:52:46Z

Migrate Iceberg Stored Procedures to Spark built-in implementations since SPARK-44167 (4.0.0) adds Stored Procedures support for Spark.

Note, this change brings a bit of backward incompability as call procedure will follow Spark behavior:

arguments are always case sensitive
type coercion follows https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html, which is more strict

pan3793 · 2025-05-20T13:54:30Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteManifestsProcedure.java

  }

-  @TestTemplate
+  @Disabled // Spark SQL does not support case insensitive for named arguments


Strictly speaking, it's a regression, but I think it's okay to respect the Spark SQL syntax.

oh, there is no spark config for this, right?

How do built-in functions work in Spark? Do procedures behave consistently with that?

If I remember correctly, Spark should share arg matching with functions.

@aokolnychyi @szehon-ho resolution of named args for both function and procedure happens in org.apache.spark.sql.catalyst.plans.logical.NamedParametersSupport, it simply uses parameterNamesSet.contains(parameterName) to match the arg name.

Spark code change is required if we want it to respect spark.sql.caseSensitive

I think we should just follow what Spark does today.

If Spark decides to change this behavior both for functions and procedures in the future, that's fine. I don't think we have to worry about this in Iceberg. Also, this will be a brand-new Iceberg jar for Spark 4.0.

unfortunately spark 4.0 will be supported in Iceberg 1.10?

I vote disable case insensitivity for now and merge this pr in 1.10 to avoid breaking things in 1.11

pan3793 · 2025-05-20T13:55:02Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteManifestsProcedure.java


    sql(
-        "CALL %s.system.rewrite_manifests(usE_cAcHiNg => false, tAbLe => '%s')",
+        "CALL %s.system.rewrite_manifests(use_caching => false, table => '%s')",


same as above, Spark does not support case insensitive for named args

pan3793 · 2025-05-24T13:02:01Z

...v4.0/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestWriteAborts.java

        ImmutableList.of(),
        sql(
-            "CALL %s.system.remove_orphan_files(table => '%s', older_than => %dL, location => '%s')",
+            "CALL %s.system.remove_orphan_files(table => '%s', older_than => CAST(%dL AS TIMESTAMP), location => '%s')",


implicit cast does not work here, LONG => TIMESTAMP requries an explicit cast

@aokolnychyi this should work right?

this follows https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html

oh nice, so disable ansi should make this case work?

unfortunately, no, it does not satisfy the implicit cast requirements

i see, too bad, that is another backward incompatbiility.

Yeah it's unfortunate but it is a major version upgrade so I think it's OK. I also don't think the iceberg should get too involved in trying to work around spark behavior especially for a major version upgrade.

I do think we should call these out in our procedure docs so that way we can make it easy for folks to figure out any gotcha's on upgrading

pan3793 · 2025-05-24T13:06:15Z

...k-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAncestorsOfProcedure.java

  }

  @TestTemplate
  public void testInvalidAncestorOfCases() {


though I kept and updated these tests, but actually cases like missing/wrong/duplicated parameters are properly handled by Spark catalyst now, not necessary to verify it on Iceberg again.

maybe we should remove those negative test cases

i think its fine to keep, good to verify the behavior and have a reference

pan3793 · 2025-05-24T13:09:40Z

CI is green now, cc @aokolnychyi @huaxingao @szehon-ho would you mind taking a look?

github-actions · 2025-06-24T00:18:03Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

szehon-ho · 2025-06-24T18:24:45Z

...4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/RollbackToSnapshotProcedure.java

+
  @Override
  public String description() {
    return "RollbackToSnapshotProcedure";


we should probably improve the description, but can be done in separate pr

I prefer to narrow the PR scope and do it later

...park-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

.../main/scala/org/apache/spark/sql/execution/datasources/v2/ExtendedDataSourceV2Strategy.scala

szehon-ho · 2025-07-03T01:38:40Z

...nsions/src/test/java/org/apache/iceberg/spark/extensions/TestFastForwardBranchProcedure.java

-        .isInstanceOf(AnalysisException.class)
-        .hasMessage("Named and positional arguments cannot be mixed");
+        .isInstanceOf(RuntimeException.class)
+        .hasMessageStartingWith("Couldn't load table");


was this an error in the original test?

Couldn't load table doesn't seem like the correct error message here?

Actually, this becomes a legal case in Spark's named args resolution

Spark requires:

the named arguments don't contains positional arguments once keyword arguments start

the named arguments don't use the duplicated names

that's say

-- illegal in Iceberg, legal in Spark CALL catalog.system.fast_forward('test_table', branch => 'main', to => 'newBranch') -- illegal in both Iceberg and Spark CALL catalog.system.fast_forward(table => 'test_table', 'main', to => 'newBranch')

I have to change it to the latter to make the test meaningful.

szehon-ho · 2025-07-03T01:41:43Z

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteManifestsProcedure.java

  }

-  @TestTemplate
+  @Disabled // Spark SQL does not support case insensitive for named arguments


unfortunately spark 4.0 will be supported in Iceberg 1.10?

szehon-ho · 2025-07-03T01:43:01Z

...sions/src/test/java/org/apache/iceberg/spark/extensions/TestCherrypickSnapshotProcedure.java

    assertThatThrownBy(() -> sql("CALL %s.system.cherrypick_snapshot('t', 2.2)", catalogName))
-        .isInstanceOf(AnalysisException.class)
-        .hasMessageStartingWith("Wrong arg type for snapshot_id: cannot cast");
+        .isInstanceOf(RuntimeException.class)


can we make the test capture the original idea? (cast exception)

+1 this test doesn't make sense to me? are we trying to test calling a procedure on a table that doesn't exist or the table exists and we want to fail on the cast of the invalid snapshot ID

addressed by changing 2.2 to '2.2', now it throws CAST_INVALID_INPUT

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/BaseProcedure.java

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

...v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java

szehon-ho · 2025-07-03T02:37:11Z

@amogh-jahagirdar @nastra @stevenzwu we will release 1.10 with spark 4.0 support, right?

This change wont make the release and will thus bring a bit of backward incompability as call procedure will follow Spark behavior:

arguments always case sensitive
type coercion is more strict (ANSI)

Any suggestion? Should we just doc this for next release? (or let me know if 1.10 will not have spark 4.0 support)

…entations

pan3793 · 2025-07-03T08:21:32Z

@szehon-ho thanks for reviewing, all comments are addressed now.

szehon-ho

rest of suggestion can be handled in separate pr (test only)

RussellSpitzer · 2025-07-03T21:53:48Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/AncestorsOfProcedure.java

-  public StructType outputType() {
-    return OUTPUT_TYPE;
+  public boolean isDeterministic() {
+    return false;


This one should be deterministic though ? Or is this just to avoid some other Spark behavior

This hasn't been used by Spark yet.

I conservatively assume that all procedures are indeterministic to avoid unexpected behavior. For this procedure, it can return a different result after the snapshot expires

RussellSpitzer · 2025-07-03T21:55:37Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/BaseProcedure.java

 import scala.Option;

-abstract class BaseProcedure implements Procedure {
+abstract class BaseProcedure implements BoundProcedure, UnboundProcedure {


I'm a little confused by this implementing bouth Bound and Unbound and the "bind" calls returning nothing in the implementations, Could you explain what the intent is here?

IIUC The idea was in Spark, that you can have different implementations of the procedure if you call with different arguments (thats when you bind it), but for all practical purpose I think its not so used in Iceberg

RussellSpitzer

I don't want to hold this back, I'm +1 on all the changes but I have a few things I think we should clean up before merging. I'm leaving my approval here so that others can feel free to merge once they are on board.

Things we need to clean up

There are a bunch of orphan tests now that just catch a generic error being thrown, for most of these I think we need to just remove the test case altogether if we aren't able to actually test it. See "Couldn't load ..."
I'm a little confused about the Bound, Unbound thing. If this is the expected way to implement procedures then I don't have a problem, just feels like we should only have to do one of them?
Deterministic is set to false for a bunch of procedures which I think are deterministic? Is this to avoid some spark behavior?

Anyway those are all minor concerns, this seems like a pretty straight forward translation. Thanks @pan3793 for the work!

amogh-jahagirdar

Ditto to basically everything @RussellSpitzer said. I think I'm fundamentally OK with this PR but it'd be good to make sure those tests are cleaned up and double check the deterministic API on those procedures before merging? It'd also be good to add a section in docs about some of these behavior changes around case sensitivity and type handling (but that can be in a separate PR)

pan3793 · 2025-07-04T03:09:49Z

@RussellSpitzer @amogh-jahagirdar thanks for the review.

I have tuned the test cases you pointed to make them meaningful, please take a look at each inline reply for details.
I think @szehon-ho has answered the question about the Bound/Unbound Procedure.
isDeterministic is not used in Spark 4.0 yet.
I can create a separate PR to change docs, given the diff of this PR is already quite large

...nsions/src/test/java/org/apache/iceberg/spark/extensions/TestFastForwardBranchProcedure.java

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteManifestsProcedure.java

nastra · 2025-07-04T13:10:30Z

.../v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteTablePathProcedure.java

  @Override
-  public StructType outputType() {
-    return OUTPUT_TYPE;
+  public boolean isDeterministic() {


maybe this method can go into BaseProcedure since the return value is the same across all procedures

szehon-ho · 2025-07-07T19:22:13Z

I guess Spark just needs all Expression to mark whether they are deterministic , for things like knowing whether you can use them in various places like filter aggregate, merge condition, default value, etc. not sure they all apply here :) but it makes sense for now that procedure output for iceberg is not deterinistic as it doesnt make sense to use the output like that.

As there's a lot of approvals here, I can wait a bit today if any more comment and merge if not to unblock 1.10 release

szehon-ho · 2025-07-07T22:08:10Z

Merged, thanks @pan3793 for the great work, and everyone for jumping on the reviews!

…entations (apache#13106) (apache#1611) Co-authored-by: Cheng Pan <[email protected]>

### What changes were proposed in this pull request? As the title. ### Why are the changes needed? The issue was originally found during - apache/iceberg#13106 I don't see any special reason that named parameters should always be case sensitive. (correct me if I'm wrong) I tested PostgreSQL, and the named parameters are case-insensitive by default. ``` psql (17.6 (Debian 17.6-1.pgdg13+1)) Type "help" for help. postgres=# CREATE FUNCTION concat_lower_or_upper(a text, b text, uppercase boolean DEFAULT false) RETURNS text AS $$ SELECT CASE WHEN $3 THEN UPPER($1 || ' ' || $2) ELSE LOWER($1 || ' ' || $2) END; $$ LANGUAGE SQL IMMUTABLE STRICT; CREATE FUNCTION postgres=# SELECT concat_lower_or_upper('Hello', 'World', true); concat_lower_or_upper ----------------------- HELLO WORLD (1 row) postgres=# SELECT concat_lower_or_upper(a => 'Hello', b => 'World'); concat_lower_or_upper ----------------------- hello world (1 row) postgres=# SELECT concat_lower_or_upper(A => 'Hello', b => 'World'); concat_lower_or_upper ----------------------- hello world (1 row) postgres=# ``` ### Does this PR introduce _any_ user-facing change? Yes, named parameters used by functions, procedures now respect `spark.sql.caseSensitive`, instead of always performing case sensitive. ### How was this patch tested? Added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52269 from pan3793/SPARK-53523. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? As the title. ### Why are the changes needed? The issue was originally found during - apache/iceberg#13106 I don't see any special reason that named parameters should always be case sensitive. (correct me if I'm wrong) I tested PostgreSQL, and the named parameters are case-insensitive by default. ``` psql (17.6 (Debian 17.6-1.pgdg13+1)) Type "help" for help. postgres=# CREATE FUNCTION concat_lower_or_upper(a text, b text, uppercase boolean DEFAULT false) RETURNS text AS $$ SELECT CASE WHEN $3 THEN UPPER($1 || ' ' || $2) ELSE LOWER($1 || ' ' || $2) END; $$ LANGUAGE SQL IMMUTABLE STRICT; CREATE FUNCTION postgres=# SELECT concat_lower_or_upper('Hello', 'World', true); concat_lower_or_upper ----------------------- HELLO WORLD (1 row) postgres=# SELECT concat_lower_or_upper(a => 'Hello', b => 'World'); concat_lower_or_upper ----------------------- hello world (1 row) postgres=# SELECT concat_lower_or_upper(A => 'Hello', b => 'World'); concat_lower_or_upper ----------------------- hello world (1 row) postgres=# ``` ### Does this PR introduce _any_ user-facing change? Yes, named parameters used by functions, procedures now respect `spark.sql.caseSensitive`, instead of always performing case sensitive. ### How was this patch tested? Added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52269 from pan3793/SPARK-53523. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added the spark label May 20, 2025

pan3793 commented May 20, 2025

View reviewed changes

pan3793 marked this pull request as draft May 24, 2025 09:36

pan3793 force-pushed the spark-4.0-procedure branch from 64a0ca4 to 7e8cbf3 Compare May 24, 2025 09:52

pan3793 commented May 24, 2025

View reviewed changes

pan3793 marked this pull request as ready for review May 24, 2025 13:07

github-actions bot added the stale label Jun 24, 2025

huaxingao removed the stale label Jun 24, 2025

szehon-ho reviewed Jun 24, 2025

View reviewed changes

pan3793 force-pushed the spark-4.0-procedure branch from 7e8cbf3 to 7bb9d14 Compare July 1, 2025 06:21

szehon-ho reviewed Jul 3, 2025

View reviewed changes

pan3793 added 3 commits July 3, 2025 15:19

Spark 4.0: Migrate Iceberg Stored Procedures to Spark built-in implem…

a973cfd

…entations

fix

a0203d1

address comments

8f629e1

pan3793 force-pushed the spark-4.0-procedure branch from 7bb9d14 to 8f629e1 Compare July 3, 2025 07:42

github-actions bot added the MR label Jul 3, 2025

nit

cc1c972

github-actions bot removed the MR label Jul 3, 2025

address comment

692b2cf

pan3793 added 3 commits July 3, 2025 16:48

fix

f66e5d0

parameter mismatch

5a8cb03

fix

be8fe4e

szehon-ho approved these changes Jul 3, 2025

View reviewed changes

szehon-ho added this to the Iceberg 1.10.0 milestone Jul 3, 2025

RussellSpitzer reviewed Jul 3, 2025

View reviewed changes

RussellSpitzer approved these changes Jul 3, 2025

View reviewed changes

amogh-jahagirdar reviewed Jul 3, 2025

View reviewed changes

cleanup test

6325ad8

nastra reviewed Jul 4, 2025

View reviewed changes

...nsions/src/test/java/org/apache/iceberg/spark/extensions/TestFastForwardBranchProcedure.java Show resolved Hide resolved

nastra reviewed Jul 4, 2025

View reviewed changes

...ensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteManifestsProcedure.java Outdated Show resolved Hide resolved

nit

d581a12

nastra reviewed Jul 4, 2025

View reviewed changes

nastra approved these changes Jul 4, 2025

View reviewed changes

move isDeterministic to BaseProcedure

2eda03b

szehon-ho merged commit 30ee7e8 into apache:main Jul 7, 2025
27 checks passed

ajantha-bhat mentioned this pull request Jul 11, 2025

Spark 4.0: Add procedure to compute partition stats #13523

Merged

anuragmantri added a commit to anuragmantri/iceberg that referenced this pull request Jul 25, 2025

Spark 4.0: Migrate Iceberg Stored Procedures to Spark built-in implem…

aef51a9

…entations (apache#13106) (apache#1611) Co-authored-by: Cheng Pan <[email protected]>

pan3793 mentioned this pull request Sep 8, 2025

[SPARK-53523][SQL] Named parameters respect spark.sql.caseSensitive apache/spark#52269

Closed

slfan1989 mentioned this pull request Sep 25, 2025

Spark 3.5: Backport: Refactor Spark procedures to consistently use ProcedureInput for parameter handling. #14179

Merged

pan3793 mentioned this pull request Oct 18, 2025

Doc: Remove Spark 3 specific wordings in docs #14357

Merged

Spark 4.0: Migrate Iceberg Stored Procedures to Spark built-in implementations #13106

Spark 4.0: Migrate Iceberg Stored Procedures to Spark built-in implementations #13106

Uh oh!

Conversation

pan3793 commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pan3793 commented May 20, 2025 •

edited

Loading

pan3793 Jun 26, 2025 •

edited

Loading

pan3793 May 24, 2025 •

edited

Loading

pan3793 commented May 24, 2025 •

edited

Loading

pan3793 Jul 4, 2025 •

edited

Loading

pan3793 Jul 4, 2025 •

edited

Loading

szehon-ho commented Jul 3, 2025 •

edited

Loading

RussellSpitzer Jul 3, 2025 •

edited

Loading

amogh-jahagirdar left a comment •

edited

Loading