Spark 4.0: Refactor Spark procedures to consistently use ProcedureInput for parameter handling. #13913

slfan1989 · 2025-08-24T11:07:13Z

Description

This pull request refactors the existing Spark procedures to consistently use ProcedureInput for parameter handling. Previously, many stored procedures manually handled parameter extraction from InternalRow, leading to inconsistent code across the project. By adopting ProcedureInput, we improve the code uniformity and ensure that all stored procedures now use a unified approach for parameter handling.

Changes:

Replaced manual parameter extraction with ProcedureInput in relevant stored procedure code.
Simplified parameter validation and conversion by using ProcedureInput methods like asString, asLong, asInt, etc.
Enhanced readability and maintainability by consolidating parameter handling logic into a centralized class.

Benefits:

Consistency: All stored procedures now handle parameters in a consistent and standardized way.
Readability: Code is more readable and easier to maintain by leveraging ProcedureInput's built-in functions.
Reduced Duplication: Avoids repeated parameter extraction and conversion logic throughout the project.

…t for parameter handling.

dramaticlly

Thanks for the refactoring, some nitpicks

...4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/CherrypickSnapshotProcedure.java

...k/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/ExpireSnapshotsProcedure.java

...v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java

...k/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/ExpireSnapshotsProcedure.java

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/ProcedureInput.java

dramaticlly · 2025-09-02T22:13:03Z

.../v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteManifestsProcedure.java

-    Integer specId = args.isNullAt(2) ? null : args.getInt(2);
+    ProcedureInput input = new ProcedureInput(spark(), tableCatalog(), PARAMETERS, args);
+    Identifier tableIdent = toIdentifier(input.asString(TABLE_PARAM), TABLE_PARAM.name());
+    Boolean useCaching = input.asBoolean(USE_CACHING_PARAM, false);


let's use primitive boolean as well

...4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/RollbackToSnapshotProcedure.java

....0/spark/src/main/java/org/apache/iceberg/spark/procedures/RollbackToTimestampProcedure.java

...4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/SetCurrentSnapshotProcedure.java

dramaticlly · 2025-09-02T22:15:20Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/ProcedureInput.java

+  public Long[] asLongArray(ProcedureParameter param) {
+    Long[] value = asLongArray(param, null);
+    Preconditions.checkArgument(value != null, "Parameter '%s' is not set", param.name());
+    return value;
+  }


looks like this is not needed for this change, let's add later when the needs come

slfan1989 · 2025-09-03T02:19:06Z

@dramaticlly Thank you very much for reviewing the code and for your valuable suggestions! I have adopted most of your suggestions. Regarding the suggestion to modify the field types, my consideration is that this PR mainly focuses on improving input parameter parsing. Changing the field types might require adjustments to the execution logic, so in order to maintain consistency with the original author's setup, I would prefer not to make this change. Do you think this is acceptable?

dramaticlly · 2025-09-03T16:46:43Z

@dramaticlly Thank you very much for reviewing the code and for your valuable suggestions! I have adopted most of your suggestions. Regarding the suggestion to modify the field types, my consideration is that this PR mainly focuses on improving input parameter parsing. Changing the field types might require adjustments to the execution logic, so in order to maintain consistency with the original author's setup, I would prefer not to make this change. Do you think this is acceptable?

Thanks @slfan1989 , generally I think separate changes into multiple pulls are acceptable. However for Spark, we generally want to keep the supported Spark versions in sync, so we likely will need a back port PR to have same change for Spark 3.4 and 3.5. Many PRs with back port can fanout quickly

These suggested change from Boolean to boolean shall be relative straightforward, and our existing unit tests shall catch the problem with right coverage. Otherwise there's bigger problem in our procedure tests. Please let me know if this changes your mind.

slfan1989 · 2025-09-04T03:18:36Z

@dramaticlly Thank you very much for reviewing the code and for your valuable suggestions! I have adopted most of your suggestions. Regarding the suggestion to modify the field types, my consideration is that this PR mainly focuses on improving input parameter parsing. Changing the field types might require adjustments to the execution logic, so in order to maintain consistency with the original author's setup, I would prefer not to make this change. Do you think this is acceptable?

Thanks @slfan1989 , generally I think separate changes into multiple pulls are acceptable. However for Spark, we generally want to keep the supported Spark versions in sync, so we likely will need a back port PR to have same change for Spark 3.4 and 3.5. Many PRs with back port can fanout quickly

These suggested change from Boolean to boolean shall be relative straightforward, and our existing unit tests shall catch the problem with right coverage. Otherwise there's bigger problem in our procedure tests. Please let me know if this changes your mind.

@dramaticlly Thank you for your explanation! I believe it makes sense. I will try to improve the code based on your suggestions.

slfan1989 · 2025-09-16T04:35:57Z

@nastra Could you please help review this PR? Thank you very much! This PR unifies input parameter parsing using ProcedureInput to enhance consistency and maintainability.

cc: @dramaticlly

slfan1989 · 2025-09-18T05:59:07Z

@nastra Could you please help review this PR? Thank you very much! This PR unifies input parameter parsing using ProcedureInput to enhance consistency and maintainability.

cc: @dramaticlly

@nastra Could you kindly review this PR? In this PR, we have standardized the input parameter parsing for all stored procedures, and going forward, the parameter parsing for all stored procedures will be consistent. Thank you very much!

nastra · 2025-09-19T13:36:38Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/ProcedureInput.java

+    return value;
+  }
+
+  public Long asTimestampLong(ProcedureParameter param, Long defaultValue) {


this should be called asTimestampMillis

Thank you for reviewing the code. I will make the adjustments based on your suggestions.

nastra · 2025-09-19T13:37:27Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/ProcedureInput.java

    return args.isNullAt(ordinal) ? defaultValue : (Integer) args.getInt(ordinal);
  }

+  public long asTimestampLong(ProcedureParameter param) {


Suggested change

public long asTimestampLong(ProcedureParameter param) {

public long asTimestampMillis(ProcedureParameter param) {

nastra · 2025-09-19T13:40:25Z

...k/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/ExpireSnapshotsProcedure.java

+    Integer retainLastNum = input.asInt(RETAIN_LAST_PARAM, null);
+    Integer maxConcurrentDeletes = input.asInt(MAX_CONCURRENT_DELETES_PARAM, null);
+    boolean streamResult = input.asBoolean(STREAM_RESULTS_PARAM, false);
+    Long[] snapshotIds = input.asLongArray(SNAPSHOT_IDS_PARAM, null);


why not make those return long[] instead of Long[]?

Thank you for your suggestion. I have adjusted the return value to long[] to maintain consistency with the original usage.

nastra · 2025-09-19T13:42:19Z

...k/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/ExpireSnapshotsProcedure.java

+    Integer maxConcurrentDeletes = input.asInt(MAX_CONCURRENT_DELETES_PARAM, null);
+    boolean streamResult = input.asBoolean(STREAM_RESULTS_PARAM, false);
+    Long[] snapshotIds = input.asLongArray(SNAPSHOT_IDS_PARAM, null);
+    boolean cleanExpiredMetadata = input.asBoolean(CLEAN_EXPIRED_METADATA_PARAM, false);


generally speaking, we need to make sure that we don't change return types as this changes semantics. previously it was expecting a Boolean and now it gets a boolean. If the parameter wasn't defined by the caller, then we would rely on the API's default value (instead of passing true/false). That's also why the null checks further below existed.

Good catch on the boolean handling. I actually suggested @slfan1989 convert these to primitive booleans because all the procedures here use boolean parameters as explicit enablement flags, where null defaults to disabled.

This gives us a chance to clean up the if $Boolean != null pattern below when setting the corresponding flags.
Since we typically backport changes from the latest Spark version to older ones, making this change now should save us some backport work later, even though it means a bit more review effort upfront. Our unit tests should catch any behavioral changes from the semantic shift.

I would be in favor of using boolean instead of Boolean but we do actually have cases where we want to not pass an argument to the underlying core API if the user didn't provide any input and thus rely on whatever the default behavior of the core API is

Thank you for the information! From my perspective, changing it to boolean is feasible. The unit tests have all passed, and the results are as expected. Shall we go ahead and accept this improvement?

we need to switch back to using Boolean if it was Boolean originally

but we do actually have cases where we want to not pass an argument to the underlying core API if the user didn't provide any input and thus rely on whatever the default behavior of the core API is

From my understanding, I believe @nastra wanted us to keep original boxed boolean to convey user intention, as unset default of null implies not apply such flag to underlying spark action. Now we switch to false as primitive boolean, it will always pass such flag to action.

Sorry I missed this initially, let's switch to Boolean just like before this change and keep the minimal necessary change to adopt ProcedureInput. Maybe also worth a comment to convey the message as well

@nastra @dramaticlly Thank you for helping review the code! I have already reverted this part of the code back to Boolean.

nastra · 2025-09-19T13:45:12Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/ProcedureInput.java

        defaultValue);
  }

+  public DeleteOrphanFiles.PrefixMismatchMode asPrefixMismatchMode(ProcedureParameter param) {


I don't think we should put this method here as it's only used in a single place

This part of the code has already been improved.

nastra · 2025-09-19T13:47:50Z

...v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java

-    boolean prefixListing = args.isNullAt(9) ? false : args.getBoolean(9);
+    boolean prefixListing = input.asBoolean(PREFIX_LISTING_PARAM, false);

+    Long finalOlderThanMillis = olderThanMillis;


why is this one needed? It doesn't seem that olderThanMillis is being modified here

I also suggested this earlier in #13913 (comment)

nastra · 2025-09-19T13:48:30Z

.../v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteManifestsProcedure.java

-    Integer specId = args.isNullAt(2) ? null : args.getInt(2);
+    ProcedureInput input = new ProcedureInput(spark(), tableCatalog(), PARAMETERS, args);
+    Identifier tableIdent = input.ident(TABLE_PARAM);
+    boolean useCaching = input.asBoolean(USE_CACHING_PARAM, false);


semantics are being changed here

nastra

semantics of parameter types are being changed at a few places and we need to restore the original behavior

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/ProcedureInput.java

This reverts commit 7b008c7

This reverts commit 0771881

.../v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteManifestsProcedure.java

…ures/RewriteManifestsProcedure.java Spark4.0: Using the default value null. Co-authored-by: Eduard Tudenhoefner <[email protected]>

slfan1989 · 2025-09-24T22:24:30Z

@nastra Thanks for reviewing and merging the code! @dramaticlly Thanks for taking the time to review it!

…ut for parameter handling. (apache#13913)

github-actions bot added the spark label Aug 24, 2025

github-actions bot added API and removed API labels Aug 31, 2025

Spark4.0: Refactor Spark procedures to consistently use ProcedureInpu…

bcf5ba1

…t for parameter handling.

slfan1989 force-pushed the use-procedure-input-for-procedures branch from 02799a1 to bcf5ba1 Compare September 1, 2025 01:38

slfan1989 changed the title ~~Spark4.0: Refactor Spark procedures to consistently use ProcedureInput for parameter handling.~~ Spark 4.0: Refactor Spark procedures to consistently use ProcedureInput for parameter handling. Sep 1, 2025

Spark4.0: Fix CheckStyle Issue.

4ec4257

dramaticlly reviewed Sep 2, 2025

View reviewed changes

Spark 4.0: Improve Some Code.

ffff03a

Spark 4.0: Fix Junit Test Error.

f039c67

slfan1989 added 3 commits September 4, 2025 17:57

Spark 4.0: Improve Some Code.

0771881

Spark 4.0: Improve Some Code.

7b008c7

Spark 4.0: Improve Some Code.

f718641

Merge branch 'apache:main' into use-procedure-input-for-procedures

16a415c

nastra reviewed Sep 19, 2025

View reviewed changes

nastra requested changes Sep 19, 2025

View reviewed changes

slfan1989 added 2 commits September 21, 2025 10:02

Spark 4.0: Improve Some Code.

e578b8f

Spark 4.0: Fix Junit Error.

2e5f73d

nastra reviewed Sep 22, 2025

View reviewed changes

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/ProcedureInput.java Show resolved Hide resolved

slfan1989 added 3 commits September 22, 2025 23:57

Spark 4.0: Revert boolean 2 Boolean

8804cae

Revert "Spark 4.0: Improve Some Code."

01db4b0

This reverts commit 7b008c7

Revert "Spark 4.0: Improve Some Code."

84b609a

This reverts commit 0771881

nastra reviewed Sep 23, 2025

View reviewed changes

.../v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteManifestsProcedure.java Outdated Show resolved Hide resolved

Update spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/proced…

f5c1597

…ures/RewriteManifestsProcedure.java Spark4.0: Using the default value null. Co-authored-by: Eduard Tudenhoefner <[email protected]>

nastra approved these changes Sep 23, 2025

View reviewed changes

nastra merged commit 16c9dd6 into apache:main Sep 24, 2025
27 checks passed

slfan1989 mentioned this pull request Sep 25, 2025

Spark 3.5: Backport: Refactor Spark procedures to consistently use ProcedureInput for parameter handling. #14179

Merged

gabeiglio pushed a commit to gabeiglio/iceberg that referenced this pull request Oct 1, 2025

Spark 4.0: Refactor Spark procedures to consistently use ProcedureInp…

7ab1e72

…ut for parameter handling. (apache#13913)

	public long asTimestampLong(ProcedureParameter param) {
	public long asTimestampMillis(ProcedureParameter param) {

Spark 4.0: Refactor Spark procedures to consistently use ProcedureInput for parameter handling. #13913

Spark 4.0: Refactor Spark procedures to consistently use ProcedureInput for parameter handling. #13913

Uh oh!

Conversation

slfan1989 commented Aug 24, 2025

Description

Changes:

Benefits:

Uh oh!

dramaticlly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slfan1989 commented Sep 3, 2025

Uh oh!

dramaticlly commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slfan1989 commented Sep 4, 2025

Uh oh!

slfan1989 commented Sep 16, 2025

Uh oh!

slfan1989 commented Sep 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dramaticlly Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slfan1989 Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dramaticlly commented Sep 3, 2025 •

edited

Loading

nastra Sep 19, 2025 •

edited

Loading

dramaticlly Sep 22, 2025 •

edited

Loading

slfan1989 Sep 23, 2025 •

edited

Loading