Skip to content

Conversation

@slfan1989
Copy link
Contributor

Description

This pull request refactors the existing Spark procedures to consistently use ProcedureInput for parameter handling. Previously, many stored procedures manually handled parameter extraction from InternalRow, leading to inconsistent code across the project. By adopting ProcedureInput, we improve the code uniformity and ensure that all stored procedures now use a unified approach for parameter handling.

Changes:

  • Replaced manual parameter extraction with ProcedureInput in relevant stored procedure code.
  • Simplified parameter validation and conversion by using ProcedureInput methods like asString, asLong, asInt, etc.
  • Enhanced readability and maintainability by consolidating parameter handling logic into a centralized class.

Benefits:

  • Consistency: All stored procedures now handle parameters in a consistent and standardized way.
  • Readability: Code is more readable and easier to maintain by leveraging ProcedureInput's built-in functions.
  • Reduced Duplication: Avoids repeated parameter extraction and conversion logic throughout the project.

@github-actions github-actions bot added the spark label Aug 24, 2025
@github-actions github-actions bot added API and removed API labels Aug 31, 2025
@slfan1989 slfan1989 force-pushed the use-procedure-input-for-procedures branch from 02799a1 to bcf5ba1 Compare September 1, 2025 01:38
@slfan1989 slfan1989 changed the title Spark4.0: Refactor Spark procedures to consistently use ProcedureInput for parameter handling. Spark 4.0: Refactor Spark procedures to consistently use ProcedureInput for parameter handling. Sep 1, 2025
Copy link
Contributor

@dramaticlly dramaticlly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactoring, some nitpicks

Integer specId = args.isNullAt(2) ? null : args.getInt(2);
ProcedureInput input = new ProcedureInput(spark(), tableCatalog(), PARAMETERS, args);
Identifier tableIdent = toIdentifier(input.asString(TABLE_PARAM), TABLE_PARAM.name());
Boolean useCaching = input.asBoolean(USE_CACHING_PARAM, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use primitive boolean as well

Comment on lines 113 to 117
public Long[] asLongArray(ProcedureParameter param) {
Long[] value = asLongArray(param, null);
Preconditions.checkArgument(value != null, "Parameter '%s' is not set", param.name());
return value;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is not needed for this change, let's add later when the needs come

@slfan1989
Copy link
Contributor Author

@dramaticlly Thank you very much for reviewing the code and for your valuable suggestions! I have adopted most of your suggestions. Regarding the suggestion to modify the field types, my consideration is that this PR mainly focuses on improving input parameter parsing. Changing the field types might require adjustments to the execution logic, so in order to maintain consistency with the original author's setup, I would prefer not to make this change. Do you think this is acceptable?

@dramaticlly
Copy link
Contributor

dramaticlly commented Sep 3, 2025

@dramaticlly Thank you very much for reviewing the code and for your valuable suggestions! I have adopted most of your suggestions. Regarding the suggestion to modify the field types, my consideration is that this PR mainly focuses on improving input parameter parsing. Changing the field types might require adjustments to the execution logic, so in order to maintain consistency with the original author's setup, I would prefer not to make this change. Do you think this is acceptable?

Thanks @slfan1989 , generally I think separate changes into multiple pulls are acceptable. However for Spark, we generally want to keep the supported Spark versions in sync, so we likely will need a back port PR to have same change for Spark 3.4 and 3.5. Many PRs with back port can fanout quickly

These suggested change from Boolean to boolean shall be relative straightforward, and our existing unit tests shall catch the problem with right coverage. Otherwise there's bigger problem in our procedure tests. Please let me know if this changes your mind.

@slfan1989
Copy link
Contributor Author

@dramaticlly Thank you very much for reviewing the code and for your valuable suggestions! I have adopted most of your suggestions. Regarding the suggestion to modify the field types, my consideration is that this PR mainly focuses on improving input parameter parsing. Changing the field types might require adjustments to the execution logic, so in order to maintain consistency with the original author's setup, I would prefer not to make this change. Do you think this is acceptable?

Thanks @slfan1989 , generally I think separate changes into multiple pulls are acceptable. However for Spark, we generally want to keep the supported Spark versions in sync, so we likely will need a back port PR to have same change for Spark 3.4 and 3.5. Many PRs with back port can fanout quickly

These suggested change from Boolean to boolean shall be relative straightforward, and our existing unit tests shall catch the problem with right coverage. Otherwise there's bigger problem in our procedure tests. Please let me know if this changes your mind.

@dramaticlly Thank you for your explanation! I believe it makes sense. I will try to improve the code based on your suggestions.

@slfan1989
Copy link
Contributor Author

@nastra Could you please help review this PR? Thank you very much! This PR unifies input parameter parsing using ProcedureInput to enhance consistency and maintainability.

cc: @dramaticlly

@slfan1989
Copy link
Contributor Author

@nastra Could you please help review this PR? Thank you very much! This PR unifies input parameter parsing using ProcedureInput to enhance consistency and maintainability.

cc: @dramaticlly

@nastra Could you kindly review this PR? In this PR, we have standardized the input parameter parsing for all stored procedures, and going forward, the parameter parsing for all stored procedures will be consistent. Thank you very much!

return value;
}

public Long asTimestampLong(ProcedureParameter param, Long defaultValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be called asTimestampMillis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing the code. I will make the adjustments based on your suggestions.

return args.isNullAt(ordinal) ? defaultValue : (Integer) args.getInt(ordinal);
}

public long asTimestampLong(ProcedureParameter param) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public long asTimestampLong(ProcedureParameter param) {
public long asTimestampMillis(ProcedureParameter param) {

Integer retainLastNum = input.asInt(RETAIN_LAST_PARAM, null);
Integer maxConcurrentDeletes = input.asInt(MAX_CONCURRENT_DELETES_PARAM, null);
boolean streamResult = input.asBoolean(STREAM_RESULTS_PARAM, false);
Long[] snapshotIds = input.asLongArray(SNAPSHOT_IDS_PARAM, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not make those return long[] instead of Long[]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion. I have adjusted the return value to long[] to maintain consistency with the original usage.

Integer maxConcurrentDeletes = input.asInt(MAX_CONCURRENT_DELETES_PARAM, null);
boolean streamResult = input.asBoolean(STREAM_RESULTS_PARAM, false);
Long[] snapshotIds = input.asLongArray(SNAPSHOT_IDS_PARAM, null);
boolean cleanExpiredMetadata = input.asBoolean(CLEAN_EXPIRED_METADATA_PARAM, false);
Copy link
Contributor

@nastra nastra Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally speaking, we need to make sure that we don't change return types as this changes semantics. previously it was expecting a Boolean and now it gets a boolean. If the parameter wasn't defined by the caller, then we would rely on the API's default value (instead of passing true/false). That's also why the null checks further below existed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the boolean handling. I actually suggested @slfan1989 convert these to primitive booleans because all the procedures here use boolean parameters as explicit enablement flags, where null defaults to disabled.

This gives us a chance to clean up the if $Boolean != null pattern below when setting the corresponding flags.
Since we typically backport changes from the latest Spark version to older ones, making this change now should save us some backport work later, even though it means a bit more review effort upfront. Our unit tests should catch any behavioral changes from the semantic shift.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor of using boolean instead of Boolean but we do actually have cases where we want to not pass an argument to the underlying core API if the user didn't provide any input and thus rely on whatever the default behavior of the core API is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the information! From my perspective, changing it to boolean is feasible. The unit tests have all passed, and the results are as expected. Shall we go ahead and accept this improvement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to switch back to using Boolean if it was Boolean originally

Copy link
Contributor

@dramaticlly dramaticlly Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we do actually have cases where we want to not pass an argument to the underlying core API if the user didn't provide any input and thus rely on whatever the default behavior of the core API is

From my understanding, I believe @nastra wanted us to keep original boxed boolean to convey user intention, as unset default of null implies not apply such flag to underlying spark action. Now we switch to false as primitive boolean, it will always pass such flag to action.

Sorry I missed this initially, let's switch to Boolean just like before this change and keep the minimal necessary change to adopt ProcedureInput. Maybe also worth a comment to convey the message as well

Copy link
Contributor Author

@slfan1989 slfan1989 Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra @dramaticlly Thank you for helping review the code! I have already reverted this part of the code back to Boolean.

defaultValue);
}

public DeleteOrphanFiles.PrefixMismatchMode asPrefixMismatchMode(ProcedureParameter param) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should put this method here as it's only used in a single place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the code has already been improved.

boolean prefixListing = args.isNullAt(9) ? false : args.getBoolean(9);
boolean prefixListing = input.asBoolean(PREFIX_LISTING_PARAM, false);

Long finalOlderThanMillis = olderThanMillis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this one needed? It doesn't seem that olderThanMillis is being modified here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also suggested this earlier in #13913 (comment)

Integer specId = args.isNullAt(2) ? null : args.getInt(2);
ProcedureInput input = new ProcedureInput(spark(), tableCatalog(), PARAMETERS, args);
Identifier tableIdent = input.ident(TABLE_PARAM);
boolean useCaching = input.asBoolean(USE_CACHING_PARAM, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantics are being changed here

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantics of parameter types are being changed at a few places and we need to restore the original behavior

…ures/RewriteManifestsProcedure.java


Spark4.0: Using the default value null.

Co-authored-by: Eduard Tudenhoefner <[email protected]>
@nastra nastra merged commit 16c9dd6 into apache:main Sep 24, 2025
27 checks passed
@slfan1989
Copy link
Contributor Author

@nastra Thanks for reviewing and merging the code! @dramaticlly Thanks for taking the time to review it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants