-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-53112][SQL][PYTHON][CONNECT] Support TIME in the make_timestamp_ntz and try_make_timestamp_ntz functions in PySpark #51831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
uros-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting this PR on pause until the corresponding Scala PR is merged: #51828.
|
|
||
|
|
||
| def make_timestamp_ntz( # type: ignore[misc] | ||
| yearsOrDate: "ColumnOrName", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change the argument names?
does it break this?
make_timestamp_ntz(years=..., months=..., days=..., hours=..., mins=..., secs=...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah let's don't change the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted this change. using keyword arguments for new date and time arguments.
|
Adding @Yicong-Huang to take over this task and address lint failures & outstanding comments. |
…standardizing formatting
…ent handling and type casting
|
@zhengruifeng could you please review this again? I've updated it with setting |
…ions with improved argument handling and error messaging
…p_ntz functions for clarity and consistency
…_ntz for flexibility
…t tests and clarifying positional arguments
|
|
||
|
|
||
| def make_timestamp_ntz( # type: ignore[misc] | ||
| *args: "ColumnOrName", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we remove the *args: "ColumnOrName" and only support following patterns? @HyukjinKwon
make_timestamp_ntz(years=y, months=mo, days=d, hours=h, mins=mi, secs=s)
make_timestamp_ntz(y, mo, d, h, mins=mi, secs=s)
make_timestamp_ntz(y, mo, d, h, mi, s)
make_timestamp_ntz(date=d, time=t) <- for the new code path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep *args: "ColumnOrName", we need a helper function to preprocess the arguments so that it can be reused in all the places.
def _preprocess_make_timestampe_args(...) -> Tuple[ColumnOrName, ..., ColumnOrName] which returns a 8-tuple for the 8 arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need the *args to match the positional arguments. Those are from the original use case. If we remove it, I am afraid that we'd cause breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it won't break positional cases like following?
make_timestamp_ntz(y, mo, d, h, mi, s)
make_timestamp_ntz(y, mo, d, h, mins=mi, secs=s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right. I did one more refactor to remove *args and now we explicitly support the following 4 cases:
make_timestamp_ntz(years=y, months=mo, days=d, hours=h, mins=mi, secs=s)
make_timestamp_ntz(y, mo, d, h, mins=mi, secs=s)
make_timestamp_ntz(y, mo, d, h, mi, s)
make_timestamp_ntz(date=d, time=t)
Could you please check again?
…nctions by removing overloads and enhancing argument handling
| "Value for `<arg_name>` must be between <lower_bound> and <upper_bound> (inclusive), got <actual>" | ||
| ] | ||
| }, | ||
| "WRONG_NUM_ARGS": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems we can reuse error class CANNOT_SET_TOGETHER:
raise PySparkValueError(
errorClass="CANNOT_SET_TOGETHER",
messageParameters={"arg_list": "years|months|days|hours|mins|secs and date|time"},
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed!
|
|
||
| @overload | ||
| def make_timestamp_ntz( | ||
| *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need *?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is to mark the date and time (anything after *) are keyword only. Otherwise it will be marked as positional as well, then mypy would complain.
| hours: Optional["ColumnOrName"] = None, | ||
| mins: Optional["ColumnOrName"] = None, | ||
| secs: Optional["ColumnOrName"] = None, | ||
| *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need *?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| and date is None | ||
| and time is None | ||
| ): | ||
| from typing import cast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cast is already imported at the top of this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed redundant imports (cast, PySparkValueError).
| date: Optional["ColumnOrName"] = None, | ||
| time: Optional["ColumnOrName"] = None, | ||
| ) -> Column: | ||
| # 6 positional/keyword arguments: years, months, days, hours, mins, secs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify it a bit:
if year is not None:
if any(arg is not None for arg in [date, time]):
raise CANNOT_SET_TOGETHER
return _invoke_function_over_columns("make_timestamp_ntz", years, ...)
else:
if any(arg is not None for arg in [years, months, days, hours, mins, secs]):
raise CANNOT_SET_TOGETHER
return _invoke_function_over_columns("make_timestamp_ntz", date, time)
_invoke_function_over_columns will raise NOT_COLUMN_OR_STR if a input is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I have it too complicated to check all arguments. I've followed your suggestion to push other validations to _invoke_function_over_columns
| # Invalid argument combinations - return NULL for try_ functions | ||
| # For try_ functions, invalid inputs should return NULL | ||
| from pyspark.sql.connect.functions import lit | ||
|
|
||
| return lit(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not allow this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed to use default values when inputs are None. Please check.
…ET_TOGETHER for conflicting arguments
…stamp_ntz and try_make_timestamp_ntz functions
…meters gracefully by returning NULL
…combinations, edge cases, and error handling scenarios
…generic exceptions for invalid inputs
…mproved readability
| _months = lit(0) if months is None else months | ||
| _days = lit(0) if days is None else days | ||
| _hours = lit(0) if hours is None else hours | ||
| _mins = lit(0) if mins is None else mins | ||
| _secs = lit(0) if secs is None else secs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why support None => lit(0) here?
…arameters and enhance test cases for required arguments
…ueError for conflicting argument combinations
|
merged to master |
|
@Yicong-Huang some newly added tests actually depends on ANSI, and failed when ANSI is off. I am going to remove them in #52466 |
…p_ntz and try_make_timestamp_ntz functions in PySpark ### What changes were proposed in this pull request? Implement the `make_timestamp_ntz` and `try_make_timestamp_ntz` functions in PySpark & PySpark Connect API. ### Why are the changes needed? Expand API support for the `make_timestamp_ntz` and `try_make_timestamp_ntz` functions. ### Does this PR introduce _any_ user-facing change? Yes, the new functions are now available in Python API. ### How was this patch tested? Added appropriate Python function tests. - pyspark.sql.tests.test_functions - pyspark.sql.tests.connect.test_parity_functions ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51831 from uros-db/python-try_make_timestamp_ntz. Lead-authored-by: Yicong-Huang <[email protected]> Co-authored-by: Uros Bojanic <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
What changes were proposed in this pull request?
Implement the
make_timestamp_ntzandtry_make_timestamp_ntzfunctions in PySpark & PySpark Connect API.Why are the changes needed?
Expand API support for the
make_timestamp_ntzandtry_make_timestamp_ntzfunctions.Does this PR introduce any user-facing change?
Yes, the new functions are now available in Python API.
How was this patch tested?
Added appropriate Python function tests.
Was this patch authored or co-authored using generative AI tooling?
No.