-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29776][SQL] rpad and lpad should return NULL when padstring parameter is empty #26477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@HyukjinKwon , please verify this PR. |
|
@HyukjinKwon, this we can fix it now ? |
|
@HyukjinKwon , please verify this PR. |
|
I refer other DBMS and analyse in other project like HIVE they have fixed this issue. |
|
Can you add end-to-end tests, too, before accepting tests? |
|
ok @maropu |
|
@maropu and @cloud-fan , i have modified exiting end to end test cases , please review again. |
sql/core/src/test/resources/sql-tests/results/postgreSQL/strings.sql.out
Outdated
Show resolved
Hide resolved
|
ok to test |
|
Also, you need to update the migration guide because this is a behaviour change. |
|
sure i will update the migration guide also. |
sql/core/src/test/resources/sql-tests/results/postgreSQL/strings.sql.out
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
Outdated
Show resolved
Hide resolved
|
Can you update the description and the examples for |
|
cc: @HyukjinKwon |
|
Test build #115758 has finished for PR 26477 at commit
|
|
@maropu , I have fixed all the review comments, please check. |
|
Test build #115762 has finished for PR 26477 at commit
|
sql/core/src/test/resources/sql-tests/inputs/postgreSQL/strings.sql
Outdated
Show resolved
Hide resolved
|
Test build #115813 has finished for PR 26477 at commit
|
|
What about |
|
@srowen , for lpad SPARK-29853 this jira is there, after this i will reopen PR for lpad. |
|
These are so closely related that they should be one PR and JIRA |
|
ok then i will combine both PR. |
|
Test build #115818 has finished for PR 26477 at commit
|
|
Test build #115820 has finished for PR 26477 at commit
|
|
Test build #115821 has finished for PR 26477 at commit
|
|
Test build #115822 has finished for PR 26477 at commit
|
| assertEquals(fromString("数据砖头"), fromString("数据砖头").rpad(5, EMPTY_UTF8)); | ||
| assertEquals(fromString("数据砖"), fromString("数据砖头").rpad(3, EMPTY_UTF8)); | ||
| assertEquals(EMPTY_UTF8, EMPTY_UTF8.rpad(3, EMPTY_UTF8)); | ||
| assertEquals(fromString(null), fromString("数据砖头").rpad(5, EMPTY_UTF8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the previous behaivour was like PostgreSQL before but you propose to match it to Hive, Mysql and Oracle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, I am less sure why we should necessarily change. The lpad and rpad implementations seem different per DBMS implementation. Spark's case at least has one reference and the current behaviour makes sense as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is reasonable to change this because as per description of this function
In case of empty pad string, the return value should be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon, I feel reasonable because in other DBMS like hive, Mysql they handle this issue after they find this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function behaviour varies per DBMS implementations and we have a reference as you described, PostgreSQL. Can you guys elaborate why it looks reasonable to you guys? I don't see a strong reason to change the current behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think NULL is the better choice, that way we don't return unpadded data.
NULL makes sense with invalid input.
Please refer the JIRA how Hive handled
https://issues.apache.org/jira/browse/HIVE-15792
So as per me we should follow this as per Hive.
|
I agree that NULL makes more sense in this case, but returning the original string is not a bad idea. I don't have a strong opinion on this one. cc @gatorsmile @viirya @dongjoon-hyun |
|
Same here, I guess I wouldn't change it if it's not clear whether before or after is better. |
|
I have the same option. This looks not a strong reason to change. We can change if it is a standard. Seems there are different implementation across different DBs. |
|
Thank you all, for such a good discussion, If we are not finding strong reason to change old behaviours, then we can close this PR. |
|
Closing this as discussed. |
|
+1 for the close. |
What changes were proposed in this pull request?
rpad and lpad should return NULL when padstring parameter is empty
Why are the changes needed?
Returns str, right-padded or left-padded with pad to a length of length. If str is longer than length, the return value is shortened to length characters. In case of empty pad string, the return value is null.
Different behaviours of rpad and lpad function in case of empty pad string :
Note : Implemented RPAD and LPAD function as per the definition.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Old unit tests correct as per this jira.