-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29986][SQL] casting string to date/timestamp/interval should trim all whitespaces #26626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan @maropu @wangyum @HyukjinKwon thanks in advance. |
|
The thing we care is the SQL behavior, not java behavior. Can you check other databases and see if the |
Not changing string functions, just for other types listed in pr desc. postgres=# select date E'2019-01-01\t';
date
------------
2019-01-01
(1 row)
postgres=# select date E'2019-01-01\t';
date
------------
2019-01-01
(1 row)
postgres=# select cast(E'1\t' as boolean);
bool
------
t
(1 row)
postgres=# select timestamp E'2019-01-01\t';
timestamp
---------------------
2019-01-01 00:00:00
(1 row)hiveselect date('2019-10-10\t');
"_c0"
"2019-10-10"
|
|
can you try |
we are not changing string trim here, but here is the test result. postgres=# select length(trim(E'2019-10-10\t'));
length
--------
11
(1 row) |
|
Test build #114234 has finished for PR 26626 at commit
|
| } | ||
|
|
||
| /** | ||
| * Trims whitespaces (<= ASCII 32) from both ends of this string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, so SQL trim just removes space, and java.lang.String.trim() removes everything <= 32. Maybe we could refer to that in this doc, that this is the purpose of this additional trim method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, let's mention it's the same as java.lang.String.trim
...c/main/scala/org/apache/spark/sql/catalyst/expressions/postgreSQL/PostgreCastToBoolean.scala
Show resolved
Hide resolved
| /** | ||
| * Trims whitespaces (<= ASCII 32) from both ends of this string. | ||
| * | ||
| * @return this string with no spaces at the start or end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: not just space
scala> " 234\t ".toDouble
res0: Double = 234.0
scala> java.lang.Double.valueOf(" 234\t ")
res1: Double = 234.0the way spark deal with casting string to approximate numeric - float and double also trim control characters silently. @cloud-fan I take a look at SQL standard about string trim
|
|
I also look up casting string to other types, the standard only says
It seem that most modern dbs are varying this rule. |
|
Test build #114243 has finished for PR 26626 at commit
|
|
From SQL standard some related information So space means However, seems most of the DBs don't follow the cast part, and we rely on |
|
Test build #114247 has finished for PR 26626 at commit
|
| // Nothing trimmed | ||
| return this; | ||
| } | ||
| return copyUTF8String(s, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the caller side, I think it's safe to not copy the data. We can add a caveat in the javadoc that: this method doesn't copy the data and the caller side should do copy themselves if they want to hold it for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not copy when it is an EMPTY_UTF8 either? A bit odd if we do it differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah that's a good point. OK let's leave it.
| * Trims whitespaces (<= ASCII 32) from both ends of this string. | ||
| * | ||
| * Note that, this method is the same as java's {@link String#trim}, and different from | ||
| * {@link UTF8String#trim()} which only remove only spaces(= ASCII 32) from both ends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove one "only"
| public UTF8String trimAll() { | ||
| int s = 0; | ||
| // skip all of the whitespaces (<=0x20) in the left side | ||
| while (s < this.numBytes && getByte(s) <= 0x20) s++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use ' ' instead of 0x20?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also replace them in method trim? I don't know if whether ok or not I do so in this pr
|
Test build #114269 has finished for PR 26626 at commit
|
|
Test build #114276 has finished for PR 26626 at commit
|
|
retest this please |
|
Test build #114287 has finished for PR 26626 at commit
|
|
retest this please |
|
Test build #114292 has finished for PR 26626 at commit
|
|
LGTM, can we add a migration guide? |
|
migration guide added |
|
Test build #114303 has finished for PR 26626 at commit
|
|
thanks, merging to master! |
|
Thanks for merging and the more suitable PR title :) |
What changes were proposed in this pull request?
A java like string trim method trims all whitespaces that less or equal than 0x20. currently, our UTF8String handle the space =0x20 ONLY. This is not suitable for many cases in Spark, like trim for interval strings, date, timestamps, PostgreSQL like cast string to boolean.
Why are the changes needed?
improve the white spaces handling in UTF8String, also with some bugs fixed
Does this PR introduce any user-facing change?
yes,
string with
control characterat either end can be convert to date/timestamp and interval nowHow was this patch tested?
add ut