-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32559][SQL]Fix the trim logic in UTF8String.toInt/toLong did't handle non-ASCII characters correctly #29375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cloud-fan @yaooqinn @gengliangwang Could you please help review this? |
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Outdated
Show resolved
Hide resolved
cf674a7 to
8509cff
Compare
|
retest this please |
8509cff to
a0adf3c
Compare
|
ok to test |
…aracters correctly
a0adf3c to
5db4b41
Compare
|
Test build #127144 has finished for PR 29375 at commit
|
|
Test build #127148 has finished for PR 29375 at commit
|
|
thanks, merging to master! |
|
@WangGuangxin can you open a backport PR for 3.0? |
sure |
…t handle non-ASCII characters correctly The trim logic in Cast expression introduced in apache#26622 trim non-ASCII characters unexpectly. Before this patch  After this patch  The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive Yes Added more UT Closes apache#29375 from WangGuangxin/cast-bugfix. Authored-by: wangguangxin.cn <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
| select cast(' 1' as float); | ||
| select cast(' 1 ' as DOUBLE); | ||
| select cast('1.0 ' as DEC); | ||
| select cast('1中文' as tinyint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is purely to educate me, but those characters are considered whitespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of character needs multiple bytes, so getByte(s) <= ' ' may not work.
… did't handle non-ASCII characters correctly ### What changes were proposed in this pull request? This is a backport of #29375 The trim logic in Cast expression introduced in #26622 trim non-ASCII characters unexpectly. Before this patch  After this patch  ### Why are the changes needed? The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added more UT Closes #29393 from WangGuangxin/cast-bugfix-branch-3.0. Authored-by: wangguangxin.cn <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…r change of trimming characters for cast
### What changes were proposed in this pull request?
This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```
But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.
### Why are the changes needed?
To follow the previous change.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```
Closes #33287 from sarutak/fix-utf8string-trim-issue.
Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…r change of trimming characters for cast
### What changes were proposed in this pull request?
This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```
But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.
### Why are the changes needed?
To follow the previous change.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```
Closes #33287 from sarutak/fix-utf8string-trim-issue.
Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 57a4f31)
Signed-off-by: Wenchen Fan <[email protected]>
…r change of trimming characters for cast
### What changes were proposed in this pull request?
This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```
But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.
### Why are the changes needed?
To follow the previous change.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```
Closes #33287 from sarutak/fix-utf8string-trim-issue.
Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 57a4f31)
Signed-off-by: Wenchen Fan <[email protected]>
…r change of trimming characters for cast
### What changes were proposed in this pull request?
This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```
But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.
### Why are the changes needed?
To follow the previous change.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```
Closes #33287 from sarutak/fix-utf8string-trim-issue.
Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 57a4f31)
Signed-off-by: Wenchen Fan <[email protected]>
…r change of trimming characters for cast
### What changes were proposed in this pull request?
This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```
But SPARK-32559 (apache#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.
### Why are the changes needed?
To follow the previous change.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```
Closes apache#33287 from sarutak/fix-utf8string-trim-issue.
Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 57a4f31)
Signed-off-by: Wenchen Fan <[email protected]>
…acters correctly ### What changes were proposed in this pull request? The trim logic in Cast expression introduced in #29375 trim ASCII control characters unexpectly. Before this patch  And hive  ### Why are the changes needed? The behavior described above doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? add ut Closes #41535 from Kwafoor/trim_bugfix. Lead-authored-by: wangjunbo <[email protected]> Co-authored-by: Junbo wang <[email protected]> Signed-off-by: Kent Yao <[email protected]>
…acters correctly ### What changes were proposed in this pull request? The trim logic in Cast expression introduced in #29375 trim ASCII control characters unexpectly. Before this patch  And hive  ### Why are the changes needed? The behavior described above doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? add ut Closes #41535 from Kwafoor/trim_bugfix. Lead-authored-by: wangjunbo <[email protected]> Co-authored-by: Junbo wang <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 80588e4) Signed-off-by: Kent Yao <[email protected]>
…acters correctly ### What changes were proposed in this pull request? The trim logic in Cast expression introduced in #29375 trim ASCII control characters unexpectly. Before this patch  And hive  ### Why are the changes needed? The behavior described above doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? add ut Closes #41535 from Kwafoor/trim_bugfix. Lead-authored-by: wangjunbo <[email protected]> Co-authored-by: Junbo wang <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 80588e4) Signed-off-by: Kent Yao <[email protected]>
…acters correctly ### What changes were proposed in this pull request? The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly. Before this patch  And hive  ### Why are the changes needed? The behavior described above doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? add ut Closes apache#41535 from Kwafoor/trim_bugfix. Lead-authored-by: wangjunbo <[email protected]> Co-authored-by: Junbo wang <[email protected]> Signed-off-by: Kent Yao <[email protected]>
…acters correctly ### What changes were proposed in this pull request? The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly. Before this patch  And hive  ### Why are the changes needed? The behavior described above doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? add ut Closes apache#41535 from Kwafoor/trim_bugfix. Lead-authored-by: wangjunbo <[email protected]> Co-authored-by: Junbo wang <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 80588e4) Signed-off-by: Kent Yao <[email protected]>
…acters correctly ### What changes were proposed in this pull request? The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly. Before this patch  And hive  ### Why are the changes needed? The behavior described above doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? add ut Closes apache#41535 from Kwafoor/trim_bugfix. Lead-authored-by: wangjunbo <[email protected]> Co-authored-by: Junbo wang <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 80588e4) Signed-off-by: Kent Yao <[email protected]>
…acters correctly ### What changes were proposed in this pull request? The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly. Before this patch  And hive  ### Why are the changes needed? The behavior described above doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? add ut Closes apache#41535 from Kwafoor/trim_bugfix. Lead-authored-by: wangjunbo <[email protected]> Co-authored-by: Junbo wang <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 80588e4) Signed-off-by: Kent Yao <[email protected]>
What changes were proposed in this pull request?
The trim logic in Cast expression introduced in #26622 trim non-ASCII characters unexpectly.
Before this patch

After this patch

Why are the changes needed?
The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
Added more UT