Skip to content

Conversation

@WangGuangxin
Copy link
Contributor

@WangGuangxin WangGuangxin commented Aug 6, 2020

What changes were proposed in this pull request?

The trim logic in Cast expression introduced in #26622 trim non-ASCII characters unexpectly.

Before this patch
image

After this patch
image

Why are the changes needed?

The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added more UT

@WangGuangxin
Copy link
Contributor Author

@cloud-fan @yaooqinn @gengliangwang Could you please help review this?

@yaooqinn
Copy link
Member

yaooqinn commented Aug 6, 2020

retest this please

@cloud-fan
Copy link
Contributor

ok to test

@WangGuangxin WangGuangxin changed the title [SPARK-32559][SQL]Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly [SPARK-32559][SQL]Fix the trim logic in UTF8String.toInt/toLong did't handle non-ASCII characters correctly Aug 6, 2020
@SparkQA
Copy link

SparkQA commented Aug 6, 2020

Test build #127144 has finished for PR 29375 at commit a0adf3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2020

Test build #127148 has finished for PR 29375 at commit 5db4b41.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in 9a35b93 Aug 7, 2020
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan
Copy link
Contributor

@WangGuangxin can you open a backport PR for 3.0?

@WangGuangxin
Copy link
Contributor Author

@WangGuangxin can you open a backport PR for 3.0?

sure

WangGuangxin added a commit to WangGuangxin/spark that referenced this pull request Aug 7, 2020
…t handle non-ASCII characters correctly

The trim logic in Cast expression introduced in apache#26622 trim non-ASCII characters unexpectly.

Before this patch
![image](https://user-images.githubusercontent.com/1312321/89513154-caad9b80-d806-11ea-9ebe-17c9e7d1b5b3.png)

After this patch
![image](https://user-images.githubusercontent.com/1312321/89513196-d731f400-d806-11ea-959c-6a7dc29dcd49.png)

The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive

Yes

Added more UT

Closes apache#29375 from WangGuangxin/cast-bugfix.

Authored-by: wangguangxin.cn <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
select cast(' 1' as float);
select cast(' 1 ' as DOUBLE);
select cast('1.0 ' as DEC);
select cast('1中文' as tinyint);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is purely to educate me, but those characters are considered whitespace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of character needs multiple bytes, so getByte(s) <= ' ' may not work.

dongjoon-hyun pushed a commit that referenced this pull request Aug 9, 2020
… did't handle non-ASCII characters correctly

### What changes were proposed in this pull request?

This is a backport of #29375
The trim logic in Cast expression introduced in #26622 trim non-ASCII characters unexpectly.

Before this patch
![image](https://user-images.githubusercontent.com/1312321/89513154-caad9b80-d806-11ea-9ebe-17c9e7d1b5b3.png)

After this patch
![image](https://user-images.githubusercontent.com/1312321/89513196-d731f400-d806-11ea-959c-6a7dc29dcd49.png)

### Why are the changes needed?
The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added more UT

Closes #29393 from WangGuangxin/cast-bugfix-branch-3.0.

Authored-by: wangguangxin.cn <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jul 13, 2021
…r change of trimming characters for cast

### What changes were proposed in this pull request?

This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```

But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.

### Why are the changes needed?

To follow the previous change.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```

Closes #33287 from sarutak/fix-utf8string-trim-issue.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jul 13, 2021
…r change of trimming characters for cast

### What changes were proposed in this pull request?

This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```

But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.

### Why are the changes needed?

To follow the previous change.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```

Closes #33287 from sarutak/fix-utf8string-trim-issue.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 57a4f31)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jul 13, 2021
…r change of trimming characters for cast

### What changes were proposed in this pull request?

This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```

But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.

### Why are the changes needed?

To follow the previous change.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```

Closes #33287 from sarutak/fix-utf8string-trim-issue.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 57a4f31)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jul 13, 2021
…r change of trimming characters for cast

### What changes were proposed in this pull request?

This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```

But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.

### Why are the changes needed?

To follow the previous change.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```

Closes #33287 from sarutak/fix-utf8string-trim-issue.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 57a4f31)
Signed-off-by: Wenchen Fan <[email protected]>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…r change of trimming characters for cast

### What changes were proposed in this pull request?

This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```

But SPARK-32559 (apache#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.

### Why are the changes needed?

To follow the previous change.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```

Closes apache#33287 from sarutak/fix-utf8string-trim-issue.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 57a4f31)
Signed-off-by: Wenchen Fan <[email protected]>
yaooqinn pushed a commit that referenced this pull request Jun 13, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in #29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes #41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <[email protected]>
Co-authored-by: Junbo wang <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
yaooqinn pushed a commit that referenced this pull request Jun 13, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in #29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes #41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <[email protected]>
Co-authored-by: Junbo wang <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <[email protected]>
yaooqinn pushed a commit that referenced this pull request Jun 13, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in #29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes #41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <[email protected]>
Co-authored-by: Junbo wang <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <[email protected]>
czxm pushed a commit to czxm/spark that referenced this pull request Jun 19, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes apache#41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <[email protected]>
Co-authored-by: Junbo wang <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes apache#41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <[email protected]>
Co-authored-by: Junbo wang <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <[email protected]>
GladwinLee pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes apache#41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <[email protected]>
Co-authored-by: Junbo wang <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <[email protected]>
catalinii pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes apache#41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <[email protected]>
Co-authored-by: Junbo wang <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants