Skip to content

Conversation

@yaooqinn
Copy link
Member

What changes were proposed in this pull request?

A java like string trim method trims all whitespaces that less or equal than 0x20. currently, our UTF8String handle the space =0x20 ONLY. This is not suitable for many cases in Spark, like trim for interval strings, date, timestamps, PostgreSQL like cast string to boolean.

Why are the changes needed?

improve the white spaces handling in UTF8String, also with some bugs fixed

Does this PR introduce any user-facing change?

yes,
string with control character at either end can be convert to date/timestamp and interval now

How was this patch tested?

add ut

@yaooqinn
Copy link
Member Author

cc @cloud-fan @maropu @wangyum @HyukjinKwon thanks in advance.

@cloud-fan
Copy link
Contributor

The thing we care is the SQL behavior, not java behavior. Can you check other databases and see if the trim function trims all chars whose ascii code <= 0x20?

@yaooqinn
Copy link
Member Author

The thing we care is the SQL behavior, not java behavior. Can you check other databases and see if the trim function trims all chars whose ascii code <= 0x20?

Not changing string functions, just for other types listed in pr desc.
Besides, the former interval parser should be able to handle these control chars

postgres=# select date E'2019-01-01\t';
    date
------------
 2019-01-01
(1 row)

postgres=# select date E'2019-01-01\t';
    date
------------
 2019-01-01
(1 row)

postgres=# select cast(E'1\t' as boolean);
 bool
------
 t
(1 row)

postgres=# select timestamp E'2019-01-01\t';
      timestamp
---------------------
 2019-01-01 00:00:00
(1 row)

hive

select date('2019-10-10\t');
"_c0"
"2019-10-10"

@cloud-fan
Copy link
Contributor

can you try select trim('2019-10-10\t') in other databases?

@yaooqinn
Copy link
Member Author

can you try select trim('2019-10-10\t') in other databases?

we are not changing string trim here, but here is the test result.

postgres=# select length(trim(E'2019-10-10\t'));
 length
--------
     11
(1 row)
presto> select length(trim('2019-10-10' || CHR(09)));
 _col0
-------
    10
(1 row)
hive
select length(trim('2019-10-10\t')) 
"_c0"
"11"

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114234 has finished for PR 26626 at commit 0bd8239.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* Trims whitespaces (<= ASCII 32) from both ends of this string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so SQL trim just removes space, and java.lang.String.trim() removes everything <= 32. Maybe we could refer to that in this doc, that this is the purpose of this additional trim method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, let's mention it's the same as java.lang.String.trim

/**
* Trims whitespaces (<= ASCII 32) from both ends of this string.
*
* @return this string with no spaces at the start or end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not just space

@yaooqinn
Copy link
Member Author

scala> " 234\t ".toDouble
res0: Double = 234.0

scala> java.lang.Double.valueOf(" 234\t ")
res1: Double = 234.0

the way spark deal with casting string to approximate numeric - float and double also trim control characters silently. @cloud-fan
@srowen

I take a look at SQL standard about string trim

  1. If trim character is not specified, then ' ' is implicit.

b) If is specified, then let SC be the value of trim character; otherwise, let SC be
space.

@yaooqinn
Copy link
Member Author

I also look up casting string to other types, the standard only says

If SD is character string, then SV is replaced by TRIM ( BOTH ' ' FROM VE )

It seem that most modern dbs are varying this rule.

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114243 has finished for PR 26626 at commit 41feca0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

From SQL standard

let SRC be <trim source>. TRIM ( SRC ) is equivalent to TRIM ( BOTH ' ' FROM SRC ).
cast specification
If SD is character string, then SV is replaced by SV with any leading or trailing <space>s removed.

some related information

L( <left bracket> <colon> SPACE <colon> <right bracket> )
is the set of all character strings of length 1 (one) that are the <space> character.
r) L( <left bracket> <colon> WHITESPACE <colon> <right bracket> )
is the set of all character strings of length 1 (one) that are white space characters.
...
white space
consecutive sequences of one or more characters that have no glyphs

So space means ' ', and white space means all chars whose ascii code <= 32. trim and cast should only remove spaces.

However, seems most of the DBs don't follow the cast part, and we rely on Double.valueOf so hard to change this behavior. I think it's OK to trim white spaces in cast.

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114247 has finished for PR 26626 at commit 72a2065.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Nothing trimmed
return this;
}
return copyUTF8String(s, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the caller side, I think it's safe to not copy the data. We can add a caveat in the javadoc that: this method doesn't copy the data and the caller side should do copy themselves if they want to hold it for a while.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not copy when it is an EMPTY_UTF8 either? A bit odd if we do it differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's a good point. OK let's leave it.

* Trims whitespaces (<= ASCII 32) from both ends of this string.
*
* Note that, this method is the same as java's {@link String#trim}, and different from
* {@link UTF8String#trim()} which only remove only spaces(= ASCII 32) from both ends.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove one "only"

public UTF8String trimAll() {
int s = 0;
// skip all of the whitespaces (<=0x20) in the left side
while (s < this.numBytes && getByte(s) <= 0x20) s++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use ' ' instead of 0x20?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also replace them in method trim? I don't know if whether ok or not I do so in this pr

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114269 has finished for PR 26626 at commit 79abc93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114276 has finished for PR 26626 at commit 773eb4b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114287 has finished for PR 26626 at commit 773eb4b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114292 has finished for PR 26626 at commit 773eb4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, can we add a migration guide?

@yaooqinn
Copy link
Member Author

migration guide added

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114303 has finished for PR 26626 at commit 3cd5433.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan changed the title [SPARK-29986][SQL] Introduce java like string trim to UTF8String [SPARK-29986][SQL] casting string to date/timestamp/interval should trim all whitespaces Nov 25, 2019
@cloud-fan cloud-fan closed this in de21f28 Nov 25, 2019
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@yaooqinn
Copy link
Member Author

Thanks for merging and the more suitable PR title :)

@yaooqinn yaooqinn deleted the SPARK-29986 branch November 25, 2019 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants