Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 25, 2019

What changes were proposed in this pull request?

In the PR, I propose new function stringToInterval() in IntervalUtils for converting UTF8String to CalendarInterval. The function is used in casting a STRING column to an INTERVAL column.

Why are the changes needed?

The proposed implementation is ~10 times faster. For example, parsing 9 interval units on JDK 8:
Before:

9 units w/ interval                               14004          14125         116          0.1       14003.6       0.0X
9 units w/o interval                              13785          14056         290          0.1       13784.9       0.0X

After:

9 units w/ interval                                1343           1344           1          0.7        1343.0       0.3X
9 units w/o interval                               1345           1349           8          0.7        1344.6       0.3X

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • By new tests for stringToInterval in IntervalUtilsSuite
  • By existing tests

@SparkQA
Copy link

SparkQA commented Oct 25, 2019

Test build #112677 has finished for PR 26256 at commit 479d5bd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 25, 2019

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Oct 25, 2019

Test build #112684 has finished for PR 26256 at commit 479d5bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SPARK-29605][SQL] Optimize string to interval casting [SPARK-29605][SQL] Optimize string to interval casting Oct 25, 2019
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 25, 2019

@cloud-fan @srowen @dongjoon-hyun @maropu May I ask you to review this PR.

private final val minuteStr = UTF8String.fromString("minute")
private final val secondStr = UTF8String.fromString("second")
private final val millisStr = UTF8String.fromString("millisecond")
private final val microsStr = UTF8String.fromString("microsecond")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you added final for performance reasons, but do we have actual performance diffs of the benchmarks below with/without this final? (Just out of curiosity...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-ran IntervalBenchmark without finals, there is no difference, actually.

private final val millisStr = UTF8String.fromString("millisecond")
private final val microsStr = UTF8String.fromString("microsecond")

def stringToInterval(input: UTF8String): Option[CalendarInterval] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use Option here? If this method has the same return type with safeFromString, we could make diff in Cast.scala much less.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: safeFromString -> safeAndFastFromString? Also, could you leave some comments here like This is the fast version of safeFromString...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use Option here?

The same reason as stringToDate and stringToTimestamp return Option[...]. I believe all 3 functions should follow similar convention - None for errors and Some[] for valid values.

Returning null instead of None looks slightly ugly in Scala but this should produce less garbage. Let me think of ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: safeFromString -> safeAndFastFromString?

I would remain the same name for consistency with stringToDate and stringToTimestamp, or rename it to safeFromUTF8String.

@cloud-fan
Copy link
Contributor

This might be different from INTERVAL'...'. Does casting string to interval type allows leading "interval" in the string?

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 28, 2019

Does casting string to interval type allows leading "interval" in the string?

yes, it does

@cloud-fan
Copy link
Contributor

cloud-fan commented Oct 28, 2019

Yes it does allow in Spark now, but I'm asking if it's also allowed in others.

I checked pgsql

cloud0fan=# select 'interval 1 day'::interval;
ERROR:  invalid input syntax for type interval: "interval 1 day"
LINE 1: select 'interval 1 day'::interval;
               ^
cloud0fan=# select '1 day'::interval;
 interval 
----------
 1 day
(1 row)

Can you check more?

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 28, 2019

Can you check more?

I can check but the purpose of this PR is optimize existing functionality. I don't think changing behavior here is right choice.

@cloud-fan
Copy link
Contributor

If we need to change behavior, I'd suggest we wait for it to happen before writing a new implementation to optimize it.

@cloud-fan
Copy link
Contributor

I've open #26283 to change the behavior.

…nterval

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/IntervalUtilsSuite.scala
@SparkQA
Copy link

SparkQA commented Oct 29, 2019

Test build #112865 has finished for PR 26256 at commit d68f41e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _ImputerParams(HasInputCol, HasInputCols, HasOutputCol, HasOutputCols):
  • class _OneHotEncoderParams(HasInputCol, HasInputCols, HasOutputCol, HasOutputCols,
  • abstract class BitAggregate extends DeclarativeAggregate with ExpectsInputTypes
  • case class BitAndAgg(child: Expression) extends BitAggregate
  • case class BitOrAgg(child: Expression) extends BitAggregate
  • case class BitXorAgg(child: Expression) extends BitAggregate
  • case class Version() extends LeafExpression with CodegenFallback
  • case class AlterTableRecoverPartitionsStatement(
  • case class DropNamespaceStatement(
  • case class LoadDataStatement(
  • case class ShowCreateTableStatement(tableName: Seq[String]) extends ParsedStatement
  • case class UncacheTableStatement(
  • case class DropNamespace(
  • case class LocalShuffleReaderExec(child: QueryStageExec) extends UnaryExecNode
  • case class DropNamespaceExec(

…nterval

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/IntervalUtilsSuite.scala
@SparkQA
Copy link

SparkQA commented Nov 2, 2019

Test build #113124 has finished for PR 26256 at commit 9dfb45d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 2, 2019

jenkins, retest this, please

case _ if '0' <= b && b <= '9' && fractionScale > 0 =>
fraction = Math.addExact(fraction, Math.multiplyExact(fractionScale, (b - '0')))
fractionScale /= 10
case ' ' =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about 1. hour and 1. second?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark-sql> select cast('1. seconds' as interval);
interval 1 seconds
spark-sql> select cast('1. days' as interval);
interval 1 days

@SparkQA
Copy link

SparkQA commented Nov 6, 2019

Test build #113313 has finished for PR 26256 at commit 107d16c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2019

Test build #113312 has finished for PR 26256 at commit 98dd44f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case ' ' =>
state = BEGIN_UNIT_NAME
case '.' =>
fractionScale = 100000
Copy link
Contributor

@cloud-fan cloud-fan Nov 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the antlr version(IntervalUtils.parseNanos) supports up to 9 digits in the fraction part. Shall we follow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the behavior is:

  1. if it's less than 9 digits, take the first 6 and parse to microseconds
  2. if it's more than 9 digits, fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the behavior

val input = "+1 year -1 day"
val result = new CalendarInterval(12, -1, 0)
assert(fromString(input) == result)
test("string to interval: special cases") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also test 1. second, 1. hour and 1.111111111 seconds here?

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except 2 comments.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except 2 comments.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except 2 comments.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except 2 comments.

@SparkQA
Copy link

SparkQA commented Nov 6, 2019

Test build #113324 has finished for PR 26256 at commit 8dd9518.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2019

Test build #113325 has finished for PR 26256 at commit 464eacc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case FRACTIONAL_PART =>
b match {
case _ if '0' <= b && b <= '9' && fractionScale > 0 =>
fraction = Math.addExact(fraction, Math.multiplyExact(fractionScale, (b - '0')))
Copy link
Member Author

@MaxGekk MaxGekk Nov 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fractional part cannot overflow. The max number is 999999999 fits to Int. I will remove the exact functions.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 6, 2019

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 6, 2019

Test build #113339 has finished for PR 26256 at commit 2222f13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2019

Test build #113341 has finished for PR 26256 at commit 2222f13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 29dc59a Nov 7, 2019
cloud-fan pushed a commit that referenced this pull request Nov 11, 2019
… signs and values

### What changes were proposed in this pull request?

With the latest string to literal optimization #26256, some interval strings can not be cast when there are some spaces between signs and unit values. After state `PARSE_SIGN`, it directly goes to  `PARSE_UNIT_VALUE` when takes a space character as the end. So when there are some white spaces come before the real unit value, it fails to parse, we should add a new state like `TRIM_VALUE` to trim all these spaces.

How to re-produce, which aim the revisions since  #26256 is merged

```sql
select cast(v as interval) from values ('+     1 second') t(v);
select cast(v as interval) from values ('-     1 second') t(v);
```

### Why are the changes needed?

bug fix
### Does this PR introduce any user-facing change?

no
### How was this patch tested?

1. ut
2. new benchmark test

Closes #26449 from yaooqinn/SPARK-29605.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the string-to-interval branch June 5, 2020 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants