Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 20, 2019

What changes were proposed in this pull request?

In the PR, I propose to replace existing implementation of CalendarInterval.fromCaseInsensitiveString based on a regexp by another implementation using finite state machine. The existing regex assumes particular order of interval units from YEAR to MICROSECOND. Modification of the regexp becomes very hard to support new features.

Why are the changes needed?

  • This improves Spark SQL UX by allowing users to specify interval units in any order
  • Existing regex is hard to extend
  • To maintain feature parity with PostgreSQL

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • By existing tests in CalendarIntervalSuite
  • Add new test for invalid unit dday
  • By new tests for uniqueness and unordered interval units in CalendarIntervalSuite

@SparkQA
Copy link

SparkQA commented Oct 20, 2019

Test build #112339 has finished for PR 26180 at commit b054346.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 20, 2019

@cloud-fan Please, take a look at it.

@cloud-fan
Copy link
Contributor

cloud-fan commented Oct 21, 2019

Let's wait for #26190 first. I'm not sure it's the right direction to keep improving the hand-written regex-based parser instead of using antlr.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 21, 2019

@cloud-fan WDYT about modifying it to support UTF8String as the input and use UTF8String methods inside. This can be similar to what we have for timestamps and dates - DateTimeUtils.stringToTimestamp and stringToDate. And this should be faster, and could be used from the Cast expression.

@cloud-fan
Copy link
Contributor

that sounds good!

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 22, 2019

ok. I will close this PR, and will try to implement interval parsing from UTF8String in the future.

@MaxGekk MaxGekk closed this Oct 22, 2019
@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 25, 2019

@cloud-fan Here is the PR #26256 for UTF8String to CalendarInterval.

@MaxGekk MaxGekk deleted the interval-units-order branch June 5, 2020 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants