Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: switch to using proper Substrait types for IntervalYearMonth and IntervalDayTime #11471

Merged
merged 1 commit into from
Jul 16, 2024

Conversation

Blizzara
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

I'm looking at supporting date-time stuff properly through Substrait and DataFusion, part of that is introducing (🤞) a MonthDayNano type into Substrait. While looking into it, I noticed there already are types for YearMonth and DayTime, so we could (and should!) be using those types instead of the custom UDTs introduced in #10646 (cc @waynexia )

What changes are included in this PR?

Add consuming support for IntervalYear and IntervalDay types as well as IntervalYearToMonth literals (IntervalDayToSecond already existed). Keep existing consuming support for the UDT for both for backwards compatibility (can probs remove this soon as well, but keeping it for now allows reading plans created by earlier version of DF, in case that's useful?).

Switch producer to produce the Substrait types and literals instead of UDTs for IntervalYearMonth and IntervalDayTime

Remove the parameters from the UDT type that's still kept for IntervalMonthDayNano - parameters are for things like decimal precision and scale, ie. parameters modifying the type itself; we don't need that for the UDT here.

Are these changes tested?

Existing unit tests (in producer.rs)

Are there any user-facing changes?

Start producing Substrait's interval types instead of custom UDT types, where possible.

…nd IntervalDayTime

also clean up IntervalMonthDayNano type - the type itself needs no parameters
@Blizzara Blizzara force-pushed the avo/substrait-cleanup-intervals branch from da02530 to 488f6b7 Compare July 15, 2024 08:58
@Blizzara Blizzara marked this pull request as ready for review July 15, 2024 08:58
@Blizzara Blizzara changed the title chore: switch to using proper Substrait types for IntervalYearMonth and IntervalDayTime feat: switch to using proper Substrait types for IntervalYearMonth and IntervalDayTime Jul 15, 2024
false, // The inner map field is always non-nullable (Arrow #1697),
)), false))
},
Ok(DataType::Map(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto-formatting

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an improvement to me -- less code and more features 👍

cc @waynexia let us know if you want a chance to review

@alamb alamb merged commit 169a0d3 into apache:main Jul 16, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 16, 2024

Thanks again @Blizzara

xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 17, 2024
…nd IntervalDayTime (apache#11471)

also clean up IntervalMonthDayNano type - the type itself needs no parameters
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
…nd IntervalDayTime (apache#11471)

also clean up IntervalMonthDayNano type - the type itself needs no parameters
@Blizzara Blizzara deleted the avo/substrait-cleanup-intervals branch July 19, 2024 12:29
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
…nd IntervalDayTime (apache#11471)

also clean up IntervalMonthDayNano type - the type itself needs no parameters
@waynexia
Copy link
Member

waynexia commented Aug 2, 2024

Sorry for the huge delay! (I'm catching up on things in past weeks)

I used to use custom interval types to match our definition of intervals. Because of two reasons:

  • Substrait has restricted the value range of its two interval types to [-10,000..10,000] years (ref)
  • Our interval types have different ranges with substraits. E.g., IntervalDayTime may overflow here because it only has one i32 as millisecond for substrait's two i32s (second and millisecond).

For the consumer side, we can support both types and report errors when substrait's overflow for wider scenarios. But I'm not sure if we should only produce substrait types for the above reasons. Given this kind of type information should be static, we cannot decide which type to use depending on the value at runtime.

One possible solution is not to emit IntervalYearMonth and IntervalDayTime, which may overflow. And use IntervalMonthDayNano instead as it's the widest one. But this is not ideal as well. This seems to be a nontrivial task that leaps the gap between two type systems while keeping the difference of the type system between our system and substrait as small as possible.

@Blizzara
Copy link
Contributor Author

Blizzara commented Aug 20, 2024

Ah, thanks for the notes @waynexia !

Substrait has restricted the value range of its two interval types to [-10,000..10,000] years (ref)

Yeah.. however no-one cares about those limits currently afaik, and I don't know if it's good enough reason to reduce interoperability for values that fall within that (reasonably large :D) range. As the limit is just a written text, DF -> DF roundtrip would anyways work outside of those limits too.

  • Our interval types have different ranges with substraits. E.g., IntervalDayTime may overflow here because it only has one i32 as millisecond for substrait's two i32s (second and millisecond).

Good catch - however, if I think this right, that means we may fail to consume some Substrait plans - but that's true anyways, even if we were to use Substrait UDTs for intervals, right? Now we fail to consume the too-large values, while with UDTs we'd fail to consume everything. Since it's DF that has a more restricted type, any DF-created plan is guaranteed to be within the limits of not overflowing. As such, I think it makes sense to both produce and consume the Substrait DayTime interval type, given it's able to represent all DF DayTime intervals. Did I miss something? 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants