Implicit casts/promotions in functions within Substrait's core extensions #251

jvanstraten · 2022-07-20T09:14:00Z

This has come up in a number of PRs now, and it's becoming a bit of a mess. I feel like we need to come to some collective decision about how to approach this, and then document clearly what Substrait's stance on this is:

Given for example something like multiply(i32, i32), should it A) return i32 for consistency, or B) promote to i64 because i32 can overflow?

Note that most examples are far more subtle, such as a sum(i8) -> i64 aggregate or power(fp32, fp32) -> fp64, because they can still overflow or lose precision but are just less likely to do so with the larger return type.

I personally feel quite strongly that functions should never promote, i.e. option A, no exceptions. This is how virtually all programming languages behave: you define only (i8, i8) -> i8, (i16, i16) -> i16, ..., (i64, i64) -> i64, and then have a separate mechanism to promote or convert between representations. In most languages that promotion is implicit (in C in particular integers are usually i32 by default, so i8 + literal -> i32 and thus you might not notice), in Substrait all such promotions must be explicit via a cast operation (which is much easier on consumers). The net result is the same. Just as it is up to the producer to make sure that there are no implicit casts, it should also be up to the producer (or user) to promote if needed.

Here's a non-exhaustive list of problems and inconsistencies I see with option B:

What if we later add i128, fp80/fp128, bigint, etc? Do we then impose a breaking change for basically all of Substrait to maintain consistency, or do we make new functions? If the latter, why not also versions that promote only up to 32-bit values?
What if a particular consumer adds those types as extensions instead?
What about add(i8, i8)? The correct result type would be i9.
What about aggregations, which can overflow no matter how large you make the result type? (unless we would introduce bigints and arbitrary-precision floats)
What if a consumer doesn't support the larger types for whatever reason? (unlikely with the types currently defined by Substrait, just pointing out the inconsistencies here)

An argument in favor of option B that I've seen is that first upcasting and then performing the operation is somehow less performant. I agree that might be the case if implemented naively, due to cache locality and memory-boundedness if the intermediate results are actually materialized in memory or transferred over network before being passed to the operation. However, to be blunt: that's the consumer's problem. Nothing in Substrait is mandating that everything must map to kernels in the consumer one-to-one; in fact it is highly unlikely for this to be the case in general, because that would mean that every consumer out there is already implementing things in exactly the same way. A Substrait plan intended to be compatible between consumers defines only the functionality of the plan. It's then up to each consumer to implement that functionality as efficiently as possible with their internal representation, insofar as that differs from Substrait. So, if a consumer that has a sum_i64(i8, i8) -> i64 kernel internally encounters a sum(cast<i64>(i8), cast<i64>(i8)) -> i64 pattern in an incoming plan, it should use it rather than naively doing the casts first just because Substrait writes it that way. That adds complexity to consumers, sure, but you can't have both interoperability between consumers and one-to-one mapping for each consumer.

As a stopgap solution until such optimization logic exists, and in order for Substrait to also be able to represent consumer-specific optimized plans (or hand-optimized plans a la inline assembly), extensions exist: if a consumer supports that sum_i64(i8, i8) -> i64 operation, that consumer is free to and IMO should define that function in their own extension set, and a producer that knows about the consumer it's talking to is free to use it. But such compound functions do not belong in Substrait's core function set, if only to make it very clear that a plan with such "inline assembly" operations is no longer expected to consumer-agnostic.

Again, this is just my opinion. I'm opening this issue to have a centralized place to discuss this topic in.

The text was updated successfully, but these errors were encountered:

westonpace · 2022-07-20T23:12:42Z

+1

jvanstraten · 2022-07-21T09:18:49Z

Something I'm less sure about is how to deal with awkward promotions from varchar/fixedchar to string when the size can't be determined a priori, such as in many of the functions proposed in #248. Following above writeup to the letter (which I'm not sure we should do to this extent) even concat would be concat(varchar<N>, varchar<N>) -> varchar<N> rather than concat(varchar<N>, varchar<M>) -> varchar<N+M>, but that then begs the question how you'd do that for fixedchar. I'm inclined to say promotions within a type class are okay, but beyond a type class (like concat(varchar<N>, varchar<M>) -> string) isn't; however, I don't really have strong arguments to back that up.

Especially for varchar/fixedchar interactions I have no idea what the de facto standard behavior is. I'd be inclined to say that we simply shouldn't define functions on fixedchar that can manipulate length in a variable way, and require the producer to emit a cast from fixedchar<N> to varchar<N> or string first. AFAICT, fixedchar and varchar are useful more so for storage than for computation, so I would think this isn't that big of a deal in practice, but I could be wrong. I do think, though, that if someone is intentionally using varchar over string in a computation, they probably have a good reason for that, and we shouldn't promote to string.

The same applies to decimal. I think the de facto standard there is to automatically increase precision and scale, at least within the confines of the 38-digit limit of decimal128 (and rounding to some arbitrary precision of I believe 6 digits after the decimal separator...), so we should probably follow that.

On the other hand... we can also decide to be strict and require casts up front everywhere. That would significantly reduce the complexity of the YAMLs: we could then probably get away with throwing type derivation expressions out the window entirely and limit the syntax to pattern matching and substitution only. concat<fixedchar<N>, fixedchar<M>> -> fixedchar<M+N> is a possible counterexample to that though, if we would really want to keep that in (I think that's a pretty special case though).

- coin the term "core extensions" and describe why they (and extensions in general) exist - document guidelines from substrait-io#251 and substrait-io#307 - document common options in core extensions (substrait-io#254) - better integrate the function documentation generator output into the website

jvanstraten · 2022-09-05T15:40:47Z

Since no one seems to have any problems with this, I've documented this in #301 (mostly by just referring to this issue, though).

- coin the term "core extensions" and describe why they (and extensions in general) exist - document guidelines from substrait-io#251 and substrait-io#307 - document common options in core extensions (substrait-io#254) - better integrate the function documentation generator output into the website

westonpace · 2023-03-01T21:57:09Z

It appears this discussion has reached its conclusion. I am closing this.

This PR removes some implicit casts that exist in the trig functions. More discussion here: #251 Most of the trig functions already follow this behavior. This should handle the remaining ones.

This was referenced Jul 20, 2022

feat: add product function #250

Merged

feat: add functions for arithmetic, rounding, logarithmic, and string transformations #245

Merged

feat: add trigonometry functions #241

Merged

This was referenced Jul 21, 2022

feat: add string trimming and padding functions #248

Merged

feat: add string containment functions #256

Merged

feat: add standard deviation functions #257

Merged

jvanstraten mentioned this issue Aug 15, 2022

feat: add median and count_distinct aggregation functions #278

Merged

richtia mentioned this issue Aug 16, 2022

feat: add center function #282

Merged

jvanstraten mentioned this issue Aug 17, 2022

docs: add mkdocs plugin to generate markdown from yaml doc for website #277

Merged

thisisnic mentioned this issue Aug 17, 2022

feat: add hyperbolic functions #290

Merged

jvanstraten mentioned this issue Sep 5, 2022

docs: extend extension documentation #310

Closed

jvanstraten mentioned this issue Sep 5, 2022

Establish intended scope of functions in extensions/*.yaml #307

Closed

jvanstraten mentioned this issue Sep 19, 2022

fix: type promotion issue in aggregate function sum #335

Closed

richtia mentioned this issue Oct 20, 2022

feat: update ascii , char_length and strpos functions definitions. #360

Closed

westonpace closed this as completed Mar 1, 2023

richtia mentioned this issue Apr 4, 2024

fix: remove implicit casts in trig extension functions #620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implicit casts/promotions in functions within Substrait's core extensions #251

Implicit casts/promotions in functions within Substrait's core extensions #251

jvanstraten commented Jul 20, 2022

westonpace commented Jul 20, 2022

jvanstraten commented Jul 21, 2022

jvanstraten commented Sep 5, 2022

westonpace commented Mar 1, 2023

Implicit casts/promotions in functions within Substrait's core extensions #251

Implicit casts/promotions in functions within Substrait's core extensions #251

Comments

jvanstraten commented Jul 20, 2022

westonpace commented Jul 20, 2022

jvanstraten commented Jul 21, 2022

jvanstraten commented Sep 5, 2022

westonpace commented Mar 1, 2023