Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implicit casts/promotions in functions within Substrait's core extensions #251

Closed
jvanstraten opened this issue Jul 20, 2022 · 4 comments
Closed

Comments

@jvanstraten
Copy link
Contributor

This has come up in a number of PRs now, and it's becoming a bit of a mess. I feel like we need to come to some collective decision about how to approach this, and then document clearly what Substrait's stance on this is:

Given for example something like multiply(i32, i32), should it A) return i32 for consistency, or B) promote to i64 because i32 can overflow?

Note that most examples are far more subtle, such as a sum(i8) -> i64 aggregate or power(fp32, fp32) -> fp64, because they can still overflow or lose precision but are just less likely to do so with the larger return type.

I personally feel quite strongly that functions should never promote, i.e. option A, no exceptions. This is how virtually all programming languages behave: you define only (i8, i8) -> i8, (i16, i16) -> i16, ..., (i64, i64) -> i64, and then have a separate mechanism to promote or convert between representations. In most languages that promotion is implicit (in C in particular integers are usually i32 by default, so i8 + literal -> i32 and thus you might not notice), in Substrait all such promotions must be explicit via a cast operation (which is much easier on consumers). The net result is the same. Just as it is up to the producer to make sure that there are no implicit casts, it should also be up to the producer (or user) to promote if needed.

Here's a non-exhaustive list of problems and inconsistencies I see with option B:

  • What if we later add i128, fp80/fp128, bigint, etc? Do we then impose a breaking change for basically all of Substrait to maintain consistency, or do we make new functions? If the latter, why not also versions that promote only up to 32-bit values?
  • What if a particular consumer adds those types as extensions instead?
  • What about add(i8, i8)? The correct result type would be i9.
  • What about aggregations, which can overflow no matter how large you make the result type? (unless we would introduce bigints and arbitrary-precision floats)
  • What if a consumer doesn't support the larger types for whatever reason? (unlikely with the types currently defined by Substrait, just pointing out the inconsistencies here)

An argument in favor of option B that I've seen is that first upcasting and then performing the operation is somehow less performant. I agree that might be the case if implemented naively, due to cache locality and memory-boundedness if the intermediate results are actually materialized in memory or transferred over network before being passed to the operation. However, to be blunt: that's the consumer's problem. Nothing in Substrait is mandating that everything must map to kernels in the consumer one-to-one; in fact it is highly unlikely for this to be the case in general, because that would mean that every consumer out there is already implementing things in exactly the same way. A Substrait plan intended to be compatible between consumers defines only the functionality of the plan. It's then up to each consumer to implement that functionality as efficiently as possible with their internal representation, insofar as that differs from Substrait. So, if a consumer that has a sum_i64(i8, i8) -> i64 kernel internally encounters a sum(cast<i64>(i8), cast<i64>(i8)) -> i64 pattern in an incoming plan, it should use it rather than naively doing the casts first just because Substrait writes it that way. That adds complexity to consumers, sure, but you can't have both interoperability between consumers and one-to-one mapping for each consumer.

As a stopgap solution until such optimization logic exists, and in order for Substrait to also be able to represent consumer-specific optimized plans (or hand-optimized plans a la inline assembly), extensions exist: if a consumer supports that sum_i64(i8, i8) -> i64 operation, that consumer is free to and IMO should define that function in their own extension set, and a producer that knows about the consumer it's talking to is free to use it. But such compound functions do not belong in Substrait's core function set, if only to make it very clear that a plan with such "inline assembly" operations is no longer expected to consumer-agnostic.

Again, this is just my opinion. I'm opening this issue to have a centralized place to discuss this topic in.

@westonpace
Copy link
Member

+1

@jvanstraten
Copy link
Contributor Author

Something I'm less sure about is how to deal with awkward promotions from varchar/fixedchar to string when the size can't be determined a priori, such as in many of the functions proposed in #248. Following above writeup to the letter (which I'm not sure we should do to this extent) even concat would be concat(varchar<N>, varchar<N>) -> varchar<N> rather than concat(varchar<N>, varchar<M>) -> varchar<N+M>, but that then begs the question how you'd do that for fixedchar. I'm inclined to say promotions within a type class are okay, but beyond a type class (like concat(varchar<N>, varchar<M>) -> string) isn't; however, I don't really have strong arguments to back that up.

Especially for varchar/fixedchar interactions I have no idea what the de facto standard behavior is. I'd be inclined to say that we simply shouldn't define functions on fixedchar that can manipulate length in a variable way, and require the producer to emit a cast from fixedchar<N> to varchar<N> or string first. AFAICT, fixedchar and varchar are useful more so for storage than for computation, so I would think this isn't that big of a deal in practice, but I could be wrong. I do think, though, that if someone is intentionally using varchar over string in a computation, they probably have a good reason for that, and we shouldn't promote to string.

The same applies to decimal. I think the de facto standard there is to automatically increase precision and scale, at least within the confines of the 38-digit limit of decimal128 (and rounding to some arbitrary precision of I believe 6 digits after the decimal separator...), so we should probably follow that.

On the other hand... we can also decide to be strict and require casts up front everywhere. That would significantly reduce the complexity of the YAMLs: we could then probably get away with throwing type derivation expressions out the window entirely and limit the syntax to pattern matching and substitution only. concat<fixedchar<N>, fixedchar<M>> -> fixedchar<M+N> is a possible counterexample to that though, if we would really want to keep that in (I think that's a pretty special case though).

jvanstraten added a commit to jvanstraten/substrait that referenced this issue Sep 5, 2022
 - coin the term "core extensions" and describe why they (and extensions
   in general) exist
 - document guidelines from substrait-io#251 and substrait-io#307
 - document common options in core extensions (substrait-io#254)
 - better integrate the function documentation generator output into the
   website
@jvanstraten
Copy link
Contributor Author

Since no one seems to have any problems with this, I've documented this in #301 (mostly by just referring to this issue, though).

jvanstraten added a commit to jvanstraten/substrait that referenced this issue Sep 27, 2022
 - coin the term "core extensions" and describe why they (and extensions
   in general) exist
 - document guidelines from substrait-io#251 and substrait-io#307
 - document common options in core extensions (substrait-io#254)
 - better integrate the function documentation generator output into the
   website
jvanstraten added a commit to jvanstraten/substrait that referenced this issue Sep 27, 2022
 - coin the term "core extensions" and describe why they (and extensions
   in general) exist
 - document guidelines from substrait-io#251 and substrait-io#307
 - document common options in core extensions (substrait-io#254)
 - better integrate the function documentation generator output into the
   website
@westonpace
Copy link
Member

It appears this discussion has reached its conclusion. I am closing this.

richtia added a commit that referenced this issue Apr 12, 2024
This PR removes some implicit casts that exist in the trig functions.

More discussion here:
#251

Most of the trig functions already follow this behavior. This should
handle the remaining ones.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants