Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39024: [Compute] Allow implicitly casting extension to storage types in compute functions #39200

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

js8544
Copy link
Collaborator

@js8544 js8544 commented Dec 12, 2023

Rationale for this change

Extension types are only logical, i.e. each extension type has a underlying native storage type. We can allow implicit casting from extension types to their storage types. I believe this can make ExtensionTypes much easier to use.

What changes are included in this PR?

  1. In DispatchBest, if no matching kernels exist, allow extension types to be casted to their storage types and match again. The kernel with the minimum number of casts is selected. If there are multiple best kernels, report an error. The dispatching algo is similar to C++'s.
  2. In functions that overwrite DispatchBest, call EnsureExtensionToStorage to simply replace extension types by their storage types, because these functions do not provide native kernels for extension types so we don't need to count the number of casts needed.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Copy link

⚠️ GitHub issue #39024 has been automatically assigned in GitHub to PR creator.

@js8544 js8544 changed the title GH-39024: [Compute] Allow implicitly cast extension to storage types in compute functions GH-39024: [Compute] Allow implicitly casting extension to storage types in compute functions Dec 12, 2023
@pitrou
Copy link
Member

pitrou commented Dec 12, 2023

I am not sure it's a good idea to do this automatically and exhaustively. Some operations will depend on the logical meaning of the types: you probably don't want the same casting rules to apply to UUID than to fixed_size_binary<16>, for example.

This seems reasonable for functions that merely forward one of their arguments, though, such as if_else.

@js8544
Copy link
Collaborator Author

js8544 commented Dec 12, 2023

I see. I think there are two things we can do do mitigate this problem:

  1. Implicit casts should be disabled for Cast functions.
  2. Each ExtensionType can have an allow_implicit_cast_ member to opt-in this feature.

@js8544
Copy link
Collaborator Author

js8544 commented Dec 12, 2023

There's also a more passive yet controlled approach: We add a std::vector<std::string> allowed_implicit_casting_functions_ member to each ExtensionType and we implicit cast it only for those functions.

@pitrou
Copy link
Member

pitrou commented Dec 12, 2023

There's also a more passive yet controlled approach: We add a std::vector<std::string> allowed_implicit_casting_functions_ member to each ExtensionType and we implicit cast it only for those functions.

That would work but that means a lot of work for each extension type to follow additions to the range of default compute functions.

I would suggest we perhaps need a more general semantic description of storage type equivalence.

Draft:

class ExtensionType {
 public:
  // Storage equivalence for equality testing and hashing
  static constexpr uint32_t kEquality = 1;
  // Storage equivalence for ordered comparisons
  static constexpr uint32_t kOrdering = 2;
  // Storage equivalence for selections (filter, take, etc.)
  static constexpr uint32_t kSelection = 4;
  // Storage equivalence for arithmetic
  static constexpr uint32_t kArithmetic = 8;
  // Storage equivalence for explicit casts
  static constexpr uint32_t kCasting = 16;
  // Storage equivalence for all operations
  static constexpr uint32_t kAny = std::numeric_limits<uint32_t>::max();

  // By default, an extension type can be implicitly handled as its storage type
  // for selections, equality testing and hashing.
  virtual uint32_t storage_equivalence() const { return kEquality | kSelection; }

@felipecrv WDYT?

@alamb @tustvold How do you handle extension types in Datafusion?

@alamb
Copy link
Contributor

alamb commented Dec 12, 2023

@alamb @tustvold How do you handle extension types in Datafusion?

TLDR is that we don't currently handle them

We have discussed how to do so in apache/datafusion#7923 and apache/arrow-rs#4472 but I would say we have not yet reached a consensus

cc @wjones127 and @yukkit who I think may be interested

@bkietz
Copy link
Member

bkietz commented Dec 12, 2023

I would suggest we perhaps need a more general semantic description of storage type equivalence.

Another way to handle this would be with a property on kernels (or even functions) instead of on ExtensionType:

class ARROW_EXPORT InputType {
 public:
  bool accepts_extension_types_if_storage_matches = false;
  // ...

    // ...
    filter_kernel->signature->in_types()[0].accepts_extension_types_if_storage_matches = true;

This would have the effect of disabling operation on extension arrays' storage by default, but allow enabling it per kernel or function. (The filter for accepted extension types could of course also be more articulated than a simple bool.)

I think this would put configurability where it needs to be since the specific categories selection, equality comparison, and arithmetic have differing levels of difficulty in support:

  • I can't think of a type which would not be selectable by operating on its storage unless the semantics of an array slot depended on its position in the array or on its neighbors somehow, which seems like something incompatible with arrow arrays anyway.
  • Equality comparison for extension types would have to be superset of equality comparison for their storage; if the storage slots are identical then whatever the extension type the corresponding slot in the extension array must be identical also. However it would be possible to define an extension of fixed size list wherein the semantic value of a slot is the sum of list elements from the storage's slot, in which case [1, 2] would be equal to [1, 2] but also equal to [3, 0].
  • Arithmetic is much more complex since even if a type supports addition it might not support multiplication and it might only support an addend of a specific other type (as timestamps can be added to durations but not to other timestamps).

Given that supporting various operations on the stored types is so nuanced, it seems we'd need to reserve handling of it for a system which can express those nuances - like kernel dispatch.


Another solution would be to formalize what we (de facto) have done thus far: allow casting to/from storage types, allowing operation on storage only when explicitly requested.

@felipecrv
Copy link
Contributor

+1 on what @bkietz said: the casting behavior should be declared on the operation, not on the type itself.

Ideally we would have a declarative way of defining more elaborate type constraints on kernels and a decidable solver that dispatches to the appropriate implementation. If we implement that we have to write a checker that ensures there is always only one solution (ordering of declaration of the kernel implementations shouldn't matter).

@pitrou
Copy link
Member

pitrou commented Dec 12, 2023

Ok, say I have a IPv4 extension type backed by a UInt32 storage type. I want to declare that ordering on IPv4 extension values is the same as on the storage type (so that, e.g., sorting would work automatically). How does that work with the accepts_extension_types_if_storage_matches proposal?

@felipecrv
Copy link
Contributor

Ok, say I have a IPv4 extension type backed by a UInt32 storage type. I want to declare that ordering on IPv4 extension values is the same as on the storage type (so that, e.g., sorting would work automatically). How does that work with the accepts_extension_types_if_storage_matches proposal?

The operation declares that it can receive int32 and all logical types defined in terms of int32 (like IPv4).

sort(array<int32>) can also work as sort(array<ipv4>), but sum(int32, int32) can't.

But to be fair, I would have to think more carefully about these things before accepting any magical implicit behavior. See how much pain the C implicit casts (that look inoffensive) have created in the world.

@pitrou
Copy link
Member

pitrou commented Dec 12, 2023

The operation declares that it can receive int32 and all logical types defined in terms of int32 (like IPv4).

sort(array<int32>) can also work as sort(array<ipv4>), but sum(int32, int32) can't.

Yes, but the problem here is another extension type based on uint32 would also inherit the sort behavior even though it might want a different definition of ordering.

@tustvold
Copy link
Contributor

tustvold commented Dec 12, 2023

FWIW the design in arrow-rs currently is that extension types only exist in the metadata. The rationale being that by default kernels should treat them as the storage type. Imo it goes against the whole purpose of extension types if the only way to support them is for every kernel to be aware of all possible extension types.

That's not to say there can't be special kernels for certain extension types, potentially selected as part of a query engine's planning, but I can't help feeling if an array can't be correctly treated as its physical type, it isn't really an extension type but something else entirely...

That's all to say I agree with coercing extension arrays to their physical types, arrow-rs goes even further by not having an extension array at all.

However, this is just my view on this issue, and I suspect others will feel differently

@yukkit
Copy link

yukkit commented Dec 13, 2023

Mark, to be addressed later

@paleolimbot
Copy link
Member

I quite like @pitrou's description of equivalence between a type and its storage, which lets extension type authors get a lot of mileage out of existing internals for simple cases. For example, you probably want group_by(<some_uuid>) + aggregate to "just work" and it's unrealistic for extension type authors to remember or define the appropriate internals to make that happen (if it's even possible today).

Allowing an implicit or automatic cast to storage seems like a unsafe precedent; however, allowing I can't currently think of an example where an explicit Cast(<some extension array>, <its own storage type>) would be inappropriate (maybe that works today, I haven't tried).

Imo it goes against the whole purpose of extension types if the only way to support them is for every kernel to be aware of all possible extension types.

I think it is up to extension type authors to decide what logical manipulation can and cannot be performed on an array. The extension type purpose (IMO) is that implementations take care of the physical manipulations (filter, take, slice, concatenate, read/write files). What I like about Antoine's suggestion is that it makes opting in to more storage behaviour very easy (since many extension types probably want to opt in to some or all storage behaviour) but is safe by default.

@bkietz
Copy link
Member

bkietz commented Dec 13, 2023

I think this conversation needs to move to the mailing list. I'll open a thread with a summary

@bkietz
Copy link
Member

bkietz commented Dec 13, 2023

@mapleFU
Copy link
Member

mapleFU commented Dec 30, 2023

I think another problem it's that maybe it's hard to define some logic in extension type without or even with these casts.

For example, if I want to define a "cast_extension_to_string", I cannot add my logic to the MetaFunction "Cast", do we have some solving for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] How to use CallFunction() when arg is a ExtensionScalar(ExtensionType)
9 participants