Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Casting to and from unions #6247

Open
samuelcolvin opened this issue Aug 14, 2024 · 6 comments
Open

Casting to and from unions #6247

samuelcolvin opened this issue Aug 14, 2024 · 6 comments

Comments

@samuelcolvin
Copy link
Contributor

Continuing from #6218 (review) — I thought it worth creating a dedicated issue to discuss this before writing any more code.

Well pyarrow doesn't help much (or maybe it helps a lot by giving us flexibility!)

All four cases fail:

Unsupported cast to sparse_union<int_field: int32=0, string_field: string=1> from int32
Unsupported cast to dense_union<int_field: int32=0, string_field: string=1> from int32
Unsupported cast from sparse_union<0: int32=0, 1: string=1> to int32 using function cast_int32
Unsupported cast from dense_union<0: int64=0, 1: bool=1> to int32 using function cast_int32
Python Code
import pyarrow as pa

int_array = pa.array([1, 2, 3, 4, 5], type=pa.int32())

union_fields = [
    pa.field('int_field', pa.int32()),
    pa.field('string_field', pa.string())
]
try:
    print(int_array.cast(pa.union(union_fields, mode='sparse')))
except Exception as e:
    print(e)
else:
    print('success')
try:
    print(int_array.cast(pa.union(union_fields, mode='dense')))
except Exception as e:
    print(e)
else:
    print('success')


sparse_indices = pa.array([0, 1, 0, 1, 0], type=pa.int8())
sparse_children = [
    pa.array([1, None, 3, None, None], type=pa.int32()),
    pa.array([None, 'a', None, 'b', None], type=pa.string()),
]
sparse_union_array = pa.UnionArray.from_sparse(sparse_indices, sparse_children)
# print(sparse_union_array)

try:
    print(sparse_union_array.cast(pa.int32()))
except Exception as e:
    print(e)
else:
    print('success')

dense_types = pa.array([0, 1, 1, 0, 0], type=pa.int8())
dense_offsets = pa.array([0, 0, 1, 1, 2], type=pa.int32())
dense_children = [
    pa.array([5, 6, 7]),
    pa.array([False, True]),
]
dense_union_array = pa.UnionArray.from_dense(dense_types, dense_offsets, dense_children)
# print(dense_union_array)
try:
    print(dense_union_array.cast(pa.int32()))
except Exception as e:
    print(e)
else:
    print('success')

Here's my proposal for what we support and don't support (yet):

Casting to sparse and dense union

We choose the most appropriate child to cast to using the current logic - choose the exact matching type, otherwise the first type you can cast to, left to right.

I think this is fairly simple, uncontroversial and already implemented in #6218.

Casting from sparse and dense unions

I think we can support both sparse and dense using either zip, interleave or take — any suggestion on which will be fastest much appreciated.

We can do this, either:

  1. requiring one or more fields to be castable to the output type, and just casting those children, leaving values associated with other children null
  2. or, requiring all fields to be castable

I think @alamb suggested he'd prefer 2., I started implementing 1. in #6218 — this is so we can use this union cast logic for datafusion-functions-json, to match postgres behaviour.

When the user queries:

select count(*) from foo where (thing->'field')::int=4

The value returned from thing->'field' is a JsonUnion, hence I need that to be cast to an int even though that union includes stuff like string, object and array that can't be cast to an int.

(I'm trying to roughly match PostgreSQL where select ('{"foo": 123}'::jsonb->'foo')::int is valid)

If we go with route 2. above, this expression would raise an error.

Note: for the above case of (thing->'field')::int, we already do an optimisation pass where we convert json_get_union(thing, 'field')::int to json_get_int(thing, 'field') and therefore avoid this problem. My reason for implementing casting from unions in the first place was to support expression where JsonUnion is compared to values, but the optimization won't or can't work, e.g. if thing->'field' is in a CTE, then used later.

I guess if we decide that route 2. is correct, I have a few options:

  • I might able to use query rewriting to rewrite all cases of casting from JsonUnion, e.g. replace all casts in the query with a UDF that does something custom for JsonUnion
  • I could wait for logical types [Proposal] Decouple logical from physical types datafusion#11513 and use them to control casting?
  • We could introduce config on a union to control casting behaviour, that seems like an extension of arrow and therefore unlikely to happen
@samuelcolvin
Copy link
Contributor Author

Much as route 1. (very lax casting of unions) would simplify my use case, in writing this up I realised that probably doesn't make much sense in general.

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

Much as route 1. (very lax casting of unions) would simplify my use case, in writing this up I realised that probably doesn't make much sense in general.

Yeah -- I think of a Union as the way in arrow to represent a dynamically typed value: each row of a union can be one of a set of different types

So I guess if we are casting from a union array to another type, I would as a user expect that each row of the union (regardless of what variant it was) would be cast to the target type

@gstvg
Copy link
Contributor

gstvg commented Aug 22, 2024

Hey @samuelcolvin, if I get it right, your first option looks like union_extract from DuckDB.
Example usage

I'm trying to implement it at apache/datafusion#12116 in case it helps

DuckDB union cast docs may also be of interest

@samuelcolvin
Copy link
Contributor Author

interesting, I don't know what @alamb things, but I'd say it would be best to implement it in this repo rather than datafusion.

@alamb
Copy link
Contributor

alamb commented Aug 23, 2024

I think implementing a union_extract kernel in arrow-rs makes sense to me. Starting it in datafusion and then porting it to arrow-rs also would make sense

@gstvg
Copy link
Contributor

gstvg commented Aug 23, 2024

I can port the PR here, it will take a few days. Hopefully this would avoid duplicate review work, especially since most of the tests should be rewritten from sqllogictests to unit tests. Do you agree?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants