-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Harmonize primitive conversions variant unshredding and casting #8521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
scovich
wants to merge
5
commits into
apache:main
Choose a base branch
from
scovich:harmonize-unshred-and-cast
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
0662e12
[Variant] Define and use unshred_variant function
scovich 62b6299
address reviews
scovich 29cdbad
Merge remote-tracking branch 'oss/main' into unshred-variant
scovich 2831e34
remove stale TODO
scovich 509d432
[Variant] Harmonize primitive conversions in cast_to_variant and unsh…
scovich File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I don't love is we now have two complex macros that feel kind of similar. This new macro simplified the
unshred_variantcode but has ~no effect on thecast_to_variantcode in this file. In fact, the builder definition didn't even change (L397 on the left side of the diff):So in retrospect, I'm not sure what duplication this PR actually eliminated?
Seems like it just moved some definitions around.
Any ideas on a better approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this PR doesn't seem to have eliminated much yet
There seems to be two major pieces of functionality
It seems to me like
arrow_to_variant.rsandunshred_variant.rsstill have non trivial overlap as you mention. What I was hoping was that we could somehow use one set of traits / structs for both those operations which would mean we would getunshredsupport "for free" for other types likeUnionArraysThat was the dream at anyways.
After I spent some time reviewing this PR I think I would like to propose some naming consolidation as we have two functions that are inverses of each other that are confusingly named;
cast_to_variantwhich casts arrow arrays to variantvariant_to_arrowwhich casts variants to arrow arraysI will make a PR to propose
cast_arrow_to_variantandcast_variant_to_arrowfor the kernel names and we can keep the modulesvariant_to_arrowandarrow_to_variantwith the actual codeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cast_to_variantis a function that leverages thearrow_to_variantmodule.variant_to_arrowis a module that thevariant_getfunction relies on.I guess
variant_getcould be seen as a superset of a hypotheticalcast_variant_to_arrowfunction that is an inverse ofcast_[arrow_]to_variant?That said, I do think it's a good idea to step back and take a hard look at naming conventions as the code matures and the interactions (or lack thereof) become clearer:
cast_to_variant- converts fully strongly-typed data to binary variantarrow_to_variantmoduleshred_variant- shreds a binary variant input according to the requested shredding schemaunshred_variant- converts shredded variant back to binary variantvariant_get- can be used to extract fully strongly-typed data from variant (shredded or not)variant_to_arrowmodule when necessaryAnd, for good measure, some of the conversions probably should move to
type_conversionsmodule, and I don't know where theListLikeArraytrait should live?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that is a good way to think about it. I don't think we should add
cast_to_variant_to_arrowat this time (I was very confused)Or maybe we could consolidate the conversions into
arrow_to_variantorvariant_to_arrow🤔There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is the intuitive view, and arrow_to_variant and unshred_variant both fall loosely in category 1/.
So this is tricky -- shredded variant
typed_valuecolumns must be one of the supported variant shredding types defined by the shredding spec.Unsigned integer types are not on that list, so we'll never need to unshred them (even tho we can shred them, and can even
variant_getthem by converting back).Complex types like Union and Map are also not on that list and so we'll never need to unshred them. But we can still convert them to variant: Whatever union branch is active for each row gets converted to variant, which works fine; maps are trickier -- I think our current code forcibly converts the map key column to string (with a
cast🙀) and then converts the result to variant object.FixedLenBinary is also tricky, because it's not a valid shredded variant type, but UUID uses
FixedLenBinary(16)as its physical type. So when converting to variant, all binary types (including fixed len) convert toVariant::Binaryexcept thatFixedLenBinary(16)with the UUID extension type would convert toVariant::Uuid.So one immediate problem is that the two operations have an overlapping but not equivalent set of types. And some physical types that seem the same have different interpretation/semantics. Even the simplest -- the NULL builder -- has different semantics between the two operations.
Another problem is that the definition of "primitive type" differs between the two modules:
unshred_varianttakes the variant perspective, so string, binary, and boolean arrays can all implement the genericAppendToVariantBuildertrait thatUnshredPrimitiveRowBuilderrelies on. But timestamps, which need extra state (timezone info) are not primitive and need their own builder.arrow_to_varianttakes the arrow perspective, so string, binary and boolean are not primitive types and thus need their own builder implementations. Additionally, all decimal and temporal types need special treatment and they get customer builders as well.Another problem is the need to handle
valuecolumn when unshredding, which is not needed when converting strongly typed data to variant.Overall... I couldn't find a way to slice this better, in spite of my intuition screaming it should be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and casting failures also... converting arrow value to variant can fail, and cast options decides whether the failure produces
Variant::Nullor an error. In theory, unshredding is infallible and any failure there is due to invalid data (so should always produce an error).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an LLM-generated compare/contrast, in case that's helpful:
Details
Analysis of Current Consolidation Effort
Current State
The diff shows an attempt to consolidate primitive type handling by:
ArrowToVarianttrait - A zero-cost trait that both modules can usedefine_arrow_to_variantmacro - A simpler macro for basic type conversionsshared_timestamp_to_variantfunctionKey Differences Between Approaches
arrow_to_variant.rs (Cast semantics)
define_row_builder!- Complex, feature-rich macro supporting:CastOptions,scale)Option<T>return typesT::Native: Into<Variant>constraintVariant::Binaryunshred_variant.rs (Unshred semantics)
define_arrow_to_variant!- Simpler macro supporting:ArrowToVarianttraitVariant::UuidAreas of Redundancy
1. Primitive Type Implementations
Both modules have nearly identical logic for:
2. Enum Variants and Match Arms
Both have large enums with similar variants:
3. Factory Pattern Logic
Both have similar DataType matching logic to create appropriate builders.
Semantic Differences That Prevent Full Sharing
FixedSizeBinary Handling:
Variant::BinaryVariant::UuidError Handling Philosophy:
Variant::NullDecimal Support:
Method Signatures:
append_row(builder, index)append_row(builder, metadata, index)- needs metadata for unshredded valuesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my updated suggestion (basically just update comments)