Skip to content

Conversation

@sdf-jkl
Copy link
Contributor

@sdf-jkl sdf-jkl commented Sep 16, 2025

Which issue does this PR close?

Rationale for this change

We should be able to read lists using variant_get

What changes are included in this PR?

Are these changes tested?

I'm trying to start with some basic tests to do some TDD.

Are there any user-facing changes?

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments that are hopefully helpful.

Also, we should (eventually) support nesting -- arrays and structs inside arrays.
Let's get simple lists of primitives working first, tho!

Comment on lines 1100 to 1103
let main_struct = crate::variant_array::StructArrayBuilder::new()
.with_field("metadata", Arc::new(metadata_array))
.with_field("value", Arc::new(value_array))
.with_field("typed_value", Arc::new(typed_value_array))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the variant shredding spec for arrays -- the typed_value for a shredded variant array is a non-nullable group called element, with child fields typed_value and value for shredded and unshredded list elements, respectively.

And then we'll need to build an appropriate GenericListArray out of this string array you built, which gives the offsets for each sub-list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this too, I was under the wrong impression that the metadata encoding stores the offsets for the actual values. Reading your #8359 and rereading the Variant Encoding spec I see that the values offsets are within the value encoding itself.

So the outermost typed_value should be an GenericListArray of element - VariantObjects with {value and typed_value fields}?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly! And element is non-nullable (**), while the two children are nullable.

(**) As always, in arrow, it can still have null entries, but only if its parent is already NULL for the same row (so nobody can ever observe a non-null element)

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how these unit tests will translate to variant_get?

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Sep 19, 2025

I'm not sure I understand how these unit tests will translate to variant_get?

Could you elaborate please?

I am currently trying to build just the Shredded List VariantArray test case, and while doing so learning how we could build them in shred_variant later. Once have a good way of building simple Shredded List VariantArray it will be easy to work on the rest of the unit tests for variant_get

@scovich
Copy link
Contributor

scovich commented Sep 19, 2025

I'm not sure I understand how these unit tests will translate to variant_get?

Could you elaborate please?

I am currently trying to build just the Shredded List VariantArray test case, and while doing so learning how we could build them in shred_variant later. Once have a good way of building simple Shredded List VariantArray it will be easy to work on the rest of the unit tests for variant_get

No worries -- the current iteration does look it produces a correct shredded variant containing a list, so I should probably just be patient and let you finish!

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Sep 23, 2025

Hey @scovich I see that your current implementation of follow_shredded_path_element for VariantPathElement::Field when following the shredded path is successful, it returns a ShreddedPathStep::Success(field.shredding_state()) that holds a ShreddingState::Typed that holds a reference to the typed_value array. (That we later use for the next steps)

My question is: does ShreddedPathStep::Success() necessarily have to require the input ShreddingState to be a reference?

The reason I am asking is that since we use the output of follow_shredded_path_element to get the values from the shredded VariantArray, shouldn't we be free to drop the outer array once we extract the relevant typed_value?

The only way to work with list arrays I came up with so far, is to build new arrays with arrow_select::take, combining the path index and GenericListArray offsets.
But by using this method we create new arrays within the scope of the function and can't use a reference to the array in the ShreddedPathStep::Success.
(I just pushed a commit with a non-working implementation of the idea)

Should we instead look for another way to represent a resulting array consisting of slices instead?

I just saw the #8392

Comment on lines 135 to 152
// Build the list of indices to take
let mut take_indices = Vec::with_capacity(list_len);
for i in 0..list_len {
let start = offsets[i] as usize;
let end = offsets[i + 1] as usize;
let len = end - start;

if *index < len {
take_indices.push(Some((start + index) as u32));
} else {
take_indices.push(None);
}
}

let index_array = UInt32Array::from(take_indices);

// Use Arrow compute kernel to gather elements
let taken = take(field_array, &index_array, None)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the basic idea here

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Sep 25, 2025

Hey @scovich I made it work for a one of the simple tests and it doesn't go through with the second one because Variant to Arrow does not support utf8 yet.

Do we have an issue tracking variant_to_arrow types support? If not, I can make one.

@scovich
Copy link
Contributor

scovich commented Sep 26, 2025

I made it work for a one of the simple tests and it doesn't go through with the second one because Variant to Arrow does not support utf8 yet.

Do we have an issue tracking variant_to_arrow types support? If not, I can make one.

I'm not sure we have a tracking issue for utf8 support in variant_to_arrow, but I've also noticed that it's an annoying gap for unit testing (we all seem to reach for string values...)

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 22, 2025
Copy link
Contributor Author

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building on top of the utf8 variant_to_arrow support PR.
Changes in generic_bytes_builder.rs, generic_bytes_view_builder.rs and variant_to_arrow.rs are irrelevant.
Some changes in variant_get.rs and variant_array.rs are also from the utf8 pr, so they can be safely skipped.
Main changes are:

  • Adding ShreddingStateCow enum
  • Adding VariantPathElement::Index support for unnested List VariantArray

Comment on lines +106 to 164
let Some(list_array) = typed_value.as_any().downcast_ref::<GenericListArray<i32>>()
else {
// Downcast failure - if strict cast options are enabled, this should be an error
if !cast_options.safe {
return Err(ArrowError::CastError(format!(
"Cannot access index '{}' on non-list type: {}",
index,
typed_value.data_type()
)));
}
// With safe cast options, return NULL (missing_path_step)
return Ok(missing_path_step());
};

let offsets = list_array.offsets();
let values = list_array.values(); // This is a StructArray

let Some(struct_array) = values.as_any().downcast_ref::<StructArray>() else {
return Ok(missing_path_step());
};

let Some(typed_array) = struct_array.column_by_name("typed_value") else {
return Ok(missing_path_step());
};

// Build the list of indices to take
let mut take_indices = Vec::with_capacity(list_array.len());
for i in 0..list_array.len() {
let start = offsets[i] as usize;
let end = offsets[i + 1] as usize;
let len = end - start;

if *index < len {
take_indices.push(Some((start + index) as u32));
} else {
take_indices.push(None);
}
}

let index_array = UInt32Array::from(take_indices);

// Use Arrow compute kernel to gather elements
let taken = take(typed_array, &index_array, None)?;

let metadata_array = BinaryViewArray::from_iter_values(std::iter::repeat_n(
EMPTY_VARIANT_METADATA_BYTES,
taken.len(),
));

let struct_array = &StructArrayBuilder::new()
.with_field("metadata", Arc::new(metadata_array), false)
.with_field("typed_value", taken, true)
.build();

let state = ShreddingState::try_from(struct_array)?;
Ok(ShreddedPathStep::Success(state.into()))
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use variant_get on Struct Variant Array's it's relatively easy to extract the typed_value. For example, if we extract a.b because on the inside it's just:

VariantArray{
    StructArray{
        "typed_value": 
            StructArray{
                "typed_value": PrimiteArray,  <- We can directly borrow the value into
                                                ShreddingState::Success() because the needed values in the array are contiguous 
                "value": VariantArray

But if we try to extract "typed_value" from a List VariantArray it gets more complicated. For example, extracting 0.0:

VariantArray{
    StructArray{
        "typed_value": 
            ListArray{
                Offsets
                StructArray{
                    "typed_value": PrimiteArray,  <- but the values are now not contiguous, and the
                                                   output array can only be extracted using offsets, no borrow available
                    "value": VariantArray

Because of this issue the output of follow_shredded_path_element -> ShreddedPathStep::Success can end up receiving BorrowedShreddingState or owned ShreddingState.

To make this work I added a ShreddingStateCow enum and made it the ShreddedPathStep::Success input.

);
}
shredding_state = state;
shredding_state = ShreddingStateCow::Owned(state.into_owned());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I could not come up with a way to make the shredding_state for the next path_element be ither borrowed or owned depending on the follow_shredded_path_element output.

Made it into_owned() just to pass the borrow checker.

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Oct 24, 2025

Hey @scovich, I'm ready for another go when you are available, thanks.

@github-actions github-actions bot removed the arrow Changes to the arrow crate label Nov 11, 2025
@sdf-jkl sdf-jkl requested a review from scovich November 11, 2025 03:24
@sdf-jkl sdf-jkl marked this pull request as ready for review December 3, 2025 19:37
@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Dec 3, 2025

@alamb @klion26 can you take a look?

@klion26
Copy link
Member

klion26 commented Dec 4, 2025

@alamb @klion26 can you take a look?

Thanks for the remainder, I'll check it later today or tomorrow.

Copy link
Member

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdf-jkl I've made a first round review for this, and left some comments. Please take a look.

assert_eq!(result_variant.len(), 2);

// Row 0: expect 0 index = "comedy"
assert_eq!(result_variant.value(0), Variant::from("comedy"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to cover the case that list[index] located in the value field of the input variant(such as VariantPath::from(1) here for ["horror", 123])

Maybe we can add some more test cases

  • the request index value located in the typed_value column
  • the request index value located in the value column
  • some nest struct(list in struct, or struct in list)
  • ...

let values = list_array.values(); // This is a StructArray

let Some(struct_array) = values.as_any().downcast_ref::<StructArray>() else {
return Ok(missing_path_step());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-checking here,
From the variant shredding spec, the element is a required group list. Does this return Ok(missing_path_step()); mean the input is not a valid variant?

optional group tags (VARIANT) {
  required binary metadata;
  optional binary value;
  optional group typed_value (LIST) {   # must be optional to allow a null list
    repeated group list {
      required group element {          # shredded element
        optional binary value;
        optional binary typed_value (STRING);
      }
    }
  }
}

Err(ArrowError::NotYetImplemented(
"Pathing into shredded variant array index".into(),
))
let Some(list_array) = typed_value.as_any().downcast_ref::<GenericListArray<i32>>()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance the list length exceeds i32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wanted to add two cases for i32 and i64 lists.

let end = offsets[i + 1] as usize;
let len = end - start;

if *index < len {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm wrong. Here we assert that all the values will be typed_value column, and use the indices collected here to retrieve the final value.

What if the value is located in the value column instead of the typed_value column? (change the test test_shredded_list_as_string from VariantPath::from(0) to VariantPath::from(1) can see this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is running variant_get with the as_type parameter, specifying which type we are looking for.

The example VariantArray is shredded by String type, therefore a String value cannot be outside the typed_value column and will return a null in this test case.

Given the ["comedy", "drama"], ["horror" 123] ListArray if we try take the value from index 1 instead 0, we will get:

thread 'variant_get::test::test_shredded_list_as_string' (51616) panicked at parquet-variant-compute\src\variant_get.rs:1735:9:
assertion `left == right` failed
  left: StringArray
[
  "drama",
  null,
]
 right: StringArray
[
  "comedy",
  "horror",
]

Copy link
Member

@klion26 klion26 Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not describing it clearly.

The data["comedy", "drama"], ["horro", 123] translated into variant will be that

  • comedy, drama and horro in the typed_value column,
  • and 123 in the value column(it has an incompatible type).

Here, we retrieve all the results from the typed_value column(take in line 148), but ["hooro", 123](1)(the second item in the list) here will return null(if CastOptions::safe = true) and Err (if CastOptions::safe = false) -- currently, we return null for both of the cases.

Seems there may be something more tricky here(maybe we need to have a design note for this as this comment), such as

  • if the target_type here we request is not list/struct then, we can use the logic like here, and respect the CastOptions::safe
  • If we need to handle variant nesting here or somewhere else?
    • Here, I don't have any answer yet. I'll try to find some time next week for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Other than (partially shredded) object fields, the shredding spec doesn't actually require any other type to shred merely because a compatible typed_value column exists. We have to assume that e.g. value could contain i8, i32, and i64 values even if typed_value is a 64-bit int. And AFAIK, we also have to assume that value could contain a variant array even if typed_value is a list. Super annoying.

Maybe this code already handles that case, but I wanted to make sure to flag it.

let index_array = UInt32Array::from(take_indices);

// Use Arrow compute kernel to gather elements
let taken = take(typed_array, &index_array, None)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this will create a new array, not sure if we can use some "view" here to avoid creating the new array here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe this is the inconvenient part of working with non-contiguous data.

if *index < len {
take_indices.push(Some((start + index) as u32));
} else {
take_indices.push(None);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean OutOfBound in the current list? Currently, we'll return null, not sure if we need to return an error in this case. Return an error or not, maybe we can log the behavior somewhere.

Copy link
Contributor Author

@sdf-jkl sdf-jkl Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lists can be variable length within a ListArray, so I don't think it should be an error if the list if not long enough for an index.

(GenericListArray docs)
For example, the ListArray shown in the following diagram stores lists of strings. Note that [] represents an empty (length 0), but non NULL list.

┌─────────────┐
│   [A,B,C]   │
├─────────────┤
│     []      │
├─────────────┤
│    NULL     │
├─────────────┤
│     [D]     │
├─────────────┤
│  [NULL, F]  │
└─────────────┘

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the list could have a different length, but do we need to return Err if there is an OutOfBound occurs (like the CastOptions:safe=false), so that the caller knows this.

@sdf-jkl
Copy link
Contributor Author

sdf-jkl commented Dec 30, 2025

@liamzwbao taking over with #8082

@sdf-jkl sdf-jkl closed this Dec 30, 2025
@sdf-jkl sdf-jkl deleted the shredded_list_support branch January 3, 2026 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants