Skip to content

Conversation

@scovich
Copy link
Contributor

@scovich scovich commented Jul 3, 2025

Which issue does this PR close?

Rationale for this change

Reviews and testing on #7843 exposed the fact that creating a variant list or object builder has side effects that leave the parent in an inconsistent/invalid state if the child builder is never finalized. Rework the finalization logic to be more direct so that child builders have no effect on their parents before their finish method is called.

What changes are included in this PR?

  • Define a new ParentState enum that tracks the necessary information for a child to fully finalize its parent.
  • Remove the pending machinery from builders

Are these changes tested?

Existing unit tests mostly cover this change.

Added new tests to verify that failing to call finish leaves the parent unmodified.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 3, 2025
cur_offset += key.len();
}
// Write final offset
write_offset(&mut metadata, cur_offset, offset_size);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: We can't use append_offset_array because the final offset isn't known until after we've written all the other offsets. Also, because metadata isn't a ValueBuffer.

Copy link
Contributor

@friendlymatthew friendlymatthew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice. The pending machinery was definitely too eager.

Plus, the offset handling looks much smoother

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great -- thank you @scovich and @friendlymatthew

FYI @viirya

let obj = variant.as_object().unwrap();
assert_eq!(obj.len(), 2);
assert_eq!(obj.get("first"), Some(Variant::Int8(1)));
assert_eq!(obj.get("second"), Some(Variant::Int8(2)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +588 to +589
let (parent_state, validate_unique_fields) = self.parent_state();
ListBuilder::new(parent_state, validate_unique_fields)
Copy link
Contributor

@alamb alamb Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to avoid returningvalidate_unique_fields in a local field you could probably copy it by value with something like

Suggested change
let (parent_state, validate_unique_fields) = self.parent_state();
ListBuilder::new(parent_state, validate_unique_fields)
let validate_unique_fields = self.validate_unique_fields;
let parent_state = self.parent_state();
ListBuilder::new(parent_state, validate_unique_fields)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I was doing at first, but it was extra code duplicated at every call site.

I also tried introducing a ProvidesParentState type trait, that the builder new methods could take &mut impl of... but that also didn't work great because parent_state for object builder takes a param while this one does not.

So yeah... the current PR state is the least bad option I thought of.

let obj_builder = ObjectBuilder::new(&mut self.buffer, self.metadata_builder)
.with_validate_unique_fields(self.validate_unique_fields);
self.pending = true;
// Returns validate_unique_fields because we can no longer reference self once this method returns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above -- I think you can remove the need to return validate_unique_fields if you want

self.pending = Some((key, field_start));

obj_builder
// Returns validate_unique_fields because we can no longer reference self once this method returns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I didn't run this code, but why cannot reference self after this returns? I don't see parent_state consume self.

Copy link
Member

@viirya viirya Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it is mutably borrowed by the parent state.

}

#[test]
fn test_variant_builder_to_object_builder_no_finish() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what the difference before and after this change? Without this PR, this test also passes without issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a builder didn't finish, the next append to its parent would finalize the orphaned pending stuff. For a list, that meant an extra offset (as if there were a zero-sized value there, without even a header). For an object, it would associate two different fields with the same value bytes.

That's a good question, why the tests pass without these fixes tho?

Copy link
Contributor Author

@scovich scovich Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried cherry picking the tests to main branch and all four failed in the expected ways (empty values for lists; extra fields for objects)?

% cargo test --no-fail-fast -- no_finish
  ...
failures:

---- builder::tests::test_list_builder_to_list_builder_no_finish stdout ----
thread 'builder::tests::test_list_builder_to_list_builder_no_finish' panicked at parquet-variant/src/builder.rs:1545:59:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Received empty bytes")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- builder::tests::test_list_builder_to_object_builder_no_finish stdout ----
thread 'builder::tests::test_list_builder_to_object_builder_no_finish' panicked at parquet-variant/src/builder.rs:1566:59:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Received empty bytes")

---- builder::tests::test_object_builder_to_object_builder_no_finish stdout ----
thread 'builder::tests::test_object_builder_to_object_builder_no_finish' panicked at parquet-variant/src/builder.rs:1610:9:
assertion `left == right` failed
  left: 3
 right: 2

---- builder::tests::test_object_builder_to_list_builder_no_finish stdout ----
thread 'builder::tests::test_object_builder_to_list_builder_no_finish' panicked at parquet-variant/src/builder.rs:1589:9:
assertion `left == right` failed
  left: 3
 right: 2

failures:
    builder::tests::test_list_builder_to_list_builder_no_finish
    builder::tests::test_list_builder_to_object_builder_no_finish
    builder::tests::test_object_builder_to_list_builder_no_finish
    builder::tests::test_object_builder_to_object_builder_no_finish

test result: FAILED. 2 passed; 4 failed; 0 ignored; 0 measured; 90 filtered out; finished in 0.00s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! You meant the other two tests, that exercise the top-level variant builder:

running 6 tests
test builder::tests::test_variant_builder_to_list_builder_no_finish ... ok
test builder::tests::test_variant_builder_to_object_builder_no_finish ... ok
test builder::tests::test_object_builder_to_object_builder_no_finish ... FAILED
test builder::tests::test_list_builder_to_list_builder_no_finish ... FAILED
test builder::tests::test_list_builder_to_object_builder_no_finish ... FAILED
test builder::tests::test_object_builder_to_list_builder_no_finish ... FAILED

The variant builder didn't have any pending mechanism, and failing to finish its child builder already had no side effects. We just didn't have a test that verified this before.


/// Finalizes this object and appends it to its parent, which otherwise remains unmodified.
pub fn finish(mut self) -> Result<(), ArrowError> {
let metadata_builder = self.parent_state.metadata_builder();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I miss anything. The only major difference after this change is the metadata builder and buffer of parent builder are pulled into parent state.

But I took at the current codebase, parent_buffer is also only modified in finish. So for this modification, before and after this change has no difference.

For metadata_builder, looks like it is the major change. Now check_pending_field is removed so any pending nested builder won't be updated to metadata builder when we create another new nested builder.

But seems it is only limited to if the pending nested builder hasn't insert any field yet? If we has inserted fields to the pending nested builder, metadata builder is already updated even the pending nested builder is not finalized.

Maybe it is why all un-finialized nested builders in these new "no finish" tests are without any "insert` calls?

If so, this seems not resolve the issue @alamb raised at the beginning? Because I guess the case @alamb concern is like

let mut builder = VariantBuilder::new();
...
{
  let _object_builder = builder.new_object();
  _object_builder.insert(xxx);
  _object_builder.insert(yyy);
  _object_builder.insert(zzz);
  // _object_builder doesn't been finalized.
}
...
builder.finish();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy enough to add the nested insert(s), but I don't think I understand "the issue" you refer to, that the current code does not fix?

The new code doesn't insert a new object field's name to the metadata dictionary until the builder's finishmethod is called. You are correct, that a scenario like the above could leave extra/unused field names in the metadata dictionary (because the field names were added when the field finished inserting, but then the object that contains those fields was never actually added to its parent). However, spurious names in the metadata dictionary are harmless -- the spec does not require that every metadata entry be referenced as a field id in the corresponding value. In fact, the variant shredding spec requires the metadata dictionary to have "extra" field names, corresponding to fields that were shredded out. That way, a reader who wants to unshred the those fields back to normal variant values doesn't have to rewrite the metadata dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I split the four inner/outer builder tests to not finish exactly one of outer or inner... and the outer not-finish cases all break with "Received empty bytes". So it looks like there IS some problem still.

Copy link
Contributor Author

@scovich scovich Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh.... this relates to

If we didn't finish the outer builder, the top-level variant builder didn't contain any value at all, and the resulting value buffer is empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, this seems not resolve the issue @alamb raised at the beginning? Because I guess the case @alamb concern is like

let mut builder = VariantBuilder::new();
...
{
  let _object_builder = builder.new_object();
  _object_builder.insert(xxx);
  _object_builder.insert(yyy);
  _object_builder.insert(zzz);
  // _object_builder doesn't been finalized.
}
...
builder.finish();

My initial concern was that it was too easy to forget to call finish and thus lose the structure

Along the way @scovich convinced me that this was acceptable (and actually preferrable) behavior rather than auto completing builders on drop

THe rationale is that there are other ways that drop gets called other than forgetting to call build -- for example an exit on error when using ? where the object probably shouldn't get dropped

I think after this PR the behavior makes much more sense and is now documented

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't argue that we should do finalization in drop. I meant, is this "no effect until finalized" only effective for builders which are added but are called "insert" yet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I saw these tests are updated to include insert, looks like they are good now?

Copy link
Member

@viirya viirya Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new code doesn't insert a new object field's name to the metadata dictionary until the builder's finishmethod is called.

Doesn't it? I saw that insert already modify the metadata builder. It is happened before finish is called, right?

pub fn insert<'m, 'd, T: Into<Variant<'m, 'd>>>(&mut self, key: &str, value: T) {
  let metadata_builder = self.parent_state.metadata_builder();
  let field_id = metadata_builder.upsert_field_name(key); // <- it is inserted into the metadata builder
}

The only difference to previous code, is check_pending_field is removed now so if it is not finished, the field won't be updated to fields and no effect to finish.

@scovich
Copy link
Contributor Author

scovich commented Jul 4, 2025

Hmm, when trying to rebase on top of #7843, I'm getting compilation failures such as:

error[E0509]: cannot move out of type `builder::ListBuilder<'_>`, which implements the `Drop` trait
   --> parquet-variant/src/builder.rs:697:43
    |
697 |         parent_buffer.append_offset_array(self.offsets, Some(data_size), offset_size);
    |                                           ^^^^^^^^^^^^
    |                                           |
    |                                           cannot move out of here
    |                                           move occurs because `self.offsets` has type `std::vec::Vec<usize>`, which does not implement the `Copy` trait

That's not such a nice side effect of the empty impl Drop... but I guess easy enough to solve with std::mem::take.

pub fn insert<'m, 'd, T: Into<Variant<'m, 'd>>>(&mut self, key: &str, value: T) {
self.check_pending_field();
// Get metadata_builder from parent state
let metadata_builder = self.parent_state.metadata_builder();
Copy link
Member

@viirya viirya Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so modification to metadata builder from a nested builder won't affect the parent builder if the nested builder doesn't finalize in the end ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems so, looks like the metadata builder doesn't affect the finalization too much. It is just used to provide unique field id.

@alamb alamb merged commit 213d3be into apache:main Jul 7, 2025
12 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 7, 2025

Thanks again @scovich and @viirya and @friendlymatthew

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

4 participants