-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] List and object builders have no effect until finalized #7865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| cur_offset += key.len(); | ||
| } | ||
| // Write final offset | ||
| write_offset(&mut metadata, cur_offset, offset_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: We can't use append_offset_array because the final offset isn't known until after we've written all the other offsets. Also, because metadata isn't a ValueBuffer.
friendlymatthew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice. The pending machinery was definitely too eager.
Plus, the offset handling looks much smoother
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great -- thank you @scovich and @friendlymatthew
FYI @viirya
| let obj = variant.as_object().unwrap(); | ||
| assert_eq!(obj.len(), 2); | ||
| assert_eq!(obj.get("first"), Some(Variant::Int8(1))); | ||
| assert_eq!(obj.get("second"), Some(Variant::Int8(2))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| let (parent_state, validate_unique_fields) = self.parent_state(); | ||
| ListBuilder::new(parent_state, validate_unique_fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to avoid returningvalidate_unique_fields in a local field you could probably copy it by value with something like
| let (parent_state, validate_unique_fields) = self.parent_state(); | |
| ListBuilder::new(parent_state, validate_unique_fields) | |
| let validate_unique_fields = self.validate_unique_fields; | |
| let parent_state = self.parent_state(); | |
| ListBuilder::new(parent_state, validate_unique_fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I was doing at first, but it was extra code duplicated at every call site.
I also tried introducing a ProvidesParentState type trait, that the builder new methods could take &mut impl of... but that also didn't work great because parent_state for object builder takes a param while this one does not.
So yeah... the current PR state is the least bad option I thought of.
| let obj_builder = ObjectBuilder::new(&mut self.buffer, self.metadata_builder) | ||
| .with_validate_unique_fields(self.validate_unique_fields); | ||
| self.pending = true; | ||
| // Returns validate_unique_fields because we can no longer reference self once this method returns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above -- I think you can remove the need to return validate_unique_fields if you want
| self.pending = Some((key, field_start)); | ||
|
|
||
| obj_builder | ||
| // Returns validate_unique_fields because we can no longer reference self once this method returns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I didn't run this code, but why cannot reference self after this returns? I don't see parent_state consume self.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it is mutably borrowed by the parent state.
| } | ||
|
|
||
| #[test] | ||
| fn test_variant_builder_to_object_builder_no_finish() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm what the difference before and after this change? Without this PR, this test also passes without issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a builder didn't finish, the next append to its parent would finalize the orphaned pending stuff. For a list, that meant an extra offset (as if there were a zero-sized value there, without even a header). For an object, it would associate two different fields with the same value bytes.
That's a good question, why the tests pass without these fixes tho?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried cherry picking the tests to main branch and all four failed in the expected ways (empty values for lists; extra fields for objects)?
% cargo test --no-fail-fast -- no_finish
...
failures:
---- builder::tests::test_list_builder_to_list_builder_no_finish stdout ----
thread 'builder::tests::test_list_builder_to_list_builder_no_finish' panicked at parquet-variant/src/builder.rs:1545:59:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Received empty bytes")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---- builder::tests::test_list_builder_to_object_builder_no_finish stdout ----
thread 'builder::tests::test_list_builder_to_object_builder_no_finish' panicked at parquet-variant/src/builder.rs:1566:59:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Received empty bytes")
---- builder::tests::test_object_builder_to_object_builder_no_finish stdout ----
thread 'builder::tests::test_object_builder_to_object_builder_no_finish' panicked at parquet-variant/src/builder.rs:1610:9:
assertion `left == right` failed
left: 3
right: 2
---- builder::tests::test_object_builder_to_list_builder_no_finish stdout ----
thread 'builder::tests::test_object_builder_to_list_builder_no_finish' panicked at parquet-variant/src/builder.rs:1589:9:
assertion `left == right` failed
left: 3
right: 2
failures:
builder::tests::test_list_builder_to_list_builder_no_finish
builder::tests::test_list_builder_to_object_builder_no_finish
builder::tests::test_object_builder_to_list_builder_no_finish
builder::tests::test_object_builder_to_object_builder_no_finish
test result: FAILED. 2 passed; 4 failed; 0 ignored; 0 measured; 90 filtered out; finished in 0.00s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! You meant the other two tests, that exercise the top-level variant builder:
running 6 tests
test builder::tests::test_variant_builder_to_list_builder_no_finish ... ok
test builder::tests::test_variant_builder_to_object_builder_no_finish ... ok
test builder::tests::test_object_builder_to_object_builder_no_finish ... FAILED
test builder::tests::test_list_builder_to_list_builder_no_finish ... FAILED
test builder::tests::test_list_builder_to_object_builder_no_finish ... FAILED
test builder::tests::test_object_builder_to_list_builder_no_finish ... FAILED
The variant builder didn't have any pending mechanism, and failing to finish its child builder already had no side effects. We just didn't have a test that verified this before.
|
|
||
| /// Finalizes this object and appends it to its parent, which otherwise remains unmodified. | ||
| pub fn finish(mut self) -> Result<(), ArrowError> { | ||
| let metadata_builder = self.parent_state.metadata_builder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I miss anything. The only major difference after this change is the metadata builder and buffer of parent builder are pulled into parent state.
But I took at the current codebase, parent_buffer is also only modified in finish. So for this modification, before and after this change has no difference.
For metadata_builder, looks like it is the major change. Now check_pending_field is removed so any pending nested builder won't be updated to metadata builder when we create another new nested builder.
But seems it is only limited to if the pending nested builder hasn't insert any field yet? If we has inserted fields to the pending nested builder, metadata builder is already updated even the pending nested builder is not finalized.
Maybe it is why all un-finialized nested builders in these new "no finish" tests are without any "insert` calls?
If so, this seems not resolve the issue @alamb raised at the beginning? Because I guess the case @alamb concern is like
let mut builder = VariantBuilder::new();
...
{
let _object_builder = builder.new_object();
_object_builder.insert(xxx);
_object_builder.insert(yyy);
_object_builder.insert(zzz);
// _object_builder doesn't been finalized.
}
...
builder.finish();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easy enough to add the nested insert(s), but I don't think I understand "the issue" you refer to, that the current code does not fix?
The new code doesn't insert a new object field's name to the metadata dictionary until the builder's finishmethod is called. You are correct, that a scenario like the above could leave extra/unused field names in the metadata dictionary (because the field names were added when the field finished inserting, but then the object that contains those fields was never actually added to its parent). However, spurious names in the metadata dictionary are harmless -- the spec does not require that every metadata entry be referenced as a field id in the corresponding value. In fact, the variant shredding spec requires the metadata dictionary to have "extra" field names, corresponding to fields that were shredded out. That way, a reader who wants to unshred the those fields back to normal variant values doesn't have to rewrite the metadata dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I split the four inner/outer builder tests to not finish exactly one of outer or inner... and the outer not-finish cases all break with "Received empty bytes". So it looks like there IS some problem still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh.... this relates to
If we didn't finish the outer builder, the top-level variant builder didn't contain any value at all, and the resulting value buffer is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, this seems not resolve the issue @alamb raised at the beginning? Because I guess the case @alamb concern is like
let mut builder = VariantBuilder::new(); ... { let _object_builder = builder.new_object(); _object_builder.insert(xxx); _object_builder.insert(yyy); _object_builder.insert(zzz); // _object_builder doesn't been finalized. } ... builder.finish();
My initial concern was that it was too easy to forget to call finish and thus lose the structure
Along the way @scovich convinced me that this was acceptable (and actually preferrable) behavior rather than auto completing builders on drop
THe rationale is that there are other ways that drop gets called other than forgetting to call build -- for example an exit on error when using ? where the object probably shouldn't get dropped
I think after this PR the behavior makes much more sense and is now documented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't argue that we should do finalization in drop. I meant, is this "no effect until finalized" only effective for builders which are added but are called "insert" yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I saw these tests are updated to include insert, looks like they are good now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new code doesn't insert a new object field's name to the metadata dictionary until the builder's finishmethod is called.
Doesn't it? I saw that insert already modify the metadata builder. It is happened before finish is called, right?
pub fn insert<'m, 'd, T: Into<Variant<'m, 'd>>>(&mut self, key: &str, value: T) {
let metadata_builder = self.parent_state.metadata_builder();
let field_id = metadata_builder.upsert_field_name(key); // <- it is inserted into the metadata builder
}The only difference to previous code, is check_pending_field is removed now so if it is not finished, the field won't be updated to fields and no effect to finish.
|
Hmm, when trying to rebase on top of #7843, I'm getting compilation failures such as: That's not such a nice side effect of the empty |
| pub fn insert<'m, 'd, T: Into<Variant<'m, 'd>>>(&mut self, key: &str, value: T) { | ||
| self.check_pending_field(); | ||
| // Get metadata_builder from parent state | ||
| let metadata_builder = self.parent_state.metadata_builder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so modification to metadata builder from a nested builder won't affect the parent builder if the nested builder doesn't finalize in the end ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems so, looks like the metadata builder doesn't affect the finalization too much. It is just used to provide unique field id.
|
Thanks again @scovich and @viirya and @friendlymatthew |
Which issue does this PR close?
ObjectBuilder::finalizeis not called, the resulting Variant object is malformed. #7863Rationale for this change
Reviews and testing on #7843 exposed the fact that creating a variant list or object builder has side effects that leave the parent in an inconsistent/invalid state if the child builder is never finalized. Rework the finalization logic to be more direct so that child builders have no effect on their parents before their
finishmethod is called.What changes are included in this PR?
ParentStateenum that tracks the necessary information for a child to fully finalize its parent.pendingmachinery from buildersAre these changes tested?
Existing unit tests mostly cover this change.
Added new tests to verify that failing to call
finishleaves the parent unmodified.Are there any user-facing changes?
No.