Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

@friendlymatthew friendlymatthew commented Jul 7, 2025

Rationale for this change

This PR contains algorithmic modifications to the validation logic and the associated benchmarks, specifically targeting complex object and list validation.

Previously, the approach involved iterating over each element and repeatedly fetching the same slice of the backing buffer, then slicing into that buffer again for each individual element. This led to redundant buffer access.

This validation approach is done in multiple passes that take advantage of the variant's memory layout. For example, dictionary field names are stored contiguously; instead of checking whether a field name is UTF8-encoded separately, we now validate the entire field name buffer in a single pass.

The benchmark cases were adapted from test_json_to_variant_object_very_large, test_json_to_variant_object_complex, and test_json_to_variant_array_nested_large test cases.

Compared to #7871, we observe a significant improvement in performance:

Screenshot 2025-07-07 at 10 25 07 AM

@scovich @alamb

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 7, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/meta-validation branch from f2ff2a3 to 1a9ff6c Compare July 7, 2025 16:02
@friendlymatthew
Copy link
Contributor Author

friendlymatthew commented Jul 7, 2025

Note: this is still a POC. There's some code movement to do, but I figured it would be best to leave the diff like this to make it more readable.

But the core concept is fully fleshed out.

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/meta-validation branch from 1a9ff6c to b6af863 Compare July 7, 2025 18:12
@alamb
Copy link
Contributor

alamb commented Jul 7, 2025

BTW I ran the test from this ticket locally before this PR

Previously

        PASS [  21.886s] parquet-variant::test_json_to_variant test_json_to_variant_object_very_large
        PASS [  15.711s] parquet-variant::test_json_to_variant test_json_to_variant_object_very_large

With this PR

        PASS [   6.570s] parquet-variant::test_json_to_variant test_json_to_variant_object_very_large

Only 6.5 seconds 🤗 -- nice work

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting ideas. Made a partial pass with some ideas for improved control flow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this faster and/or simpler than try_into from slice to array?

Comment on lines 241 to 234
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Iterator::is_sorted_by?

let offsets_monotonically_increasing = offset_chunks
    .map(...)
    .is_sorted_by(...);

(again below)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: The code below uses the materialized offsets.

But for large objects, wouldn't it be cheaper (and simpler) to combine the checks instead of materializing the offset array? Most likely inside a helper method so the code can leverage ?:

    let mut offsets = offset_chunks.map(|chunk| match chunk {
        ...
    });
    let Some(mut prev_offset) = offset_chunks.next().transpose()? else {
        return Ok(());
    };
    for offset in offsets {
        let offset = offset?;
        if prev_offset >= offset {
            return Err(... not monotonic ...);
        }

        let value_bytes = slice_from_slice(value_buffer, prev_offset..offset)?;
        prev_offset = offset;
        let _ = Variant::try_new_with_metadata(self.metadata, value_bytes)?;
    };
    Ok(())
}

Comment on lines 269 to 256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, no need to materialize the iterator if we combine the checks into a single loop?
(but also see comment below)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I expect it will be straightforward to avoid materializing these iterators
(again below)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- it seems unecessary to allocate and copy all the values into a Vec here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm happy to change, but I think in this case the extra allocation isn’t hurting perf very much and it makes the code a lot simpler to reason about because we can split up the following checks better.

I could inline calls to map_bytes_to_offsets everywhere we loop over field_ids, but that doesn’t seem much better than doing a single allocation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could inline calls to map_bytes_to_offsets everywhere we loop over field_ids, but that doesn’t seem much better than doing a single allocation

I only see one pass over field_ids? Two possible paths, but each is single-pass?
(the not-sorted-metadata case makes a second pass, but I'm pretty sure it's redundant and will never find a failure -- see other comment)

I think in this case the extra allocation isn’t hurting perf very much

Seems like something we could (should) quantify? For a not-small dictionary I would expect materialization cost to be quite noticeable. Anything big enough to fall out L1 cache would likely see a significant slowdown due to a wave of cache misses on every pass. And that includes not just the offset vec itself (easy enough to prefetch), but also the bytes of the strings and values we examine.

and it makes the code a lot simpler to reason about

I don't think that's necessarily true for the metadata and list cases above (see other comments).

For this case, it would look something like:

let field_ids = map_bytes_to_offsets(field_id_buffer, self.header.field_id_size);
if let Some(mut field_id) = field_ids.next() {
    if self.metadata.is_sorted() {
        // Since the metadata dictionary has unique and sorted field names...
        for next_field_id in field_ids {
            if next_field_id <= field_id {
                return Err(... field names not sorted ...);
            }
            field_id = next_field_id;
        }
    } else {
        // The metadata dictionary can't guarantee uniqueness... 
        ... very similar to the metadata sortedness check: scan + validate_fallible_iterator ...
        ... probably could even factor out a helper function that both sites can call ...
    }
    if field_id >= self.metadata.dictionary_size() {
        return Err(... field id is not valid ...);
    }
}

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/meta-validation branch from 7a581d1 to 1f461da Compare July 8, 2025 02:39
@friendlymatthew friendlymatthew marked this pull request as draft July 8, 2025 12:27
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/meta-validation branch 4 times, most recently from 26399ea to 9628c83 Compare July 9, 2025 00:08
@friendlymatthew friendlymatthew marked this pull request as ready for review July 9, 2025 00:11
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/meta-validation branch 3 times, most recently from 63fa322 to 646f8a4 Compare July 9, 2025 00:19
@friendlymatthew
Copy link
Contributor Author

Hi @viirya, I would appreciate your thoughts on this PR as well

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @friendlymatthew -- this is making progress

I am not quite sure about the benchmarks

Also, have you had a chance to incorporate @scovich 's comment in https://github.com/apache/arrow-rs/pull/7878/files#r2190963902 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment seems incorrect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, that is what I get from lifting this code from test cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran these benchmarks locally

cargo bench --bench variant_validation

Here is some output:

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 33.9s, or reduce sample count to 10.
Benchmarking bench_validate_large_object: Collecting 100 samples in estimated 33.859 s (100 iterations)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a non trivial amount of the time being measured by this benchmark is the creation of the JSON string...

Copy link
Contributor Author

@friendlymatthew friendlymatthew Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the generation of large object out of benchmark? Only benchmarking validation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that would be useful. I did some profiling at more than half the time of this benchmark is spent parsing JSON

Screenshot 2025-07-10 at 9 06 29 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arg -- that is embarassing. I will push up a fix soon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found these benchmarks somewhat confusing -- they are mostly testing json --> Variant conversion I think.

While those are useful benchmarks, they probably shouldn't be called variant_validation

SO I suggest you split this benchmark up into two:

  1. parquet-variant/benches/variant_validation -- creates the metadata and value progrmatically with VariantBuilder once and then benchmarks callingVariant::try_new on that (precreated) metadata/value
  2. parquet-variant-json/benches/variant_json.rs -- creates a json string once and then benchmarks how fast calling json_to_variant is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found these benchmarks somewhat confusing -- they are mostly testing json --> Variant conversion I think.

While those are useful benchmarks, they probably shouldn't be called variant_validation

Hi, the following benchmarks use iter_batched which doesn't include the startup time (in this case, building up the variant). The benchmark runs should only involve the validation code.

Reading this issue, it seems like iter_batched_ref is a better function to use.

As for the json_to_variant, I could've built up the variant manually via the builder, but found the json example quite convenient. Since we moved json_to_variant to a separate crate, I had no choice but to move the validation benchmarks here.

But I agree, we should move this back to parquet-variant. It is a bit unfortunate, but we can manually build up the objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much improvement the new crate adds ?

One thing I thought was that if we create a Variant from JSON we already know all the values were utf8, so it makes sense to skip validating it again befor read

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between the currently used str::from_utf8 and this simdutf8::basic::from_utf8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this isn't the bottleneck so I'm fine with using str::from_utf8. I am a big fan of simd and am a sucker for potential vectorized operations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet already has simdutf8 as (optional) dependency, so I think it makes sense to use it if enabled.

simdutf8 = { version = "0.1.5", optional = true, default-features = false }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- it seems unecessary to allocate and copy all the values into a Vec here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is faster than simply checking that all the field_ids are less than self.metadata.dictionary_size -- finding the max requires looking at all the items anywyas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like there is a lot of duplicated code that does about the same thing:
Check that a slice of offsets are all

  1. sorted (potentially)--
  2. Less than some max offset
  3. Point into a valid sub variant

I wonder if it possible to factor it all out into a function like

fn validate_offsets(offset_buffer: &[u8], num_offsets: usize, offset_size: OffsetSize, max_valid_offset: usize) {
...
}

Or something 🤔

Copy link
Contributor Author

@friendlymatthew friendlymatthew Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan on following up with a PR that removes this check since we can validate the monotonicity of offsets when accessing variants

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this a new check? Seems exiting VariantMetadata's with_full_validation doesn't have this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the current validation was written before we supported sorted dictionaries

Comment on lines 254 to 256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
.map(|i| {
let field_range = offsets[i - 1]..offsets[i];
value_str.get(field_range)
})
.map(|i| value_str.get(offsets[i - 1]..offsets[i]))

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not materialize id and offset arrays during validation -- there's no way it's free -- but maybe that can be a fast-follow PR?

Also -- some of the efficient iteration techniques the validation uses should probably be backported to the actual iterators?

Comment on lines +222 to +227
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this check -- the loop below does offsets[i-1]..offsets[i] for every i in 1..offsets.len(), and any non-monotonic offset would cause slice_from_slice to return an Err like:?

Tried to extract byte(s) 42..25 from 100-byte buffer

Is it worth making an extra pass over the offsets (which requires materializing them) just to have a slightly nicer error message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplified version would be something like:

let offsets = map_bytes_to_offsets(offset_buffer, self.header.offset_size);
if let Some(mut start) = offsets.next() {
    for end in offsets {
          ... validate start..end ...
        start = end;
    }
}

Comment on lines +239 to +244
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm not sure it's worth paying an extra pass for a monotonicity check, when the slicing operations that follow will naturally fail in the presence of non-monotonic offsets?

let offsets = map_bytes_to_offsets(offset_bytes, self.header.offset_size);
if let Some(first_offset) = offsets.next() {
    // Create an iterator over the strings. We must consume the iterator to validate it.
    let strings = offsets.scan(first_offset, |start, end| {
        let s = slice_from_slice(value_buffer, start..end);
        *start = end;
        Some(s)
    });
    if self.header.is_sorted {
        // verify the strings are sorted and unique, in addition to being individually valid
        if let Some(mut a) = strings.next().transpose()? {
            for b in strings {
                let b = b?;
                if a >= b {
                    return Err(... dictionary values are not unique and ordered ...);
                }
                a = b;
            }
        }
    } else {
        // Just verify the strings are all individually valid
        validate_fallible_iterator(strings)?;
    }
}

This time, an iterator scan works well because we naturally unpack the errors later.

Note: The above does require making slice_from_slice fully generic:

pub(crate) fn slice_from_slice<T, I: SliceIndex<[T]> + Clone + Debug>(
    bytes: &[T],
    index: I,
) -> Result<&I::Output, ArrowError> {

Comment on lines 202 to 203
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a doc comment explaining what it does and the conditions under which it's panic-free?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I took a look at the codegen, the compiler optimizes out the bounds checks so it can not panic under any circumstance

Screenshot 2025-07-10 at 9 28 01 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rustc optimizes and specializes super aggresively -- the specialization you show is for a single-element iterator -- it doesn't even have a loop. But agree that using chunks (rather than chunks_exact) means it shouldn't be able to panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, the screenshot should've showed the codegen for chunks_exact.

chunks_exact would never panic. If the slice length that we want to chunk over is not evenly divided by the chunk size, it will omit the remainder elements and can be retrieved by calling remainder() from the iterator.

More formally, ChunksExact will return array.len() // chunk_length elements. ChunksExact::remainder will return a slice with array.len() % chunk_length elements.

chunks would risk a panic if the slice length is not evenly divided by the chunk size. Since, the remainder is directly appended as the last slice in the iterator.

Comment on lines +222 to +223
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clever!

Comment on lines +254 to +265
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this check. If any field id were out of bounds, the collectat L245 would have already failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shallow validation already ensured the last offset is in bounds, correct?

Comment on lines +282 to +283
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this not work for some reason?

Suggested change
Ok::<_, ArrowError>(())
Ok(())

(try_for_each does seem to have a pretty weird collect-like signature, but I would have hoped that the previous ? would have fixed the error type...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could inline calls to map_bytes_to_offsets everywhere we loop over field_ids, but that doesn’t seem much better than doing a single allocation

I only see one pass over field_ids? Two possible paths, but each is single-pass?
(the not-sorted-metadata case makes a second pass, but I'm pretty sure it's redundant and will never find a failure -- see other comment)

I think in this case the extra allocation isn’t hurting perf very much

Seems like something we could (should) quantify? For a not-small dictionary I would expect materialization cost to be quite noticeable. Anything big enough to fall out L1 cache would likely see a significant slowdown due to a wave of cache misses on every pass. And that includes not just the offset vec itself (easy enough to prefetch), but also the bytes of the strings and values we examine.

and it makes the code a lot simpler to reason about

I don't think that's necessarily true for the metadata and list cases above (see other comments).

For this case, it would look something like:

let field_ids = map_bytes_to_offsets(field_id_buffer, self.header.field_id_size);
if let Some(mut field_id) = field_ids.next() {
    if self.metadata.is_sorted() {
        // Since the metadata dictionary has unique and sorted field names...
        for next_field_id in field_ids {
            if next_field_id <= field_id {
                return Err(... field names not sorted ...);
            }
            field_id = next_field_id;
        }
    } else {
        // The metadata dictionary can't guarantee uniqueness... 
        ... very similar to the metadata sortedness check: scan + validate_fallible_iterator ...
        ... probably could even factor out a helper function that both sites can call ...
    }
    if field_id >= self.metadata.dictionary_size() {
        return Err(... field id is not valid ...);
    }
}

Copy link
Member

@viirya viirya Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if there are values like ..., None, Some(a), None, Some(b)..., is it possible that an unordered case like a >= b cannot be detected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, if is_sorted is false, it means the dictionary cannot guarantee both sortedness and uniqueness? If so, is it still a "dictionary"?

And if so, the name is_sorted is also confusing as it doesn't carry only sortedness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but all this naming comes from the variant spec 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @viirya, good question. From the specification:

If sorted_strings is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, is field id started from 0 and all field ids are continuous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, per the spec:

A field_id is an index into the dictionary in the metadata

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, existing full validation of VariantObject doesn't check if its field names are sorted, is it missing previously?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. The old validation was only ensuring individual fields names were valid, without comparing them for sortedness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Field names sortness is different to field id sortness, right? We still can check if field_ids is sorted or not. If it is sorted, we can still just check the last field id?

@alamb
Copy link
Contributor

alamb commented Jul 10, 2025

I have sort of lost track of the current state of this PR. Is it something we want to merge and then do follow ons? If so, can we file some tickets to track those follow ons? Or do we want to keep working on this PR?

I would like to improve the benchmarks to more cleanly separate the validation benchmarks from JSON parsing benchmarks.

@friendlymatthew
Copy link
Contributor Author

friendlymatthew commented Jul 10, 2025

I have sort of lost track of the current state of this PR. Is it something we want to merge and then do follow ons? If so, can we file some tickets to track those follow ons? Or do we want to keep working on this PR?

I would like to improve the benchmarks to more cleanly separate the validation benchmarks from JSON parsing benchmarks.

Hi, I would like to spend a bit more time on this. @scovich raised some good points about redundant checks which will remove the need to collect offsets.

Here is my checklist for this PR:

  • avoid materializing offsets + remove redundant checks
  • better documentation
  • fix the validation benchmark and move them back to parquet-variant
    - [ ] make the offset validation DRY by moving it to a freestanding method

The simdutf8 stuff can be incorporated as a follow up PR.

@scovich
Copy link
Contributor

scovich commented Jul 10, 2025

avoid materializing offsets + remove redundant checks

Honestly, this one feels like a good fast-follow, rather than something that would block the merge? Unless it's super easy/fast to clean up. Perfection as enemy of good and all that. This PR is already a drastic improvement over the current validation. And as @viirya pointed out, this PR adds some validations that were just plain missing before.

Also, as a separate PR we could hopefully even see benchmark results quantifying the impact of the change.

@friendlymatthew
Copy link
Contributor Author

avoid materializing offsets + remove redundant checks

Honestly, this one feels like a good fast-follow, rather than something that would block the merge? Unless it's super easy/fast to clean up. Perfection as enemy of good and all that. This PR is already a drastic improvement over the current validation. And as @viirya pointed out, this PR adds some validations that were just plain missing before.

Also, as a separate PR we could hopefully even see benchmark results quantifying the impact of the change.

Sounds good. I'll get the benchmarks cleaned up and kick the rest over for another PR

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/meta-validation branch 2 times, most recently from 2fc9ed9 to 97610fc Compare July 11, 2025 01:10
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/meta-validation branch from 97610fc to d7b2596 Compare July 11, 2025 01:12
@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

Looks like there is one final CI check to fix and then I can merge this PR in https://github.com/apache/arrow-rs/actions/runs/16209770665/job/45767524794?pr=7878

I can't push to your PR directly because I don't have permissions in the pydantic fork

If you make PRs from your own fork in the future, I can make changes directly if you think that will help

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @friendlymatthew -- I think this is an improvement and it gets us setup for more improvements (the benchmark, etc)

Thus I think we should merge it as soon as we can get the CI to pass

Thanks again @scovich @viirya and @Dandandan for all the help

@friendlymatthew
Copy link
Contributor Author

Looks like there is one final CI check to fix and then I can merge this PR in https://github.com/apache/arrow-rs/actions/runs/16209770665/job/45767524794?pr=7878

I can't push to your PR directly because I don't have permissions in the pydantic fork

If you make PRs from your own fork in the future, I can make changes directly if you think that will help

Hi, I've been trying to figure out why CI is failing. Looking at the warnings, a lot of it is coming from the parquet crate.

The warnings related to parquet-variant come from builder.rs, which this PR doesn't touch.

Fwiw, running cargo doc on main also seems to trigger different warnings.

@friendlymatthew
Copy link
Contributor Author

For example, the latest CI run displays errors like:

Documenting arrow-ord v55.2.0 (/__w/arrow-rs/arrow-rs/arrow-ord)
error: unnecessary parentheses around closure body
--> arrow-ord/src/cmp.rs:276:43
|
276 | let c = |((l, r), n)| ((l ^ r) | (l & r & n));
| ^ ^
|
= note: -D unused-parens implied by -D warnings
= help: to override -D warnings add #[allow(unused_parens)]
help: remove these parentheses
|
276 - let c = |((l, r), n)| ((l ^ r) | (l & r & n));
276 + let c = |((l, r), n)| (l ^ r) | (l & r & n);
|

error: could not document arrow-ord

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

For example, the latest CI run displays errors like:

Documenting arrow-ord v55.2.0 (/__w/arrow-rs/arrow-rs/arrow-ord)
error: unnecessary parentheses around closure body
--> arrow-ord/src/cmp.rs:276:43
|
276 | let c = |((l, r), n)| ((l ^ r) | (l & r & n));
| ^ ^
|
= note: -D unused-parens implied by -D warnings
= help: to override -D warnings add #[allow(unused_parens)]
help: remove these parentheses
|
276 - let c = |((l, r), n)| ((l ^ r) | (l & r & n));
276 + let c = |((l, r), n)| (l ^ r) | (l & r & n);
|

error: could not document arrow-ord

Looks to me like the latest nightly rust got more strict and for some reason we build docs with nightly in this crate

So to reproduce locally I bet you can do

rustup toolchain install nightly
cargo +nightly doc  --document-private-items --no-deps --workspace --all-features

So in other words it is not related to your PR and CI would likely fail on main if we re-ran it

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

@viirya fixed the docs CI failure in this PR ❤️

So I am merging this PR in to keep the code flowing

@alamb alamb merged commit b2583b0 into apache:main Jul 11, 2025
12 of 13 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

Thanks again. @friendlymatthew can you make sure any follow on tasks we identified are properly tracked with tickets?

Thank you again @scovich @viirya and @Dandandan -- this one took a bit but I think we are on the right track

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] test_json_to_variant_object_very_large takes over 20s [Variant] Converting variant to JSON string seems slow

5 participants