Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast utf8 validation when loading string view from parquet #6009

Merged
merged 3 commits into from
Jul 8, 2024

Conversation

XiangpengHao
Copy link
Contributor

Which issue does this PR close?

Closes #5995.

Rationale for this change

Current utf-8 validation for string view is super slow, even slower than reading StringArray from parquet, which copies & consolidates the strings to a new buffer.

It does not have to be slow, but making it fast requires a bit of art. I tried many approaches, and this PR is the most simple and easy to understand one.

Please check the comments to see if they make sense (and helps understanding what's going on).

To run the benchmark:

cargo bench --bench arrow_reader --features="arrow test_common experimental" "arrow_array_reader/String.*Array/plain"

We will get this:

arrow_array_reader/StringViewArray/plain encoded, mandatory, no NULLs
                        time:   [287.81 µs 288.21 µs 288.66 µs]
                        change: [-84.308% -84.292% -84.273%] (p = 0.00 < 0.05)
                        Performance has improved.

arrow_array_reader/StringViewArray/plain encoded, optional, no NULLs
                        time:   [289.25 µs 289.60 µs 289.95 µs]
                        change: [-84.303% -84.286% -84.268%] (p = 0.00 < 0.05)
                        Performance has improved.

arrow_array_reader/StringViewArray/plain encoded, optional, half NULLs
                        time:   [207.22 µs 207.39 µs 207.54 µs]
                        change: [-78.958% -78.940% -78.924%] (p = 0.00 < 0.05)
                        Performance has improved.

Not only does loading StringViewArray 5x faster than the previous implementation, but also almost 2x faster than loading StringArray:

arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs
                        time:   [345.03 µs 345.54 µs 346.20 µs]

arrow_array_reader/StringArray/plain encoded, optional, no NULLs
                        time:   [366.46 µs 368.51 µs 370.30 µs]

arrow_array_reader/StringArray/plain encoded, optional, half NULLs
                        time:   [427.10 µs 427.21 µs 427.32 µs]

What changes are included in this PR?

The code change is quite small, just some small tweaks to adjust the timing of utf-8 validation.

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 5, 2024
@XiangpengHao
Copy link
Contributor Author

I believe the CI failure is not from us...

// The implementation keeps a water mark `utf8_validation_begin` to track the beginning of the buffer that is not validated.
// If the length is smaller than 128, then we continue to next string.
// If the length is larger than 128, then we validate the buffer before the length bytes, and move the water mark to the beginning of next string.
if len < 128 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, is there a reason to write if case in this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's a bit awkward... I just want to place some comments under if len < 128.
Do you think it will look better to just have if len >= 128?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, an empty if block makes me spend some time checking if I missed something. But anyway, it's not a big issue. It's fine to keep as-is.

@Xuanwo
Copy link
Member

Xuanwo commented Jul 5, 2024

I believe the CI failure is not from us...

Yep, docs build failed for #6008

Comment on lines 339 to 340
// (1) Validating one 100 bytes of utf-8 is much faster than validating ten 10 bytes of utf-8.
// Potentially because of the auto SIMD by compiler, someone please confirm this :)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rust standard library at least used to have a lot of manual SIMD trickery to make UTF-8 validation fast

// I.e., the validation cannot validate the buffer in one pass, but instead, validate strings chunk by chunk.
//
// Given the above observations, the goal is to do batch validation as much as possible.
// The key idea is that if the length is smaller than 128 (99% of the case), then the length bytes are valid utf-8 (bc they are ASCII).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very clever idea

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't sufficient to just check that the string buffer is valid UTF-8, you must also validate that the offsets don't split a UTF-8 codepoint. Now it might be that the length checking logic already does this in effect, as a UTF-8 continuation would be > 128, but at the very least we should justify this aspect

@XiangpengHao
Copy link
Contributor Author

I run a tiny benchmark to show the speed of utf-8 validation versus the string length. With the same total amount of bytes in the buffer, bigger chunk size -> faster performance. I hope this can back the hypotheses of this PR. I'll probably write a small blog for a more detailed analysis.

Code:

use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion};

fn make_string(n: usize) -> Vec<u8> {
    let mut s = Vec::with_capacity(n);
    for _ in 0..n {
        s.push('A' as u8);
    }
    s
}

fn validate_utf8(c: &mut Criterion) {
    let s = make_string(2048);
    for chunk in [8, 16, 32, 64, 128, 256, 512, 1024, 2048] {
        c.bench_with_input(
            BenchmarkId::new("validate 2048 byte buffer", chunk),
            &chunk,
            |b, c| {
                b.iter(|| {
                    for c in s.chunks(*c) {
                        black_box(std::str::from_utf8(c).unwrap());
                    }
                })
            },
        );
    }
}

criterion_group!(benches, validate_utf8);
criterion_main!(benches);

Result:
image

ChatGPT link

@alamb
Copy link
Contributor

alamb commented Jul 6, 2024

It isn't sufficient to just check that the string buffer is valid UTF-8, you must also validate that the offsets don't split a UTF-8 codepoint. Now it might be that the length checking logic already does this in effect, as a UTF-8 continuation would be > 128, but at the very least we should justify this aspect

I was thinking about constructing an invalid case

Since the lengths are little endian it would be possible for an unterminated UTF8 sequence to be followed by a length byte which in theory could complete the utf8 sequence

However, as @tustvold points out, all multi-byte utf-8 values have a 1 in their top bit (and thus u8 representation is greater than 128) so in this case thelength byte can not be mis-interpreted as a part of the multi-byte sequence

I would be happy to make some ascii art diagrams if that would help

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an amazing PR -- thank you @XiangpengHao

I think the mark of great code is when it is very simple, and the rationale well commented.

I do agree it would be nice to add some additional justification about how this will work even with data that splits the pages.

Maybe some fuzz testing / additional unit tests (e.g. that take a utf8 string and split it for example) might help too, but I am not sure it is required

Thank you @mapleFU @Xuanwo and @tustvold for the reviews and comments

parquet/src/arrow/array_reader/byte_view_array.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Jul 8, 2024

🚀

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fast UTF-8 validation when reading StringViewArray from Parquet
5 participants