Fast utf8 validation when loading string view from parquet #6009

XiangpengHao · 2024-07-05T15:47:34Z

Which issue does this PR close?

Closes #5995.

Rationale for this change

Current utf-8 validation for string view is super slow, even slower than reading StringArray from parquet, which copies & consolidates the strings to a new buffer.

It does not have to be slow, but making it fast requires a bit of art. I tried many approaches, and this PR is the most simple and easy to understand one.

Please check the comments to see if they make sense (and helps understanding what's going on).

To run the benchmark:

cargo bench --bench arrow_reader --features="arrow test_common experimental" "arrow_array_reader/String.*Array/plain"

We will get this:

arrow_array_reader/StringViewArray/plain encoded, mandatory, no NULLs
                        time:   [287.81 µs 288.21 µs 288.66 µs]
                        change: [-84.308% -84.292% -84.273%] (p = 0.00 < 0.05)
                        Performance has improved.

arrow_array_reader/StringViewArray/plain encoded, optional, no NULLs
                        time:   [289.25 µs 289.60 µs 289.95 µs]
                        change: [-84.303% -84.286% -84.268%] (p = 0.00 < 0.05)
                        Performance has improved.

arrow_array_reader/StringViewArray/plain encoded, optional, half NULLs
                        time:   [207.22 µs 207.39 µs 207.54 µs]
                        change: [-78.958% -78.940% -78.924%] (p = 0.00 < 0.05)
                        Performance has improved.

Not only does loading StringViewArray 5x faster than the previous implementation, but also almost 2x faster than loading StringArray:

arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs
                        time:   [345.03 µs 345.54 µs 346.20 µs]

arrow_array_reader/StringArray/plain encoded, optional, no NULLs
                        time:   [366.46 µs 368.51 µs 370.30 µs]

arrow_array_reader/StringArray/plain encoded, optional, half NULLs
                        time:   [427.10 µs 427.21 µs 427.32 µs]

What changes are included in this PR?

The code change is quite small, just some small tweaks to adjust the timing of utf-8 validation.

Are there any user-facing changes?

XiangpengHao · 2024-07-05T16:00:18Z

I believe the CI failure is not from us...

Xuanwo · 2024-07-05T16:00:36Z

parquet/src/arrow/array_reader/byte_view_array.rs

+                // The implementation keeps a water mark `utf8_validation_begin` to track the beginning of the buffer that is not validated.
+                // If the length is smaller than 128, then we continue to next string.
+                // If the length is larger than 128, then we validate the buffer before the length bytes, and move the water mark to the beginning of next string.
+                if len < 128 {


Hi, is there a reason to write if case in this way?

I agree it's a bit awkward... I just want to place some comments under if len < 128.
Do you think it will look better to just have if len >= 128?

Yes, an empty if block makes me spend some time checking if I missed something. But anyway, it's not a big issue. It's fine to keep as-is.

Xuanwo · 2024-07-05T16:02:23Z

I believe the CI failure is not from us...

Yep, docs build failed for #6008

tustvold · 2024-07-05T16:40:12Z

parquet/src/arrow/array_reader/byte_view_array.rs

+                // (1) Validating one 100 bytes of utf-8 is much faster than validating ten 10 bytes of utf-8.
+                //     Potentially because of the auto SIMD by compiler, someone please confirm this :)


The rust standard library at least used to have a lot of manual SIMD trickery to make UTF-8 validation fast

tustvold · 2024-07-05T16:40:28Z

parquet/src/arrow/array_reader/byte_view_array.rs

+                //     I.e., the validation cannot validate the buffer in one pass, but instead, validate strings chunk by chunk.
+                //
+                // Given the above observations, the goal is to do batch validation as much as possible.
+                // The key idea is that if the length is smaller than 128 (99% of the case), then the length bytes are valid utf-8 (bc they are ASCII).


This is a very clever idea

tustvold

It isn't sufficient to just check that the string buffer is valid UTF-8, you must also validate that the offsets don't split a UTF-8 codepoint. Now it might be that the length checking logic already does this in effect, as a UTF-8 continuation would be > 128, but at the very least we should justify this aspect

XiangpengHao · 2024-07-05T17:06:46Z

I run a tiny benchmark to show the speed of utf-8 validation versus the string length. With the same total amount of bytes in the buffer, bigger chunk size -> faster performance. I hope this can back the hypotheses of this PR. I'll probably write a small blog for a more detailed analysis.

Code:

use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion};

fn make_string(n: usize) -> Vec<u8> {
    let mut s = Vec::with_capacity(n);
    for _ in 0..n {
        s.push('A' as u8);
    }
    s
}

fn validate_utf8(c: &mut Criterion) {
    let s = make_string(2048);
    for chunk in [8, 16, 32, 64, 128, 256, 512, 1024, 2048] {
        c.bench_with_input(
            BenchmarkId::new("validate 2048 byte buffer", chunk),
            &chunk,
            |b, c| {
                b.iter(|| {
                    for c in s.chunks(*c) {
                        black_box(std::str::from_utf8(c).unwrap());
                    }
                })
            },
        );
    }
}

criterion_group!(benches, validate_utf8);
criterion_main!(benches);

Result:

ChatGPT link

parquet/src/arrow/array_reader/byte_view_array.rs

alamb · 2024-07-06T13:28:07Z

It isn't sufficient to just check that the string buffer is valid UTF-8, you must also validate that the offsets don't split a UTF-8 codepoint. Now it might be that the length checking logic already does this in effect, as a UTF-8 continuation would be > 128, but at the very least we should justify this aspect

I was thinking about constructing an invalid case

Since the lengths are little endian it would be possible for an unterminated UTF8 sequence to be followed by a length byte which in theory could complete the utf8 sequence

However, as @tustvold points out, all multi-byte utf-8 values have a 1 in their top bit (and thus u8 representation is greater than 128) so in this case thelength byte can not be mis-interpreted as a part of the multi-byte sequence

I would be happy to make some ascii art diagrams if that would help

alamb

This is an amazing PR -- thank you @XiangpengHao

I think the mark of great code is when it is very simple, and the rationale well commented.

I do agree it would be nice to add some additional justification about how this will work even with data that splits the pages.

Maybe some fuzz testing / additional unit tests (e.g. that take a utf8 string and split it for example) might help too, but I am not sure it is required

Thank you @mapleFU @Xuanwo and @tustvold for the reviews and comments

parquet/src/arrow/array_reader/byte_view_array.rs

Co-authored-by: Andrew Lamb <[email protected]>

alamb · 2024-07-08T18:42:19Z

🚀

fast utf8 validation

2af8b9d

github-actions bot added the parquet Changes to the parquet crate label Jul 5, 2024

Xuanwo reviewed Jul 5, 2024

View reviewed changes

tustvold reviewed Jul 5, 2024

View reviewed changes

mapleFU reviewed Jul 5, 2024

View reviewed changes

parquet/src/arrow/array_reader/byte_view_array.rs Show resolved Hide resolved

better documentation

9c5b31c

alamb approved these changes Jul 6, 2024

View reviewed changes

parquet/src/arrow/array_reader/byte_view_array.rs Outdated Show resolved Hide resolved

Update parquet/src/arrow/array_reader/byte_view_array.rs

6604216

Co-authored-by: Andrew Lamb <[email protected]>

alamb merged commit af4d6b6 into apache:master Jul 8, 2024
16 checks passed

alamb mentioned this pull request Jul 15, 2024

2024 Q3-Q4 Roadmap? apache/datafusion#11442

Open

This was referenced Jul 24, 2024

Fast UTF-8 validation when reading StringViewArray from Parquet #5995

Closed

cargo doc ci failed in latest rust version 1.81.0-nightly (cc8da78a0 2024-07-04) #6008

Closed

alamb mentioned this pull request Aug 2, 2024

Support casting between BinaryView <--> Utf8 and LargeUtf8 #6180

Merged

mapleFU mentioned this pull request Aug 15, 2024

GH-42247: [C++] Support casting to and from utf8_view/binary_view apache/arrow#43302

Merged

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast utf8 validation when loading string view from parquet #6009

Fast utf8 validation when loading string view from parquet #6009

XiangpengHao commented Jul 5, 2024

XiangpengHao commented Jul 5, 2024

Xuanwo Jul 5, 2024

XiangpengHao Jul 5, 2024

Xuanwo Jul 6, 2024

Xuanwo commented Jul 5, 2024

tustvold Jul 5, 2024

tustvold Jul 5, 2024

tustvold left a comment •

edited

Loading

XiangpengHao commented Jul 5, 2024

alamb commented Jul 6, 2024 •

edited

Loading

alamb left a comment

alamb commented Jul 8, 2024

		// (1) Validating one 100 bytes of utf-8 is much faster than validating ten 10 bytes of utf-8.
		// Potentially because of the auto SIMD by compiler, someone please confirm this :)

Fast utf8 validation when loading string view from parquet #6009

Fast utf8 validation when loading string view from parquet #6009

Conversation

XiangpengHao commented Jul 5, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

XiangpengHao commented Jul 5, 2024

Xuanwo Jul 5, 2024

Choose a reason for hiding this comment

XiangpengHao Jul 5, 2024

Choose a reason for hiding this comment

Xuanwo Jul 6, 2024

Choose a reason for hiding this comment

Xuanwo commented Jul 5, 2024

tustvold Jul 5, 2024

Choose a reason for hiding this comment

tustvold Jul 5, 2024

Choose a reason for hiding this comment

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

XiangpengHao commented Jul 5, 2024

alamb commented Jul 6, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jul 8, 2024

tustvold left a comment •

edited

Loading

alamb commented Jul 6, 2024 •

edited

Loading