-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast utf8 validation when loading string view from parquet #6009
Conversation
I believe the CI failure is not from us... |
// The implementation keeps a water mark `utf8_validation_begin` to track the beginning of the buffer that is not validated. | ||
// If the length is smaller than 128, then we continue to next string. | ||
// If the length is larger than 128, then we validate the buffer before the length bytes, and move the water mark to the beginning of next string. | ||
if len < 128 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, is there a reason to write if
case in this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's a bit awkward... I just want to place some comments under if len < 128
.
Do you think it will look better to just have if len >= 128
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, an empty if
block makes me spend some time checking if I missed something. But anyway, it's not a big issue. It's fine to keep as-is.
Yep, docs build failed for #6008 |
// (1) Validating one 100 bytes of utf-8 is much faster than validating ten 10 bytes of utf-8. | ||
// Potentially because of the auto SIMD by compiler, someone please confirm this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rust standard library at least used to have a lot of manual SIMD trickery to make UTF-8 validation fast
// I.e., the validation cannot validate the buffer in one pass, but instead, validate strings chunk by chunk. | ||
// | ||
// Given the above observations, the goal is to do batch validation as much as possible. | ||
// The key idea is that if the length is smaller than 128 (99% of the case), then the length bytes are valid utf-8 (bc they are ASCII). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very clever idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't sufficient to just check that the string buffer is valid UTF-8, you must also validate that the offsets don't split a UTF-8 codepoint. Now it might be that the length checking logic already does this in effect, as a UTF-8 continuation would be > 128, but at the very least we should justify this aspect
I run a tiny benchmark to show the speed of utf-8 validation versus the string length. With the same total amount of bytes in the buffer, bigger chunk size -> faster performance. I hope this can back the hypotheses of this PR. I'll probably write a small blog for a more detailed analysis. Code: use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion};
fn make_string(n: usize) -> Vec<u8> {
let mut s = Vec::with_capacity(n);
for _ in 0..n {
s.push('A' as u8);
}
s
}
fn validate_utf8(c: &mut Criterion) {
let s = make_string(2048);
for chunk in [8, 16, 32, 64, 128, 256, 512, 1024, 2048] {
c.bench_with_input(
BenchmarkId::new("validate 2048 byte buffer", chunk),
&chunk,
|b, c| {
b.iter(|| {
for c in s.chunks(*c) {
black_box(std::str::from_utf8(c).unwrap());
}
})
},
);
}
}
criterion_group!(benches, validate_utf8);
criterion_main!(benches); |
I was thinking about constructing an invalid case Since the lengths are little endian it would be possible for an unterminated UTF8 sequence to be followed by a length byte which in theory could complete the utf8 sequence However, as @tustvold points out, all multi-byte utf-8 values have a I would be happy to make some ascii art diagrams if that would help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an amazing PR -- thank you @XiangpengHao
I think the mark of great code is when it is very simple, and the rationale well commented.
I do agree it would be nice to add some additional justification about how this will work even with data that splits the pages.
Maybe some fuzz testing / additional unit tests (e.g. that take a utf8 string and split it for example) might help too, but I am not sure it is required
Thank you @mapleFU @Xuanwo and @tustvold for the reviews and comments
Co-authored-by: Andrew Lamb <[email protected]>
🚀 |
Which issue does this PR close?
Closes #5995.
Rationale for this change
Current utf-8 validation for string view is super slow, even slower than reading StringArray from parquet, which copies & consolidates the strings to a new buffer.
It does not have to be slow, but making it fast requires a bit of art. I tried many approaches, and this PR is the most simple and easy to understand one.
Please check the comments to see if they make sense (and helps understanding what's going on).
To run the benchmark:
We will get this:
Not only does loading StringViewArray 5x faster than the previous implementation, but also almost 2x faster than loading StringArray:
What changes are included in this PR?
The code change is quite small, just some small tweaks to adjust the timing of utf-8 validation.
Are there any user-facing changes?