Improve performance of constructing `ByteView`s for small strings #6034

alamb · 2024-07-10T10:33:03Z

Part of #5374

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

NoteL This is copied from the wonderful description on #6031 from @XiangpengHao and @aoil-al

In our benchmark, every string is larger than 12 bytes, this means that when making the views, we always fall into this branch. The specialty about this branch is that it only reads the first 4 bytes of the string, and LLVM is smart enough to inline loading the values (i.e., prevent calling into copy_non_overlapping).

Is that a big deal?

Unfortunately, yes. If you change the benchmark string to very small, e.g., "test", you should be able to see (using the same benchmark script above) the performance of loading ViewArray drops significantly, and it doesn't make sense -- how could building StringViewArrays wit shorter strings be so much slower than building longer strings?

The root cause is that when string is short, we need to memcpy the bytes to the view, as the compiler has no idea how long the string is, it can not inline the bytes and has to call to memcpy, which is slow.

Describe the solution you'd like

I would like to improve the performance of StringView creation more

Describe alternatives you've considered

here's a special variant of make_view, you can replace this function with the following new implementation, and the performance improves.

This new implementation makes 13 copies of make_view_inner; each copy is a specialized length known to the compiler (LEN is const), and then the compiler can optimize each of the implementations as needed. Looking at the assembly code, there is indeed no call to memcpy.

/// Creates a view from a fixed length input (the compiler can generate
/// specialized code for this)
fn make_view_inner<const LEN: usize>(data: &[u8]) -> u128 {
    let mut view_buffer = [0; 16];
    view_buffer[0..4].copy_from_slice(&(LEN as u32).to_le_bytes());
    view_buffer[4..4 + LEN].copy_from_slice(&data[..LEN]);
    u128::from_le_bytes(view_buffer)
}

/// Create a view based on the given data, block id and offset
#[inline(always)]
pub fn make_view(data: &[u8], block_id: u32, offset: u32) -> u128 {
    let len = data.len() as u32;
    // Generate specialized code for each potential small string length
    // to improve performance
    match len {
        0 => make_view_inner::<0>(data),
        1 => make_view_inner::<1>(data),
        2 => make_view_inner::<2>(data),
        3 => make_view_inner::<3>(data),
        4 => make_view_inner::<4>(data),
        5 => make_view_inner::<5>(data),
        6 => make_view_inner::<6>(data),
        7 => make_view_inner::<7>(data),
        8 => make_view_inner::<8>(data),
        9 => make_view_inner::<9>(data),
        10 => make_view_inner::<10>(data),
        11 => make_view_inner::<11>(data),
        12 => make_view_inner::<12>(data),
        _ => {
            let view = ByteView {
                length: len,
                prefix: u32::from_le_bytes(data[0..4].try_into().unwrap()),
                buffer_index: block_id,
                offset,
            };
            view.into()
        }
    }
}

(special thanks to @aoli-al for triangulating the root cause and prototype the fix)

While the above is somewhat non intuitive, if it improves performance and is sufficiently commented, it would be a good thing to add

Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2024-07-10T14:30:41Z

For anyone of the compiler persuasion, I think the code above is basically a form of loop unrolling (the loop is in memcpy) for certain constants that we know will be hit

jhorstmann · 2024-07-11T12:01:45Z

Another pattern that works well for small copies and generates much smaller code than full specialization, is using divide and conquer, copy as large a chunk as possible followed by consecutive smaller chunks. Would look something like this on a simplified example:

pub fn make_view2(mut data: &[u8], block_id: u32, offset: u32) -> u128 {
    let len = data.len() as u32;
    let mut tmp = [0_u8; 16];
    tmp[..4].copy_from_slice(&len.to_le_bytes());
    let mut i = 4_usize;

    if data.len() >= 8 {
        tmp[i..i+8].copy_from_slice(&data[..8]);
        i += 8;
        data = &data[8..];
    }
    if data.len() >= 4 {
        // unsafe { std::intrinsics::assume(i+4 <= tmp.len()); }
        tmp[i..i+4].copy_from_slice(&data[..4]);
        i += 4;
        data = &data[4..];
    }
    if data.len() >= 2 {
        // unsafe { std::intrinsics::assume(i+2 <= tmp.len()); }
        tmp[i..i+2].copy_from_slice(&data[..2]);
        i += 2;
        data = &data[2..];
    }
    if data.len() >= 1 {
        // unsafe { std::intrinsics::assume(i < tmp.len()); }
        tmp[i] = data[0];
        i += 1;
        data = &data[1..];
    }

    u128::from_le_bytes(tmp)
}

This also avoids bounds checks on data, but unfortunately not on the last accesses to tmp array.

XiangpengHao · 2024-07-22T15:52:34Z

Update here that I benchmarked @jhorstmann 's implementation through #6101 but the performance did not improve much. I digged slightly deeper into this and convinced myself that the overhead is the function call to ptr::copy_non_overlapping.
An experiment with test string length of 1 shows the performance is still very slow, indicating that the loop unrolling is not the bottleneck here.

So I filed #6102 with the original implementation (with minor tweaks).

mapleFU · 2024-07-24T14:17:50Z

Just wonder if this problem is due to some optimization or other reason, I've tested some similar code in C++: https://godbolt.org/z/czePsTofc . And find the difference between makeInline2 would work better on string with same length but when string length change fastly it would run slower.

XiangpengHao · 2024-07-24T14:24:51Z

but when string length change fastly it would run slower.

Can you elaborate a bit more on this? @mapleFU

My understanding is that some string length are fast, e.g., 4 and 8, becuase they align with register width. Some string length need more instructions, e.g., 7, 11, because they need to load more data then shift to clear out the unused bits.
Because of that, testing with fixed 4-byte string can be quite faster than testing with varying lengths.

mapleFU · 2024-07-24T14:34:07Z

https://godbolt.org/z/czePsTofc : makeInline2 would try to generate lots of branching by switch, and if all string would likely to keep the same length, the code would like memset & memcpy for previous version:

        call    memset@PLT
        mov     rdi, r14
        mov     rsi, rbx
        mov     rdx, r15
        call    memcpy@PLT

on the otherside, makeInline2 would generate different function depending on bit-width. For example:

.LBB1_2:
        movzx   esi, byte ptr [rdi]
        xor     edx, edx
        shl     rsi, 32
        or      rax, rsi
        ret
.LBB1_3:
        movzx   esi, word ptr [rdi]
        xor     edx, edx
        shl     rsi, 32
        or      rax, rsi
        ret

If branch prediction works well, same function would be called again and again.

some string length are fast, e.g., 4 and 8, becuase they align with register width.

AFAIK, if same instr is used, and no cross cache-line memory visiting, memory copying instr would not suffer from unaligned access? ( Maybe I'm wrong )

Updated: with help of compiler assume, short makeInline might generate efficient code without memcpy, but clang doesn't apply that optimization: https://godbolt.org/z/YM58jhE5T

mapleFU · 2024-07-24T15:00:13Z

My understanding is that some string length are fast, e.g., 4 and 8, becuase they align with register width.

Oh you're right here since some bit-width will get shorter instructions. Thanks for the idea!

XiangpengHao · 2024-07-24T15:16:12Z

If branch prediction works well, same function would be called again and again.

Did you observe the branch prediction suffers on string with varying lengths? That's very interesting! Would love to look at the benchmark code!

I think the rust implementation only has one conditional branch, i.e., len > 12. If len is smaller than 12, it calcuates the offset and unconditionaly jump to the corresponding offset, which should not cause pressure on the branch predictor.

with help of compiler assume

That's a very cool idea! That would lead to cleaner and more efficient code. But it seems that Rust don't suppor them yet.

mapleFU · 2024-07-24T15:23:38Z

Did you observe the branch prediction suffers on string with varying lengths?

Yeah, I'm interesting on this optimization so I tried on C++ code but find it may not so well if string length change fastly. But anyway it's a bit faster than calling memcpy on Clang

And I just think out that maybe performance for different string length would not so difference because of store buffer forwarding 🤔 Some analysis might required here

which should not cause pressure on the branch predictor.

Maybe I'd try match len generated code, I'm not familiar with rust, let me have a try

mapleFU · 2024-07-24T15:28:27Z

I've tried in this (I'm not familiar with Rust so maybe I made mistake): https://godbolt.org/z/ejWG65bGf

XiangpengHao · 2024-07-24T15:30:27Z

The compiler needs -C opt-level=3 flag to see optimized code

mapleFU · 2024-07-24T15:31:29Z

Oh thanks, so it generate nearly same with C++ switch : https://godbolt.org/z/c9Exx3cbd

alamb · 2024-07-24T21:51:36Z

label_issue.py automatically added labels {'parquet'} from #6101

alamb · 2024-07-24T21:51:39Z

label_issue.py automatically added labels {'arrow'} from #6102

mapleFU · 2024-09-01T12:18:40Z

I opened an issue in LLVM llvm/llvm-project#106895
I believe gcc optimized code is also good, would waiting for it released

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jul 10, 2024

This was referenced Jul 10, 2024

[EPIC] Implement StringViewArray and BinaryViewArray #5374

Closed

Improve performance reading ByteViewArray from parquet by removing an implicit copy #6031

Merged

alamb mentioned this issue Jul 15, 2024

2024 Q3-Q4 Roadmap? apache/datafusion#11442

Open

This was referenced Jul 22, 2024

Add benchmark to track byte-view construction performance #6101

Merged

Faster GenericByteView construction #6102

Merged

alamb closed this as completed in #6102 Jul 23, 2024

alamb added the parquet Changes to the parquet crate label Jul 24, 2024

alamb added the arrow Changes to the arrow crate label Jul 24, 2024

mapleFU mentioned this issue Sep 1, 2024

memcpy/memset not inlined when __builtin_assume gurantee the data is short enough llvm/llvm-project#106895

Open

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of constructing `ByteView`s for small strings #6034

Improve performance of constructing `ByteView`s for small strings #6034

alamb commented Jul 10, 2024

alamb commented Jul 10, 2024

jhorstmann commented Jul 11, 2024

XiangpengHao commented Jul 22, 2024 •

edited

Loading

mapleFU commented Jul 24, 2024

XiangpengHao commented Jul 24, 2024

mapleFU commented Jul 24, 2024 •

edited

Loading

mapleFU commented Jul 24, 2024

XiangpengHao commented Jul 24, 2024

mapleFU commented Jul 24, 2024 •

edited

Loading

mapleFU commented Jul 24, 2024

XiangpengHao commented Jul 24, 2024 •

edited

Loading

mapleFU commented Jul 24, 2024

alamb commented Jul 24, 2024

alamb commented Jul 24, 2024

mapleFU commented Sep 1, 2024

Improve performance of constructing ByteViews for small strings #6034

Improve performance of constructing ByteViews for small strings #6034

Comments

alamb commented Jul 10, 2024

Is that a big deal?

alamb commented Jul 10, 2024

jhorstmann commented Jul 11, 2024

XiangpengHao commented Jul 22, 2024 • edited Loading

mapleFU commented Jul 24, 2024

XiangpengHao commented Jul 24, 2024

mapleFU commented Jul 24, 2024 • edited Loading

mapleFU commented Jul 24, 2024

XiangpengHao commented Jul 24, 2024

mapleFU commented Jul 24, 2024 • edited Loading

mapleFU commented Jul 24, 2024

XiangpengHao commented Jul 24, 2024 • edited Loading

mapleFU commented Jul 24, 2024

alamb commented Jul 24, 2024

alamb commented Jul 24, 2024

mapleFU commented Sep 1, 2024

Improve performance of constructing `ByteView`s for small strings #6034

Improve performance of constructing `ByteView`s for small strings #6034

XiangpengHao commented Jul 22, 2024 •

edited

Loading

mapleFU commented Jul 24, 2024 •

edited

Loading

mapleFU commented Jul 24, 2024 •

edited

Loading

XiangpengHao commented Jul 24, 2024 •

edited

Loading