You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's something I've noticed using cargo-show-asm in the past 2-3 months.
There are occasional cases where building a lib crate with LTO doesn't produce the same results as building a bin/bench crate for a given function. (Note: Turning off LTO makes the code match on lib crates).
Below is a simple example based on an in-progress repo of mine:
/// The implementation of a generic histogram, storing the for each byte using type `T`./// `T` should be a type that can be incremented.#[derive(Clone,Copy,PartialEq,Eq,PartialOrd,Ord)]pubstructHistogram<T>{pubcounter:[T;256],}/// Implementation of a histogram using unsigned 32 bit integers as the counter.#[derive(Clone,Copy,PartialEq,Eq,PartialOrd,Ord,Default)]pubstructHistogram32{pubinner:Histogram<u32>,}implDefaultforHistogram<u32>{// Defaults to a zero'd array.fndefault() -> Self{Histogram{counter:[0;256]}}}constNUM_SLICES:usize = 4;constSLICE_SIZE_U32S:usize = 256;pubfnhistogram_nonaliased_withruns_core(data:&[u8]) -> Histogram32{// 1K on stack, should be good.letmut histogram = [Histogram32::default();NUM_SLICES];unsafe{letmut ptr = data.as_ptr();let end = ptr.add(data.len());let current_ptr = histogram[0].inner.counter.as_mut_ptr();if data.len() > 24{let aligned_end = end.sub(24);letmut current = (ptr as*constu64).read_unaligned();while ptr < aligned_end {// Prefetch next 1 iteration.let next = (ptr.add(8)as*constu64).read_unaligned();if current == next {// Check if all bytes are the same within 'current'.// With a XOR, we can check every byte (except byte 0)// with its predecessor. If our value is <256,// then all bytes are the same value.let shifted = current << 8;if(shifted ^ current) < 256{// All bytes same - increment single bucket by 16// (current is all same byte and current equals next)*current_ptr.add((current &0xFF)asusize) += 16;}else{// Same 8 bytes twice - sum with INC2sum8(current_ptr, current,2);}}else{// Process both 8-byte chunks with INC1sum8(current_ptr, current,1);sum8(current_ptr, next,1);}
current = ((ptr.add(16))as*constu64).read_unaligned();
ptr = ptr.add(16);}}while ptr < end {let byte = *ptr;*current_ptr.add(byte asusize) += 1;
ptr = ptr.add(1);}// Sum up all bytes// Vectorization-friendly summationifNUM_SLICES <= 1{
histogram[0]}else{letmut result = histogram[0];for x in(0..256).step_by(4){letmut sum0 = 0_u32;letmut sum1 = 0_u32;letmut sum2 = 0_u32;letmut sum3 = 0_u32;// Changing to suggested code breaks.#[allow(clippy::needless_range_loop)]for slice in0..NUM_SLICES{
sum0 += histogram[slice].inner.counter[x];
sum1 += histogram[slice].inner.counter[x + 1];
sum2 += histogram[slice].inner.counter[x + 2];
sum3 += histogram[slice].inner.counter[x + 3];}
result.inner.counter[x] = sum0;
result.inner.counter[x + 1] = sum1;
result.inner.counter[x + 2] = sum2;
result.inner.counter[x + 3] = sum3;}
result
}}}#[inline(always)]unsafefnsum8(current_ptr:*mutu32,mutvalue:u64,increment:u32){for index in0..8{let byte = (value &0xFF)asusize;let slice_offset = (index % NUM_SLICES)*SLICE_SIZE_U32S;let write_ptr = current_ptr.add(slice_offset + byte);let current = (write_ptr as*constu32).read_unaligned();(write_ptr).write_unaligned(current + increment);
value >>= 8;}}
Apologies for the long assembly, it's the best example I have on hand that I can think of off the top of my head.
The code respects opt-level, but certain optimisations are missed; typically auto-vectorization from the one or two times I've ran into this issue. Host is Linux x86-64, but OS nor target-cpu seems to have any impact here.
This isn't a help request or anything of the sort; I was just wondering if this behaviour is worth documenting somewhere.
The text was updated successfully, but these errors were encountered:
Whole point of LTO is to optimize the code in a context that is not available when you compile parts separately. Now, by default cargo-show-asm asks rustc to dump assembly (llvm, etc) as text and parses those. This happens before LTO.
You can enable disasm feature and pass --disasm while trying to compile your binary or even pass path to executable file. This way cargo-show-asm will look at binary code after all possible optimization steps (LTO, BOLT, etc).
There's something I've noticed using
cargo-show-asm
in the past 2-3 months.There are occasional cases where building a
lib
crate with LTO doesn't produce the same results as building a bin/bench crate for a given function. (Note: Turning off LTO makes the code match onlib
crates).Below is a simple example based on an in-progress repo of mine:
Or if you'd prefer repo and commit, here.
Will need to add
no_mangle
as usual.Building with
LTO
enabled for release inCargo.toml
gives:Apologies for the long assembly, it's the best example I have on hand that I can think of off the top of my head.
The code respects
opt-level
, but certain optimisations are missed; typically auto-vectorization from the one or two times I've ran into this issue. Host isLinux x86-64
, but OS nortarget-cpu
seems to have any impact here.This isn't a help request or anything of the sort; I was just wondering if this behaviour is worth documenting somewhere.
The text was updated successfully, but these errors were encountered: