Skip to content

Commit

Permalink
Fix CpuWriteGpuReadBuffer producing unaligned gpu buffer offsets (#1716)
Browse files Browse the repository at this point in the history
Turns out our eagerness to acquire aligned pointers for fast copy operation backfired and got us into an impossible situation: By offsetting staging buffers to ensure cpu pointer alignment, we sometimes choose offsets that aren't allowed for copy operations. E.g. we get back a buffer that has a pointer alignment of 2 (that happens -.-) we therefore offset the pointer by 14 (our min alignment is 16!). We now can copy data into the buffer quickly and safely. But when scheduling e.g. `copy_buffer_to_texture` we get a wgpu crash! Wgpu requires the offset (we put 14) to be:
* a multiple of wgpu::COPY_BUFFER_ALIGNMENT
* a multiple of the texel block size
Neither of which is true now! You might be asking why wgpu gives such oddly aligned buffers out to begin with, and the answer is sadly that the WebGL impl has issues + that the spec doesn't guarantee anything, so this is strictly speaking valid (although most other backends will give out 16 byte aligned pointers). See gfx-rs/wgpu#3508

Long story short, I changed (and simplified) the way we go about alignment on `CpuWriteGpuReadBelt`. The CPU pointer no longer has *any* alignment guarantees and offsets fullfill now the above guarantees. This is _ok_ since we already wrapped all accesses to the cpu pointer and can do byte writes to them. The huge drawback of this is ofc that `copy_from_slice` now has to do the heavy lifting of checking for alignment and then doing the right instructions for everything that is worth while doing so (that is, the things `memcpy` does when it deals with raw byte pointers)

Testing: Confirmed fix with crashing repro on the Web, then ran `just py-run-all` for native, renderer samples local and on web.
Have not checked if this has any practical perf impact. Luckily our interface makes this very much a "optimize later" problem (copy operations within `CpuWriteGpuReadBuffer` can be made more clever in the future if need to be; unlikely necessary to be fair though)
  • Loading branch information
Wumpf authored Mar 27, 2023
1 parent 8a37e50 commit f0285c7
Showing 1 changed file with 54 additions and 82 deletions.
136 changes: 54 additions & 82 deletions crates/re_renderer/src/allocator/cpu_write_gpu_read_belt.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
use std::{num::NonZeroU32, ops::DerefMut, sync::mpsc};
use std::{num::NonZeroU32, sync::mpsc};

use crate::wgpu_resources::{texture_row_data_info, BufferDesc, GpuBuffer, GpuBufferPool};

Expand Down Expand Up @@ -34,48 +34,50 @@ impl<T> CpuWriteGpuReadBuffer<T>
where
T: bytemuck::Pod + 'static,
{
/// Memory as slice of T.
/// Memory as slice.
///
/// Note that we can't rely on any alignment guarantees here!
/// We could offset the mapped CPU-sided memory, but then the GPU offset won't be aligned anymore.
/// There's no way we can meet conflicting alignment requirements, so we need to work with unaligned bytes instead.
/// See [this comment on this wgpu issue](https://github.com/gfx-rs/wgpu/issues/3508#issuecomment-1485044324) about what we tried before.
///
/// Once wgpu has some alignment guarantees, we might be able to use this here to allow faster copies!
/// (copies of larger blocks are likely less affected as `memcpy` typically does dynamic check/dispatching for SIMD based copies)
///
/// Do *not* make this public as we need to guarantee that the memory is *never* read from!
#[inline(always)]
fn as_slice(&mut self) -> &mut [T] {
fn as_slice(&mut self) -> &mut [u8] {
// TODO(andreas): Is this access slow given that it internally goes through a trait interface? Should we keep the pointer around?
// `write_view` may have padding at the end that isn't a multiple of T's size.
// Bytemuck get's unhappy about that, so cast the correct range.
bytemuck::cast_slice_mut(
&mut self.write_view[self.unwritten_element_range.start * std::mem::size_of::<T>()
..self.unwritten_element_range.end * std::mem::size_of::<T>()],
)
&mut self.write_view[self.unwritten_element_range.start * std::mem::size_of::<T>()
..self.unwritten_element_range.end * std::mem::size_of::<T>()]
}

/// Pushes a slice of elements into the buffer.
///
/// Panics if the data no longer fits into the buffer.
#[inline]
pub fn extend_from_slice(&mut self, elements: &[T]) {
self.as_slice()[..elements.len()].copy_from_slice(elements);
let bytes = bytemuck::cast_slice(elements);
self.as_slice()[..bytes.len()].copy_from_slice(bytes);
self.unwritten_element_range.start += elements.len();
}

/// Pushes several elements into the buffer.
///
/// Extends until either running out of space or elements.
/// Panics if there are more elements than there is space in the buffer.
#[inline]
pub fn extend(&mut self, elements: impl Iterator<Item = T>) {
let mut num_elements = 0;
for (target, source) in self.as_slice().iter_mut().zip(elements) {
*target = source;
num_elements += 1;
for element in elements {
self.push(element);
}
self.unwritten_element_range.start += num_elements;
}

/// Pushes a single element into the buffer and advances the write pointer.
///
/// Panics if the data no longer fits into the buffer.
#[inline]
pub fn push(&mut self, element: T) {
self.as_slice()[0] = element;
self.as_slice()[..std::mem::size_of::<T>()].copy_from_slice(bytemuck::bytes_of(&element));
self.unwritten_element_range.start += 1;
}

Expand Down Expand Up @@ -165,47 +167,22 @@ impl Chunk {
self.buffer.size() - self.unused_offset
}

fn required_padding(write_view: &mut wgpu::BufferViewMut<'_>, alignment: u64) -> u64 {
// Use deref_mut explicitly because wgpu warns otherwise that read access is slow.
let ptr = write_view.deref_mut().as_ptr() as u64;
wgpu::util::align_to(ptr, alignment) - ptr
}

/// Caller needs to make sure that there is enough space plus potential padding.
fn allocate_aligned<T: bytemuck::Pod>(
/// Caller needs to make sure that there is enough space.
fn allocate<T: bytemuck::Pod>(
&mut self,
num_elements: usize,
size_in_bytes: u64,
alignment: u64,
) -> CpuWriteGpuReadBuffer<T> {
debug_assert!(num_elements * std::mem::size_of::<T>() <= size_in_bytes as usize);

// Optimistic first mapping attempt.
let mut start_offset = self.unused_offset;
let mut end_offset = start_offset + size_in_bytes;
let mut buffer_slice = self.buffer.slice(start_offset..end_offset);
let mut write_view = buffer_slice.get_mapped_range_mut();

// Check if it fulfills the requested alignment.
let required_padding = Self::required_padding(&mut write_view, alignment);
if required_padding != 0 {
// Undo mapping and try again with padding!
re_log::trace!(
"CpuWriteGpuReadBuffer::allocate alignment requirement not fulfilled. Need to add {required_padding} for alignment of {alignment}"
);

drop(write_view);

start_offset = self.unused_offset + required_padding;
end_offset = start_offset + size_in_bytes;
buffer_slice = self.buffer.slice(start_offset..end_offset);
write_view = buffer_slice.get_mapped_range_mut();

let required_padding = Self::required_padding(&mut write_view, alignment);
debug_assert_eq!(required_padding, 0);
}
let byte_offset_in_chunk_buffer = self.unused_offset;
let end_offset = byte_offset_in_chunk_buffer + size_in_bytes;

debug_assert!(byte_offset_in_chunk_buffer % CpuWriteGpuReadBelt::MIN_OFFSET_ALIGNMENT == 0);
debug_assert!(end_offset <= self.buffer.size());

let buffer_slice = self.buffer.slice(byte_offset_in_chunk_buffer..end_offset);
let write_view = buffer_slice.get_mapped_range_mut();
self.unused_offset = end_offset;

#[allow(unsafe_code)]
Expand All @@ -226,7 +203,7 @@ impl Chunk {

CpuWriteGpuReadBuffer {
chunk_buffer: self.buffer.clone(),
byte_offset_in_chunk_buffer: start_offset,
byte_offset_in_chunk_buffer,
write_view,
unwritten_element_range: 0..num_elements,
_type: std::marker::PhantomData,
Expand Down Expand Up @@ -279,11 +256,15 @@ impl CpuWriteGpuReadBelt {
/// Requiring a minimum alignment means we need to pad less often.
/// Also, it has the potential of making memcpy operations faster.
///
/// Align to 4xf32. Should be enough for most usecases!
/// Needs to be larger or equal than [`wgpu::MAP_ALIGNMENT`].
/// Needs to be larger or equal than [`wgpu::MAP_ALIGNMENT`], [`wgpu::COPY_BUFFER_ALIGNMENT`]
/// and the largest possible texel block footprint (since offsets for texture copies require this)
///
/// For alignment requirements in `WebGPU` in general, refer to
/// [the specification on alignment-class limitations](https://www.w3.org/TR/webgpu/#limit-class-alignment)
pub const MIN_ALIGNMENT: u64 = 16;
///
/// Note that this does NOT mean that the CPU memory has *any* alignment.
/// See this issue about [lack of CPU memory alignment](https://github.com/gfx-rs/wgpu/issues/3508) in wgpu/WebGPU.
const MIN_OFFSET_ALIGNMENT: u64 = 16;

/// Create a cpu-write & gpu-read staging belt.
///
Expand All @@ -299,11 +280,21 @@ impl CpuWriteGpuReadBelt {
/// TODO(andreas): Adaptive chunk sizes
/// TODO(andreas): Shrinking after usage spikes?
pub fn new(chunk_size: wgpu::BufferSize) -> Self {
static_assertions::const_assert!(wgpu::MAP_ALIGNMENT <= CpuWriteGpuReadBelt::MIN_ALIGNMENT);
static_assertions::const_assert!(
wgpu::MAP_ALIGNMENT <= CpuWriteGpuReadBelt::MIN_OFFSET_ALIGNMENT
);
static_assertions::const_assert!(
wgpu::COPY_BUFFER_ALIGNMENT <= CpuWriteGpuReadBelt::MIN_OFFSET_ALIGNMENT
);
// Largest uncompressed texture format (btw. many compressed texture format have the same block size!)
debug_assert!(
wgpu::TextureFormat::Rgba32Uint.describe().block_size as u64
<= CpuWriteGpuReadBelt::MIN_OFFSET_ALIGNMENT
);

let (sender, receiver) = mpsc::channel();
CpuWriteGpuReadBelt {
chunk_size: wgpu::util::align_to(chunk_size.get(), Self::MIN_ALIGNMENT),
chunk_size: wgpu::util::align_to(chunk_size.get(), Self::MIN_OFFSET_ALIGNMENT),
active_chunks: Vec::new(),
closed_chunks: Vec::new(),
free_chunks: Vec::new(),
Expand All @@ -313,8 +304,6 @@ impl CpuWriteGpuReadBelt {
}

/// Allocates a cpu writable buffer for `num_elements` instances of type `T`.
///
/// Handles alignment requirements automatically, allowing arbitrarily aligned types without issues.
pub fn allocate<T: bytemuck::Pod>(
&mut self,
device: &wgpu::Device,
Expand All @@ -325,32 +314,17 @@ impl CpuWriteGpuReadBelt {

debug_assert!(num_elements > 0, "Cannot allocate zero-sized buffer");

// Potentially overestimate alignment with Self::MIN_ALIGNMENT, see Self::MIN_ALIGNMENT doc string.
let alignment = (std::mem::align_of::<T>() as wgpu::BufferAddress).max(Self::MIN_ALIGNMENT);
// Pad out the size of the used buffer to a multiple of Self::MIN_ALIGNMENT.
// This increases our chance of having back to back allocations within a chunk.
// Potentially overestimate size with Self::MIN_ALIGNMENT, see Self::MIN_ALIGNMENT doc string.
let size = wgpu::util::align_to(
(std::mem::size_of::<T>() * num_elements) as wgpu::BufferAddress,
Self::MIN_ALIGNMENT,
Self::MIN_OFFSET_ALIGNMENT,
);

// We need to be super careful with alignment since today wgpu
// has no guarantees on how pointers to mapped memory are aligned!
// For all we know, pointers might be 1 aligned, causing even a u32 write to crash the process!
//
// For details and (as of writing) ongoing discussion see https://github.com/gfx-rs/wgpu/issues/3508
//
// To work around this, we ask for a bigger size, so we can safely pad out
// if the returned pointer is not correctly aligned.
// (i.e. we will use _up to_ `required_size` bytes, but at least `size`)]
let maximum_padding = alignment - 1;
let max_required_size = size + maximum_padding;

// Try to find space in any of the active chunks first.
let mut chunk = if let Some(index) = self
.active_chunks
.iter_mut()
.position(|chunk| chunk.remaining_capacity() >= max_required_size)
.position(|chunk| chunk.remaining_capacity() >= size)
{
self.active_chunks.swap_remove(index)
} else {
Expand All @@ -360,15 +334,13 @@ impl CpuWriteGpuReadBelt {
if let Some(index) = self
.free_chunks
.iter()
.position(|chunk| chunk.remaining_capacity() >= max_required_size)
.position(|chunk| chunk.remaining_capacity() >= size)
{
self.free_chunks.swap_remove(index)
} else {
// Allocation might be bigger than a chunk!
let buffer_size = wgpu::util::align_to(
self.chunk_size.max(max_required_size),
Self::MIN_ALIGNMENT,
);
let buffer_size =
wgpu::util::align_to(self.chunk_size.max(size), Self::MIN_OFFSET_ALIGNMENT);
// Happens relatively rarely, this is a noteworthy event!
re_log::debug!("Allocating new CpuWriteGpuReadBelt chunk of size {buffer_size}");
let buffer = buffer_pool.alloc(
Expand All @@ -388,7 +360,7 @@ impl CpuWriteGpuReadBelt {
}
};

let cpu_buffer_view = chunk.allocate_aligned(num_elements, size, alignment);
let cpu_buffer_view = chunk.allocate(num_elements, size);
self.active_chunks.push(chunk);
cpu_buffer_view
}
Expand Down

1 comment on commit f0285c7

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust Benchmark

Benchmark suite Current: f0285c7 Previous: 8a37e50 Ratio
datastore/insert/batch/rects/insert 605813 ns/iter (± 2916) 607999 ns/iter (± 3501) 1.00
datastore/latest_at/batch/rects/query 1853 ns/iter (± 9) 1847 ns/iter (± 7) 1.00
datastore/latest_at/missing_components/primary 287 ns/iter (± 1) 289 ns/iter (± 5) 0.99
datastore/latest_at/missing_components/secondaries 438 ns/iter (± 2) 444 ns/iter (± 2) 0.99
datastore/range/batch/rects/query 151846 ns/iter (± 1511) 152928 ns/iter (± 776) 0.99
mono_points_arrow/generate_message_bundles 46513917 ns/iter (± 622483) 43674525 ns/iter (± 2094613) 1.07
mono_points_arrow/generate_messages 123500647 ns/iter (± 1110759) 122673947 ns/iter (± 1466966) 1.01
mono_points_arrow/encode_log_msg 156246217 ns/iter (± 984562) 153248336 ns/iter (± 1169571) 1.02
mono_points_arrow/encode_total 327457744 ns/iter (± 1935988) 322940902 ns/iter (± 2003700) 1.01
mono_points_arrow/decode_log_msg 179138730 ns/iter (± 1094507) 176465542 ns/iter (± 1591891) 1.02
mono_points_arrow/decode_message_bundles 54574594 ns/iter (± 644866) 53000928 ns/iter (± 931076) 1.03
mono_points_arrow/decode_total 231466995 ns/iter (± 1461577) 227298270 ns/iter (± 1971502) 1.02
batch_points_arrow/generate_message_bundles 286488 ns/iter (± 2428) 285537 ns/iter (± 2423) 1.00
batch_points_arrow/generate_messages 6111 ns/iter (± 43) 6018 ns/iter (± 80) 1.02
batch_points_arrow/encode_log_msg 385179 ns/iter (± 2164) 376585 ns/iter (± 2817) 1.02
batch_points_arrow/encode_total 694213 ns/iter (± 4347) 690810 ns/iter (± 3373) 1.00
batch_points_arrow/decode_log_msg 353613 ns/iter (± 1783) 350176 ns/iter (± 1466) 1.01
batch_points_arrow/decode_message_bundles 1567 ns/iter (± 9) 1613 ns/iter (± 33) 0.97
batch_points_arrow/decode_total 357799 ns/iter (± 1669) 358649 ns/iter (± 2838) 1.00
arrow_mono_points/insert 6204320834 ns/iter (± 17905183) 6120830809 ns/iter (± 19096063) 1.01
arrow_mono_points/query 1724079 ns/iter (± 17922) 1721371 ns/iter (± 18940) 1.00
arrow_batch_points/insert 3033644 ns/iter (± 15959) 3013481 ns/iter (± 30029) 1.01
arrow_batch_points/query 15406 ns/iter (± 99) 15257 ns/iter (± 209) 1.01
arrow_batch_vecs/insert 42562 ns/iter (± 216) 43113 ns/iter (± 440) 0.99
arrow_batch_vecs/query 479101 ns/iter (± 5927) 469903 ns/iter (± 5425) 1.02
tuid/Tuid::random 34 ns/iter (± 0) 34 ns/iter (± 0) 1

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.