Fix CpuWriteGpuReadBuffer producing unaligned gpu buffer offsets (#1716)

Turns out our eagerness to acquire aligned pointers for fast copy operation backfired and got us into an impossible situation: By offsetting staging buffers to ensure cpu pointer alignment, we sometimes choose offsets that aren't allowed for copy operations. E.g. we get back a buffer that has a pointer alignment of 2 (that happens -.-) we therefore offset the pointer by 14 (our min alignment is 16!). We now can copy data into the buffer quickly and safely. But when scheduling e.g. `copy_buffer_to_texture` we get a wgpu crash! Wgpu requires the offset (we put 14) to be: * a multiple of wgpu::COPY_BUFFER_ALIGNMENT * a multiple of the texel block size Neither of which is true now! You might be asking why wgpu gives such oddly aligned buffers out to begin with, and the answer is sadly that the WebGL impl has issues + that the spec doesn't guarantee anything, so this is strictly speaking valid (although most other backends will give out 16 byte aligned pointers). See gfx-rs/wgpu#3508 Long story short, I changed (and simplified) the way we go about alignment on `CpuWriteGpuReadBelt`. The CPU pointer no longer has *any* alignment guarantees and offsets fullfill now the above guarantees. This is _ok_ since we already wrapped all accesses to the cpu pointer and can do byte writes to them. The huge drawback of this is ofc that `copy_from_slice` now has to do the heavy lifting of checking for alignment and then doing the right instructions for everything that is worth while doing so (that is, the things `memcpy` does when it deals with raw byte pointers) Testing: Confirmed fix with crashing repro on the Web, then ran `just py-run-all` for native, renderer samples local and on web. Have not checked if this has any practical perf impact. Luckily our interface makes this very much a "optimize later" problem (copy operations within `CpuWriteGpuReadBuffer` can be made more clever in the future if need to be; unlikely necessary to be fair though)
rerun-io · Mar 27, 2023 · f0285c7 · f0285c7 · github-actions · Mar 27, 2023
1 parent 8a37e50
commit f0285c7
Showing 1 changed file with 54 additions and 82 deletions.
diff --git a/crates/re_renderer/src/allocator/cpu_write_gpu_read_belt.rs b/crates/re_renderer/src/allocator/cpu_write_gpu_read_belt.rs
@@ -1,4 +1,4 @@
-use std::{num::NonZeroU32, ops::DerefMut, sync::mpsc};
+use std::{num::NonZeroU32, sync::mpsc};
 
 use crate::wgpu_resources::{texture_row_data_info, BufferDesc, GpuBuffer, GpuBufferPool};
 
@@ -34,48 +34,50 @@ impl<T> CpuWriteGpuReadBuffer<T>
 where
     T: bytemuck::Pod + 'static,
 {
-    /// Memory as slice of T.
+    /// Memory as slice.
+    ///
+    /// Note that we can't rely on any alignment guarantees here!
+    /// We could offset the mapped CPU-sided memory, but then the GPU offset won't be aligned anymore.
+    /// There's no way we can meet conflicting alignment requirements, so we need to work with unaligned bytes instead.
+    /// See [this comment on this wgpu issue](https://github.com/gfx-rs/wgpu/issues/3508#issuecomment-1485044324) about what we tried before.
+    ///
+    /// Once wgpu has some alignment guarantees, we might be able to use this here to allow faster copies!
+    /// (copies of larger blocks are likely less affected as `memcpy` typically does dynamic check/dispatching for SIMD based copies)
     ///
     /// Do *not* make this public as we need to guarantee that the memory is *never* read from!
     #[inline(always)]
-    fn as_slice(&mut self) -> &mut [T] {
+    fn as_slice(&mut self) -> &mut [u8] {
         // TODO(andreas): Is this access slow given that it internally goes through a trait interface? Should we keep the pointer around?
-        // `write_view` may have padding at the end that isn't a multiple of T's size.
-        // Bytemuck get's unhappy about that, so cast the correct range.
-        bytemuck::cast_slice_mut(
-            &mut self.write_view[self.unwritten_element_range.start * std::mem::size_of::<T>()
-                ..self.unwritten_element_range.end * std::mem::size_of::<T>()],
-        )
+        &mut self.write_view[self.unwritten_element_range.start * std::mem::size_of::<T>()
+            ..self.unwritten_element_range.end * std::mem::size_of::<T>()]
     }
 
     /// Pushes a slice of elements into the buffer.
     ///
     /// Panics if the data no longer fits into the buffer.
     #[inline]
     pub fn extend_from_slice(&mut self, elements: &[T]) {
-        self.as_slice()[..elements.len()].copy_from_slice(elements);
+        let bytes = bytemuck::cast_slice(elements);
+        self.as_slice()[..bytes.len()].copy_from_slice(bytes);
         self.unwritten_element_range.start += elements.len();
     }
 
     /// Pushes several elements into the buffer.
     ///
-    /// Extends until either running out of space or elements.
+    /// Panics if there are more elements than there is space in the buffer.
     #[inline]
     pub fn extend(&mut self, elements: impl Iterator<Item = T>) {
-        let mut num_elements = 0;
-        for (target, source) in self.as_slice().iter_mut().zip(elements) {
-            *target = source;
-            num_elements += 1;
+        for element in elements {
+            self.push(element);
         }
-        self.unwritten_element_range.start += num_elements;
     }
 
     /// Pushes a single element into the buffer and advances the write pointer.
     ///
     /// Panics if the data no longer fits into the buffer.
     #[inline]
     pub fn push(&mut self, element: T) {
-        self.as_slice()[0] = element;
+        self.as_slice()[..std::mem::size_of::<T>()].copy_from_slice(bytemuck::bytes_of(&element));
         self.unwritten_element_range.start += 1;
     }
 
@@ -165,47 +167,22 @@ impl Chunk {
         self.buffer.size() - self.unused_offset
     }
 
-    fn required_padding(write_view: &mut wgpu::BufferViewMut<'_>, alignment: u64) -> u64 {
-        // Use deref_mut explicitly because wgpu warns otherwise that read access is slow.
-        let ptr = write_view.deref_mut().as_ptr() as u64;
-        wgpu::util::align_to(ptr, alignment) - ptr
-    }
-
-    /// Caller needs to make sure that there is enough space plus potential padding.
-    fn allocate_aligned<T: bytemuck::Pod>(
+    /// Caller needs to make sure that there is enough space.
+    fn allocate<T: bytemuck::Pod>(
         &mut self,
         num_elements: usize,
         size_in_bytes: u64,
-        alignment: u64,
     ) -> CpuWriteGpuReadBuffer<T> {
         debug_assert!(num_elements * std::mem::size_of::<T>() <= size_in_bytes as usize);
 
-        // Optimistic first mapping attempt.
-        let mut start_offset = self.unused_offset;
-        let mut end_offset = start_offset + size_in_bytes;
-        let mut buffer_slice = self.buffer.slice(start_offset..end_offset);
-        let mut write_view = buffer_slice.get_mapped_range_mut();
-
-        // Check if it fulfills the requested alignment.
-        let required_padding = Self::required_padding(&mut write_view, alignment);
-        if required_padding != 0 {
-            // Undo mapping and try again with padding!
-            re_log::trace!(
-                "CpuWriteGpuReadBuffer::allocate alignment requirement not fulfilled. Need to add {required_padding} for alignment of {alignment}"
-            );
-
-            drop(write_view);
-
-            start_offset = self.unused_offset + required_padding;
-            end_offset = start_offset + size_in_bytes;
-            buffer_slice = self.buffer.slice(start_offset..end_offset);
-            write_view = buffer_slice.get_mapped_range_mut();
-
-            let required_padding = Self::required_padding(&mut write_view, alignment);
-            debug_assert_eq!(required_padding, 0);
-        }
+        let byte_offset_in_chunk_buffer = self.unused_offset;
+        let end_offset = byte_offset_in_chunk_buffer + size_in_bytes;
 
+        debug_assert!(byte_offset_in_chunk_buffer % CpuWriteGpuReadBelt::MIN_OFFSET_ALIGNMENT == 0);
         debug_assert!(end_offset <= self.buffer.size());
+
+        let buffer_slice = self.buffer.slice(byte_offset_in_chunk_buffer..end_offset);
+        let write_view = buffer_slice.get_mapped_range_mut();
         self.unused_offset = end_offset;
 
         #[allow(unsafe_code)]
@@ -226,7 +203,7 @@ impl Chunk {
 
         CpuWriteGpuReadBuffer {
             chunk_buffer: self.buffer.clone(),
-            byte_offset_in_chunk_buffer: start_offset,
+            byte_offset_in_chunk_buffer,
             write_view,
             unwritten_element_range: 0..num_elements,
             _type: std::marker::PhantomData,
@@ -279,11 +256,15 @@ impl CpuWriteGpuReadBelt {
     /// Requiring a minimum alignment means we need to pad less often.
     /// Also, it has the potential of making memcpy operations faster.
     ///
-    /// Align to 4xf32. Should be enough for most usecases!
-    /// Needs to be larger or equal than [`wgpu::MAP_ALIGNMENT`].
+    /// Needs to be larger or equal than [`wgpu::MAP_ALIGNMENT`], [`wgpu::COPY_BUFFER_ALIGNMENT`]
+    /// and the largest possible texel block footprint (since offsets for texture copies require this)
+    ///
     /// For alignment requirements in `WebGPU` in general, refer to
     /// [the specification on alignment-class limitations](https://www.w3.org/TR/webgpu/#limit-class-alignment)
-    pub const MIN_ALIGNMENT: u64 = 16;
+    ///
+    /// Note that this does NOT mean that the CPU memory has *any* alignment.
+    /// See this issue about [lack of CPU memory alignment](https://github.com/gfx-rs/wgpu/issues/3508) in wgpu/WebGPU.
+    const MIN_OFFSET_ALIGNMENT: u64 = 16;
 
     /// Create a cpu-write & gpu-read staging belt.
     ///
@@ -299,11 +280,21 @@ impl CpuWriteGpuReadBelt {
     /// TODO(andreas): Adaptive chunk sizes
     /// TODO(andreas): Shrinking after usage spikes?
     pub fn new(chunk_size: wgpu::BufferSize) -> Self {
-        static_assertions::const_assert!(wgpu::MAP_ALIGNMENT <= CpuWriteGpuReadBelt::MIN_ALIGNMENT);
+        static_assertions::const_assert!(
+            wgpu::MAP_ALIGNMENT <= CpuWriteGpuReadBelt::MIN_OFFSET_ALIGNMENT
+        );
+        static_assertions::const_assert!(
+            wgpu::COPY_BUFFER_ALIGNMENT <= CpuWriteGpuReadBelt::MIN_OFFSET_ALIGNMENT
+        );
+        // Largest uncompressed texture format (btw. many compressed texture format have the same block size!)
+        debug_assert!(
+            wgpu::TextureFormat::Rgba32Uint.describe().block_size as u64
+                <= CpuWriteGpuReadBelt::MIN_OFFSET_ALIGNMENT
+        );
 
         let (sender, receiver) = mpsc::channel();
         CpuWriteGpuReadBelt {
-            chunk_size: wgpu::util::align_to(chunk_size.get(), Self::MIN_ALIGNMENT),
+            chunk_size: wgpu::util::align_to(chunk_size.get(), Self::MIN_OFFSET_ALIGNMENT),
             active_chunks: Vec::new(),
             closed_chunks: Vec::new(),
             free_chunks: Vec::new(),
@@ -313,8 +304,6 @@ impl CpuWriteGpuReadBelt {
     }
 
     /// Allocates a cpu writable buffer for `num_elements` instances of type `T`.
-    ///
-    /// Handles alignment requirements automatically, allowing arbitrarily aligned types without issues.
     pub fn allocate<T: bytemuck::Pod>(
         &mut self,
         device: &wgpu::Device,
@@ -325,32 +314,17 @@ impl CpuWriteGpuReadBelt {
 
         debug_assert!(num_elements > 0, "Cannot allocate zero-sized buffer");
 
-        // Potentially overestimate alignment with Self::MIN_ALIGNMENT, see Self::MIN_ALIGNMENT doc string.
-        let alignment = (std::mem::align_of::<T>() as wgpu::BufferAddress).max(Self::MIN_ALIGNMENT);
-        // Pad out the size of the used buffer to a multiple of Self::MIN_ALIGNMENT.
-        // This increases our chance of having back to back allocations within a chunk.
+        // Potentially overestimate size with Self::MIN_ALIGNMENT, see Self::MIN_ALIGNMENT doc string.
         let size = wgpu::util::align_to(
             (std::mem::size_of::<T>() * num_elements) as wgpu::BufferAddress,
-            Self::MIN_ALIGNMENT,
+            Self::MIN_OFFSET_ALIGNMENT,
         );
 
-        // We need to be super careful with alignment since today wgpu
-        // has no guarantees on how pointers to mapped memory are aligned!
-        // For all we know, pointers might be 1 aligned, causing even a u32 write to crash the process!
-        //
-        // For details and (as of writing) ongoing discussion see https://github.com/gfx-rs/wgpu/issues/3508
-        //
-        // To work around this, we ask for a bigger size, so we can safely pad out
-        // if the returned pointer is not correctly aligned.
-        // (i.e. we will use _up to_ `required_size` bytes, but at least `size`)]
-        let maximum_padding = alignment - 1;
-        let max_required_size = size + maximum_padding;
-
         // Try to find space in any of the active chunks first.
         let mut chunk = if let Some(index) = self
             .active_chunks
             .iter_mut()
-            .position(|chunk| chunk.remaining_capacity() >= max_required_size)
+            .position(|chunk| chunk.remaining_capacity() >= size)
         {
             self.active_chunks.swap_remove(index)
         } else {
@@ -360,15 +334,13 @@ impl CpuWriteGpuReadBelt {
             if let Some(index) = self
                 .free_chunks
                 .iter()
-                .position(|chunk| chunk.remaining_capacity() >= max_required_size)
+                .position(|chunk| chunk.remaining_capacity() >= size)
             {
                 self.free_chunks.swap_remove(index)
             } else {
                 // Allocation might be bigger than a chunk!
-                let buffer_size = wgpu::util::align_to(
-                    self.chunk_size.max(max_required_size),
-                    Self::MIN_ALIGNMENT,
-                );
+                let buffer_size =
+                    wgpu::util::align_to(self.chunk_size.max(size), Self::MIN_OFFSET_ALIGNMENT);
                 // Happens relatively rarely, this is a noteworthy event!
                 re_log::debug!("Allocating new CpuWriteGpuReadBelt chunk of size {buffer_size}");
                 let buffer = buffer_pool.alloc(
@@ -388,7 +360,7 @@ impl CpuWriteGpuReadBelt {
             }
         };
 
-        let cpu_buffer_view = chunk.allocate_aligned(num_elements, size, alignment);
+        let cpu_buffer_view = chunk.allocate(num_elements, size);
         self.active_chunks.push(chunk);
         cpu_buffer_view
     }