Dynamic ring buffer sizing for BZip2 and ZStandard on non-seekable streams#1269
Dynamic ring buffer sizing for BZip2 and ZStandard on non-seekable streams#1269
Conversation
Agent-Logs-Url: https://github.com/adamhathcock/sharpcompress/sessions/2d1412f8-34f8-4a32-8802-e52770342940 Co-authored-by: adamhathcock <527620+adamhathcock@users.noreply.github.com>
| // BZip2 decompresses in whole blocks; the compressed size of the first block | ||
| // can be close to the uncompressed maximum (9 × 100 000 = 900 000 bytes). | ||
| // The ring buffer must hold all compressed bytes read during format detection. | ||
| minimumRewindBufferSize: BZip2Constants.baseBlockSize * 9 |
There was a problem hiding this comment.
@copilot are there more kinds than just BZip2 to worry about? Zstandard?
There was a problem hiding this comment.
Only BZip2 needs the enlarged buffer. The other formats all decompress incrementally:
- ZStandard reads from the source in chunks of
ZSTD_DStreamInSize() ≈ 128 KBand produces output from each chunk before reading more. The default 160 KB ring buffer is larger, so no special treatment is needed. - GZip (DEFLATE), LZip (LZMA), XZ (LZMA2), and LZW are all streaming/dictionary-based decompressors that produce output while consuming input — well within the default 160 KB buffer.
BZip2 is unique in that it must buffer and decode an entire compressed block before producing any output, and at compression level 9 that block can be up to 900 KB of compressed data.
…(minBufferSize) Agent-Logs-Url: https://github.com/adamhathcock/sharpcompress/sessions/47398db2-828b-472c-a213-6d5865cb0082 Co-authored-by: adamhathcock <527620+adamhathcock@users.noreply.github.com>
Agent-Logs-Url: https://github.com/adamhathcock/sharpcompress/sessions/5c7bd74e-7315-4dfd-a5a4-e8fd2804f7be Co-authored-by: adamhathcock <527620+adamhathcock@users.noreply.github.com>
|
@copilot revert the default buffer size to 80k and use the same mechanism for BZip2 for Zstandard to be what is needed |
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Overview
Changes Verified (3 files)
Implementation VerificationThe PR owner request has been fully implemented:
The implementation is correct:
Files Reviewed
|
There was a problem hiding this comment.
Pull request overview
This PR addresses rewind buffer overflows during TAR format detection on non-seekable .tar.bz2 streams by ensuring the rewind ring buffer is large enough to accommodate BZip2’s large first-block reads during probing.
Changes:
- Adds an optional minimum buffer size parameter to
SharpCompressStream.StartRecording(...)to allow upfront ring-buffer sizing. - Introduces
TarWrapper.MinimumRewindBufferSizeplus a computedTarWrapper.MaximumRewindBufferSize(BZip2 sets 900,000 bytes) and wires this into TAR reader/detection paths. - Adds regression and sizing tests for TAR+BZip2 non-seekable scenarios and
StartRecording(minBufferSize)behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/SharpCompress.Test/Tar/TarReaderTests.cs | Adds regression test for non-seekable .tar.bz2 and asserts wrapper minimum rewind sizes. |
| tests/SharpCompress.Test/Streams/SharpCompressStreamSeekTest.cs | Adds tests intended to cover StartRecording(minBufferSize) and default behavior. |
| src/SharpCompress/Readers/Tar/TarReader.Factory.cs | Ensures TAR reader stream is created with a rewind buffer sized to at least the wrapper maximum. |
| src/SharpCompress/packages.lock.json | Updates locked versions for Microsoft.NET.ILLink.Tasks for net8.0/net10.0 targets. |
| src/SharpCompress/IO/SharpCompressStream.cs | Extends StartRecording with optional minimum buffer size and allocates ring buffer accordingly. |
| src/SharpCompress/IO/SeekableSharpCompressStream.cs | Updates override signature for the new StartRecording(int? ...) API. |
| src/SharpCompress/Factories/TarWrapper.cs | Adds per-wrapper minimum rewind sizing and computes a global maximum across wrappers. |
| src/SharpCompress/Factories/TarFactory.cs | Uses TarWrapper.MaximumRewindBufferSize when starting recording for TAR detection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -190,10 +200,14 @@ public virtual void StartRecording() | |||
| ); | |||
| } | |||
|
|
|||
| // Ensure ring buffer exists | |||
| // Allocate ring buffer with the requested minimum size (at least the global default). | |||
| if (_ringBuffer is null) | |||
| { | |||
| _ringBuffer = new RingBuffer(Constants.RewindableBufferSize); | |||
| var size = | |||
| minBufferSize.GetValueOrDefault() > Constants.RewindableBufferSize | |||
| ? minBufferSize.GetValueOrDefault() | |||
| : Constants.RewindableBufferSize; | |||
| _ringBuffer = new RingBuffer(size); | |||
| } | |||
There was a problem hiding this comment.
The new StartRecording(minBufferSize) only allocates a ring buffer when _ringBuffer is null. For non-seekable streams created via SharpCompressStream.Create(...), _ringBuffer is typically preallocated in the factory constructor, so a larger minBufferSize passed here will be silently ignored even if the existing buffer is smaller than the requested minimum. Consider either (a) validating that the existing ring buffer length meets the requested minimum and throwing a clear exception if not, or (b) ensuring callers that need a larger buffer always pass it at Create-time and documenting that StartRecording cannot increase an already-allocated buffer.
| // because EnsureMinimumRewindBufferSize expands the ring buffer | ||
| // to hold the BZip2 block before calling IsTarFile. |
There was a problem hiding this comment.
This regression test comment references EnsureMinimumRewindBufferSize expanding the ring buffer, but there is no such API in the codebase and the implementation appears to rely on allocating the correct buffer size up-front (e.g., via TarWrapper.MaximumRewindBufferSize / StartRecording(minBufferSize)). Please update the comment to match the actual mechanism so the test remains an accurate explanation of the failure mode and fix.
| // because EnsureMinimumRewindBufferSize expands the ring buffer | |
| // to hold the BZip2 block before calling IsTarFile. | |
| // because the rewind ring buffer is allocated with enough capacity | |
| // up front (e.g. via MaximumRewindBufferSize / StartRecording) to hold | |
| // the BZip2 block before calling IsTarFile. |
| var stream = SharpCompressStream.Create(nonSeekableMs, largeSize); | ||
|
|
||
| // Pass the required size upfront — no expansion needed later |
There was a problem hiding this comment.
This test doesn’t currently exercise the new StartRecording(minBufferSize) allocation path: SharpCompressStream.Create(nonSeekableMs, largeSize) already preallocates the ring buffer with largeSize, so StartRecording(largeSize) is effectively redundant. To actually validate the new behavior, construct the stream in a way that leaves _ringBuffer null until StartRecording (or start with a smaller buffer and assert the expected behavior when a larger minimum is requested).
| var stream = SharpCompressStream.Create(nonSeekableMs, largeSize); | |
| // Pass the required size upfront — no expansion needed later | |
| var stream = SharpCompressStream.Create(nonSeekableMs); | |
| // Pass the required size upfront — forces allocation to at least largeSize |
| bufferSize: Math.Max( | ||
| readerOptions.RewindableBufferSize ?? 0, | ||
| TarWrapper.MaximumRewindBufferSize | ||
| ) | ||
| ); |
There was a problem hiding this comment.
TarReader.Factory now forces the rewind buffer size to at least TarWrapper.MaximumRewindBufferSize (currently 900,000 bytes) on non-seekable streams, even when the caller explicitly sets ReaderOptions.RewindableBufferSize smaller. This is a noticeable memory increase for non-BZip2 tar streams (e.g., tar.gz). If possible, consider honoring a smaller caller-specified size for formats that don’t require the maximum, or document that TarReader will always allocate at least this amount for non-seekable streams.
| bufferSize: Math.Max( | |
| readerOptions.RewindableBufferSize ?? 0, | |
| TarWrapper.MaximumRewindBufferSize | |
| ) | |
| ); | |
| bufferSize: readerOptions.RewindableBufferSize ?? TarWrapper.MaximumRewindBufferSize | |
| ); |
…(DStreamInSize=131075) Agent-Logs-Url: https://github.com/adamhathcock/sharpcompress/sessions/d5a696df-2906-4262-acf9-06cc174e91fc Co-authored-by: adamhathcock <527620+adamhathcock@users.noreply.github.com>
Done in 456a7df. |
|
Closing in favor of #1273 |
BZip2 decompresses entire blocks before producing output. At compression level 9 the first block can hold up to 900 KB uncompressed; with incompressible data the compressed bytes approach that size. Similarly, ZStandard blocks can be up to 128 KB (ZSTD_BLOCKSIZE_MAX). The default 80 KB ring buffer was too small for both formats, causing a rewind overflow during
IsTarFileformat detection on non-seekable streams.Changes
SharpCompressStream.StartRecording(int? minBufferSize = null)— the ring buffer is now allocated with the correct size atStartRecordingtime and never reallocated. Passing aminBufferSizegreater thanConstants.RewindableBufferSizeallocates a larger buffer upfront. No-op for seekable streams (which use native seeking).Constants.RewindableBufferSize— reverted to 80 KB (81920 bytes). Formats that require larger buffers declare their own minimum viaTarWrapper.MinimumRewindBufferSize.TarWrapper.MinimumRewindBufferSize— new property (opt-in via constructor param) declaring how large a ring buffer format detection requires for a given wrapper:BZip2Constants.baseBlockSize * 9 = 900 000 bytesZstandardConstants.DStreamInSize = 131 075 bytes(ZSTD_BLOCKSIZE_MAX + ZSTD_blockHeaderSize)Constants.RewindableBufferSizeZstandardConstants.BlockSizeMax/DStreamInSize— new constants matchingZSTD_BLOCKSIZE_MAX(131072) andZSTD_DStreamInSize(131075), used to size the ring buffer for ZStandard format detection.TarWrapper.MaximumRewindBufferSize— new static property: the pre-computed maximumMinimumRewindBufferSizeacross all wrappers (900 000 bytes, driven by BZip2). Used as the buffer size at stream construction so no format-specific branching is needed at call sites.TarFactory/TarReader.Factory— passTarWrapper.MaximumRewindBufferSizewhen creating or starting recording on the stream, ensuring the buffer is correctly sized before any reads occur. Applies to both sync and async paths.The buffer is sized correctly at construction, so seekable streams are unaffected and non-seekable streams always have enough capacity without any post-construction reallocation.