feat(5851): ArrowWriter memory usage #5967

wiedld · 2024-06-27T16:24:06Z

Which issue does this PR close?

Closes #5851
(Altho followup PRs may add bytes to the accounting.)

Rationale for this change

We have several profiling test cases which compare datafusion's tracked MemoryReservations with the actual peak memory usage. The largest single difference was in the tracking of memory used during the parquet encoding (via ArrowWriter). Here is a summary of the discrepancy per test case:

Use case	root caller	profiled heap peak (actual used)	peak reserved bytes (datafusion estimate)	difference (actual - estimate)
test case 1, incl 61 columns	TrackedMemoryArrowWriter	798.9 MB	49.4 MB	-749.5 MB
test case 2, incl 4869 columns	TrackedMemoryArrowWriter	4.1 GB	220.1 MB	-3.9 GB
test case 3, incl 1042 colums	TrackedMemoryArrowWriter	3.3 GB	98.7 MB	-3.2 GB

These^^ results provided significant motivation to fulfill the existing upstream feature request, to provide an ArrowWriter API for memory size used during encoding (refer to #5851). Currently, we have been reserving memory bytes based upon the anticipated encoded (compressed) size, as that was the only API available on the ArrowWriter.

This PR introduce a new memory_size() API, defined as both the already encoded size plus the uncompressed/unflushed bytes in buffer. Next, we limited our accounting of unflushed bytes to the DictEncoder, (although future PRs may expand this accounting). This change alone had a significant impact on our test case 3:

Use case	root caller	profiled heap peak (actual used)	peak reserved bytes (datafusion estimate)	difference (actual - estimate)
test case 3, CONTROL	TrackedMemoryArrowWriter	3.3 GB	98.7 MB	-3.2 GB
test case 3, WITH THESE CHANGES	TrackedMemoryArrowWriter	3.3 GB	2.2 GB	-1.1 GB

Accounting for the DictEncoder unflushed bytes has improved our memory tracking by ~2 GBs in this test case. We anticipate followup PRs which expand this memory_size() accounting to cover our other test cases as well.

What changes are included in this PR?

Delineate two APIs: the existing method for the anticipated encoded size, vs the new API for the memory size during encoding.
Then implement the new memory to include accounting for unflushed DictEncoder bytes.

Are there any user-facing changes?

Yes, the new ArrowWriter::memory_size() API.

… ArrowWriter and column writers

…mentations and the DictEncoder

alamb

Thanks @wiedld -- this is looking great

My biggest comment / confusion about this PR is that as written it was somewhat unclear that ArrowWriter::memory_size includes both the size of in progress buffers (not yet tracked by in_progress_size) AND in_progress_size). And then it was hard for me to convince myself that the code correctly accounts for both (as the memory use calculation needs to be a superset of in_progress_size and other memory)

I think as written and explained this is confusing. I think we should:

Update the documentation to make it clear that ArrowWriter::memory_size includes both calculations (for example, explicitly say that the internal buffer size is memory_size() - in_progress_size())
Change the implementation of ArrowWriter::memory_size to explicitly add the in progress estimated_total_bytes and memory size so it is easier to verify (I left a more specific suggestion inline)

In addition, can we update the docs here:
https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html#memory-limiting

To mention estimated_memory_size? as well

Something like "The writer itself has internal buffers which can consume substantial amounts of memory, especially for data that encodes very efficiently. ArrowWriter::memory_size can be used to track the size of these internal memory buffers "

alamb · 2024-06-28T11:54:18Z

parquet/src/arrow/arrow_writer/mod.rs

+        match &self.in_progress {
+            Some(in_progress) => in_progress.writers.iter().map(|x| x.memory_size()).sum(),
+            None => 0,


It took me a while to understand that this calculation actually also includes the in_progress_size too

What do you think about making this more explicit like

Suggested change

match &self.in_progress {

Some(in_progress) => in_progress.writers.iter().map(|x| x.memory_size()).sum(),

None => 0,

match &self.in_progress {

Some(in_progress) => in_progress.writers.iter().map(|x| x.memory_size() + x.get_estimated_total_bytes()).sum(),

None => 0,

And then change code like

/// Returns the estimated total memory usage. /// /// Unlike [`Self::get_estimated_total_bytes`] this is an estimate /// of the current memory usage and not it's anticipated encoded size. #[cfg(feature = "arrow")] pub(crate) fn memory_size(&self) -> usize { self.column_metrics.total_bytes_written as usize + self.encoder.estimated_memory_size() }

to only include the estimated memory size:

/// Returns the estimated total memory buffer usage. #[cfg(feature = "arrow")] pub(crate) fn memory_size(&self) -> usize { self.encoder.estimated_memory_size() }

The in_progress_size includes the flushed_bytes + unflushed_bytes.
The memory_size includes flushed_bytes + unflushed_bytes + processing_size.

At first glance, it does look like we could do memory_size = in_progress_size + processing_size. But what the calculation actually is:
in_progress_size = flushed_bytes + unflushed_bytes_anticipated_size_after_flush
memory_size = flushed_bytes + unflushed_bytes_current_mem_size + processing_size.

Per our approach to memory limiting, we should have unflushed bytes (in buffer) already be encoded. However, I believe that's not the case for the indices on the dict encoder? As such, the accounting in this PR specifically considers unflush_bytes seperately -- and pushes down the new memory_size() interface until where we can delineate the DictEncoder vs other encoders.

I added a commit to help explain. Please let me know if I misunderstood. 🙏🏼

I am thinking from a user's point of view (e.g. our usecase in InfluxDB )

If I want to know how much memory the ArrowWriter is using (so I can track it, against a budget for example) what API should I use?

I was worried that I couldn't use

ArrowWriter::memory_size because that does not include the estimated data pages sizes AND I couldn't use

ArrowWriter::memory_size() + ArrowWriter::in_progress_size() because that would double count perviously buffered data.

However, after reviewing the code more closely I see the difference is that ArrowWriter::in_progress_size includes an estimate of how large the parquet encoded data would be after flush (which is not actually memory currently used) which presumably in most cases will be smaller than the actual bytes used. I will try and updated the comments as well to clarify this

parquet/src/column/writer/encoder.rs

parquet/src/arrow/arrow_writer/byte_array.rs

…nce btwn memory_size vs get_estimated_total_byte

…erface, and update impls to account for bloom filter size

wiedld · 2024-06-28T19:34:07Z

parquet/src/column/writer/encoder.rs

+            // Whereas for all other encoders the buffer contains encoded bytes.
+            // Therefore, we can use the estimated_data_encoded_size.


Maybe the phrase "all other encoders" should be changed to "it's presumed for all other encoders" since moving target.

I reviewed this carefully and I think it is worth not intermixing the encoded estimate with the memory usage, so I took the liberty of implementing estimated_memory_size for Encoder as well

leoyvens · 2024-07-01T15:54:00Z

For benchmarking, you may want to include allocators that are less prone to memory fragmentation like snmalloc (rust crate).

…emory-usage

alamb · 2024-07-01T17:55:29Z

parquet/src/arrow/arrow_writer/byte_array.rs

@@ -334,6 +338,10 @@ impl DictEncoder {
        num_required_bits(length.saturating_sub(1) as u64)
    }

+    fn estimated_memory_size(&self) -> usize {
+        self.interner.storage().page.len() + self.indices.len() * 8


I think we also need to account for the interner's dedup hash table. I added some code to do this in XX

Also, in general a more accurate estimate of memory usage is capacity() * element_size

alamb · 2024-07-01T18:24:15Z

Sorry I lost some of my in progress comments -- I spent some time trying to clarify the documentation for this PR along with the implementation (I threaded through some additional memory calculations for encoder). Let me know what you think @wiedld

alamb · 2024-07-01T18:25:08Z

parquet/src/arrow/async_writer/mod.rs

+    /// Estimated memory usage, in bytes, of this `ArrowWriter`
+    ///
+    /// See [ArrowWriter::memory_size] for more information.
+    pub fn memory_size(&self) -> usize {


I also added these wrappers for symmetry with the other AsyncWriter methods

alamb · 2024-07-01T18:28:48Z

parquet/src/util/interner.rs

@@ -32,6 +32,9 @@ pub trait Storage {

    /// Adds a new element, returning the key
    fn push(&mut self, value: &Self::Value) -> Self::Key;
+
+    /// Return an estimate of the memory used in this storage, in bytes
+   fn estimated_memory_size(&self) -> usize;


Despite this being 'pub trait' it is not pub outside the module: https://docs.rs/parquet/latest/parquet/?search=storage

alamb · 2024-07-01T18:29:46Z

parquet/src/column/writer/encoder.rs

@@ -93,10 +93,17 @@ pub trait ColumnValueEncoder {
    /// Returns true if this encoder has a dictionary page


likewise, while this trait is marked pub it is not exposed outside the crate: https://docs.rs/parquet/latest/parquet/?search=ColumnValueEncoder

wiedld · 2024-07-01T19:48:39Z

parquet/src/bloom_filter/mod.rs

+    /// Return the total in memory size of this bloom filter in bytes
+    pub(crate) fn memory_size(&self) -> usize {
+        self.0.capacity() * std::mem::size_of::<Block>()


wiedld · 2024-07-01T19:49:19Z

parquet/src/arrow/arrow_writer/byte_array.rs

+
+    fn estimated_memory_size(&self) -> usize {
+        self.page.capacity() * std::mem::size_of::<u8>()
+            + self.values.capacity() * std::mem::size_of::<std::ops::Range<usize>>()
+    }


wiedld · 2024-07-01T19:52:04Z

parquet/src/util/interner.rs

+    pub fn estimated_memory_size(&self) -> usize {
+        self.storage.estimated_memory_size() +
+            // estimate size of dedup hashmap as just th size of the keys
+            self.dedup.capacity() + std::mem::size_of::<S::Key>()


alamb · 2024-07-02T15:30:23Z

I plan to merge this PR in an hour or two unless anyone else would like time to review

wiedld

I cannot approve my own PR; so consider this an approval of @alamb 's changes 😆 .

alamb · 2024-07-02T18:01:39Z

Thanks again @wiedld

wiedld added 3 commits June 27, 2024 09:19

refactor(5851): delineate the different memory estimates APIs for the…

b796982

… ArrowWriter and column writers

feat(5851): add memory size estimates to the ColumnValueEncoder imple…

b9a002f

…mentations and the DictEncoder

test(5851): add memory_size() to in-progress test

23775bd

github-actions bot added the parquet Changes to the parquet crate label Jun 27, 2024

wiedld marked this pull request as ready for review June 27, 2024 16:32

alamb reviewed Jun 28, 2024

View reviewed changes

alamb mentioned this pull request Jun 28, 2024

API to get memory usage for parquet ArrowWriter #5851

Closed

wiedld marked this pull request as draft June 28, 2024 17:27

wiedld force-pushed the 5851/arrow-writer-memory-usage branch from b88ac6f to 25738dc Compare June 28, 2024 17:53

wiedld marked this pull request as ready for review June 28, 2024 18:09

wiedld marked this pull request as draft June 28, 2024 18:14

wiedld added 3 commits June 28, 2024 12:11

chore(5851): update docs to make it more explicit what is the differe…

0702756

…nce btwn memory_size vs get_estimated_total_byte

feat(5851): clarify the ColumnValueEncoder::estimated_memory_size int…

3f8681b

…erface, and update impls to account for bloom filter size

feat(5851): account for stats array size in the ByteArrayEncoder

dcefe9e

wiedld force-pushed the 5851/arrow-writer-memory-usage branch from 25738dc to dcefe9e Compare June 28, 2024 19:13

wiedld marked this pull request as ready for review June 28, 2024 19:26

wiedld commented Jun 28, 2024

View reviewed changes

wiedld marked this pull request as draft June 29, 2024 00:13

This was referenced Jun 30, 2024

Release arrow-rs / parquet minor version 52.1.0 #5905

Closed

DataFusion weekly project plan (Andrew Lamb) - July 1, 2024 apache/datafusion#11190

Closed

alamb marked this pull request as ready for review July 1, 2024 16:54

alamb added 7 commits July 1, 2024 13:32

Refine documentation

678766b

Merge remote-tracking branch 'apache/master' into 5851/arrow-writer-m…

1d289e6

…emory-usage

More accurate memory estimation

aced4a1

Improve tests

3ca8839

Update accounting for non dict encoded data

81e5efb

Include more memory size calculations

27ceff8

clean up async writer

9043f20

alamb reviewed Jul 1, 2024

View reviewed changes

alamb added 2 commits July 1, 2024 14:33

clippy

6485b3c

tweak

099b00e

wiedld commented Jul 1, 2024

View reviewed changes

wiedld commented Jul 2, 2024

View reviewed changes

alamb merged commit 5c6f857 into apache:master Jul 2, 2024
16 checks passed

alamb deleted the 5851/arrow-writer-memory-usage branch July 2, 2024 18:01

This was referenced Jul 2, 2024

feat(5851/iox-11168): memory usage vs encoding size for ArrowWriter influxdata/arrow-rs#2

Closed

Track memory used by parquet writers. apache/datafusion#11344

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(5851): ArrowWriter memory usage #5967

feat(5851): ArrowWriter memory usage #5967

wiedld commented Jun 27, 2024 •

edited

Loading

alamb left a comment

alamb Jun 28, 2024

wiedld Jun 28, 2024 •

edited

Loading

alamb Jul 1, 2024 •

edited

Loading

wiedld Jun 28, 2024

alamb Jul 1, 2024

leoyvens commented Jul 1, 2024

alamb Jul 1, 2024

alamb commented Jul 1, 2024

alamb Jul 1, 2024

alamb Jul 1, 2024

alamb Jul 1, 2024

wiedld Jul 1, 2024

wiedld Jul 1, 2024

wiedld Jul 1, 2024

alamb commented Jul 2, 2024

wiedld left a comment

alamb commented Jul 2, 2024

		// Whereas for all other encoders the buffer contains encoded bytes.
		// Therefore, we can use the estimated_data_encoded_size.

		@@ -93,10 +93,17 @@ pub trait ColumnValueEncoder {
		/// Returns true if this encoder has a dictionary page

feat(5851): ArrowWriter memory usage #5967

feat(5851): ArrowWriter memory usage #5967

Conversation

wiedld commented Jun 27, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wiedld Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leoyvens commented Jul 1, 2024

Choose a reason for hiding this comment

alamb commented Jul 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 2, 2024

wiedld left a comment

Choose a reason for hiding this comment

alamb commented Jul 2, 2024

wiedld commented Jun 27, 2024 •

edited

Loading

wiedld Jun 28, 2024 •

edited

Loading

alamb Jul 1, 2024 •

edited

Loading