Skip to content

Avoid over-retaining memory in PartitioningExchanger#9327

Merged
sopel39 merged 1 commit intotrinodb:masterfrom
pettyjamesm:compact-partitioned-pages
Sep 24, 2021
Merged

Avoid over-retaining memory in PartitioningExchanger#9327
sopel39 merged 1 commit intotrinodb:masterfrom
pettyjamesm:compact-partitioned-pages

Conversation

@pettyjamesm
Copy link
Copy Markdown
Member

@pettyjamesm pettyjamesm commented Sep 21, 2021

Compact pages inside of PartitioningExchanger when all rows go to the same partition so that the behavior is consistent with when pages get sub-partitioned and page.copyPositions is called.

The reason for doing this is a little bit non-local to the PartitioningExchanger implementation, before the recent change to special case this "all positions go to the same partition" scenario, PartitioningExchanger would call copyPositions unconditionally. This inefficiency was incidentally mitigating another problem:

  1. VariableWidthBlockEncoding is implemented such that it simply refers to the SerializedPage#slice it is deserialized from (ie: it does not copy the contents into a new Slice, which in the case of a remote exchange is a Slice covering all data including data that was copied out into new buffers (eg: LongArrayBlock instances). This means that the retained size for deserialized VariableWidthBlock instances are always higher than the size of the block after a remote exchange or after being unspilled from disk.
  2. Page#getRetainedSize simply sums the retained size of each block, but when multiple VariableWidthBlock columns are deserialized from the same input slice after a remote exchange, each one will report that it retains the entire (same) input slice. This makes the page retained size effectively <input slice size> * <number of variable width columns> which can be a huge over-reporting of retained memory at the page level.

This was identified as an issue in presto and as a result, the comparable change that to the one that introduced the pass-through change in Trino was reverted. Now, the options to resolve this are basically:

  1. Copy VariableWidthBlock values out of their input slice as part of deserializing. This solved the problem, because each block then retains its own separate compact slice after deserialization but comes with a performance penalty.
  2. Refactor getRetainedSize to be aware of previously accounted for object instances (ie: Slice identity) so that shared slices backing multiple columns are only counted once (I'm not sure this is straightforward to implement correctly)
  3. Reintroduce something equivalent to the copyPositions logic (ie: Page#compact() to avoid that problem here so that the behavior is at least consistent between the pass-through and partitioning scenarios in PartitioningExchanger (this only reduces the blast radius of the problem above, it does not address it fully)

Copy link
Copy Markdown
Member

@sopel39 sopel39 Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should compact data eagerly here. This pages will usually land in PagesIndex anyway which will compact pages (possibly eagerly), see io.trino.operator.PagesIndex#compact

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having PagesIndex#compact handle this case will actually prevent HashBuilderOperator from spilling to disk based on HashBuilderOperator#startMemoryRevoke() deciding that reductions in retained size of 20% or better upon compaction are sufficient to avoid starting to spill.

@pettyjamesm pettyjamesm force-pushed the compact-partitioned-pages branch from b9d753a to 463e30a Compare September 22, 2021 17:00
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying is as expensive as compaction. I think we should:

  1. either perform positions copy outside of lock too
    or
  2. Just do compaction here.

I think 2) is better and then potentially fix it in subsequent PR

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately copyPositions can't be performed outside of the lock because the lock is protecting the IntArrayList[] positionAssignments (and the inner IntArrayList entries) from another thread entering the partitioning loop and overwriting the position values before it has finished using them.

The single partition case is special in that we know once we've found that situation that we're done with the positions lists and it's safe for other threads to proceed. That said, compacting the page inside of the synchronized block is still better compared to not, so I'm ok with leaving the compaction inside the synchronized region for now (although I would still plan to change it back afterwards). Let me know if you'd still prefer to fix the issue with approach 2 for now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately copyPositions can't be performed outside of the lock because the lock is protecting the IntArrayList[] positionAssignments (and the inner IntArrayList entries) from another thread entering the partitioning loop and overwriting the position values before it has finished using them.

Is it really faster to reuse that arrays?

That said, compacting the page inside of the synchronized block is still better compared to not, so I'm ok with leaving the compaction inside the synchronized region for now (although I would still plan to change it back afterwards). Let me know if you'd still prefer to fix the issue with approach 2 for now.

Let's split it into two PRs

Copy link
Copy Markdown
Member Author

@pettyjamesm pettyjamesm Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split it into two PRs

Sure, will do.

Is it really faster to reuse that arrays?

I think this is more of an allocation rate and total footprint consideration. If each driver feeding the LocalExchangeSink had it's own partitioning arrays that would be task_concurrency * task_concurrency instances of IntArrayList, each retaining an int[] the size of the largest input page position counts seen. By sharing the IntArrayList[] partitionAssignments that's a one-time scratch structure with access guarded by a mutex. I think there should be a way to do the initial page partition assignments outside of the critical section to reduce the duration of the lock being held, but that's a more dramatic experimental change to make.

@pettyjamesm pettyjamesm force-pushed the compact-partitioned-pages branch from 463e30a to 57ed2e9 Compare September 23, 2021 14:39
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bug. You use positions.elements() after you run positions.clear();

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally IntArrayList.clear() just sets IntArrayList#size = 0; but does not modify the int[] elements array. Reseting the size to 0 immediately after the if (size == 0) check simplifies the logic once the single partition case is refactored to return early (in the next PR) since otherwise you would have to ensure the clear() call happens in both branches.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally IntArrayList.clear() just sets IntArrayList#size = 0; but does not modify the int[] elements array

I know, but it doesn't look good and you don't know if implementation of clear won't change in the future.

Copy link
Copy Markdown
Member Author

@pettyjamesm pettyjamesm Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the IntArrayList javadoc:

This class implements a lightweight, fast, open, optimized, reuse-oriented version of array-based lists. Instances of this class represent a list with an array that is enlarged as needed when new entries are created (by doubling its current length), but is never made smaller (even on a clear()). A family of trimming methods lets you control the size of the backing array; this is particularly useful if you reuse instances of this class.

I've made the following changes in case these are enough to satisfy you on the justification of calling clear() in advance of using the array:

  • Enhance the comment to mention this assumption
  • Store a local variable int[] elements = positions.elements() before calling clear()

If you still have some concerns I can drop that change out of this PR since it's not strictly required for this PR, and we can resolve how best to handle it as part of the second PR.

Compact pages inside of PartitioningExchanger when all rows go to
the same partition so that the behavior is consistent with when
pages get sub-partitioned and to avoid over-retaining memory on
the page.
@pettyjamesm pettyjamesm force-pushed the compact-partitioned-pages branch from 57ed2e9 to 3beed17 Compare September 23, 2021 17:07
@sopel39 sopel39 merged commit b7418a9 into trinodb:master Sep 24, 2021
@sopel39 sopel39 mentioned this pull request Sep 24, 2021
8 tasks
@github-actions github-actions bot added this to the 363 milestone Sep 24, 2021
@pettyjamesm pettyjamesm deleted the compact-partitioned-pages branch September 28, 2021 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants