Skip to content

Fix memory accounting leak in aggregation#12333

Merged
arhimondr merged 6 commits intotrinodb:masterfrom
arhimondr:fix-memory-accounting-leak
May 13, 2022
Merged

Fix memory accounting leak in aggregation#12333
arhimondr merged 6 commits intotrinodb:masterfrom
arhimondr:fix-memory-accounting-leak

Conversation

@arhimondr
Copy link
Copy Markdown
Contributor

Description

All the memory pool related failures from #11275 I checked are INSERT queries with a memory leak reported in either an AggregationOperator or an HashAggregationOperator.

I managed to reproduce it locally only once by running the INSERT INTO invalid_partition_value VALUES (4, 'test' || chr(13)) query in a loop and checking whether cluster memory pools are free upon competition.

The issue is incredibly difficult to reproduce as there are two things that must go wrong:

  1. A TableWriterOperator instance must happen to get used after close (fixed by Ensure no methods are called on operator after close)
  2. The TableWriterOperator operator must trigger an internal aggregation state update to cause it's memory context to get updated

Under normal circumstances the OperatorContext is destroyed during an operator close what prevents any further memory allocations for a given operator. However the OperatorContext for a nested operator within the TableWriterOperator wasn't being properly closed (fixed by Destroy OperatorContext for nested operators).

Also i found that memory can be still be reserved after memory context close (by using tryReserve). Fixed by Ensure no memory can be allocated after memory context close

Is this change a fix, improvement, new feature, refactoring, or other?

Fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Core

How would you describe this change to a non-technical end user or system administrator?

Under rare circumstances DML queries (INSERT,CTAS, etc.) could not release reserved memory what could potentially result in cluster eventually running out of memory.

Related issues, pull requests, and links

Fixes #11275

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label May 11, 2022
@arhimondr arhimondr requested review from linzebing and losipiuk May 11, 2022 00:04
@findepi findepi changed the title Fix memory accounting leak Fix memory accounting leak in aggregation May 11, 2022
@findepi
Copy link
Copy Markdown
Member

findepi commented May 11, 2022

Failure may be related

Error:  Failures: 
Error:    TestDriver.testBrokenOperatorAddSource:300 
Expecting a cause with type:
  <"io.trino.spi.TrinoException">
and message:
  <"Driver was interrupted">
but type was:
  <"java.lang.RuntimeException">
and message was:
  <"Interrupted">.

Copy link
Copy Markdown
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % CI failure

@arhimondr arhimondr force-pushed the fix-memory-accounting-leak branch from 21ae538 to 479c5a1 Compare May 12, 2022 22:49
@arhimondr
Copy link
Copy Markdown
Contributor Author

@findepi @losipiuk I had to add two more commits (Unify Driver#process method,Make sure Driver interrupt always handled properly) to fix the test failure. Please take a look.

@findepi findepi requested review from dain and sopel39 May 13, 2022 07:22
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is always called with 1; maybe replace with processOnce()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that. It feels like processOnce is less intuitive. Though I don't have a strong opinion here.

arhimondr added 6 commits May 13, 2022 11:20
Nested operators are used to compute column level statistics on write
An interrupt might occur when isFinishedInternal is executed. If it
occurs at that step it should still be post-processed correctly.
Some tests are written to fail with an external failure what causes
tasks being retried several times.
@arhimondr arhimondr force-pushed the fix-memory-accounting-leak branch from 479c5a1 to e2c54df Compare May 13, 2022 15:30
@arhimondr
Copy link
Copy Markdown
Contributor Author

CI: #12385

@arhimondr arhimondr merged commit cfb0e34 into trinodb:master May 13, 2022
@arhimondr arhimondr deleted the fix-memory-accounting-leak branch May 13, 2022 17:36
@github-actions github-actions bot added this to the 381 milestone May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Flaky tests when asserting memory pools are empty upon QueryRunner termination in integration tests

4 participants