Skip to content

Support for killing individual task if cluster is out of memory#11129

Merged
losipiuk merged 5 commits intotrinodb:masterfrom
losipiuk:lo/kill-by-task
Mar 8, 2022
Merged

Support for killing individual task if cluster is out of memory#11129
losipiuk merged 5 commits intotrinodb:masterfrom
losipiuk:lo/kill-by-task

Conversation

@losipiuk
Copy link
Copy Markdown
Member

@losipiuk losipiuk commented Feb 21, 2022

Description

Support for killing individual task if cluster is out of memory.
Individual tasks will be killed for queries which are run with task-level retries.
Currently only supported if default low memory killer policy is used (total-reservation-on-blocked-nodes).

Is this change a fix, improvement, new feature, refactoring, or other?

improvement to new feature

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine

How would you describe this change to a non-technical end user or system administrator?

N/A

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Section
* Add support for killing individual tasks of queries run with ``retry-mode`` set to ``TASK`` low memory killer.
  Requires ``query.low-memory-killer.policy`` config option to be set to ``total-reservation-on-blocked-nodes``. ({issue}`11129`)

@cla-bot cla-bot bot added the cla-signed label Feb 21, 2022
@linzebing linzebing self-requested a review March 1, 2022 18:04
@losipiuk losipiuk force-pushed the lo/kill-by-task branch 2 times, most recently from c818b5a to d868290 Compare March 1, 2022 22:33
@losipiuk losipiuk marked this pull request as ready for review March 1, 2022 22:33
Copy link
Copy Markdown
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me % comments

@losipiuk
Copy link
Copy Markdown
Member Author

losipiuk commented Mar 2, 2022

AC

@losipiuk losipiuk force-pushed the lo/kill-by-task branch 2 times, most recently from e0545dd to b9797dd Compare March 2, 2022 15:09
@losipiuk losipiuk requested a review from martint March 2, 2022 19:28
@losipiuk
Copy link
Copy Markdown
Member Author

losipiuk commented Mar 2, 2022

@martint could you maybe take a look?

@losipiuk
Copy link
Copy Markdown
Member Author

losipiuk commented Mar 3, 2022

Major changes:

  • Rebased on top of Support for killing individual task if cluster is out of memory #11129.
  • Dropped mechanism of per-task memory limit. We are not enforcing limits right now. We are estimating tasks size, and bin-packing based on that, but we allow it to go above. If memory pool is exhaust memory kill will intervene and kill one of the tasks
  • I dropped integration test. With memory-killer approach I was not able to make it work (deterministically enforce that one of the tasks would require full not allocation)

@arhimondr PTAL once again

NVM - comment was meant for different PR

losipiuk added 5 commits March 7, 2022 13:52
Extend MemoryInfo data structure passed from workers to coordinator
via `v1/memory` resource with memory reservation for all tasks which
are running as part of queries with task-level retries enabled.

The extra information is needed for implementing task aware cluster
out of memory killer.
@losipiuk
Copy link
Copy Markdown
Member Author

losipiuk commented Mar 8, 2022

CI: #10631 (Table '...' not found)

@losipiuk losipiuk merged commit 14f01f2 into trinodb:master Mar 8, 2022
@github-actions github-actions bot added this to the 373 milestone Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants