Worker Low Heap Memory Task Killer#21254
Conversation
|
How is this feature being tested ? Can we also add some monitoring / metrics ? |
There was a problem hiding this comment.
Can this be put into the GcStatsMonitor itself that already exists ?
There was a problem hiding this comment.
Introduce a new class due the existing one is for monitoring the GC activily.
There was a problem hiding this comment.
Pushing back a bit: this one does that too except it acts on the gc signals
There was a problem hiding this comment.
Should we only kill when we haven't been able to reclaim for a while. Like for a couple of times ?
There was a problem hiding this comment.
Keeping track of the memory state is not accurate with this functionality as we won't have visibility into between last full GC and current one if the worker able to free up memory or not.
There was a problem hiding this comment.
I am not following. Why can you not keep the previous bytes after full gc in a class instance variable along with timestamp ?
There was a problem hiding this comment.
I am slightly confused: lowMemoryTaskKillThreshhold is used to multiply with max memory in isLowMemory method and is also used as an absolute value in the RHS ? Is it a ratio ? If not, why is GC bytes compared with it ?
There was a problem hiding this comment.
lowMemoryTaskKillThreshhold is meant to provide the memory threshold that if full GC able to reclaim then the task killer won't kick in. The other threshold is lowMemoryTaskKillerMemoryThreshold is used to mark when do we want to consider worker is running low memory.
There was a problem hiding this comment.
some more stats in this error msg ?
There was a problem hiding this comment.
Should we have some kind of a "quiet period" after it has killed a query to observe the effects ? Like how do we prevent this from going rogue ?
There was a problem hiding this comment.
Full GC won't happen once the query has been killed unless there are more memory hungry queries running and consuming memory resulting in the same scenario. And in that case we want it to kick in and kill them to avoid jvm oom.
There was a problem hiding this comment.
Can you use maxBy or such instead of sorting ?
There was a problem hiding this comment.
Should we add stats.getRevocableMemoryReservationInBytes() too?
94e8205 to
38fb6d3
Compare
NikhilCollooru
left a comment
There was a problem hiding this comment.
Can we add some unit tests ?
presto-main/src/main/java/com/facebook/presto/memory/HighMemoryTaskKillerStrategy.java
Outdated
Show resolved
Hide resolved
38fb6d3 to
ed9516f
Compare
There was a problem hiding this comment.
Can this be enabled is the other two configs are set appropriately? Trying to avoid a third Boolean config about this features
There was a problem hiding this comment.
log.error(ignored, msg) to log the stack trace maybe
There was a problem hiding this comment.
Should the check for info.isMajorGc be done inside here ? Otherwise this method is only called for full gcs.
There was a problem hiding this comment.
Should this be called FREE_MEMORY_... or just TRIGGER_ON_ or even KILL_ON_... ? It doesn't free any memory explicitly other than failing queries.
There was a problem hiding this comment.
can you use currentGarbageCollectedBytes here ... ?
Which strategy are we going to roll out with initially ?
There was a problem hiding this comment.
Why not put the .reversed here ?
There was a problem hiding this comment.
due to having static vs non static method, this is not straight moving the reversed at the top so going to address in followup PR.
There was a problem hiding this comment.
Should the check for isLowMemory be done common to both the approaches ?
There was a problem hiding this comment.
how is the initial case of lastFullGCTimestamp = 0 handled ?
There was a problem hiding this comment.
This second approach of FREE_MEM_ON_FULL_GC is very similar to the first one (even with the reclaimMemoryThreshhold check -- its just time invariant in that it does not look at the prev full gc.
Not a big deal, but can we commonify some code ?
presto-main/src/main/java/com/facebook/presto/memory/HighMemoryTaskKiller.java
Outdated
Show resolved
Hide resolved
ed9516f to
b07023b
Compare
In Meta, we noticed instances where if we run worker on a low memory machine and it runs high memory intensive workload, there are times when worker is running low memory and full GC not able to free up enough memory. This result in worker OOMs when worker continue progressing with the work trying to capture more memory. With this experimental feature, we want worker to kill tasks in case its running in low memory (configurable) and during full GC not able to free up enough memory (configurable). For such scenario worker identifies task consuming the most memory and kill it. There are two strategies implemented in the change when task killer kicked in 1. During full GC if worker is not able to free enough memory (configured) and heap usage is above configured threshold 2. During frequent full GCs (configured) if worker not able to reclaim enough memory (configured)
b07023b to
a89e614
Compare
Description
With this experimental feature, we want worker to kill tasks in case its running in low memory
(configurable) and during full GC not able to free up enough memory (configurable). For such scenario
worker identifies task consuming the most memory and kill it.
There are two strategies implemented in the change when task killer kicked in
Motivation and Context
n Meta, we noticed instances where if we run worker on a low memory machine and
it runs high memory intensive workload, there are times when worker is running low
memory and full GC not able to free up enough memory. This result in worker OOMs when
worker continue progressing with the work trying to capture more memory.
Impact
Able to prevent worker OOM and able to run queries successfully instead of killing all running queries.
Test Plan
Ran shadow in presto verifier where with the feature disable all queries died due to REMOTE_TASK_ERROR or REMOTE_HOST_GONE due to worker OOM. With the feature enabled, subset of the queries failed with INSUFFICIENT_RESOURCES due to worker heap memory is high. And subset able to finish successfully.
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.