Add support to perform heapDump on exceeded memory limit failures#16669
Add support to perform heapDump on exceeded memory limit failures#16669highker merged 1 commit intoprestodb:masterfrom
Conversation
aweisberg
left a comment
There was a problem hiding this comment.
One nit regarding the name of the option. Otherwise LGTM. Thank you!
There was a problem hiding this comment.
Borderline it might be less useful to do this for revocable memory, but since this is a debug option you enable it seems like it should be fine since exceeding revocable limit is pretty rare.
There was a problem hiding this comment.
This is specific to exceeding the memory limit so the name should probably reflect that.
There was a problem hiding this comment.
I'm fine with this being null given the specialized nature of the whole thing, but if someone picky comes along to commit they might want this to be Optional.
446d8eb to
4e7fc30
Compare
There was a problem hiding this comment.
Does it make sense to check whether the heapDumpFilePath is a valid path?
There was a problem hiding this comment.
That is a good point!
This is a session property per query and not a configuration at startup. It will fail if the path is bad here anyways, but it might make sense to do something like https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/spiller/FileSingleStreamSpillerFactory.java#L101 ?
I am torn on whether the query should fail fast at startup if the file can't be created vs just letting it fail to create the heap dump. It's a bit finicky because we would need to check for each query whether the file can be created.
This is for debug so the temptation is to just do what is simple.
There was a problem hiding this comment.
I did thought of adding some kind of valid path check but did not pursued on it since this is more of a debugging feature. If heapDump is enabled and heapDumpPath is not specified, a tmp path is automatically constructed for heapDump. Users have the ability to override the heapDumpPath, if needed and they are expected to provide a valid path.
Also, the heapDump is a best effort feature and will not fail the query on exceptions. If any exception is thrown, it will be swallowed and query execution will fail with proper EXCEEDED_MEMORY_LIMIT exception.
souravpal
left a comment
There was a problem hiding this comment.
I suspect this will be one of the most useful debug features. Ship it!
There was a problem hiding this comment.
let's move this to com.facebook.presto.util;
There was a problem hiding this comment.
do we need synchronized method if we use AtomicBoolean for isHeapDumpTriggered?
There was a problem hiding this comment.
If we use AtomicBoolean, then we need to first set isHeapDumpTriggered and then trigger the heap dump. While the heapDump is in progress, other threads will skip this logic and throw the exceeded memory exception immediately, which should be OK.
There was a problem hiding this comment.
inline bean
return ManagementFactory.newPlatformMXBeanProxy(server, HOTSPOT_BEAN_NAME, HotSpotDiagnosticMXBean.class);There was a problem hiding this comment.
It's kinda weird having a dump inside an exception and do it synchronically.... Can we at the callsite, when memory exceeds, we throw the error and asynchronically use an executor to do the dump?
There was a problem hiding this comment.
This is a basically a catch-all method for any exceeded memory OOM. If we add this logic at callsite, it will be spread throughout the code and might not look clean. Presto exceeded memory limit failure is like an OOM so getting a heapdump here makes most sense for debugging purpose.
Also, this is a debugging feature and will be disabled in prod. I didn't wanted to over-engineer the solution since this logic will be triggered in control environment while testing a specific query for OOM failures.
There was a problem hiding this comment.
It's important the heap dump be triggered synchronously with the out of memory error because the purpose of the entire thing is capture the state of the heap when the error occurs. If it's asynchronous it allows time for queries to fail and cleanup to occur both in this query and in other queries that are running concurrently.
WDYT?
There was a problem hiding this comment.
If that is the case, shall we add comments to this method (and other's in this class) to indicate this needed and only for debugging purpose.
There was a problem hiding this comment.
we probably don't wanna expose this session property to users..... let's make it hidden; same for the other one
There was a problem hiding this comment.
for the default value, we usually specify it in a config; for this case, featuresConfig would be a good one. Check the other examples in this class.
There was a problem hiding this comment.
I explicitly did not add a Config property for this since we don't expect it to be enabled in prod. This feature will only be enabled by session property when specified. Is it always the case that we should specify a corresponding config property for every session property?
There was a problem hiding this comment.
When I considered it I decided that the way it is structured it doesn't make much sense as a config option. It's a pretty niche usage and it will only dump the heap once per process (there is a guard in heap dumper).
I'm not sure how applicable this is for production clusters given that restriction. It seems like a good restriction just because it limits the blast radius of accidentally enabling the option and blasting it at a production cluster (yikes!).
Just thinking about it makes me think we actually want to prohibit this by default unless the cluster is started with an option permitting it that can't be set via session property (so features config).
There was a problem hiding this comment.
Also, add a comment to these two places to explain the rationale why we don't want a config
There was a problem hiding this comment.
probably we wanna specify the directory path instead of the file name. Then we can encode our own file name like <query_id>_<stage_id>.hprof
There was a problem hiding this comment.
The idea was to keep this simple but this suggestion also sounds good. Lets do this.
There was a problem hiding this comment.
can we make the two files immutable and pass them in throw the constructor? We will have a lot of tasks per query and it's not necessary setting them all the time
There was a problem hiding this comment.
I had asked the similar question in the past in one of the PR and I got this response for why we should not pass it in constructor: #16297 (comment)
There was a problem hiding this comment.
I ran into the same thing. https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/execution/SqlTaskManager.java#L409
QueryContext is constructed from a LoadingCache and when the LoadingCache is constructed there is no session. Nested loading caches in fact. I think there are races where the Task and associated QueryContext need to be retrieved before the session is available because messages for a task can arrive before the associated session information does.
Seems like we could simplify by having one setter with two params though?
There was a problem hiding this comment.
they could be final if we get them from constructor
There was a problem hiding this comment.
- this could be
private static final - call it
IS_HEAPDUMP_TRIGGEREDand move it closer toHOTSPOT_BEAN_NAME
There was a problem hiding this comment.
Do we need a
finally {
isHeapDumpTriggered.set(false);
}There was a problem hiding this comment.
We dont. We want to trigger heapdump only once. As soon as the first thread sets the atomic boolean to TRUE, no other thread should be able to trigger a heapdump after that.
There was a problem hiding this comment.
If that is the case, shall we add comments to this method (and other's in this class) to indicate this needed and only for debugging purpose.
There was a problem hiding this comment.
Also, add a comment to these two places to explain the rationale why we don't want a config
|
Could you update the |
We don't have any easy way to get heapdump on exceeded_memory_limit failures. Having an ability to trigger a heapdump in such cases will greatly improve debugging experiences around query OOMs.
3908230 to
e4ed693
Compare
We don't have any easy way to get heapdump on exceeded_memory_limit
failures. Having an ability to trigger a heapdump in such cases will
greatly improve debugging experiences around query OOMs.
Test plan - Manually ran queries with session property enabled and heap snapshot was dumped into the specified file