-
Notifications
You must be signed in to change notification settings - Fork 5.5k
[native] Fix crash in Taskmanager getResults. #18751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amitkdutta CAn you please doublecheck that it is impossible to have this situation:
- getData() request comes and either it creates the new PrestoTask or some other thread does.
- That new PrestoTask does not have Velox Task (it is null).
- We throw this error.
- Velox Task is created for this PrestoTask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is this special scenario that creates an empty PrestoTask without Velox Task in it. This scenario happens when error occurs while parsing or doing any protocol translation in task update endpoint. The empty shell PrestoTask was created to contain that error so that following getInfo()/getStatus() calls on that task can know what's going on. I think the crash here might be related to taht.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was clearly a race here. Check line 630 of this file. There is an explicit check on error in PrestoTask and by that line prestoTask does not contain an error yet. But here it does at the current line.
It indicates an error PrestoTask being constructed at line 630 and when it gets here the error is populated. I think a blame from another angle would be unsafe usage of prestoTask (without lock).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this fix will not necessarily will fix the problem.
By that time we might still not have error in the task.
@tanjialiang Can the task in this special case be created with error from the start to avoid the race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A way to verify is to print out the contained error to see if it is an error related to parsing/json/protocol etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this fix will not necessarily will fix the problem. By that time we might still not have error in the task. @tanjialiang Can the task in this special case be created with error from the start to avoid the race?
The task(PrestoTask) is created from the start with Error in it. We cannot create a Velox task because error happens while creating that velox task.
The issue here is, while doing checks here and for lines above we never acquire a lock from presto task (idk why, but it seems to be intentional). And presto task state changes while we use stuff inside this presto task.
xiaoxmeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amitkdutta Good catch. Thanks for the fix. Consider to add a unit test to cover this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some race conditions in prestoTask fields. So consider to change 'taskStarted' and 'error' to an atomic which seems to be necessary if we enabling tsan on presto cpp tests which might help us to catch this sort of condition in tests.
// NOTE: 'error' in PrestoTask might have been set here by async query failure code path even though we have checked 'error' right above this loop. Also createOrUpdateErrorTask sets both 'error' and 'taskStarted' but leave velox 'task' empty in async query failure case.
Also consider to create a unit test using testValue facility provided by Velox.
|
@amitkdutta , @tanjialiang , @xiaoxmeng
to |
ef85e66 to
7c62943
Compare
7c62943 to
d8e12cb
Compare
@spershin @tanjialiang Addressed changes as you suggested. This solves the crash as well. Thanks for the suggestion. |
spershin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Recently after shadowing a lot of traffic (where query failure rate is high), we found a crash which has following stack trace while looking in gdb:
Now, moving further in frame #5, we find that velox task is null
Hence the line
prestoTask->task->state()crashes. Actually we always check Prestotask's error status before accessing it's velox task, this is one place where it was missing and causing random crashes. This PR fixes it, and put proper comments for taskstarted field of PrestoTask.