-
Notifications
You must be signed in to change notification settings - Fork 440
TEZ-4349. DAGClient gets stuck with invalid cached DAGStatus #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
🎊 +1 overall
This message was automatically generated. |
| dagClientRpc.setAMProxy(createMockProxy(DAGStatusStateProto.DAG_SUCCEEDED, 1000l)); | ||
| dagClientRpc.injectAMFault(new IOException("injected AM Fault")); | ||
| dagClient.resetCounters(); | ||
| dagClientRpc.resetCountesr(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please fix this typo?
(I know it's not introduced with this patch, but it's worth fixing as you already working on this area)
| if (cachedDAG != null) { | ||
| // could not get from AM (not reachable/ was killed). return cached status. | ||
| return cachedDagStatus; | ||
| return cachedDAG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am I right to assume this is the codepath where the original issue happened? could you please clarify how can we indefinitely stuck here?
I mean, we can only hit this part if getDAGStatusViaAM returns null but dagCompleted is not true, so when we hit this again and again in getDAGStatusViaAM :
} catch (TezException e) {
// can be either due to a n/w issue of due to AM completed.
} catch (IOException e) {
// can be either due to a n/w issue of due to AM completed.
}
also getApplicationReportInternal keeps returning null in checkAndSetDagCompletionStatus
was it the case for you?
if so, does it make sense to put at least debug level log messages to the silent catch branches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback.
Yes, considering the implementation of TezJob.run()-Line222 in Pig:
Pig polls on the DAGStatus inside an infinite loop:
while (true) {
try {
dagStatus = dagClient.getDAGStatus(null);
} catch (Exception e) {
log.info("Cannot retrieve DAG status", e);
break;
}
if (dagStatus.isCompleted()) {
// do something
// break;
}
sleep(1000);
} Let's assume the following scenario on Tez Side:
- Pig first iteration calls
getDAGStatusViaAM()which successfully pulls the DAGStatus and updates the cachedDAGStatus to running. - Pig sleeps 1000
- second call from Pig calls
getDAGStatusViaAM()which encountersTezExceptionorIOException. The call would return the last cachedDAGStatus (which is running), instead of null. - Since the status is
running, the Pig-thread sleeps - This will keep going as long as the
getDAGStatusViaAM()fails, and the last valid DAGStatus is still cached.
The problem in this corner case is that the Pig client will keep looping indefinitely as long as it does not receive a null or dagClient.getDAGStatus(null) does not throw an exception.
From a client perspective, it is better to fail early in order to recover faster.
|
🎊 +1 overall
This message was automatically generated. |
| public static final String TEZ_CLIENT_DAG_STATUS_CACHE_TIMEOUT_MINUTES = TEZ_PREFIX | ||
| + "client.dag.status.cache.timeout-minutes"; | ||
| // Default timeout is 5 minutes. | ||
| public static final long TEZ_CLIENT_DAG_STATUS_CACHE_TIMEOUT_MINUTES_DEFAULT = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can lower this to 1 minute, especially because in case of query status stuck (that the patch is about to address) this default value will be the minimum time in which the client has the chance to realize the stuck cached progress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we actually would prefer seconds. I can see use cases that would benefit from a lower time than 1 minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with seconds. However, I am not so sure about the impact of setting the expiration to a small value.
For example, it is possible to have some delay in the AM executing the RPC call (or even a timeout at one time). If the expiration of the cache is smaller than the total of ("AM RPC with failure" + "wait for the client to retry" + "AM RPC"), then the cache won't serve its purpose.
If I understand correctly, cachedDAGStatus purpose is to protect the client from falling too soon to the RM. Correct me if I misunderstand the purpose of the cachedDAGStatus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commit changes the TimeUnit to seconds and the default to 60.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @amahussein, I'm assuming the default value 60s would work properly
|
🎊 +1 overall
This message was automatically generated. |
abstractdog
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
TEZ-4349 DAGClient gets stuck with invalid cached DAGStatu
The
cachedDagStatusshould be valid for a certain amount of time, or certain number of retires.When the
cachedDAGStatusexpires, the DAGClient tries to pull from AM or the RM.An error in fetching the status from both AM and RM, would return null to the caller.
TezConfiguration.TEZ_CLIENT_DAG_STATUS_CACHE_TIMEOUT_MINUTEStez.client.dag.status.cache.timeout-minutes, and the default is 5. ThetimeUnitof the expiration is minutes.TestDAGClient.testGetDagStatusWithCachedStatusExpirationmvn test -Dtest=TestDAGClient,TestTezClient,TestMockDAGAppMaster