Skip to content

Conversation

@amahussein
Copy link
Contributor

TEZ-4349 DAGClient gets stuck with invalid cached DAGStatu

The cachedDagStatus should be valid for a certain amount of time, or certain number of retires.

When the cachedDAGStatus expires, the DAGClient tries to pull from AM or the RM.
An error in fetching the status from both AM and RM, would return null to the caller.

  • The expiration time can be configured using TezConfiguration.TEZ_CLIENT_DAG_STATUS_CACHE_TIMEOUT_MINUTES tez.client.dag.status.cache.timeout-minutes, and the default is 5. The timeUnit of the expiration is minutes.
  • Added a new UT TestDAGClient.testGetDagStatusWithCachedStatusExpiration
  • ran the following unit tests: mvn test -Dtest=TestDAGClient,TestTezClient,TestMockDAGAppMaster

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 29s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+1 💚 mvninstall 13m 13s master passed
+1 💚 compile 0m 34s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 0m 34s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 checkstyle 1m 4s master passed
+1 💚 javadoc 0m 50s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 38s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+0 🆗 spotbugs 1m 34s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 31s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 23s the patch passed
+1 💚 compile 0m 24s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 0m 24s the patch passed
+1 💚 compile 0m 20s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 javac 0m 20s the patch passed
-0 ⚠️ checkstyle 0m 15s tez-api: The patch generated 16 new + 103 unchanged - 0 fixed = 119 total (was 103)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 24s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 25s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 findbugs 1m 3s the patch passed
_ Other Tests _
+1 💚 unit 2m 7s tez-api in the patch passed.
+1 💚 asflicense 0m 14s The patch does not generate ASF License warnings.
25m 25s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-161/1/artifact/out/Dockerfile
GITHUB PR #161
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 8779a8e83e2d 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / f39a51e
Default Java Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
checkstyle https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-161/1/artifact/out/diff-checkstyle-tez-api.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-161/1/testReport/
Max. process+thread count 263 (vs. ulimit of 5500)
modules C: tez-api U: tez-api
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-161/1/console
versions git=2.25.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

dagClientRpc.setAMProxy(createMockProxy(DAGStatusStateProto.DAG_SUCCEEDED, 1000l));
dagClientRpc.injectAMFault(new IOException("injected AM Fault"));
dagClient.resetCounters();
dagClientRpc.resetCountesr();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please fix this typo?
(I know it's not introduced with this patch, but it's worth fixing as you already working on this area)

if (cachedDAG != null) {
// could not get from AM (not reachable/ was killed). return cached status.
return cachedDagStatus;
return cachedDAG;
Copy link
Contributor

@abstractdog abstractdog Nov 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I right to assume this is the codepath where the original issue happened? could you please clarify how can we indefinitely stuck here?
I mean, we can only hit this part if getDAGStatusViaAM returns null but dagCompleted is not true, so when we hit this again and again in getDAGStatusViaAM :

    } catch (TezException e) {
      // can be either due to a n/w issue of due to AM completed.
    } catch (IOException e) {
      // can be either due to a n/w issue of due to AM completed.
    }

also getApplicationReportInternal keeps returning null in checkAndSetDagCompletionStatus

was it the case for you?
if so, does it make sense to put at least debug level log messages to the silent catch branches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.

Yes, considering the implementation of TezJob.run()-Line222 in Pig:

Pig polls on the DAGStatus inside an infinite loop:

        while (true) {
            try {
                dagStatus = dagClient.getDAGStatus(null);
            } catch (Exception e) {
                log.info("Cannot retrieve DAG status", e);
                break;
            }
           if (dagStatus.isCompleted()) {
               // do something
               // break;
           }
           sleep(1000);
       }  

Let's assume the following scenario on Tez Side:

  • Pig first iteration calls getDAGStatusViaAM() which successfully pulls the DAGStatus and updates the cachedDAGStatus to running.
  • Pig sleeps 1000
  • second call from Pig calls getDAGStatusViaAM() which encounters TezException or IOException. The call would return the last cachedDAGStatus (which is running), instead of null.
  • Since the status is running, the Pig-thread sleeps
  • This will keep going as long as the getDAGStatusViaAM() fails, and the last valid DAGStatus is still cached.

The problem in this corner case is that the Pig client will keep looping indefinitely as long as it does not receive a null or dagClient.getDAGStatus(null) does not throw an exception.
From a client perspective, it is better to fail early in order to recover faster.

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 33s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+1 💚 mvninstall 13m 14s master passed
+1 💚 compile 0m 39s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 0m 37s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 checkstyle 1m 6s master passed
+1 💚 javadoc 0m 50s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 38s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+0 🆗 spotbugs 1m 35s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 32s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 22s the patch passed
+1 💚 compile 0m 24s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 0m 24s the patch passed
+1 💚 compile 0m 20s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 javac 0m 20s the patch passed
-0 ⚠️ checkstyle 0m 14s tez-api: The patch generated 2 new + 103 unchanged - 0 fixed = 105 total (was 103)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 24s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 23s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 findbugs 1m 3s the patch passed
_ Other Tests _
+1 💚 unit 2m 4s tez-api in the patch passed.
+1 💚 asflicense 0m 14s The patch does not generate ASF License warnings.
25m 31s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-161/2/artifact/out/Dockerfile
GITHUB PR #161
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux d275826a7234 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / f39a51e
Default Java Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
checkstyle https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-161/2/artifact/out/diff-checkstyle-tez-api.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-161/2/testReport/
Max. process+thread count 262 (vs. ulimit of 5500)
modules C: tez-api U: tez-api
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-161/2/console
versions git=2.25.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

public static final String TEZ_CLIENT_DAG_STATUS_CACHE_TIMEOUT_MINUTES = TEZ_PREFIX
+ "client.dag.status.cache.timeout-minutes";
// Default timeout is 5 minutes.
public static final long TEZ_CLIENT_DAG_STATUS_CACHE_TIMEOUT_MINUTES_DEFAULT = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can lower this to 1 minute, especially because in case of query status stuck (that the patch is about to address) this default value will be the minimum time in which the client has the chance to realize the stuck cached progress

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we actually would prefer seconds. I can see use cases that would benefit from a lower time than 1 minute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with seconds. However, I am not so sure about the impact of setting the expiration to a small value.
For example, it is possible to have some delay in the AM executing the RPC call (or even a timeout at one time). If the expiration of the cache is smaller than the total of ("AM RPC with failure" + "wait for the client to retry" + "AM RPC"), then the cache won't serve its purpose.
If I understand correctly, cachedDAGStatus purpose is to protect the client from falling too soon to the RM. Correct me if I misunderstand the purpose of the cachedDAGStatus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit changes the TimeUnit to seconds and the default to 60.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @amahussein, I'm assuming the default value 60s would work properly

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 39s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+1 💚 mvninstall 14m 44s master passed
+1 💚 compile 0m 41s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 compile 0m 37s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 checkstyle 1m 5s master passed
+1 💚 javadoc 0m 57s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 42s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+0 🆗 spotbugs 1m 58s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 56s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 34s the patch passed
+1 💚 compile 0m 37s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javac 0m 38s the patch passed
+1 💚 compile 0m 32s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 javac 0m 32s the patch passed
+1 💚 checkstyle 0m 19s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 38s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚 javadoc 0m 32s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚 findbugs 1m 28s the patch passed
_ Other Tests _
+1 💚 unit 2m 21s tez-api in the patch passed.
+1 💚 asflicense 0m 15s The patch does not generate ASF License warnings.
29m 47s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-161/3/artifact/out/Dockerfile
GITHUB PR #161
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 7dbdbb5f65d4 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / f39a51e
Default Java Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-161/3/testReport/
Max. process+thread count 264 (vs. ulimit of 5500)
modules C: tez-api U: tez-api
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-161/3/console
versions git=2.25.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@abstractdog abstractdog self-requested a review December 25, 2021 13:48
Copy link
Contributor

@abstractdog abstractdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants