Skip to content

Conversation

@mudit-97
Copy link
Contributor

@mudit-97 mudit-97 commented Feb 13, 2023

When Tez DAG recovery is failed because of some reason in the second retry of any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE

Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(), Tez AM will try to shutdown the DAG, and since recovery was failed so there will not be any running DAGs

If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM sets the status to SUCCEEDED

This can result in issues in dependent systems like Hive which will move ahead with other tasks in pipeline assuming the DAG was success, this can result in moving empty data in Hive

As part of this PR, we are proposing to introduce a patch in TEZ, which introduces a config, which when set, then in case of recovery missing in attempts > 1, it fails the DAG

Raised JIRA for the same: https://issues.apache.org/jira/browse/TEZ-4474

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

Copy link
Contributor

@shameersss1 shameersss1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes LGTM +1. I have some one minor nit comment and please check the checkstyle violation : https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/10/artifact/out/diff-checkstyle-tez-dag.txt

*/
@ConfigurationScope(Scope.AM)
@ConfigurationProperty(type="boolean")
public static final String TEZ_AM_FAILURE_ON_MISSING_RECOVERY =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we have a better config name? TEZ_AM_FAILURE_ON_MISSING_RECOVERY_DATA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, done

@shameersss1
Copy link
Contributor

@abstractdog Could you please review the PR?

@tez-yetus

This comment was marked as outdated.

Copy link
Contributor

@shameersss1 shameersss1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the PR title to adhere with the latest approach

@mudit-97 mudit-97 changed the title TEZ-4474: Added config to fail the DAG status when shutdown called with no current running DAGs TEZ-4474: Added config to fail the DAG status when recovery data is missing Feb 17, 2023
@abstractdog
Copy link
Contributor

@mudit-97, @shameersss1: thanks for working on this so far, let me find some time next week to discover and review this scenario and patch

field.set(dam, spyRecoveryFs);

verify(dam.mockScheduler).setShouldUnregisterFlag();
verify(dam.mockShutdown).shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're testing the new feature here, I'm missing something here:

  1. it's not clear for the first sight which part of the spy fs caused this expected behavior? (returning with ERROR)
  2. is there a chance to extend this method to reflect what happens in case of TEZ_AM_FAILURE_ON_MISSING_RECOVERY_DATA=false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abstractdog , enhanced the test case

  1. For the spy, now I am checking the exact number of invocations + I am capturing the values in each invocation to confirm this happened during recovery flow only and that too during summary file fetch
  2. I created a separate test case to capture that scenario also when TEZ_AM_FAILURE_ON_MISSING_RECOVERY_DATA is false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @mudit-97 , just a few questions:

  1. I'm not 100% sure, but do we really need spy here? as far as I know we use spies when really need an instance created by us instead of a mocked one, would you consider trying to use a mock here, which might be much simpler?
  2. if we end up using a spy here, please fix the method arguments' name...I know, this might look nitpicking, but even if we don't use them now, it's always a code smell to have args like "boolean b, int i, short i1, long l"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abstractdog , converted it to mock and removed spy, please check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool, thanks, one more comment :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abstractdog done that change also, please check

}

@Override
public FSDataOutputStream create(Path path, FsPermission fsPermission, boolean b,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like an IDE generated method, please use the correct variable names (even if they are not used), e.g. here:

      public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize,
          short replication, long blockSize, Progressable progress) throws IOException {

please check other methods too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abstractdog , I tried to create a spy filesystem object, FileSystem was an abstract hadoop class and it had the similar variable names so I kept them same, these I just created for placeholder because I needed to create an instance of Filesystem and kept in parity with main class, if needed I will replace these with some other names

ApplicationId appId = ApplicationId.newInstance(1, 1);
ApplicationAttemptId attemptId = ApplicationAttemptId.newInstance(appId, 2);

FileSystem spyRecoveryFs = spy(new FileSystem() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this spy reusable in other testing methods in this class? if so, refactor it to class field, if it's not, make it obvious why is this special (with variable name and/or comment on implemented methods)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made it common for the class

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@mudit1289 mudit1289 force-pushed the TEZ-4474 branch 2 times, most recently from 6b01129 to fdbaf21 Compare February 23, 2023 09:29
@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 37s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+0 🆗 mvndep 5m 47s Maven dependency ordering for branch
+1 💚 mvninstall 10m 35s master passed
+1 💚 compile 1m 14s master passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 compile 1m 4s master passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+1 💚 checkstyle 1m 5s master passed
+1 💚 javadoc 1m 15s master passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 javadoc 1m 7s master passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+0 🆗 spotbugs 1m 18s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 2m 38s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 9s Maven dependency ordering for patch
+1 💚 mvninstall 0m 48s the patch passed
+1 💚 compile 0m 53s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 javac 0m 53s the patch passed
+1 💚 compile 0m 46s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+1 💚 javac 0m 46s the patch passed
+1 💚 checkstyle 0m 30s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 46s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 javadoc 0m 48s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+1 💚 findbugs 2m 17s the patch passed
_ Other Tests _
+1 💚 unit 2m 8s tez-api in the patch passed.
+1 💚 unit 5m 4s tez-dag in the patch passed.
+1 💚 asflicense 0m 20s The patch does not generate ASF License warnings.
40m 56s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/16/artifact/out/Dockerfile
GITHUB PR #266
JIRA Issue TEZ-4474
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux b1b691cfb22f 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / be99489
Default Java Private Build-1.8.0_352-8u352-ga-1~22.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~22.04-b08
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/16/testReport/
Max. process+thread count 389 (vs. ulimit of 5500)
modules C: tez-api tez-dag U: .
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/16/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

private static final String CLASS_SUFFIX = "_CLASS";
private static final File TEST_DIR = new File(System.getProperty("test.build.data"),
TestDAGAppMaster.class.getName()).getAbsoluteFile();
private final FileSystem mockFs = mock(FileSystem.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this is just a single line now, it doesn't have to be a field, you can have it in the methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, approved

Copy link
Contributor

@abstractdog abstractdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 pending tests

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 38s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+0 🆗 mvndep 7m 27s Maven dependency ordering for branch
+1 💚 mvninstall 10m 38s master passed
+1 💚 compile 1m 15s master passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 compile 1m 7s master passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+1 💚 checkstyle 1m 7s master passed
+1 💚 javadoc 1m 23s master passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 javadoc 1m 12s master passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+0 🆗 spotbugs 1m 29s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 3m 1s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 8s Maven dependency ordering for patch
+1 💚 mvninstall 0m 57s the patch passed
+1 💚 compile 1m 7s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 javac 1m 7s the patch passed
+1 💚 compile 0m 55s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+1 💚 javac 0m 55s the patch passed
-0 ⚠️ checkstyle 0m 20s tez-dag: The patch generated 6 new + 54 unchanged - 0 fixed = 60 total (was 54)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 57s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 javadoc 0m 57s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+1 💚 findbugs 2m 47s the patch passed
_ Other Tests _
+1 💚 unit 2m 19s tez-api in the patch passed.
+1 💚 unit 5m 19s tez-dag in the patch passed.
+1 💚 asflicense 0m 20s The patch does not generate ASF License warnings.
45m 24s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/17/artifact/out/Dockerfile
GITHUB PR #266
JIRA Issue TEZ-4474
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 95c06f27dbd3 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / db9eb1e
Default Java Private Build-1.8.0_352-8u352-ga-1~22.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~22.04-b08
checkstyle https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/17/artifact/out/diff-checkstyle-tez-dag.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/17/testReport/
Max. process+thread count 492 (vs. ulimit of 5500)
modules C: tez-api tez-dag U: .
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/17/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 40s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+0 🆗 mvndep 6m 6s Maven dependency ordering for branch
+1 💚 mvninstall 10m 41s master passed
+1 💚 compile 1m 20s master passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 compile 1m 13s master passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+1 💚 checkstyle 1m 8s master passed
+1 💚 javadoc 1m 22s master passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 javadoc 1m 13s master passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+0 🆗 spotbugs 1m 34s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 3m 11s master passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 9s Maven dependency ordering for patch
+1 💚 mvninstall 0m 56s the patch passed
+1 💚 compile 1m 5s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 javac 1m 5s the patch passed
+1 💚 compile 0m 56s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+1 💚 javac 0m 56s the patch passed
-0 ⚠️ checkstyle 0m 22s tez-dag: The patch generated 6 new + 54 unchanged - 0 fixed = 60 total (was 54)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 56s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04
+1 💚 javadoc 0m 56s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~22.04-b08
+1 💚 findbugs 2m 28s the patch passed
_ Other Tests _
+1 💚 unit 2m 9s tez-api in the patch passed.
+1 💚 unit 5m 12s tez-dag in the patch passed.
+1 💚 asflicense 0m 20s The patch does not generate ASF License warnings.
43m 48s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/18/artifact/out/Dockerfile
GITHUB PR #266
JIRA Issue TEZ-4474
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 6fc6576785ed 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / db9eb1e
Default Java Private Build-1.8.0_352-8u352-ga-1~22.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu222.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~22.04-b08
checkstyle https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/18/artifact/out/diff-checkstyle-tez-dag.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/18/testReport/
Max. process+thread count 391 (vs. ulimit of 5500)
modules C: tez-api tez-dag U: .
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/18/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@mudit-97
Copy link
Contributor Author

@abstractdog , the tests are completed, can you please approve and merge

@abstractdog abstractdog merged commit 3e194cb into apache:master Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants