-
Notifications
You must be signed in to change notification settings - Fork 440
TEZ-4474: Added config to fail the DAG status when recovery data is missing #266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
shameersss1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes LGTM +1. I have some one minor nit comment and please check the checkstyle violation : https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-266/10/artifact/out/diff-checkstyle-tez-dag.txt
| */ | ||
| @ConfigurationScope(Scope.AM) | ||
| @ConfigurationProperty(type="boolean") | ||
| public static final String TEZ_AM_FAILURE_ON_MISSING_RECOVERY = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we have a better config name? TEZ_AM_FAILURE_ON_MISSING_RECOVERY_DATA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, done
|
@abstractdog Could you please review the PR? |
This comment was marked as outdated.
This comment was marked as outdated.
shameersss1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the PR title to adhere with the latest approach
|
@mudit-97, @shameersss1: thanks for working on this so far, let me find some time next week to discover and review this scenario and patch |
| field.set(dam, spyRecoveryFs); | ||
|
|
||
| verify(dam.mockScheduler).setShouldUnregisterFlag(); | ||
| verify(dam.mockShutdown).shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're testing the new feature here, I'm missing something here:
- it's not clear for the first sight which part of the spy fs caused this expected behavior? (returning with ERROR)
- is there a chance to extend this method to reflect what happens in case of TEZ_AM_FAILURE_ON_MISSING_RECOVERY_DATA=false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abstractdog , enhanced the test case
- For the spy, now I am checking the exact number of invocations + I am capturing the values in each invocation to confirm this happened during recovery flow only and that too during summary file fetch
- I created a separate test case to capture that scenario also when TEZ_AM_FAILURE_ON_MISSING_RECOVERY_DATA is false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @mudit-97 , just a few questions:
- I'm not 100% sure, but do we really need spy here? as far as I know we use spies when really need an instance created by us instead of a mocked one, would you consider trying to use a mock here, which might be much simpler?
- if we end up using a spy here, please fix the method arguments' name...I know, this might look nitpicking, but even if we don't use them now, it's always a code smell to have args like "boolean b, int i, short i1, long l"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abstractdog , converted it to mock and removed spy, please check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool, thanks, one more comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abstractdog done that change also, please check
| } | ||
|
|
||
| @Override | ||
| public FSDataOutputStream create(Path path, FsPermission fsPermission, boolean b, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like an IDE generated method, please use the correct variable names (even if they are not used), e.g. here:
public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize,
short replication, long blockSize, Progressable progress) throws IOException {
please check other methods too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abstractdog , I tried to create a spy filesystem object, FileSystem was an abstract hadoop class and it had the similar variable names so I kept them same, these I just created for placeholder because I needed to create an instance of Filesystem and kept in parity with main class, if needed I will replace these with some other names
| ApplicationId appId = ApplicationId.newInstance(1, 1); | ||
| ApplicationAttemptId attemptId = ApplicationAttemptId.newInstance(appId, 2); | ||
|
|
||
| FileSystem spyRecoveryFs = spy(new FileSystem() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this spy reusable in other testing methods in this class? if so, refactor it to class field, if it's not, make it obvious why is this special (with variable name and/or comment on implemented methods)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made it common for the class
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
6b01129 to
fdbaf21
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
🎊 +1 overall
This message was automatically generated. |
| private static final String CLASS_SUFFIX = "_CLASS"; | ||
| private static final File TEST_DIR = new File(System.getProperty("test.build.data"), | ||
| TestDAGAppMaster.class.getName()).getAbsoluteFile(); | ||
| private final FileSystem mockFs = mock(FileSystem.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as this is just a single line now, it doesn't have to be a field, you can have it in the methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, approved
abstractdog
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 pending tests
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@abstractdog , the tests are completed, can you please approve and merge |
When Tez DAG recovery is failed because of some reason in the second retry of any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE
Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(), Tez AM will try to shutdown the DAG, and since recovery was failed so there will not be any running DAGs
If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM sets the status to SUCCEEDED
This can result in issues in dependent systems like Hive which will move ahead with other tasks in pipeline assuming the DAG was success, this can result in moving empty data in Hive
As part of this PR, we are proposing to introduce a patch in TEZ, which introduces a config, which when set, then in case of recovery missing in attempts > 1, it fails the DAG
Raised JIRA for the same: https://issues.apache.org/jira/browse/TEZ-4474