-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HDFS-16143. Add Timer in EditLogTailer and de-flake TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits #3235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
7143f8c to
244b00f
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
244b00f to
3a731fa
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
99c7d0e to
1c2efbf
Compare
…nTailInProgressEdits
1c2efbf to
b5554b6
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@aajisaka @jojochuang @tasanuma Could you please review this PR? After the latest revision, 4 builds were triggered and we don't see failures in |
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not confident in my ability to review this... just a question
| new PermissionStatus("test", "test", | ||
| new FsPermission((short)00755)), true); | ||
| // reset lastRollTimeMs in EditLogTailer. | ||
| active.getNamesystem().getEditLogTailer().resetLastRollTimeMs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mind to explain when this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look @jojochuang.
EditLogTailer has a thread that keeps running to identify when is the right time to trigger log rolling by calling Active Namenode's rollEditLog() API.
private void doWork() {
long currentSleepTimeMs = sleepTimeMs;
while (shouldRun) {
long editsTailed = 0;
try {
// There's no point in triggering a log roll if the Standby hasn't
// read any more transactions since the last time a roll was
// triggered.
boolean triggeredLogRoll = false;
if (tooLongSinceLastLoad() &&
lastRollTriggerTxId < lastLoadedTxnId) {
triggerActiveLogRoll();
triggeredLogRoll = true;
}
...
...
What happens with this test is that by the time we create new dirs in this for loop, this active thread would keep checking and intermittently keep triggering log roll by making RPC calls to Active Namenode, and hence this test would become flaky because the test expects Standby Namenode's last applied txn id to be less than active Namenode's last written txn id within a specific time duration (this is the only reason behind it's flakiness). When it comes to how long EditLogTailer's thread keeps waiting to trigger log roll, it depends on lastRollTimeMs.
In the above code, tooLongSinceLastLoad() refers to:
/**
* @return true if the configured log roll period has elapsed.
*/
private boolean tooLongSinceLastLoad() {
return logRollPeriodMs >= 0 &&
(monotonicNow() - lastRollTimeMs) > logRollPeriodMs;
}
Hence, until lastRollTimeMs worth of time is elapsed, log roll would not be tailed, however, this always tends to be flaky because we have no control over how much time mkdir calls in this for loop is going to take and in that meantime, lastRollTimeMs worth of time can be elapsed easily, hence this test is flaky. When we expect Standby Namenode's txnId to be less than that of Active Namenode, it is not the case because log is rolled by above thread in EditLogTailer.
Hence, it is important for this test to keep resetting lastRollTimeMs while mkdir calls are getting executed so that we don't give chance for tooLongSinceLastLoad() to be successful until we want it to be successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xkrogen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for trying to fix this flaky test! I haven't 100% understood the current approach yet but I think I understand the gist. It seems reasonable, but it the real root cause here is that we're using real timing info in a multi-threaded test. It seems it would be better to allow for passing in a mock Timer implementation to EditLogTailer so that we can have full control over the timing information used, eliminating this whole class of problem and making it deterministic instead of applying a band-aid by tweaking the timing a bit. WDYT?
| if (future != null) { | ||
| future.cancel(true); | ||
| } | ||
| future.cancel(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the null-check removed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because future will never be null here. The only way we can reach here is by catching TimeoutException and TimeoutException can only occur here because of future.get(rollEditsTimeoutMs, TimeUnit.MILLISECONDS), hence we don't need null-check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Both ExecutionException and InterruptedException will also only be thrown by Future#get, so I think the try-catch should probably look like:
Future<Void> future = rollEditsRpcExecutor.submit(getNameNodeProxy());
try {
future.get(....)
...
} catch (...) {
...
}
This will make it more clear why future will not be null when we get into the catch block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But also this is unrelated to this JIRA, can you put up a separate JIRA/PR for it? Particular since this JIRA is just aimed at fixing test flakiness, it's better to minimize any production code changes.
| } | ||
|
|
||
| @VisibleForTesting | ||
| public void resetLastRollTimeMs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package-private instead of public?
| long curSegmentTxId = active.getNamesystem().getFSImage().getEditLog() | ||
| .getCurSegmentTxId(); | ||
| return (origTxId != curSegmentTxId); | ||
| }, 500, TimeUnit.SECONDS.toMillis(maxWaitSec)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the check interval increased from 100ms to 500ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking above condition every 100ms is too aggressive, keeping it to 500ms is less aggressive and quite enough for both of our tests: a) timeout during verification b) successful verification of Standby NN's txnId.
However, now that we are going to add Timer implementation, it's better to keep it as is.
| // so that our timeout test provided just below can take advantage | ||
| // of validation: (monotonicNow() - lastRollTimeMs) > logRollPeriodMs | ||
| // provided in EditLogTailer#tooLongSinceLastLoad(). | ||
| active.getNamesystem().getEditLogTailer().resetLastRollTimeMs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you just updated the last roll time on L444 above, why do we need to do it again?
virajjasani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it would be better to allow for passing in a mock Timer implementation to EditLogTailer so that we can have full control over the timing information used, eliminating this whole class of problem and making it deterministic instead of applying a band-aid by tweaking the timing a bit.
Nice suggestion, updated the PR. Thanks for the review @xkrogen.
| if (future != null) { | ||
| future.cancel(true); | ||
| } | ||
| future.cancel(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because future will never be null here. The only way we can reach here is by catching TimeoutException and TimeoutException can only occur here because of future.get(rollEditsTimeoutMs, TimeUnit.MILLISECONDS), hence we don't need null-check.
| long curSegmentTxId = active.getNamesystem().getFSImage().getEditLog() | ||
| .getCurSegmentTxId(); | ||
| return (origTxId != curSegmentTxId); | ||
| }, 500, TimeUnit.SECONDS.toMillis(maxWaitSec)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking above condition every 100ms is too aggressive, keeping it to 500ms is less aggressive and quite enough for both of our tests: a) timeout during verification b) successful verification of Standby NN's txnId.
However, now that we are going to add Timer implementation, it's better to keep it as is.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| if (future != null) { | ||
| future.cancel(true); | ||
| } | ||
| future.cancel(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Both ExecutionException and InterruptedException will also only be thrown by Future#get, so I think the try-catch should probably look like:
Future<Void> future = rollEditsRpcExecutor.submit(getNameNodeProxy());
try {
future.get(....)
...
} catch (...) {
...
}
This will make it more clear why future will not be null when we get into the catch block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's better to add a new constructor to avoid changes in the calling classes:
public EditLogTailer(FSNamesystem namesystem, Configuration conf) {
this(namesystem, conf, new Timer());
}
EditLogTailer(FSNamesystem namesystem, Configuration conf,
Timer timer) {
this.tailerThread = ...
this.conf = conf;
this.namesystem = namesystem;
this.timer = timer;
...
}
But upon looking further, it seems that we never actually pass in a custom timer to the constructor; we only change it via setTimer, so I think we can get rid of the version that has a custom timer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already have a class called ManualTimer or FakeTimer, can you check again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we have FakeTimer, however, it's c'tor allows only fixed values initially, which can only be incremented later (IMHO, not efficient for our test, we just require setting to specific value):
public FakeTimer() {
// Initialize with a non-trivial value.
now = 1577836800000L; // 2020-01-01 00:00:00,000+0000
nowNanos = TimeUnit.MILLISECONDS.toNanos(1000);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the common faker timer would be better if we create a new public constructor? Using advance would be more clearer when reading.
FakeTimer fakeTimer = new FakeTimer(standby.getNamesystem().getEditLogTailer().getTimer().monotonicNow());
standby.getNamesystem().getEditLogTailer().setTimerForTest(fakerTimer);
...
fakeTimer.advance(inSufficientTimeForLogRoll);
try {
checkForLogRoll(active, origTxId, noLogRollWaitTime);
fail("Expected to timeout");
}
....
fakeTimer.advance(sufficientTimeForLogRoll);
checkForLogRoll(active, origTxId, logRollWaitTime);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change the name of the parameter (e.g. newTimer) instead of suppressing the warning?
| if (future != null) { | ||
| future.cancel(true); | ||
| } | ||
| future.cancel(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But also this is unrelated to this JIRA, can you put up a separate JIRA/PR for it? Particular since this JIRA is just aimed at fixing test flakiness, it's better to minimize any production code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of changing the timer instance multiple times, can you store a reference to the timer, then adjust the time returned by the timer?
TestTimer timer = new TestTimer(...);
standby.getNamesystem().getEditLogTailer().setTimerForTest(timer);
timer.setTime(...);
...
timer.setTime(...);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this comment to just:
// move time forward by 1 second, which is not enough to trigger a log roll
Better yet, we could save the log roll time in line 405 above:
long logRollPeriod = standbyCatchupWaitTime + noLogRollWaitTime + 1;
and then set the time to curTime + TimeUnit.SECONDS.toMillis(logRollPeriod / 2) -- I think this should be sufficiently self-explanatory to not require a comment, since you can clearly see we're only moving time by half the log-roll period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we don't need to change this anymore now that we're using the fake timer?
This comment has been minimized.
This comment has been minimized.
|
@xkrogen @aajisaka @jojochuang I have addressed all the comments and test is no longer flaky after introducing custom Timer, and it is more simplified now with minor changes in source code. Please take a look when you get some time. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
liuml07
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look.
- @virajjasani I see some nice discussions about the failure and the approach are in the format of comments. Could you explain in the PR description section what is the intention of the test, what makes it flaky, and what is the solution to fix the flaky test? We can borrow previous discussion comments. That will be both helpful for review and revisit efforts. Specially, it is worth to clarify that only the test if flaky; the production code in log tailer works just fine.
- I usually prefer a less intrusive code change for fixing flaky tests. But here I understand the benefit of fine control timer would make the test more reliable. Also I see other places have been separating the timer for easing tests as well.
This current PR looks good to me, but I think it would be better to have another vote before commit. CC: @jojochuang @xkrogen
| long curTime = standby.getNamesystem().getEditLogTailer().getTimer() | ||
| .monotonicNow(); | ||
| long inSufficientTimeForLogRoll = | ||
| TimeUnit.SECONDS.toMillis(logRollPeriod / 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few places changing this logRollPeriod from seconds to milliseconds. It seems clearer to jut rename it to logRollPeriodMs and use values in milliseconds.
| } | ||
| assertTrue(exceptionThrown); | ||
|
|
||
| long curTimeNew = standby.getNamesystem().getEditLogTailer().getTimer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It this the testTimer.monotinicNow()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the common faker timer would be better if we create a new public constructor? Using advance would be more clearer when reading.
FakeTimer fakeTimer = new FakeTimer(standby.getNamesystem().getEditLogTailer().getTimer().monotonicNow());
standby.getNamesystem().getEditLogTailer().setTimerForTest(fakerTimer);
...
fakeTimer.advance(inSufficientTimeForLogRoll);
try {
checkForLogRoll(active, origTxId, noLogRollWaitTime);
fail("Expected to timeout");
}
....
fakeTimer.advance(sufficientTimeForLogRoll);
checkForLogRoll(active, origTxId, logRollWaitTime);
|
Thank you @liuml07 for the review. Addressed the comments and updated PR description with detailed explanation of the changes. |
|
Cool I will hold one week for 2nd vote before committing. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
tasanuma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, @virajjasani.
Most of them look good to me. I left some minor comments.
...t/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java
Outdated
Show resolved
Hide resolved
...t/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java
Show resolved
Hide resolved
| long inSufficientTimeForLogRoll = logRollPeriodMs / 3; | ||
| final FakeTimer testTimer = | ||
| new FakeTimer(curTime + inSufficientTimeForLogRoll); | ||
| standby.getNamesystem().getEditLogTailer().setTimerForTest(testTimer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a thought. it would be great if we can refactor the MiniDfsCluster, the NameNode, FSNamesystem and EditLogTailer such that they take a FakeTimer as a parameter during initialization. If all the tests adopt the way of FakeTimer we wouldn't have so many flaky tests. But I reckon it's out of scope of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea, I think we can target this as follow up work. Similar to EditLogTailer, we should introduce Timer instance such that we keep using Timer's default version of now, monotonicNow etc utilities but tests would get a way to inject FakeTimer.
|
💔 -1 overall
This message was automatically generated. |
|
@liuml07 There are three +1s. Can we merge it? |
|
Thank you very much for your contribution. Thank you everyone else for reviewing! |
|
Thanks everyone for the in-depth reviews! |
…#testStandbyTriggersLogRollsWhenTailInProgressEdits (apache#3235) Contributed by Viraj Jasani. Signed-off-by: Mingliang Liu <[email protected]> Signed-off-by: Takanobu Asanuma <[email protected]> Signed-off-by: Wei-Chiu Chuang <[email protected]>
The intention here is to de-flake test TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits.
The purpose of this test is to ensure that EditLogTailer is able to perform log roll within a specific period of time. Test sets
dfs.ha.tail-edits.periodas 1 sec anddfs.ha.log-roll.periodas 5 sec. Hence, the thread responsible for checking whether we should tail edit journals sleeps for 1 sec in an never-ending loop and it can trigger active log roll iftimer.monotonicNow() - lastRollTimeMsis greater than 5 sec. Unless we can provide custom Timer to tweakmonotonicNow(), it is almost impossible to guarantee that above condition will be true. So this test has been flaky.In order to de-flake it, we have introduced
TimerinEditLogTailerto replace all instances ofTime.monotonicNow()totimer.monotonicNow()where timer isTimerinstance. This part of the code doesn't change any behaviour in source code ofEditLogTailer. On the other hand, making this change allows this test to provide it's own custom timer instance and properly deal with above condition ((timer.monotonicNow() - lastRollTimeMs) > logRollPeriodMs) such that it will be false when we setinSufficientTimeForLogRollinFakeTimerand the condition will be true when we advance the timer withsufficientTimeForLogRolland hence it will trigger the log rolling and we can assert the same.