Skip to content

Commit 08bcd83

Browse files
[ML] Reduce persistent tasks periodic reassignment interval in ... (#36845)
... MlDistributedFailureIT.testLoseDedicatedMasterNode. An intermittent failure has been observed in `MlDistributedFailureIT. testLoseDedicatedMasterNode`. The test launches a cluster comprised by a dedicated master node and a data and ML node. It creates a job and datafeed and starts them. It then shuts down and restarts the master node. Finally, the test asserts that the two tasks have been reassigned within 10s. The intermittent failure is due to the assertions that the tasks have been reassigned failing. Investigating the failure revealed that the `assertBusy` that performs that assertion times out. Furthermore, it appears that the job task is not reassigned because the memory tracking info is stale. Memory tracking info is refreshed asynchronously when a job is attempted to be reassigned. Tasks are attempted to be reassigned either due to a relevant cluster state change or periodically. The periodic interval is controlled by a cluster setting called `cluster.persistent_tasks.allocation.recheck_interval` and defaults to 30s. What seems to be happening in this test is that if all cluster state changes after the master node is restarted come through before the async memory info refresh completes, then the job might take up to 30s until it is attempted to reassigned. Thus the `assertBusy` times out. This commit changes the test to reduce the periodic check that reassigns persistent tasks to `200ms`. If the above theory is correct, this should eradicate those failures. Closes #36760
1 parent 0f2f00a commit 08bcd83

File tree

2 files changed

+15
-2
lines changed

2 files changed

+15
-2
lines changed

server/src/main/java/org/elasticsearch/persistent/PersistentTasksClusterService.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,8 @@ public PersistentTasksClusterService(Settings settings, PersistentTasksExecutorR
7373
this::setRecheckInterval);
7474
}
7575

76-
void setRecheckInterval(TimeValue recheckInterval) {
76+
// visible for testing only
77+
public void setRecheckInterval(TimeValue recheckInterval) {
7778
periodicRechecker.setInterval(recheckInterval);
7879
}
7980

x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/integration/MlDistributedFailureIT.java

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,14 @@
1515
import org.elasticsearch.common.settings.Settings;
1616
import org.elasticsearch.common.unit.ByteSizeUnit;
1717
import org.elasticsearch.common.unit.ByteSizeValue;
18+
import org.elasticsearch.common.unit.TimeValue;
1819
import org.elasticsearch.common.xcontent.DeprecationHandler;
1920
import org.elasticsearch.common.xcontent.NamedXContentRegistry;
2021
import org.elasticsearch.common.xcontent.XContentHelper;
2122
import org.elasticsearch.common.xcontent.XContentParser;
2223
import org.elasticsearch.common.xcontent.XContentType;
2324
import org.elasticsearch.index.query.QueryBuilders;
25+
import org.elasticsearch.persistent.PersistentTasksClusterService;
2426
import org.elasticsearch.persistent.PersistentTasksCustomMetaData;
2527
import org.elasticsearch.persistent.PersistentTasksCustomMetaData.PersistentTask;
2628
import org.elasticsearch.test.junit.annotations.TestLogging;
@@ -72,7 +74,6 @@ public void testFailOver() throws Exception {
7274
});
7375
}
7476

75-
@AwaitsFix(bugUrl = "https://github.com/elastic/elasticsearch/issues/32905")
7677
public void testLoseDedicatedMasterNode() throws Exception {
7778
internalCluster().ensureAtMostNumDataNodes(0);
7879
logger.info("Starting dedicated master node...");
@@ -290,6 +291,17 @@ private void run(String jobId, CheckedRunnable<Exception> disrupt) throws Except
290291
client().admin().indices().prepareSyncedFlush().get();
291292

292293
disrupt.run();
294+
295+
PersistentTasksClusterService persistentTasksClusterService =
296+
internalCluster().getInstance(PersistentTasksClusterService.class, internalCluster().getMasterName());
297+
// Speed up rechecks to a rate that is quicker than what settings would allow.
298+
// The tests would work eventually without doing this, but the assertBusy() below
299+
// would need to wait 30 seconds, which would make the suite run very slowly.
300+
// The 200ms refresh puts a greater burden on the master node to recheck
301+
// persistent tasks, but it will cope in these tests as it's not doing anything
302+
// else.
303+
persistentTasksClusterService.setRecheckInterval(TimeValue.timeValueMillis(200));
304+
293305
assertBusy(() -> {
294306
ClusterState clusterState = client().admin().cluster().prepareState().get().getState();
295307
PersistentTasksCustomMetaData tasks = clusterState.metaData().custom(PersistentTasksCustomMetaData.TYPE);

0 commit comments

Comments
 (0)