[ML] Reduce persistent tasks periodic reassignment interval in ... (#36845)

dimitris-athanasiou · web-flow · commit 08bcd8375783 · 2018-12-20T14:53:36.000+02:00
... MlDistributedFailureIT.testLoseDedicatedMasterNode. An intermittent failure has been observed in `MlDistributedFailureIT. testLoseDedicatedMasterNode`. The test launches a cluster comprised by a dedicated master node and a data and ML node. It creates a job and datafeed and starts them. It then shuts down and restarts the master node. Finally, the test asserts that the two tasks have been reassigned within 10s. The intermittent failure is due to the assertions that the tasks have been reassigned failing. Investigating the failure revealed that the `assertBusy` that performs that assertion times out. Furthermore, it appears that the job task is not reassigned because the memory tracking info is stale. Memory tracking info is refreshed asynchronously when a job is attempted to be reassigned. Tasks are attempted to be reassigned either due to a relevant cluster state change or periodically. The periodic interval is controlled by a cluster setting called `cluster.persistent_tasks.allocation.recheck_interval` and defaults to 30s. What seems to be happening in this test is that if all cluster state changes after the master node is restarted come through before the async memory info refresh completes, then the job might take up to 30s until it is attempted to reassigned. Thus the `assertBusy` times out. This commit changes the test to reduce the periodic check that reassigns persistent tasks to `200ms`. If the above theory is correct, this should eradicate those failures. Closes #36760
diff --git a/server/src/main/java/org/elasticsearch/persistent/PersistentTasksClusterService.java b/server/src/main/java/org/elasticsearch/persistent/PersistentTasksClusterService.java
@@ -73,7 +73,8 @@ public PersistentTasksClusterService(Settings settings, PersistentTasksExecutorR
             this::setRecheckInterval);
     }
 
-    void setRecheckInterval(TimeValue recheckInterval) {
+    // visible for testing only
+    public void setRecheckInterval(TimeValue recheckInterval) {
         periodicRechecker.setInterval(recheckInterval);
     }
 
diff --git a/x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/integration/MlDistributedFailureIT.java b/x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/integration/MlDistributedFailureIT.java
@@ -15,12 +15,14 @@
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.common.unit.ByteSizeUnit;
 import org.elasticsearch.common.unit.ByteSizeValue;
+import org.elasticsearch.common.unit.TimeValue;
 import org.elasticsearch.common.xcontent.DeprecationHandler;
 import org.elasticsearch.common.xcontent.NamedXContentRegistry;
 import org.elasticsearch.common.xcontent.XContentHelper;
 import org.elasticsearch.common.xcontent.XContentParser;
 import org.elasticsearch.common.xcontent.XContentType;
 import org.elasticsearch.index.query.QueryBuilders;
+import org.elasticsearch.persistent.PersistentTasksClusterService;
 import org.elasticsearch.persistent.PersistentTasksCustomMetaData;
 import org.elasticsearch.persistent.PersistentTasksCustomMetaData.PersistentTask;
 import org.elasticsearch.test.junit.annotations.TestLogging;
@@ -72,7 +74,6 @@ public void testFailOver() throws Exception {
         });
     }
 
-    @AwaitsFix(bugUrl = "https://github.com/elastic/elasticsearch/issues/32905")
     public void testLoseDedicatedMasterNode() throws Exception {
         internalCluster().ensureAtMostNumDataNodes(0);
         logger.info("Starting dedicated master node...");
@@ -290,6 +291,17 @@ private void run(String jobId, CheckedRunnable<Exception> disrupt) throws Except
         client().admin().indices().prepareSyncedFlush().get();
 
         disrupt.run();
+
+        PersistentTasksClusterService persistentTasksClusterService =
+            internalCluster().getInstance(PersistentTasksClusterService.class, internalCluster().getMasterName());
+        // Speed up rechecks to a rate that is quicker than what settings would allow.
+        // The tests would work eventually without doing this, but the assertBusy() below
+        // would need to wait 30 seconds, which would make the suite run very slowly.
+        // The 200ms refresh puts a greater burden on the master node to recheck
+        // persistent tasks, but it will cope in these tests as it's not doing anything
+        // else.
+        persistentTasksClusterService.setRecheckInterval(TimeValue.timeValueMillis(200));
+
         assertBusy(() -> {
             ClusterState clusterState = client().admin().cluster().prepareState().get().getState();
             PersistentTasksCustomMetaData tasks = clusterState.metaData().custom(PersistentTasksCustomMetaData.TYPE);

Original file line number	Diff line number	Diff line change
`@@ -73,7 +73,8 @@ public PersistentTasksClusterService(Settings settings, PersistentTasksExecutorR`
`73`	`73`	`this::setRecheckInterval);`
`74`	`74`	`}`
`75`	`75`
`76`		`- void setRecheckInterval(TimeValue recheckInterval) {`
	`76`	`+ // visible for testing only`
	`77`	`+ public void setRecheckInterval(TimeValue recheckInterval) {`
`77`	`78`	`periodicRechecker.setInterval(recheckInterval);`
`78`	`79`	`}`
`79`	`80`