Skip to content

Fix failing tests on feature/desired-balance-allocator branch #86429

@idegtiarenko

Description

@idegtiarenko

Following tests are failing and need to be fixed before DesiredBalanceShardsAllocator could be merged to master.

  • DesiredBalanceServiceTests.* (Fix DesiredBalanceServiceTests #86435)
  • DesiredBalanceReconcilerTests.* (Fix DesiredBalanceReconcilerTests#testFailsNewPrimariesIfNoDataNodes #86432)
  • ClusterAllocationExplainIT. testAllocationFilteringOnIndexCreation
  • ClusterHealthIT. testHealthOnIndexCreation
  • CorruptedFileIT. testCorruptionOnNetworkLayer (Replica shard is not started and remains in error state)
  • CorruptedFileIT. testReplicaCorruption
  • FilteringAllocationIT. testDecommissionNodeNoReplicas
  • IndexFoldersDeletionListenerIT. testListenersInvokedWhenIndexHasLeftOverShard (small probability to stuck after logger.debug("--> creating a new index [{}]", indexName);)
  • IndexRecoveryIT. testCancelNewShardRecoveryAndUsesExistingShardCopy
  • IndexRecoveryIT. testDoNotInfinitelyWaitForMapping (timed out waiting for green state: ALLOCATION_FAILED, failed shard on node [y0Jt0-QrSu29efTbYU8AdQ]: failed to create index, failure org.elasticsearch.index.mapper.MapperParsingException: simulate mapping parsing error)
  • IndexRecoveryIT. testCancelRecoveryWithAutoExpandReplicas (stuck after creating index [0-all] index in a cluster with a single master and no data nodes)
  • RareClusterStateIT. testDeleteCreateInOneBulk (consistently timing out on creating index with 0s timeout)
  • RecoveryFromGatewayIT. testSingleNodeNoFlush
  • ReplicaShardAllocatorIT. testDoNotCancelRecoveryForBrokenNode (timed out waiting for green state: ALLOCATION_FAILED, failed recovery, failure org.elasticsearch.indices.recovery.RecoveryFailedException)
  • ReplicaShardAllocatorIT. testPreferCopyCanPerformNoopRecovery
  • ReplicaShardAllocatorIT. testPreferCopyWithHighestMatchingOperations
  • ReplicaShardAllocatorIT. testPeerRecoveryForClosedIndices (<5% probability)
  • ReplicaShardAllocatorSyncIdIT. testPreferCopyCanPerformNoopRecovery
  • SimpleIndexStateIT. testFastCloseAfterCreateContinuesCreateAfterOpen (~50% failure rate with Expected: <RED> but: was <YELLOW> when creating index that could not be allocated)
  • TransportSearchFailuresIT. testFailedSearchWithWrongQuery (~1% probability to timeout on logger.info("Done Cluster Health, status {}", clusterHealth.getStatus());, looks like it is more likely ~5% with -Dtests.seed=F9E8E5F50A9C9B21)
  • UpdateShardAllocationSettingsIT. testUpdateSameHostSetting
  • ClusterRerouteIT. testDelayWithALargeAmountOfShards (might rarely timeout. Shards balance is not converging 250 shards over 3 data nodes with ~5% probability). Related to: BalancedShardsAllocator rebalancing might move shards but not improve the balance #88384
  • GetGlobalCheckpointsActionIT. testWaitOnIndexCreated (repeatedly failing)
  • GetGlobalCheckpointsActionIT#testWaitOnPrimaryShardThrottled (cluster.routing.allocation.node_initial_primaries_recoveries=0 prevents balance from converging)
  • org.elasticsearch.datastreams.DataStreamMigrationIT. testBasicMigration (times out when executing migration, listener is not called in the else branch)
  • NodeShutdownShardsIT. testNodeReplacementOnlyAllowsShardsFromReplacedNode
  • test {yaml=indices.split/30_copy_settings/Copy settings during split index}
  • test {yaml=indices.shrink/30_copy_settings/Copy settings during shrink index}
  • TransformAuditorIT.testAliasCreatedforBWCIndexes

org.elasticsearch.action.admin.indices.shrink.TransportResizeAction

  • ShrinkIndexIT. testCreateShrinkIndexToN ([NO(initial allocation of the shrunken index is only allowed on nodes [_id:"hg09_hMfS3uDUfv93xggmA"] that hold a copy of every shard in the index)])
  • ShrinkIndexIT. testShrinkThenSplitWithFailedNode (NO(initial allocation of the shrunken index is only allowed on nodes [_id:"eh7a8csCQzOwCePaFlw9xA"] that hold a copy of every shard in the index))
  • SplitIndexIT. testCreateSplitIndexToN (NO(source primary is allocated on another node))
  • SplitIndexIT. testSplitFromOneToN (NO(source primary is allocated on another node))
  • SplitIndexIT. testSplitIndexPrimaryTerm (NO(source primary is allocated on another node))
  • PartitionedRoutingIT. testShrinking

HasFrozenCacheAllocationDecider

  • various searchable snapshot test failures due to throttling when xpack.searchable.snapshot.shared_cache.size is not yet reported
  • FrozenExistenceDeciderIT. testZeroToOne fails for the same reason

MoveAllocationCommand usage

  • ClusterRerouteIT. testClusterRerouteWithBlocks (uses MoveAllocationCommand)
  • IndexPrimaryRelocationIT. testPrimaryRelocationWhileIndexing (uses MoveAllocationCommand)
  • IndexRecoveryIT. testRerouteRecovery
  • IndicesStoreIntegrationIT. testIndexCleanup (~10% to stuck when running individually)
  • RelocationIT. testRelocationWhileIndexingRandom (MoveAllocationCommand)
  • RelocationIT. testRelocationWhileRefreshing (MoveAllocationCommand)

not retrying shard allocation after an error

setWaitForNoRelocatingShards(true) should wait for desired balance to converge

  • AwarenessAllocationIT. testAwarenessZonesIncrementalNodes (health setWaitForNoRelocatingShards(true) is not waiting for a pending desired balance computation ~10% chance)

Snapshot related tests

  • AbortedRestoreIT. testAbortedRestoreAlsoAbortFileRestores
  • BlobStoreIncrementalityIT. testIncrementalBehaviorOnPrimaryFailover (20% chance failure with timed out waiting for green state)
  • FsBlobStoreRepositoryIntegTests. testSnapshotAndRestore
  • IndicesOptionsIntegrationIT. testWildcardBehaviourSnapshotRestore
  • MetadataLoadingDuringSnapshotRestoreIT. testWhenMetadataAreLoaded
  • ConcurrentSnapshotsIT. testConcurrentRestoreDeleteAndClone
  • CorruptedBlobStoreRepositoryIT. *
  • DataStreamsSnapshotsIT. *
  • DedicatedClusterSnapshotRestoreIT. *
  • DiskThresholdDeciderIT. testRestoreSnapshotAllocationDoesNotExceedWatermark
  • RestoreSnapshotIT. *
  • SharedClusterSnapshotRestoreIT.testUnrestorableIndexDuringRestore (this test is stuck when running individually)
  • SnapshotCustomPluginStateIT. testIncludeGlobalState
  • SnapshotStressTestsIT. testRandomActivities
  • SystemDataStreamSnapshotIT. *
  • SystemIndicesSnapshotIT. *

4308 integration tests passed.

ESAllocationTestCase related unit test failures when using desired balance allocator

Metadata

Metadata

Labels

:Distributed Coordination/AllocationAll issues relating to the decision making around placing a shard (both master logic & on the nodes)Team:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions