Skip to content

Add INDEX_REFRESH_BLOCK#117543

Merged
tlrx merged 16 commits intoelastic:mainfrom
tlrx:2024/11/26/ES-10131
Nov 29, 2024
Merged

Add INDEX_REFRESH_BLOCK#117543
tlrx merged 16 commits intoelastic:mainfrom
tlrx:2024/11/26/ES-10131

Conversation

@tlrx
Copy link
Member

@tlrx tlrx commented Nov 26, 2024

This pull request adds a new ClusterBlockLevel called REFRESH. This level is used in a new ClusterBlock.INDEX_REFRESH_BLOCK which is automatically added to new indices that are created from empty store, with replicas, and only on serverless deployments that have a feature flag enabled. This block is also only added when all nodes of a cluster are in a recent enough transport version.

If for some reason the new ClusterBlock is sent over the wire to a node with an old transport version, the REFRESH cluster block level will be removed from the set of level blocked. I expect it to not be an issue, as nodes with old transport versions should not make any usage of ClusterBlock.INDEX_REFRESH_BLOCK. Still, I think that it is worth backporting ClusterBlockLevel.REFRESH, ClusterBlock.INDEX_REFRESH_BLOCK and the serialization changes to v8.18.

In the future, the REFRESH cluster block will be used:

  • to block refreshes on shards until an unpromotable shard is started
  • to allow skipping shards when searching

Relates ES-10131

Relates ES-10131
@tlrx tlrx added >non-issue :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v9.0.0 labels Nov 26, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. label Nov 26, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Nov 26, 2024
@tlrx tlrx marked this pull request as draft November 26, 2024 12:24
@tlrx tlrx marked this pull request as ready for review November 27, 2024 13:39
@tlrx tlrx requested a review from a team as a code owner November 27, 2024 13:39

private static <E extends Enum<E>> boolean assertEnumToWrite(E enumValue, TransportVersion version) {
assert enumValue instanceof XContentType == false : "XContentHelper#writeTo should be used for XContentType serialisation";
assert enumValue != ClusterBlockLevel.REFRESH || version.onOrAfter(TransportVersions.NEW_REFRESH_CLUSTER_BLOCK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should not have a reference to a specific enum value and transport version - this should be handled by the class doing the serialization

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no other way to do this check. Notice that the class
using this enum does do the right thing. I think we can either have the assertion or remove it. I would prefer to have it, similar to how we have the xcontenttype assertion.
We could have a separate task to introduce the infra to declare a min transport version on an enum to ensure this is verified in the infra.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this assertion to ensure that we'll be informed in case I miss a place where the ClusterBlockLevel is serialized. I agree it's not great, but it helps me be more confident in the change as some part of this PR will be backported to 8.x (at least the new cluster block and the serialization changes).

I would prefer to keep this assertion until the backport is done and CI run for a couple of weeks before removing it, if that's possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm with @thecoop here - writeEnum is only appropriate for the (very common) case that enum values map exactly to their ordinals in the wire protocol. In all other cases the caller should define its own mapping between enum values and wire representations, calling writeVInt() and readVInt() itself to deal with older versions that relied on writeEnum() and readEnum().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 00d3dce to remove that change in StreamOutput.

Copy link
Member

@thecoop thecoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamOutput should not be modified for a specific enum value

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

METADATA_READ,
METADATA_WRITE;
METADATA_WRITE,
REFRESH;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not check, but wonder if there are any greater than, less than comparisons against the "level", since the wording signals some order. Perhaps you can take a look (if you have not already)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and saw no comparisons based on the ordinal (and did not expect to saw one either)

EnumSet.of(ClusterBlockLevel.WRITE)
);
public static final ClusterBlock INDEX_REFRESH_BLOCK = new ClusterBlock(
14,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is with the hole here, just avoiding the unfortunate 13 or leaving room for one to go in between?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 is assigned to CLUSTER_READ_ONLY_ALLOW_DELETE_BLOCK in the Metadata class.

I was thinking of declaring all ids in the same constant class (all blocks are defined in server) as a possible follow up.

if (useRefreshBlock(settings) == false) {
return (clusterBlocks, indexMetadata, minClusterTransportVersion) -> {};
}
logger.info("applying refresh block on index creation");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to debug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I pushed d838465

private static boolean applyRefreshBlock(IndexMetadata indexMetadata, TransportVersion minClusterTransportVersion) {
return 0 < indexMetadata.getNumberOfReplicas() // index has replicas
&& indexMetadata.getResizeSourceIndex() == null // index is not a split/shrink index
&& indexMetadata.getInSyncAllocationIds().values().stream().allMatch(Set::isEmpty) // index is a new index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever be not true? Fine to keep ofc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this to be always true, but decided to copy the conditions from org.elasticsearch.cluster.routing.IndexRoutingTable.Builder#initializeEmpty for extra safety.

private final boolean forbidPrivateIndexSettings;
private final Set<IndexSettingProvider> indexSettingProviders;
private final ThreadPool threadPool;
private final @Nullable ClusterBlocksTransformer blocksTransformerUponIndexCreation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer looks nullable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in ccee9d1


private static <E extends Enum<E>> boolean assertEnumToWrite(E enumValue, TransportVersion version) {
assert enumValue instanceof XContentType == false : "XContentHelper#writeTo should be used for XContentType serialisation";
assert enumValue != ClusterBlockLevel.REFRESH || version.onOrAfter(TransportVersions.NEW_REFRESH_CLUSTER_BLOCK)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no other way to do this check. Notice that the class
using this enum does do the right thing. I think we can either have the assertion or remove it. I would prefer to have it, similar to how we have the xcontenttype assertion.
We could have a separate task to introduce the infra to declare a min transport version on an enum to ensure this is verified in the infra.

Copy link
Contributor

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea for serverless!

added to new indices that are created from empty store

Why not all indices?

I think that it is worth backporting

Should you add the v8.x and auto-backport labels?

to block refreshes on shards until an unpromotable shard is started

I am a bit unfamiliar with cluster blocks, but just to confirm, the block will be at the index level, not the whole cluster, right? I think the clusterBlocks.addIndexBlock() shows that, but would like you to confirm.

Copy link
Contributor

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will wait for rest of my comments' answers before approving, but in general looks good to me.

@tlrx
Copy link
Member Author

tlrx commented Nov 28, 2024

Why not all indices?

The change is targeting searches during the rollover of datastreams, when the new write index can accept writes but unpromotable shards are not ready yet. Do you think it would be valuable for other indices like restored from snapshot?

Should you add the v8.x and auto-backport labels?

I'll add the label but I expect to only backport the serialization changes with the new cluster block (ie, not the changes to apply the block during index creation).

I am a bit unfamiliar with cluster blocks, but just to confirm, the block will be at the index level, not the whole cluster, right? I think the clusterBlocks.addIndexBlock() shows that, but would like you to confirm.

Yes. ClusterBlockLevel can be used at a global level and index level, but this change applies the new refresh block at the index level.

Copy link
Contributor

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please consider my approval "light" in the sense that I'm not too familiar with cluster blocks.

The change is targeting searches during the rollover of datastreams, when the new write index can accept writes but unpromotable shards are not ready yet. Do you think it would be valuable for other indices like restored from snapshot?

Because you mentioned you may use it to allow skipping shards when searching, it might not be applicable to all indices. However, I was thinking that the PR's aspect that it blocks until search shards is ready may be useful for all indices, in lieu of a "wait for index green health" step.

Specifically, PR #117486 came to mind, where we (with @carlosdelest ) were struggling to find a way to have a step "wait for at least one search shard to be ready" so that searches do not fail. We do not have such an API yet. The idea that refreshes or searches simply wait for the search shard(s) to be ready makes sense to me. I'm not sure a cluster block is the perfect way for this, but I was intrigued by the idea and thought I'd mention it for your thoughts here -- whether this might be re-used.

@tlrx tlrx requested a review from thecoop November 29, 2024 08:36
@carlosdelest
Copy link
Member

I'm not sure a cluster block is the perfect way for this, but I was intrigued by the idea and thought I'd mention it for your thoughts here -- whether this might be re-used.

@kingherc , thanks for the ping!

Would searches performed fail during the INDEX_REFRESH_BLOCK? In that case, I'm not sure how we would be able to reuse this for the "wait for a search shard use case" 🤔 . How do you envision using this block for it?

@tlrx
Copy link
Member Author

tlrx commented Nov 29, 2024

Would searches performed fail during the INDEX_REFRESH_BLOCK? In that case, I'm not sure how we would be able to reuse this for the "wait for a search shard use case" 🤔 . How do you envision using this block for it?

We plan to modify the search logic so that it skips search shards if the index has a refresh block and no search shard copy is started. The goal here is to avoid the search returning a shard failure (and therefore 503s).

We also plan to modify the refresh logic so that it blocks until a search shard is started (that's why it is called a refresh block). This way, if a bulk request with ?refresh=immediate|wait_for is executed, it will block until a search shard is started and a subsequent search is guaranteed to see the indexed document.

Note that ES-10131 has more context too.

@kingherc
Copy link
Contributor

Specifically @carlosdelest I was thinking that if INDEX_REFRESH_BLOCK could apply to synonyms, we try having the tests simply refresh the synonyms index, which would block until at least a search shard is ready. But what Tanguy also described on skipping empty not-ready search shards could work as well if the synonyms are empty.

Note that for this PR to help with the synonyms test situation, the synonyms index in the tests would need to be newly created and empty so that the INDEX_REFRESH_BLOCK is put.

@carlosdelest
Copy link
Member

Got it @kingherc , I will give it a try when this PR is merged. Thanks!

@tlrx
Copy link
Member Author

tlrx commented Nov 29, 2024

Got it @kingherc , I will give it a try when this PR is merged. Thanks!

There is more changes to do after this PR before you can use it. Also, it will only work in serverless.

@tlrx tlrx merged commit 045f6a3 into elastic:main Nov 29, 2024
@tlrx
Copy link
Member Author

tlrx commented Nov 29, 2024

Thanks all for your feedback.

I merged the PR with the assertion removed as requested by @thecoop, who is on PTO today.

tlrx added a commit to tlrx/elasticsearch that referenced this pull request Nov 29, 2024
This is the backport of elastic#117543 for 8.18. It contains the cluster
block and cluster level block, the transport version and serialization
changes.

It does NOT contain the MetadataCreateIndexService logic to apply
the block.
tlrx added a commit that referenced this pull request Nov 29, 2024
This is the backport of #117543 for 8.18. It contains the 
cluster block and cluster level block, the transport 
version and serialization changes.

It does NOT contain the MetadataCreateIndexService 
logic to apply the block.
craigtaverner pushed a commit to craigtaverner/elasticsearch that referenced this pull request Dec 2, 2024
This change adds a new ClusterBlockLevel called REFRESH. 
This level is used in a new ClusterBlock.INDEX_REFRESH_BLOCK 
which is automatically added to new indices that are created 
from empty store, with replicas, and only on serverless 
deployments that have a feature flag enabled. This block is 
also only added when all nodes of a cluster are in a recent 
enough transport version.

If for some reason the new ClusterBlock is sent over the 
wire to a node with an old transport version, the REFRESH 
cluster block level will be removed from the set of level 
blocked.

In the future, the REFRESH cluster block will be used:
    to block refreshes on shards until an unpromotable shard is started
    to allow skipping shards when searching

Relates ES-10131
benchaplin added a commit that referenced this pull request Jul 21, 2025
#117543 introduced a cluster block that is added to new indices in
stateless and removed when at least one replica is ready. A search
against those indices should be skipped during that time.
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
…c#3208)

This pull request introduces a feature flag setting that enables 
the addition of the INDEX_REFRESH_BLOCK upon new index 
creation (see elastic#117543).

It also adds integration tests relative to the automatic addition 
of the refresh block.

Relates
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
elastic#117543 introduced a ClusterBlock which is applied to new indices in Serverless which do not yet have search shards up. We should skip searches for indices with this block in order to avoid meaningless 503s.
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
…c#3208)

This pull request introduces a feature flag setting that enables 
the addition of the INDEX_REFRESH_BLOCK upon new index 
creation (see elastic#117543).

It also adds integration tests relative to the automatic addition 
of the refresh block.

Relates
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
elastic#117543 introduced a ClusterBlock which is applied to new indices in Serverless which do not yet have search shards up. We should skip searches for indices with this block in order to avoid meaningless 503s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >non-issue serverless-linked Added by automation, don't add manually Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. v8.18.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants