Avoid loading shard metadata while closing#29140
Avoid loading shard metadata while closing#29140DaveCTurner wants to merge 32 commits intoelastic:masterfrom
Conversation
If `ShardStateMetaData.FORMAT.loadLatestState` is called while a shard is closing, the shard metadata directory may be deleted after its existence has been checked but before the Lucene `Directory` has been created. When the `Directory` is created, the just-deleted directory is brought back into existence. There are three places where `loadLatestState` is called in a manner that leaves it open to this race. This change ensures that these calls occur either under a `ShardLock` or else while holding a reference to the existing `Store`. In either case, this protects the shard metadata directory from concurrent deletion. Cf elastic#19338, elastic#21463, elastic#25335 and https://issues.apache.org/jira/browse/LUCENE-7375
|
Pinging @elastic/es-distributed |
|
Note to reviewers: I have assumed a certain amount of consistency between I also don't have a good plan for testing this. Pointers appreciated. |
|
@bleskes, any thoughts here? |
|
Maybe it's a naive solution, but isn't it enough to just make sure all access in the |
|
We discussed this on Zoom, and decided that it'd be more appropriate to ask the NB the Within |
|
I tried this. I don't particularly like having the call to |
Why don't you like it? IndexShard is already the one that writes it. Alternatively we can keep an in memory copy of it, thought I personally don't feel it's needed. |
|
Really, just that it involved importing things that weren't already there, which hinted that something was wrong. If you're good with it then that's enough. Next up is to try and get a failing test for this. |
I think I'm missing something - |
|
I added a test that fails occasionally on |
| throw new AlreadyClosedException(shardId + " can't load shard state metadata - shard is closed"); | ||
| } | ||
|
|
||
| return ShardStateMetaData.FORMAT.loadLatestState(logger, namedXContentRegistry, dataLocations); |
There was a problem hiding this comment.
Very useful, thanks. This makes things much simpler. I pushed 3eff6c9.
| public ShardStateMetaData loadShardStateMetaDataIfOpen(NamedXContentRegistry namedXContentRegistry, Path[] dataLocations) | ||
| throws IOException { | ||
| synchronized (mutex) { | ||
| if (state == IndexShardState.CLOSED) { |
There was a problem hiding this comment.
This check is not needed if making our own ShardStateMetaData so I will remove it.
| } | ||
| } | ||
|
|
||
| public ShardStateMetaData loadShardStateMetaDataIfOpen(NamedXContentRegistry namedXContentRegistry, Path[] dataLocations) |
There was a problem hiding this comment.
As per comment below this is not needed since we can make our own ShardStateMetaData.
| } | ||
| } | ||
|
|
||
| public ShardStateMetaData loadShardStateMetaDataIfOpen(NamedXContentRegistry namedXContentRegistry, Path[] dataLocations) |
There was a problem hiding this comment.
It was, I think, because otherwise it was possible we'd get hold of an IndexShard while it was closing and then fail to load the metadata since it'd already been deleted. However, as per comment below we don't need to touch the disk here.
| final IndexSettings indexSettings = indexService != null ? indexService.getIndexSettings() : new IndexSettings(metaData, settings); | ||
| final ShardPath shardPath = ShardPath.loadShardPath(logger, nodeEnv, shardId, indexSettings); | ||
| final ShardPath shardPath; | ||
| try (ShardLock ignored = nodeEnv.shardLock(shardId, TimeUnit.SECONDS.toMillis(5))) { |
There was a problem hiding this comment.
I looked at how we could be in a situation in which the shard lock is unavailable for a long time. This'd be the case if the shard was open, but that means there's an IndexShard so we don't get here. More precisely, there are some circumstances in which we could get here and then fail to get the shard lock because the shard is now open, but retrying is the thing to do here.
All the other usages of the shard lock seem short-lived. They protect some IO (e.g. deleting the shards, etc) so may take some time, but not infinitely long.
Also, we obtain the same shard lock a few lines down, in Store.readMetadataSnapshot, unless ShardPath.loadShardPath returns null.
Could you clarify, @ywelsch?
| ShardStateMetaData shardStateMetaData = ShardStateMetaData.FORMAT.loadLatestState(logger, NamedXContentRegistry.EMPTY, | ||
| nodeEnv.availableShardPaths(request.shardId)); | ||
|
|
||
| ShardStateMetaData shardStateMetaData = safelyLoadLatestState(shardId); |
There was a problem hiding this comment.
Ok, I moved this code around in 7f835cc. I'm not 100% comfortable with the changes made since I'm unfamiliar with all the invariants that may or may not hold here - please tread carefully.
| try { | ||
| IndexSettings indexSettings = new IndexSettings(metaData, settings); | ||
| shardPath = ShardPath.loadShardPath(logger, nodeEnv, shardId, indexSettings); | ||
| try (ShardLock ignored = nodeEnv.shardLock(shardId, TimeUnit.SECONDS.toMillis(5))) { |
There was a problem hiding this comment.
We obtain the same shard lock a few lines down, in Store.tryOpenIndex(...), unless ShardPath.loadShardPath returns null in which case we throw a different exception.
| listingThread.start(); | ||
| } | ||
|
|
||
| // Deleting an index asserts that it really is gone from disk, so no other assertions are necessary here. |
| final ShardStateMetaData shardStateMetaData = indexShard.getShardStateMetaData(); | ||
| final String allocationId = shardStateMetaData.allocationId != null ? | ||
| shardStateMetaData.allocationId.getId() : null; | ||
| logger.debug("{} shard state info found: [{}]", shardId, shardStateMetaData); |
There was a problem hiding this comment.
this can be chatty. Can we move back to trace?
| final IndexShard indexShard = indicesService.getShardOrNull(shardId); | ||
| if (indexShard != null) { | ||
| final ShardStateMetaData shardStateMetaData = indexShard.getShardStateMetaData(); | ||
| final String allocationId = shardStateMetaData.allocationId != null ? |
There was a problem hiding this comment.
allocationIds have been around since I don't know how long. When can this be null?
There was a problem hiding this comment.
Its declaration says this:
There are lots of other null checks too. Maybe worth addressing separately?
|
I'm good with doing this at a different PR.
…On Thu, May 24, 2018 at 1:16 PM David Turner ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
server/src/main/java/org/elasticsearch/gateway/TransportNodesListGatewayStartedShards.java
<#29140 (comment)>
:
>
+ final IndexShard indexShard = indicesService.getShardOrNull(shardId);
+ if (indexShard != null) {
+ final ShardStateMetaData shardStateMetaData = indexShard.getShardStateMetaData();
+ final String allocationId = shardStateMetaData.allocationId != null ?
Its declaration says this:
https://github.com/elastic/elasticsearch/blob/65385426033fe105df8aee61d97d7d92b4ab0ecf/server/src/main/java/org/elasticsearch/index/shard/ShardStateMetaData.java#L44-L45
There are lots of other null checks too. Maybe worth addressing separately?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29140 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA9bJ0C1Ez-HYuP4phC-IwM5qswB0uUPks5t1pZzgaJpZM4SwXd3>
.
|
ywelsch
left a comment
There was a problem hiding this comment.
I've left a few more asks and comments.
| } | ||
|
|
||
| public ShardStateMetaData getShardStateMetaData() { | ||
| synchronized (mutex) { |
There was a problem hiding this comment.
we can avoid the mutex here. just do a one-time volatile read of shardrouting (which is an immutable object). indexSettings.getUUID() are a final object and the uuid is immutable.
| final IndexSettings indexSettings = indexService != null ? indexService.getIndexSettings() : new IndexSettings(metaData, settings); | ||
| final ShardPath shardPath = ShardPath.loadShardPath(logger, nodeEnv, shardId, indexSettings); | ||
| final ShardPath shardPath; | ||
| try (ShardLock ignored = nodeEnv.shardLock(shardId, TimeUnit.SECONDS.toMillis(5))) { |
There was a problem hiding this comment.
In TransportNodesListGatewayStartedShards and in Store.readMetadataSnapshot, which we call below, we catch the ShardLockObtainFailedException and treat it either as an empty store (in case of TransportNodesListShardStoreMetaData) or as an ok target for primary allocation (see TransportNodesListGatewayStartedShards and PrimaryShardAllocator.buildNodeShardsResult), but we've made sure not to end up in a situation where the master goes into a potentially long retry loop (which causes a reroute storm on the master). I don't want to open this box of Pandora here, so my suggestion is to add
} catch (ShardLockObtainFailedException ex) {
logger.info(() -> new ParameterizedMessage("{}: failed to obtain shard lock", shardId), ex);
return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY);
}
here so as not to mess with existing behavior.
| if (shardPath == null) { | ||
| throw new IllegalStateException(shardId + " no shard path found"); | ||
| } | ||
| Store.tryOpenIndex(shardPath.resolveIndex(), shardId, nodeEnv::shardLock, logger); |
There was a problem hiding this comment.
Instead of acquiring the shard lock for a second time, I would prefer if we would do it once, and move this call under that lock and just rename tryOpenIndex to tryOpenIndexUnderLock, removing the locking mechanism from it.
Same thing for TransportNodesListShardStoreMetaData. You can then also remove the ShardLocker interface, which irked me for a while.
| } | ||
|
|
||
| final ShardStateMetaData shardStateMetaData; | ||
| try (ShardLock ignored = nodeEnv.shardLock(shardId, TimeUnit.SECONDS.toMillis(5))) { |
There was a problem hiding this comment.
Hmm, I just spotted this - there are still two calls to nodeEnv.shardLock here. TBH I don't know what we should be doing on failure of this one.
|
This PR represents an actual issue, and all the other issues that point to it were closed in its favour, but the consequences of I would like to explore the idea of loading the metadata of every on-disk index much earlier in the lifecycle of a node, avoiding these concurrency issues (of course introducing different ones in their place, but perhaps the new ones will be less tricky). |
|
I think it makes sense to explore alternative ways of coordinating the loading of shard state metadata. We have fixed the current test failures by weakening the assertions on the existence of a shard folder after clean-up. As there is no immediate plan to work on this, I'm closing this one out. |
If
ShardStateMetaData.FORMAT.loadLatestStateis called while a shard isclosing, the shard metadata directory may be deleted after its existence has
been checked but before the Lucene
Directoryhas been created. When theDirectoryis created, the just-deleted directory is brought back intoexistence.
There are three places where
loadLatestStateis called in a manner thatleaves it open to this race. This change ensures that these calls occur either
under a
ShardLockor else while holding a reference to the existingStore.In either case, this protects the shard metadata directory from concurrent
deletion.
Cf #19338, #21463, #25335 and https://issues.apache.org/jira/browse/LUCENE-7375