Skip to content

Make CacheService.get() throws AlreadyClosedException when service is stopped#122006

Merged
tlrx merged 3 commits intoelastic:mainfrom
tlrx:2025/02/07/fix-121927
Feb 10, 2025
Merged

Make CacheService.get() throws AlreadyClosedException when service is stopped#122006
tlrx merged 3 commits intoelastic:mainfrom
tlrx:2025/02/07/fix-121927

Conversation

@tlrx
Copy link
Copy Markdown
Member

@tlrx tlrx commented Feb 7, 2025

This is caught thanks to #121210: if shard files are verified/checksumed
while the node is stopping, an IllegalStateException is throw by
CacheService.get() when it attempts to read data from the cache. This
exception later caused the verification to fail and then the Lucene
index to be marked as corrupted (which nows fails for searchable
snapshots shards that are read-only and should not be corrupted at
all).

This pull request changes ensureLifecycleStarted(), which is called
during CacheService.get(), to throw an AlreadyClosedException when
the service is stopped (note that ACE extends IllegalStateException,
which is convenient here). This ACE will be later specially handlded
in the checksumIndex method to not mark the shard as corrupted (see
#121210).

Example of stack trace failure:

1> [2025-02-06T21:09:16,741][WARN ][o.e.i.s.IndexShard       ] [node_s1] [ctytqrbncy][0] check index [failure]: checksum failed on [_1_2_Lucene90_0.dvd]
  1> java.io.IOException: failed to read data from cache for [CacheFileReference{cacheKey='CacheKey[snapshotUUID=N7dEq0_hRhuSftJZCNqSnw, snapshotIndexName=ctytqrbncy, shardId=[ctytqrbncy][0], fileName=_1_2_Lucene90_0.dvd]', fileLength=235, acquired=false}]
  1> 	at org.elasticsearch.xpack.searchablesnapshots.store.input.MetadataCachingIndexInput.readDirectlyIfAlreadyClosed(MetadataCachingIndexInput.java:512) ~[main/:?]
  1> 	at org.elasticsearch.xpack.searchablesnapshots.store.input.MetadataCachingIndexInput.readInternal(MetadataCachingIndexInput.java:618) ~[main/:?]
  1> 	at org.elasticsearch.blobcache.common.BlobCacheBufferedIndexInput.refill(BlobCacheBufferedIndexInput.java:304) ~[blob-cache-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.blobcache.common.BlobCacheBufferedIndexInput.readBytes(BlobCacheBufferedIndexInput.java:114) ~[blob-cache-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.blobcache.common.BlobCacheBufferedIndexInput.readBytes(BlobCacheBufferedIndexInput.java:92) ~[blob-cache-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.apache.lucene.store.BufferedChecksumIndexInput.readBytes(BufferedChecksumIndexInput.java:45) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]
  1> 	at org.apache.lucene.store.DataInput.readBytes(DataInput.java:73) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]
  1> 	at org.apache.lucene.store.ChecksumIndexInput.skipByReading(ChecksumIndexInput.java:79) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]
  1> 	at org.apache.lucene.store.ChecksumIndexInput.seek(ChecksumIndexInput.java:64) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]
  1> 	at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:619) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]
  1> 	at org.elasticsearch.index.store.Store.checkIntegrity(Store.java:584) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.index.shard.IndexShard.doCheckIndex(IndexShard.java:3244) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.index.shard.IndexShard.checkIndex(IndexShard.java:3210) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.index.shard.IndexShard.maybeCheckIndex(IndexShard.java:3200) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2125) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$17(StoreRecovery.java:595) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.ActionListenerImplementations$ResponseDroppingActionListener.onResponse(ActionListenerImplementations.java:274) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.SubscribableListener$SuccessResult.complete(SubscribableListener.java:387) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:307) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.SubscribableListener.setResult(SubscribableListener.java:336) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.support.SubscribableListener.onResponse(SubscribableListener.java:250) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onResponse(ActionListenerImplementations.java:184) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:336) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:400) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.repositories.blobstore.FileRestoreContext.lambda$restore$1(FileRestoreContext.java:167) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:247) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$11.restoreFiles(BlobStoreRepository.java:3602) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restore(FileRestoreContext.java:163) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$58(BlobStoreRepository.java:3707) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:101) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1044) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27) ~[elasticsearch-9.1.0-SNAPSHOT.jar:9.1.0-SNAPSHOT]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
  1> 	at java.lang.Thread.run(Thread.java:1575) ~[?:?]
  1> Caused by: java.lang.IllegalStateException: Failed to read data from cache: cache service is not started [CLOSED]
  1> 	at org.elasticsearch.xpack.searchablesnapshots.cache.full.CacheService.ensureLifecycleStarted(CacheService.java:248) ~[main/:?]
  1> 	at org.elasticsearch.xpack.searchablesnapshots.cache.full.CacheService.get(CacheService.java:277) ~[main/:?]
  1> 	at org.elasticsearch.xpack.searchablesnapshots.store.SearchableSnapshotDirectory.getCacheFile(SearchableSnapshotDirectory.java:360) ~[main/:?]
  1> 	at org.elasticsearch.xpack.searchablesnapshots.store.input.MetadataCachingIndexInput$CacheFileReference.get(MetadataCachingIndexInput.java:741) ~[main/:?]
  1> 	at org.elasticsearch.xpack.searchablesnapshots.store.input.MetadataCachingIndexInput.readWithBlobCache(MetadataCachingIndexInput.java:263) ~[main/:?]
  1> 	at org.elasticsearch.xpack.searchablesnapshots.store.input.MetadataCachingIndexInput.readInternal(MetadataCachingIndexInput.java:613) ~[main/:?]
  1> 	... 33 more

Closes #121927

… stopped

This is caught thanks to elastic#121210: if shard files are verified/checksumed
while the node is stopping, an IllegalStateException is throw by
 CacheService.get() when it attempts to read data from the cache. This
 exception later caused the verification to fail and then the Lucene
 index to be marked as corrupted (which nows fails for searchable
 snapshots shards that are read-only and should not be corrupted at
 all).

 This pull request changes ensureLifecycleStarted(), which is called
 during CacheService.get(), to throw an AlreadyClosedException when
 the service is stopped (note that ACE extends IllegalStateException,
 which is convenient here). This ACE will be later specially handlded
 in the checksumIndex method to not mark the shard as corrupted (see
  elastic#121210).

Closes elastic#121927
@tlrx tlrx added >non-issue :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. v9.1.0 labels Feb 7, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. label Feb 7, 2025
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

Copy link
Copy Markdown
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good catch

@tlrx tlrx merged commit 26f6acf into elastic:main Feb 10, 2025
@tlrx tlrx deleted the 2025/02/07/fix-121927 branch February 10, 2025 13:15
@tlrx
Copy link
Copy Markdown
Member Author

tlrx commented Feb 10, 2025

Thanks Francisco

arteam added a commit to arteam/elasticsearch that referenced this pull request Feb 18, 2025
…alSearchableSnapshot`

The underlying failure `java.lang.AssertionError: Searchable snapshot directory does not support the operation [createOutput`
was fixed in elastic#122006.

The automation bot was too aggressive in re-opening this issue.

Resolve elastic#122693
arteam added a commit that referenced this pull request Feb 18, 2025
…alSearchableSnapshot` (#122831)

* Unmute `FrozenSearchableSnapshotsIntegTests#testCreateAndRestorePartialSearchableSnapshot`

The underlying failure `java.lang.AssertionError: Searchable snapshot directory does not support the operation [createOutput`
was fixed in #122006.

The automation bot was too aggressive in re-opening this issue.

Resolve #122693

* Add a check for the CLOSED state along with STOPPED

* Update x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/xpack/searchablesnapshots/cache/full/CacheService.java

Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>

---------

Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
pxsalehi added a commit that referenced this pull request Mar 31, 2026
…e to shutdown (#145209)

This is kind of similar to previous fixes related to this test (e.g. #122006) where restarting nodes cause exceptions that leads to the upper layers during recovery thinking the data is corrupted as we are not able to read from/via the cache.
ncordon pushed a commit to ncordon/elasticsearch that referenced this pull request Apr 1, 2026
…e to shutdown (elastic#145209)

This is kind of similar to previous fixes related to this test (e.g. elastic#122006) where restarting nodes cause exceptions that leads to the upper layers during recovery thinking the data is corrupted as we are not able to read from/via the cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. >non-issue Team:Distributed Indexing (obsolete) Meta label for Distributed Indexing team. Obsolete. Please do not use. v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] FrozenSearchableSnapshotsIntegTests testCreateAndRestorePartialSearchableSnapshot failing

3 participants