[Inference API] Fix flaky AuthorizationTaskExecutorIT tests by jonathan-buttner · Pull Request #139978 · elastic/elasticsearch

jonathan-buttner · 2025-12-23T21:05:17Z

This PR tries to address an org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed exception.

I think the all shards failed is indicating that the indices do not exist yet.

By having the modelRegistry.getAllModels(true, listener); pass true the ModelRegistry should persist the default endpoints and therefore create the inference indices.

Failure issue: #138012

Stack trace

    <failure message="Failed to execute phase [query], all shards failed; shardFailures {[_na_][.inference][0]: org.elasticsearch.action.NoShardAvailableActionException&#10;}" type="org.elasticsearch.action.search.SearchPhaseExecutionException">Failed to execute phase [query], all shards failed; shardFailures {[_na_][.inference][0]: org.elasticsearch.action.NoShardAvailableActionException
}
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:723)
	at app//org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.onPhaseFailure(SearchQueryThenFetchAsyncAction.java:83)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:347)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:769)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.finishOneShard(AbstractSearchAsyncAction.java:452)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:444)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.failOnUnavailable(AbstractSearchAsyncAction.java:315)
	at app//org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.doRun(SearchQueryThenFetchAsyncAction.java:482)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.run(AbstractSearchAsyncAction.java:253)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:398)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:239)
	at app//org.elasticsearch.action.search.TransportSearchAction$AsyncSearchActionProvider.runNewSearchPhase(TransportSearchAction.java:2051)
	at app//org.elasticsearch.action.search.TransportSearchAction.executeSearch(TransportSearchAction.java:1786)
	at app//org.elasticsearch.action.search.TransportSearchAction.executeLocalSearch(TransportSearchAction.java:1505)
	at app//org.elasticsearch.action.search.TransportSearchAction.lambda$executeRequest$9(TransportSearchAction.java:505)
	at app//org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:261)
	at app//org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:109)
	at app//org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:77)
	at app//org.elasticsearch.action.search.TransportSearchAction.executeRequest(TransportSearchAction.java:661)
	at app//org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:349)
	at app//org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:137)
	at app//org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:135)
	at app//org.elasticsearch.action.support.MappedActionFilters$MappedFilterChain.proceed(MappedActionFilters.java:71)
	at app//org.elasticsearch.action.support.MappedActionFilters.apply(MappedActionFilters.java:49)
	at app//org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:132)
	at app//org.elasticsearch.action.support.TransportAction.handleExecution(TransportAction.java:96)
	at app//org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:59)
	at app//org.elasticsearch.tasks.TaskManager.registerAndExecute(TaskManager.java:216)
	at app//org.elasticsearch.client.internal.node.NodeClient.executeLocally(NodeClient.java:107)
	at app//org.elasticsearch.client.internal.node.NodeClient.doExecute(NodeClient.java:85)
	at app//org.elasticsearch.client.internal.support.AbstractClient.execute(AbstractClient.java:160)
	at app//org.elasticsearch.client.internal.FilterClient.doExecute(FilterClient.java:57)
	at app//org.elasticsearch.client.internal.OriginSettingClient.doExecute(OriginSettingClient.java:44)
	at app//org.elasticsearch.client.internal.support.AbstractClient.execute(AbstractClient.java:160)
	at app//org.elasticsearch.client.internal.support.AbstractClient.search(AbstractClient.java:295)
	at app//org.elasticsearch.xpack.inference.registry.ModelRegistry.getAllModels(ModelRegistry.java:421)
	at app//org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT.getEisEndpoints(AuthorizationTaskExecutorIT.java:198)
	at app//org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT.lambda$assertChatCompletionEndpointExists$9(AuthorizationTaskExecutorIT.java:270)
	at app//org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1610)
	at app//org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1594)
	at app//org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT.assertChatCompletionEndpointExists(AuthorizationTaskExecutorIT.java:269)
	at app//org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT.assertChatCompletionEndpointExists(AuthorizationTaskExecutorIT.java:265)
	at app//org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT.testCreatesChatCompletion_AndThenCreatesTextEmbedding(AuthorizationTaskExecutorIT.java:293)

…laky-auth-test-no-shards

elasticsearchmachine · 2026-01-05T19:40:27Z

Pinging @elastic/search-inference-team (Team:Search - Inference)

DonalEvans · 2026-01-05T20:00:52Z

The test failure happened when calling assertChatCompletionEndpointExists(), which is called after calling assertNoAuthorizedEisEndpoints() earlier in the test. Both those methods call getEisEndpoints() so I'm confused why the first call would work but the second wouldn't if the issue is the indices not existing.

jonathan-buttner · 2026-01-05T21:04:20Z

Hmm

Looking at the logs closer...

[2025-12-13T00:02:23,901][INFO ][o.e.x.i.i.AuthorizationTaskExecutorIT][testCreatesChatCompletion_AndThenCreatesTextEmbedding] before test
[2025-12-13T00:02:23,924][INFO ][o.e.x.i.s.e.a.AuthorizationTaskExecutor][node_s_0][inference_utility][T#1] Started authorization poller task with id 42 
[2025-12-13T00:02:23,929][INFO ][o.e.x.i.s.e.a.AuthorizationTaskExecutor][node_s_0][masterService#updateTask][T#1] Finished creating authorization poller task, id eis-authorization-poller
[2025-12-13T00:02:23,966][ERROR][o.e.t.h.MockWebServer    ][[HTTP-Dispatcher]] failed to respond to request [GET /api/v2/authorizations]
java.lang.NullPointerException: Cannot invoke "org.elasticsearch.test.http.MockResponse.getBeforeReplyDelay()" because "response" is null

That part looks ok so far. The "response" is null is confusing but expected because initially we haven't queued a response for the mock webserver. So when auth task makes a request, the webserver throws an error.

We get 3x of those because we retry that exception.

Then we get:

The auth task is cancelled and restarted to force a new auth request to occur (after we've enqueued a response this time)
[2025-12-13T00:02:24,369][INFO ][o.e.x.i.s.e.a.AuthorizationTaskExecutor][node_s_0][inference_utility][T#1] Started authorization poller task with id 73


[2025-12-13T00:02:24,372][INFO ][o.e.x.i.s.e.a.AuthorizationTaskExecutor][node_s_0][masterService#updateTask][T#1] Finished creating authorization poller task, id eis-authorization-poller
[2025-12-13T00:02:24,385][INFO ][o.e.x.i.s.e.a.AuthorizationPoller][node_s_0][inference_response][T#1] Storing new EIS preconfigured inference endpoints with inference IDs [.rainbow-sprinkles-elastic]
[2025-12-13T00:02:24,416][INFO ][o.e.c.m.MetadataCreateIndexService][node_s_0][masterService#updateTask][T#1] creating index [.secrets-inference] in project [default], cause [auto(bulk api)], templates [], shards [1]/[1]
[2025-12-13T00:02:24,474][INFO ][o.e.c.m.MetadataCreateIndexService][node_s_0][masterService#updateTask][T#1] creating index [.inference] in project [default], cause [auto(bulk api)], templates [], shards [1]/[1]
[2025-12-13T00:02:24,498][INFO ][o.e.c.r.a.AllocationService][node_s_0][masterService#updateTask][T#1] in project [default] updating number_of_replicas to [0] for indices [.secrets-inference, .inference]


Test begins shutting down 🤔. This is odd. My guess is this occurs because the all shards failed exception but it's weird that we don't see that here.
[2025-12-13T00:02:24,714][INFO ][o.e.x.i.i.AuthorizationTaskExecutorIT][testCreatesChatCompletion_AndThenCreatesTextEmbedding] after test



[2025-12-13T00:02:24,767][INFO ][o.e.c.r.a.AllocationService][node_s_0][masterService#updateTask][T#1] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.secrets-inference][0], [.inference][0]]])." previous.health="YELLOW" reason="shards started [[.secrets-inference][0], [.inference][0]]"
[2025-12-13T00:02:24,849][INFO ][o.e.x.i.i.AuthorizationTaskExecutorIT][testCreatesChatCompletion_AndThenCreatesTextEmbedding] --> waiting for all free_context tasks to complete within a reasonable time
[2025-12-13T00:02:24,864][INFO ][o.e.c.m.MetadataDeleteIndexService][node_s_0][masterService#updateTask][T#1] [.inference/05q3RZujRu-ur3ptCi6Bzg] deleting index
[2025-12-13T00:02:24,865][INFO ][o.e.c.m.MetadataDeleteIndexService][node_s_0][masterService#updateTask][T#1] [.secrets-inference/yQVC-fX2T_GsaeC_aDYreA] deleting index
[2025-12-13T00:02:24,912][INFO ][o.e.n.Node               ][testCreatesChatCompletion_AndThenCreatesTextEmbedding] stopping ...
[2025-12-13T00:02:24,923][WARN ][o.e.x.i.r.ModelRegistry  ][node_s_0][system_write][T#1] Failed to store document id: [model_.rainbow-sprinkles-elastic] inference id: [.rainbow-sprinkles-elastic] index: [.inference] bulk failure message [[.inference/05q3RZujRu-ur3ptCi6Bzg] org.elasticsearch.index.IndexNotFoundException: no such index [.inference]]
[2025-12-13T00:02:24,924][WARN ][o.e.x.i.r.ModelRegistry  ][node_s_0][system_write][T#1] Failed to store document id: [model_.rainbow-sprinkles-elastic] inference id: [.rainbow-sprinkles-elastic] index: [.secrets-inference] bulk failure message [[.secrets-inference/yQVC-fX2T_GsaeC_aDYreA] org.elasticsearch.index.IndexNotFoundException: no such index [.secrets-inference]]
[2025-12-13T00:02:24,924][WARN ][o.e.x.i.r.ModelRegistry  ][node_s_0][system_write][T#1] Failed to add minimal service settings to cluster state for inference endpoints []
org.elasticsearch.cluster.NotMasterException: node closed
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: master service is in state [STOPPED]
[2025-12-13T00:02:24,927][WARN ][o.e.x.i.s.e.a.AuthorizationPoller][node_s_0][system_write][T#1] Failed to store new EIS preconfigured inference endpoints [[org.elasticsearch.xpack.inference.services.elastic.completion.ElasticInferenceServiceCompletionModel@d6cb32ed]]
org.elasticsearch.ElasticsearchStatusException: Failed to add the inference endpoints []. The service may be in an inconsistent state. Please try deleting and re-adding the endpoints.

It's odd that we don't see the all shards failed exception in those logs, it's further up in the output

I wonder if the issue is that we're performing a search while we're trying to create the indices 🤔 . That's what is different between the first call to getEisEndpoints. The first call performs a search but the indices won't exist and won't be created. The second call to getEisEndpoints isn't creating them either, the creation is done one a separate thread by the persistent task but they're happening concurrently.

I think it's still worth a shot trying to ensure that the indices are created ahead of time to see if that helps here. In this PR, the first call to getEisEndpoints will perform a search and when it gets the response back it'll persist the default endpoints (not the EIS specific ones).

So maybe that'll help 🤷‍♂️.

The other thing this reveals is that we could handle updating the cluster state a little better.

Failed to add minimal service settings to cluster state for inference endpoints []

Specifically this means that we're passing an empty set of ids to the update cluster state. I think this is happening because when the test completes we delete all the indices so there's probably a race condition when a failure occurs. We can probably skip the update cluster state when there were no successful endpoints created.

DonalEvans · 2026-01-05T19:51:07Z

muted-tests.yml

+- class: org.elasticsearch.xpack.esql.ccq.MultiClusterSpecIT
+  method: test {csv-spec:spatial.ConvertFromStringParseError}
+  issue: https://github.com/elastic/elasticsearch/issues/139213


Is this test mute being added intentionally?

Oops nope, I'll remove it. Merge conflict issue.

DonalEvans · 2026-01-05T22:53:30Z

The "response" is null is confusing but expected because initially we haven't queued a response for the mock webserver.

The test queues a response in the web server in the @BeforeClass initClass() method, which causes the first test to run to not encounter the NPEs or ConnectionClosedException seen in the logs. Maybe that should be moved to the @Before createComponents() method so it's done for each test case and we can avoid creating unnecessary noise in the test?

I think it's still worth a shot trying to ensure that the indices are created ahead of time to see if that helps here. In this PR, the first call to getEisEndpoints will perform a search and when it gets the response back it'll persist the default endpoints (not the EIS specific ones).

Agreed, making sure the indices get created seems like a sensible choice here. I do wonder if this race between inference index creation and running a query on the index is something a customer could hit and what action they should take if they do, though.

jonathan-buttner · 2026-01-06T14:20:10Z

The test queues a response in the web server in the @BeforeClass initClass() method, which causes the first test to run to not encounter the NPEs or ConnectionClosedException seen in the logs. Maybe that should be moved to the @before createComponents() method so it's done for each test case and we can avoid creating unnecessary noise in the test?

Yeah good point, I don't know why I put that in the initClass() method 😆

…laky-auth-test-no-shards

elasticsearchmachine · 2026-01-06T19:47:47Z

💔 Backport failed

Status	Branch	Result
❌	9.3	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 139978

* upstream/main: (191 commits) Overall Decision for Deciders prioritizes THROTTLE (elastic#140237) Apply group by all logic not only to top-level aggregates (elastic#140248) [ES|QL] Refactor MV_UNION and MV_INTERSECTION to use shared set operation helper (elastic#139982) Avoid reading entire bloom filter file on reader open (elastic#139374) Mark bloom filter files for random access (elastic#139375) Ensure that the buffer used for ES93BloomFilterStoredFieldsFormat is zeroed (elastic#139034) Add busy assertion to avoid race condition for testStalledShardMigrationProperlyDetected (elastic#140230) Remove line number check for testTransitiveFindsDeepCallChain (elastic#140228) Allow a slight difference in rescored docs (elastic#139931) Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT testCreatesEisChatCompletion_DoesNotRemoveEndpointWhenNoLongerAuthorized elastic#138480 Start exchange sink fetchers concurrently (elastic#140196) Allow allocation to replacement target node on vacate completion (elastic#140150) Ignore JNA cleaner threads in SecureHdfsRepositoryAnalysisRestIT (elastic#139925) DeterministicQueue refactor and enhancement (elastic#140151) Always error out if CCS expression shows up when CCS is not supported (elastic#139009) Use IllegalArgumentException over RepositoryException for readonly-repository checks (elastic#140200) Guard promql capabilities in AnalyzerTests (elastic#140232) [Inference API] Fix flaky AuthorizationTaskExecutorIT tests (elastic#139978) Cleaning up exitable vector value impls (elastic#140190) [Inference API] Fix auth exception listener not called bug (elastic#139966) ...

jonathan-buttner · 2026-01-07T18:52:00Z

💚 All backports created successfully

Status	Branch	Result
✅	9.3

Questions ?

Please refer to the Backport tool documentation

…139978) * Fix flaky with no shards available exception * Fixing merge and adding empty response before tests (cherry picked from commit 92c8d08) # Conflicts: # muted-tests.yml

…139978) * Fix flaky with no shards available exception * Fixing merge and adding empty response before tests

…39978) (#140306) * [Inference API] Fix flaky AuthorizationTaskExecutorIT tests (#139978) * Fix flaky with no shards available exception * Fixing merge and adding empty response before tests (cherry picked from commit 92c8d08) # Conflicts: # muted-tests.yml * Fixing formatting

Fix flaky with no shards available exception

847f31c

jonathan-buttner added >test Issues or PRs that are addressing/adding tests auto-backport Automatically create backport pull requests when merged :SearchOrg/Inference Label for the Search Inference team Team:Search - Inference v9.3.0 v9.3.1 labels Dec 23, 2025

elasticsearchmachine added the v9.4.0 label Dec 23, 2025

jonathan-buttner and others added 2 commits January 5, 2026 11:49

Merge branch 'main' of github.com:elastic/elasticsearch into ia-fix-f…

e78fa9e

…laky-auth-test-no-shards

Merge branch 'main' into ia-fix-flaky-auth-test-no-shards

3de1f40

jonathan-buttner marked this pull request as ready for review January 5, 2026 19:40

jonathan-buttner requested a review from DonalEvans January 5, 2026 19:40

Merge branch 'main' into ia-fix-flaky-auth-test-no-shards

52c13f8

DonalEvans approved these changes Jan 5, 2026

View reviewed changes

jonathan-buttner added 2 commits January 6, 2026 11:45

Fixing merge and adding empty response before tests

0882db3

Merge branch 'main' of github.com:elastic/elasticsearch into ia-fix-f…

1f80c34

…laky-auth-test-no-shards

jonathan-buttner merged commit 92c8d08 into elastic:main Jan 6, 2026
35 checks passed

elasticsearchmachine added the backport pending label Jan 6, 2026

jonathan-buttner mentioned this pull request Jan 7, 2026

[9.3] [Inference API] Fix flaky AuthorizationTaskExecutorIT tests (#139978) #140306

Merged

sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Jan 7, 2026

[Inference API] Fix flaky AuthorizationTaskExecutorIT tests (elastic#…

e14d56d

…139978) * Fix flaky with no shards available exception * Fixing merge and adding empty response before tests

jonathan-buttner deleted the ia-fix-flaky-auth-test-no-shards branch January 14, 2026 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[Inference API] Fix flaky AuthorizationTaskExecutorIT tests#139978

[Inference API] Fix flaky AuthorizationTaskExecutorIT tests#139978
jonathan-buttner merged 6 commits intoelastic:mainfrom
jonathan-buttner:ia-fix-flaky-auth-test-no-shards

jonathan-buttner commented Dec 23, 2025

Uh oh!

elasticsearchmachine commented Jan 5, 2026

Uh oh!

DonalEvans commented Jan 5, 2026

Uh oh!

jonathan-buttner commented Jan 5, 2026 •

edited

Loading

Uh oh!

DonalEvans Jan 5, 2026

Uh oh!

jonathan-buttner Jan 6, 2026

Uh oh!

DonalEvans commented Jan 5, 2026

Uh oh!

jonathan-buttner commented Jan 6, 2026

Uh oh!

Uh oh!

elasticsearchmachine commented Jan 6, 2026

Uh oh!

jonathan-buttner commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

jonathan-buttner commented Dec 23, 2025

Uh oh!

elasticsearchmachine commented Jan 5, 2026

Uh oh!

DonalEvans commented Jan 5, 2026

Uh oh!

jonathan-buttner commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DonalEvans Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

jonathan-buttner Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

DonalEvans commented Jan 5, 2026

Uh oh!

jonathan-buttner commented Jan 6, 2026

Uh oh!

Uh oh!

elasticsearchmachine commented Jan 6, 2026

💔 Backport failed

Uh oh!

jonathan-buttner commented Jan 7, 2026

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jonathan-buttner commented Jan 5, 2026 •

edited

Loading