Skip to content

Comments

[Inference API] Fix flaky AuthorizationTaskExecutorIT tests#139978

Merged
jonathan-buttner merged 6 commits intoelastic:mainfrom
jonathan-buttner:ia-fix-flaky-auth-test-no-shards
Jan 6, 2026
Merged

[Inference API] Fix flaky AuthorizationTaskExecutorIT tests#139978
jonathan-buttner merged 6 commits intoelastic:mainfrom
jonathan-buttner:ia-fix-flaky-auth-test-no-shards

Conversation

@jonathan-buttner
Copy link
Contributor

This PR tries to address an org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed exception.

I think the all shards failed is indicating that the indices do not exist yet.

By having the modelRegistry.getAllModels(true, listener); pass true the ModelRegistry should persist the default endpoints and therefore create the inference indices.

Failure issue: #138012

Stack trace
    <failure message="Failed to execute phase [query], all shards failed; shardFailures {[_na_][.inference][0]: org.elasticsearch.action.NoShardAvailableActionException&#10;}" type="org.elasticsearch.action.search.SearchPhaseExecutionException">Failed to execute phase [query], all shards failed; shardFailures {[_na_][.inference][0]: org.elasticsearch.action.NoShardAvailableActionException
}
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:723)
	at app//org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.onPhaseFailure(SearchQueryThenFetchAsyncAction.java:83)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:347)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:769)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.finishOneShard(AbstractSearchAsyncAction.java:452)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:444)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.failOnUnavailable(AbstractSearchAsyncAction.java:315)
	at app//org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.doRun(SearchQueryThenFetchAsyncAction.java:482)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.run(AbstractSearchAsyncAction.java:253)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:398)
	at app//org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:239)
	at app//org.elasticsearch.action.search.TransportSearchAction$AsyncSearchActionProvider.runNewSearchPhase(TransportSearchAction.java:2051)
	at app//org.elasticsearch.action.search.TransportSearchAction.executeSearch(TransportSearchAction.java:1786)
	at app//org.elasticsearch.action.search.TransportSearchAction.executeLocalSearch(TransportSearchAction.java:1505)
	at app//org.elasticsearch.action.search.TransportSearchAction.lambda$executeRequest$9(TransportSearchAction.java:505)
	at app//org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:261)
	at app//org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:109)
	at app//org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:77)
	at app//org.elasticsearch.action.search.TransportSearchAction.executeRequest(TransportSearchAction.java:661)
	at app//org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:349)
	at app//org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:137)
	at app//org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:135)
	at app//org.elasticsearch.action.support.MappedActionFilters$MappedFilterChain.proceed(MappedActionFilters.java:71)
	at app//org.elasticsearch.action.support.MappedActionFilters.apply(MappedActionFilters.java:49)
	at app//org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:132)
	at app//org.elasticsearch.action.support.TransportAction.handleExecution(TransportAction.java:96)
	at app//org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:59)
	at app//org.elasticsearch.tasks.TaskManager.registerAndExecute(TaskManager.java:216)
	at app//org.elasticsearch.client.internal.node.NodeClient.executeLocally(NodeClient.java:107)
	at app//org.elasticsearch.client.internal.node.NodeClient.doExecute(NodeClient.java:85)
	at app//org.elasticsearch.client.internal.support.AbstractClient.execute(AbstractClient.java:160)
	at app//org.elasticsearch.client.internal.FilterClient.doExecute(FilterClient.java:57)
	at app//org.elasticsearch.client.internal.OriginSettingClient.doExecute(OriginSettingClient.java:44)
	at app//org.elasticsearch.client.internal.support.AbstractClient.execute(AbstractClient.java:160)
	at app//org.elasticsearch.client.internal.support.AbstractClient.search(AbstractClient.java:295)
	at app//org.elasticsearch.xpack.inference.registry.ModelRegistry.getAllModels(ModelRegistry.java:421)
	at app//org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT.getEisEndpoints(AuthorizationTaskExecutorIT.java:198)
	at app//org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT.lambda$assertChatCompletionEndpointExists$9(AuthorizationTaskExecutorIT.java:270)
	at app//org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1610)
	at app//org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1594)
	at app//org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT.assertChatCompletionEndpointExists(AuthorizationTaskExecutorIT.java:269)
	at app//org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT.assertChatCompletionEndpointExists(AuthorizationTaskExecutorIT.java:265)
	at app//org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT.testCreatesChatCompletion_AndThenCreatesTextEmbedding(AuthorizationTaskExecutorIT.java:293)

@jonathan-buttner jonathan-buttner added >test Issues or PRs that are addressing/adding tests auto-backport Automatically create backport pull requests when merged :SearchOrg/Inference Label for the Search Inference team Team:Search - Inference v9.3.0 v9.3.1 labels Dec 23, 2025
@jonathan-buttner jonathan-buttner marked this pull request as ready for review January 5, 2026 19:40
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-inference-team (Team:Search - Inference)

@DonalEvans
Copy link
Contributor

The test failure happened when calling assertChatCompletionEndpointExists(), which is called after calling assertNoAuthorizedEisEndpoints() earlier in the test. Both those methods call getEisEndpoints() so I'm confused why the first call would work but the second wouldn't if the issue is the indices not existing.

@jonathan-buttner
Copy link
Contributor Author

jonathan-buttner commented Jan 5, 2026

Hmm

Looking at the logs closer...

[2025-12-13T00:02:23,901][INFO ][o.e.x.i.i.AuthorizationTaskExecutorIT][testCreatesChatCompletion_AndThenCreatesTextEmbedding] before test
[2025-12-13T00:02:23,924][INFO ][o.e.x.i.s.e.a.AuthorizationTaskExecutor][node_s_0][inference_utility][T#1] Started authorization poller task with id 42 
[2025-12-13T00:02:23,929][INFO ][o.e.x.i.s.e.a.AuthorizationTaskExecutor][node_s_0][masterService#updateTask][T#1] Finished creating authorization poller task, id eis-authorization-poller
[2025-12-13T00:02:23,966][ERROR][o.e.t.h.MockWebServer    ][[HTTP-Dispatcher]] failed to respond to request [GET /api/v2/authorizations]
java.lang.NullPointerException: Cannot invoke "org.elasticsearch.test.http.MockResponse.getBeforeReplyDelay()" because "response" is null

That part looks ok so far. The "response" is null is confusing but expected because initially we haven't queued a response for the mock webserver. So when auth task makes a request, the webserver throws an error.

We get 3x of those because we retry that exception.

Then we get:

The auth task is cancelled and restarted to force a new auth request to occur (after we've enqueued a response this time)
[2025-12-13T00:02:24,369][INFO ][o.e.x.i.s.e.a.AuthorizationTaskExecutor][node_s_0][inference_utility][T#1] Started authorization poller task with id 73


[2025-12-13T00:02:24,372][INFO ][o.e.x.i.s.e.a.AuthorizationTaskExecutor][node_s_0][masterService#updateTask][T#1] Finished creating authorization poller task, id eis-authorization-poller
[2025-12-13T00:02:24,385][INFO ][o.e.x.i.s.e.a.AuthorizationPoller][node_s_0][inference_response][T#1] Storing new EIS preconfigured inference endpoints with inference IDs [.rainbow-sprinkles-elastic]
[2025-12-13T00:02:24,416][INFO ][o.e.c.m.MetadataCreateIndexService][node_s_0][masterService#updateTask][T#1] creating index [.secrets-inference] in project [default], cause [auto(bulk api)], templates [], shards [1]/[1]
[2025-12-13T00:02:24,474][INFO ][o.e.c.m.MetadataCreateIndexService][node_s_0][masterService#updateTask][T#1] creating index [.inference] in project [default], cause [auto(bulk api)], templates [], shards [1]/[1]
[2025-12-13T00:02:24,498][INFO ][o.e.c.r.a.AllocationService][node_s_0][masterService#updateTask][T#1] in project [default] updating number_of_replicas to [0] for indices [.secrets-inference, .inference]


Test begins shutting down 🤔. This is odd. My guess is this occurs because the all shards failed exception but it's weird that we don't see that here.
[2025-12-13T00:02:24,714][INFO ][o.e.x.i.i.AuthorizationTaskExecutorIT][testCreatesChatCompletion_AndThenCreatesTextEmbedding] after test



[2025-12-13T00:02:24,767][INFO ][o.e.c.r.a.AllocationService][node_s_0][masterService#updateTask][T#1] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.secrets-inference][0], [.inference][0]]])." previous.health="YELLOW" reason="shards started [[.secrets-inference][0], [.inference][0]]"
[2025-12-13T00:02:24,849][INFO ][o.e.x.i.i.AuthorizationTaskExecutorIT][testCreatesChatCompletion_AndThenCreatesTextEmbedding] --> waiting for all free_context tasks to complete within a reasonable time
[2025-12-13T00:02:24,864][INFO ][o.e.c.m.MetadataDeleteIndexService][node_s_0][masterService#updateTask][T#1] [.inference/05q3RZujRu-ur3ptCi6Bzg] deleting index
[2025-12-13T00:02:24,865][INFO ][o.e.c.m.MetadataDeleteIndexService][node_s_0][masterService#updateTask][T#1] [.secrets-inference/yQVC-fX2T_GsaeC_aDYreA] deleting index
[2025-12-13T00:02:24,912][INFO ][o.e.n.Node               ][testCreatesChatCompletion_AndThenCreatesTextEmbedding] stopping ...
[2025-12-13T00:02:24,923][WARN ][o.e.x.i.r.ModelRegistry  ][node_s_0][system_write][T#1] Failed to store document id: [model_.rainbow-sprinkles-elastic] inference id: [.rainbow-sprinkles-elastic] index: [.inference] bulk failure message [[.inference/05q3RZujRu-ur3ptCi6Bzg] org.elasticsearch.index.IndexNotFoundException: no such index [.inference]]
[2025-12-13T00:02:24,924][WARN ][o.e.x.i.r.ModelRegistry  ][node_s_0][system_write][T#1] Failed to store document id: [model_.rainbow-sprinkles-elastic] inference id: [.rainbow-sprinkles-elastic] index: [.secrets-inference] bulk failure message [[.secrets-inference/yQVC-fX2T_GsaeC_aDYreA] org.elasticsearch.index.IndexNotFoundException: no such index [.secrets-inference]]
[2025-12-13T00:02:24,924][WARN ][o.e.x.i.r.ModelRegistry  ][node_s_0][system_write][T#1] Failed to add minimal service settings to cluster state for inference endpoints []
org.elasticsearch.cluster.NotMasterException: node closed
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: master service is in state [STOPPED]
[2025-12-13T00:02:24,927][WARN ][o.e.x.i.s.e.a.AuthorizationPoller][node_s_0][system_write][T#1] Failed to store new EIS preconfigured inference endpoints [[org.elasticsearch.xpack.inference.services.elastic.completion.ElasticInferenceServiceCompletionModel@d6cb32ed]]
org.elasticsearch.ElasticsearchStatusException: Failed to add the inference endpoints []. The service may be in an inconsistent state. Please try deleting and re-adding the endpoints.

It's odd that we don't see the all shards failed exception in those logs, it's further up in the output

I wonder if the issue is that we're performing a search while we're trying to create the indices 🤔 . That's what is different between the first call to getEisEndpoints. The first call performs a search but the indices won't exist and won't be created. The second call to getEisEndpoints isn't creating them either, the creation is done one a separate thread by the persistent task but they're happening concurrently.

I think it's still worth a shot trying to ensure that the indices are created ahead of time to see if that helps here. In this PR, the first call to getEisEndpoints will perform a search and when it gets the response back it'll persist the default endpoints (not the EIS specific ones).

So maybe that'll help 🤷‍♂️.

The other thing this reveals is that we could handle updating the cluster state a little better.

Failed to add minimal service settings to cluster state for inference endpoints []

Specifically this means that we're passing an empty set of ids to the update cluster state. I think this is happening because when the test completes we delete all the indices so there's probably a race condition when a failure occurs. We can probably skip the update cluster state when there were no successful endpoints created.

muted-tests.yml Outdated
Comment on lines 321 to 323
- class: org.elasticsearch.xpack.esql.ccq.MultiClusterSpecIT
method: test {csv-spec:spatial.ConvertFromStringParseError}
issue: https://github.com/elastic/elasticsearch/issues/139213
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test mute being added intentionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops nope, I'll remove it. Merge conflict issue.

@DonalEvans
Copy link
Contributor

The "response" is null is confusing but expected because initially we haven't queued a response for the mock webserver.

The test queues a response in the web server in the @BeforeClass initClass() method, which causes the first test to run to not encounter the NPEs or ConnectionClosedException seen in the logs. Maybe that should be moved to the @Before createComponents() method so it's done for each test case and we can avoid creating unnecessary noise in the test?

I think it's still worth a shot trying to ensure that the indices are created ahead of time to see if that helps here. In this PR, the first call to getEisEndpoints will perform a search and when it gets the response back it'll persist the default endpoints (not the EIS specific ones).

Agreed, making sure the indices get created seems like a sensible choice here. I do wonder if this race between inference index creation and running a query on the index is something a customer could hit and what action they should take if they do, though.

@jonathan-buttner
Copy link
Contributor Author

The test queues a response in the web server in the @BeforeClass initClass() method, which causes the first test to run to not encounter the NPEs or ConnectionClosedException seen in the logs. Maybe that should be moved to the @before createComponents() method so it's done for each test case and we can avoid creating unnecessary noise in the test?

Yeah good point, I don't know why I put that in the initClass() method 😆

@jonathan-buttner jonathan-buttner merged commit 92c8d08 into elastic:main Jan 6, 2026
35 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.3 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 139978

szybia added a commit to szybia/elasticsearch that referenced this pull request Jan 7, 2026
* upstream/main: (191 commits)
  Overall Decision for Deciders prioritizes THROTTLE (elastic#140237)
  Apply group by all logic not only to top-level aggregates (elastic#140248)
  [ES|QL] Refactor MV_UNION and MV_INTERSECTION to use shared set operation helper (elastic#139982)
  Avoid reading entire bloom filter file on reader open (elastic#139374)
  Mark bloom filter files for random access (elastic#139375)
  Ensure that the buffer used for ES93BloomFilterStoredFieldsFormat is zeroed (elastic#139034)
  Add busy assertion to avoid race condition for testStalledShardMigrationProperlyDetected (elastic#140230)
  Remove line number check for testTransitiveFindsDeepCallChain (elastic#140228)
  Allow a slight difference in rescored docs (elastic#139931)
  Mute org.elasticsearch.xpack.inference.integration.AuthorizationTaskExecutorIT testCreatesEisChatCompletion_DoesNotRemoveEndpointWhenNoLongerAuthorized elastic#138480
  Start exchange sink fetchers concurrently (elastic#140196)
  Allow allocation to replacement target node on vacate completion (elastic#140150)
  Ignore JNA cleaner threads in SecureHdfsRepositoryAnalysisRestIT (elastic#139925)
  DeterministicQueue refactor and enhancement (elastic#140151)
  Always error out if CCS expression shows up when CCS is not supported (elastic#139009)
  Use IllegalArgumentException over RepositoryException for readonly-repository checks (elastic#140200)
  Guard promql capabilities in AnalyzerTests (elastic#140232)
  [Inference API] Fix flaky AuthorizationTaskExecutorIT tests (elastic#139978)
  Cleaning up exitable vector value impls (elastic#140190)
  [Inference API] Fix auth exception listener not called bug (elastic#139966)
  ...
@jonathan-buttner
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
9.3

Questions ?

Please refer to the Backport tool documentation

jonathan-buttner added a commit to jonathan-buttner/elasticsearch that referenced this pull request Jan 7, 2026
…139978)

* Fix flaky with no shards available exception

* Fixing merge and adding empty response before tests

(cherry picked from commit 92c8d08)

# Conflicts:
#	muted-tests.yml
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Jan 7, 2026
…139978)

* Fix flaky with no shards available exception

* Fixing merge and adding empty response before tests
elasticsearchmachine pushed a commit that referenced this pull request Jan 7, 2026
…39978) (#140306)

* [Inference API] Fix flaky AuthorizationTaskExecutorIT tests (#139978)

* Fix flaky with no shards available exception

* Fixing merge and adding empty response before tests

(cherry picked from commit 92c8d08)

# Conflicts:
#	muted-tests.yml

* Fixing formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged backport pending :SearchOrg/Inference Label for the Search Inference team Team:Search - Inference >test Issues or PRs that are addressing/adding tests v9.3.0 v9.3.1 v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants