Close IndexFieldDataService asynchronously #18888

sgup432 · 2025-08-01T01:57:30Z

Description

This fixes the field data cache clean up when an index is removed. Currently during any index removal, associated field data cache is cleaned up by iterating over ALL the keys irrespective of whether it belongs to desired index or not, and that too on the cluster applier thread.

In some cases, due to large number of entries in the fieldDataCache, this removal flow takes a lot of time and the data node isn't able to apply the cluster state, and is eventually removed due to lagging

Sample CPU profiles looks something like below

100.4% (501.8ms out of 500ms) cpu usage by thread 'opensearch[46a4b13b820a8bcf60ac8f1de15cee14][clusterApplierService#updateTask][T#1]'
     10/10 snapshots sharing following 22 elements
       app//org.opensearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.clear(IndicesFieldDataCache.java:219)
       app//org.opensearch.index.fielddata.IndexFieldDataService.clear(IndexFieldDataService.java:107)
       app//org.opensearch.index.fielddata.IndexFieldDataService.close(IndexFieldDataService.java:179)
       app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:87)
       app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:129)
       app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:79)
       app//org.opensearch.index.IndexService.close(IndexService.java:362)
       app//org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:887)
       app//org.opensearch.indices.cluster.IndicesClusterStateService.removeIndices(IndicesClusterStateService.java:410)
       app//org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:255)
       app//org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:591)
       app//org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:578)
       app//org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:546)
       app//org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:469)
       app//org.opensearch.cluster.service.ClusterApplierService.access$000(ClusterApplierService.java:81)
       app//org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:180)

Related issue, partially solves the problem - #13862.

There will be another PR around the same to refactor field data cache, and improve its data structures and iteration/removal flow further.

Related Issues

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
~~[ ] Public documentation issue/PR created, if applicable.~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Sagar Upadhyaya <[email protected]>

github-actions · 2025-08-01T02:18:20Z

❌ Gradle check result for 392f6f2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Sagar Upadhyaya <[email protected]>

github-actions · 2025-08-01T04:08:33Z

✅ Gradle check result for 7e3c6d0: SUCCESS

codecov · 2025-08-01T04:09:01Z

Codecov Report

❌ Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.93%. Comparing base (9b22c9b) to head (33ecf13).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
...nsearch/index/fielddata/IndexFieldDataService.java	62.50%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #18888      +/-   ##
============================================
+ Coverage     72.77%   72.93%   +0.16%     
- Complexity    68690    68847     +157     
============================================
  Files          5582     5590       +8     
  Lines        315456   315816     +360     
  Branches      45778    45829      +51     
============================================
+ Hits         229568   230337     +769     
+ Misses        67290    66802     -488     
- Partials      18598    18677      +79

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jainankitk

Thanks @sgup432 for taking stab at this issue. The code change looks good to me mostly, except when clear throws an error. Currently, if clear throws error it will cluster state from getting applied and it will eventually get retried in the next publication attempt. With this code change, the cluster state might be applied irrespective of clear completing successfully. Maybe I am missing something that might still cause the clear to get retried even with this change?

sgup432 · 2025-08-01T07:11:18Z

Currently, if clear throws error it will cluster state from getting applied and it will eventually get retried in the next publication attempt.

In my opinion, retrying the publication attempt just due to a field data cache error also doesn't seem like a good idea to me unless you think otherwise. I think we can at-least catch an exception here, log it and move on for now.

If clear() throws an exception in the modified logic, it will make those stale index entries lie in the cache for a while until they are cleared automatically due to size based eviction. So in worst case it might evict some valid entries causing some performance impact, but I think it is a still better tradeoff compared to node drops. Also chances of that happening compared to a node drop also seems pretty less to me.

Signed-off-by: Sagar Upadhyaya <[email protected]>

github-actions · 2025-08-01T07:48:26Z

❌ Gradle check result for f281b98: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

jainankitk · 2025-08-01T16:46:00Z

Seems like flaky test, retrying gradle check:

[Test Result](https://build.ci.opensearch.org/job/gradle-check/61448/testReport/) (1 failure / +1)

    [org.opensearch.arrow.flight.bootstrap.FlightClientManagerTests.testGetFlightClientLocationExecutionError](https://build.ci.opensearch.org/job/gradle-check/61448/testReport/junit/org.opensearch.arrow.flight.bootstrap/FlightClientManagerTests/testGetFlightClientLocationExecutionError/)

jainankitk · 2025-08-01T16:51:17Z

In my opinion, retrying the publication attempt just due to a field data cache error also doesn't seem like a good idea to me unless you think otherwise. I think we can at-least catch an exception here, log it and move on for now.

I was primarily concerned about the impact of clear not running successfully, and leaving unused entries in the field data cache. But as you mentioned below, firstly this is cache so the allocated size on JVM is bound and secondly those unused entries will eventually be cleaned up.

If clear() throws an exception in the modified logic, it will make those stale index entries lie in the cache for a while until they are cleared automatically due to size based eviction. So in worst case it might evict some valid entries causing some performance impact, but I think it is a still better tradeoff compared to node drops. Also chances of that happening compared to a node drop also seems pretty less to me.

Also I don't see this running into error much, especially since the clear invocation does not do any disk/network related operations. But wanted to understand the implications if at all for some reason it does fail.

github-actions · 2025-08-01T18:07:00Z

✅ Gradle check result for f281b98: SUCCESS

sohami · 2025-08-01T18:28:02Z

I think we should not use generic threadpool here as that is changing the close() semantic. Earlier it was single threaded with cluster applier thread in picture and also via the scheduler threadpool now we are making it multi-threaded with generic pool. If there are multiple shards (for different indexes) getting removed from the node then for each cleanup will happen on different generic thread which can block those threads for longer.

sohami · 2025-08-01T19:03:58Z

I think we should not use generic threadpool here as that is changing the close() semantic. Earlier it was single threaded with cluster applier thread in picture and also via the scheduler threadpool now we are making it multi-threaded with generic pool. If there are multiple shards (for different indexes) getting removed from the node then for each cleanup will happen on different generic thread which can block those threads for longer.

Was looking more on this, during the IndicesFieldDataCache$IndexFieldCache.clear(IndicesFieldDataCache.java:219) call it does 2 things - a) Invalidate the entry for that index and fields and b) call refresh on the global field data cache. Step b happens under the global lock as it will mutate the cache and is expensive. Step a) will not be under any lock and is not expensive as well. So multi-threaded nature here will not be very performant too. Based on these I think we can consider to move the clear call on close under the scheduler thread as well.

Other option I can think of is:

Today scheduler anyways take care of evicting the invalidated entries at some schedule. So as part of IndexService::close we can just perform the invalidation of keys and avoid refresh. Once the keys are invalidated, scheduler will take the global lock and in single iteration will remove all the stale entries from it. This will avoid doing refresh of global cache on each shard movement or IndexService::close method call.

sgup432 · 2025-08-01T20:11:02Z

getting removed from the node then for each cleanup will happen on different generic thread which can block those threads for longer.

Yeah it will, but I guess that is a known tradeoff we are taking here? As our objective for now was to avoid node drops in the worst case. The underlying problem ie inefficient removal flow is still a problem, so moving this logic to any other thread would still block it

Today scheduler anyways take care of evicting the invalidated entries at some schedule

I don't think thats the case? As CacheCleaner(scheduler thread) only calls this.cache.getCache().refresh(); which only tries to evict items if the cache has exceeded the size or it has expired.

Invalidation is either done through explicit IndexService.close or when the underlying indexReader closes(in which case we call cache.invalidate(key) directly)

sohami · 2025-08-01T21:34:47Z

Yeah it will, but I guess that is a known tradeoff we are taking here? As our objective for now was to avoid node drops in the worst case. The underlying problem ie inefficient removal flow is still a problem, so moving this logic to any other thread would still block it

I understand main reason is to avoid node drop, but was wondering if it can cause other issue with generic pool. It seems like generic pool is used in other background tasks as well like recovery, etc so should be fine here.

I don't think thats the case? As CacheCleaner(scheduler thread) only calls this.cache.getCache().refresh(); which only tries to evict items if the cache has exceeded the size or it has expired.

You are right, my assumption was invalidation will make it eligible for eviction. Since refresh was called later in the clear as well. But that is not the case.

Signed-off-by: Sagar <[email protected]>

github-actions · 2025-08-01T23:13:28Z

✅ Gradle check result for 33ecf13: SUCCESS

Signed-off-by: Sagar Upadhyaya <[email protected]> Signed-off-by: Sagar <[email protected]> Signed-off-by: sunqijun.jun <[email protected]>

Signed-off-by: Sagar Upadhyaya <[email protected]> Signed-off-by: Sagar <[email protected]>

Close IndexFieldDataService asynchronously

97affbb

Signed-off-by: Sagar Upadhyaya <[email protected]>

sgup432 requested review from a team, Bukhtawar, CEHENKLE, Rishikesh1159, VachaShah, anasalkouz, andrross, ashking94, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, sachinpkale, saratvemulapalli, shwetathareja and sohami as code owners August 1, 2025 01:57

Add changelog

392f6f2

Signed-off-by: Sagar Upadhyaya <[email protected]>

Fix unit tests

7e3c6d0

Signed-off-by: Sagar Upadhyaya <[email protected]>

jainankitk reviewed Aug 1, 2025

View reviewed changes

Catching and logging exception while clearing field data cache clear

f281b98

Signed-off-by: Sagar Upadhyaya <[email protected]>

jainankitk approved these changes Aug 1, 2025

View reviewed changes

sohami approved these changes Aug 1, 2025

View reviewed changes

Merge branch 'main' into field_data_cache_async_close

33ecf13

Signed-off-by: Sagar <[email protected]>

jainankitk merged commit 4cd6962 into opensearch-project:main Aug 1, 2025
30 of 31 checks passed

opensearch-ci-bot mentioned this pull request Aug 2, 2025

[AUTOCUT] Gradle Check Flaky Test Report for TransferManagerBlobContainerReaderTests #18872

Open

tandonks pushed a commit to tandonks/OpenSearch that referenced this pull request Aug 5, 2025

Close IndexFieldDataService asynchronously (opensearch-project#18888)

e0c3a88

Signed-off-by: Sagar Upadhyaya <[email protected]> Signed-off-by: Sagar <[email protected]>

sgup432 deleted the field_data_cache_async_close branch August 12, 2025 18:20

BrewTestBot mentioned this pull request Aug 20, 2025

opensearch 3.2.0 Homebrew/homebrew-core#234146

Merged

Jorgesnchz mentioned this pull request Aug 22, 2025

Compatibility with OpenSearch 3.2.0 wazuh/wazuh-indexer#1086

Closed

6 tasks

vinaykpud pushed a commit to vinaykpud/OpenSearch that referenced this pull request Sep 26, 2025

Close IndexFieldDataService asynchronously (opensearch-project#18888)

e6955c6

Signed-off-by: Sagar Upadhyaya <[email protected]> Signed-off-by: Sagar <[email protected]>

opensearch-ci-bot mentioned this pull request Oct 22, 2025

[AUTOCUT] Gradle Check Flaky Test Report for FlightClientManagerTests #19728

Open

Close IndexFieldDataService asynchronously #18888

Close IndexFieldDataService asynchronously #18888

Uh oh!

Conversation

sgup432 commented Aug 1, 2025

Description

Related Issues

Check List

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

codecov bot commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jainankitk left a comment

Choose a reason for hiding this comment

Uh oh!

sgup432 commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

jainankitk commented Aug 1, 2025

Uh oh!

jainankitk commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

sohami commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sohami commented Aug 1, 2025

Uh oh!

sgup432 commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sohami commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Aug 1, 2025 •

edited

Loading

sohami commented Aug 1, 2025 •

edited

Loading

sgup432 commented Aug 1, 2025 •

edited

Loading