Skip to content

Conversation

@sgup432
Copy link
Contributor

@sgup432 sgup432 commented Aug 1, 2025

Description

This fixes the field data cache clean up when an index is removed. Currently during any index removal, associated field data cache is cleaned up by iterating over ALL the keys irrespective of whether it belongs to desired index or not, and that too on the cluster applier thread.

In some cases, due to large number of entries in the fieldDataCache, this removal flow takes a lot of time and the data node isn't able to apply the cluster state, and is eventually removed due to lagging

Sample CPU profiles looks something like below

100.4% (501.8ms out of 500ms) cpu usage by thread 'opensearch[46a4b13b820a8bcf60ac8f1de15cee14][clusterApplierService#updateTask][T#1]'
     10/10 snapshots sharing following 22 elements
       app//org.opensearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.clear(IndicesFieldDataCache.java:219)
       app//org.opensearch.index.fielddata.IndexFieldDataService.clear(IndexFieldDataService.java:107)
       app//org.opensearch.index.fielddata.IndexFieldDataService.close(IndexFieldDataService.java:179)
       app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:87)
       app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:129)
       app//org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:79)
       app//org.opensearch.index.IndexService.close(IndexService.java:362)
       app//org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:887)
       app//org.opensearch.indices.cluster.IndicesClusterStateService.removeIndices(IndicesClusterStateService.java:410)
       app//org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:255)
       app//org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:591)
       app//org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:578)
       app//org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:546)
       app//org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:469)
       app//org.opensearch.cluster.service.ClusterApplierService.access$000(ClusterApplierService.java:81)
       app//org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:180)

Related issue, partially solves the problem - #13862.

There will be another PR around the same to refactor field data cache, and improve its data structures and iteration/removal flow further.

Related Issues

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Sagar Upadhyaya <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

❌ Gradle check result for 392f6f2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Sagar Upadhyaya <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

✅ Gradle check result for 7e3c6d0: SUCCESS

@codecov
Copy link

codecov bot commented Aug 1, 2025

Codecov Report

❌ Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.93%. Comparing base (9b22c9b) to head (33ecf13).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...nsearch/index/fielddata/IndexFieldDataService.java 62.50% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18888      +/-   ##
============================================
+ Coverage     72.77%   72.93%   +0.16%     
- Complexity    68690    68847     +157     
============================================
  Files          5582     5590       +8     
  Lines        315456   315816     +360     
  Branches      45778    45829      +51     
============================================
+ Hits         229568   230337     +769     
+ Misses        67290    66802     -488     
- Partials      18598    18677      +79     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sgup432 for taking stab at this issue. The code change looks good to me mostly, except when clear throws an error. Currently, if clear throws error it will cluster state from getting applied and it will eventually get retried in the next publication attempt. With this code change, the cluster state might be applied irrespective of clear completing successfully. Maybe I am missing something that might still cause the clear to get retried even with this change?

@sgup432
Copy link
Contributor Author

sgup432 commented Aug 1, 2025

Currently, if clear throws error it will cluster state from getting applied and it will eventually get retried in the next publication attempt.

In my opinion, retrying the publication attempt just due to a field data cache error also doesn't seem like a good idea to me unless you think otherwise. I think we can at-least catch an exception here, log it and move on for now.

If clear() throws an exception in the modified logic, it will make those stale index entries lie in the cache for a while until they are cleared automatically due to size based eviction. So in worst case it might evict some valid entries causing some performance impact, but I think it is a still better tradeoff compared to node drops. Also chances of that happening compared to a node drop also seems pretty less to me.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

❌ Gradle check result for f281b98: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@jainankitk
Copy link
Contributor

Seems like flaky test, retrying gradle check:

[Test Result](https://build.ci.opensearch.org/job/gradle-check/61448/testReport/) (1 failure / +1)

    [org.opensearch.arrow.flight.bootstrap.FlightClientManagerTests.testGetFlightClientLocationExecutionError](https://build.ci.opensearch.org/job/gradle-check/61448/testReport/junit/org.opensearch.arrow.flight.bootstrap/FlightClientManagerTests/testGetFlightClientLocationExecutionError/)

@jainankitk
Copy link
Contributor

In my opinion, retrying the publication attempt just due to a field data cache error also doesn't seem like a good idea to me unless you think otherwise. I think we can at-least catch an exception here, log it and move on for now.

I was primarily concerned about the impact of clear not running successfully, and leaving unused entries in the field data cache. But as you mentioned below, firstly this is cache so the allocated size on JVM is bound and secondly those unused entries will eventually be cleaned up.

If clear() throws an exception in the modified logic, it will make those stale index entries lie in the cache for a while until they are cleared automatically due to size based eviction. So in worst case it might evict some valid entries causing some performance impact, but I think it is a still better tradeoff compared to node drops. Also chances of that happening compared to a node drop also seems pretty less to me.

Also I don't see this running into error much, especially since the clear invocation does not do any disk/network related operations. But wanted to understand the implications if at all for some reason it does fail.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

✅ Gradle check result for f281b98: SUCCESS

@sohami
Copy link
Contributor

sohami commented Aug 1, 2025

I think we should not use generic threadpool here as that is changing the close() semantic. Earlier it was single threaded with cluster applier thread in picture and also via the scheduler threadpool now we are making it multi-threaded with generic pool. If there are multiple shards (for different indexes) getting removed from the node then for each cleanup will happen on different generic thread which can block those threads for longer.

@sohami
Copy link
Contributor

sohami commented Aug 1, 2025

I think we should not use generic threadpool here as that is changing the close() semantic. Earlier it was single threaded with cluster applier thread in picture and also via the scheduler threadpool now we are making it multi-threaded with generic pool. If there are multiple shards (for different indexes) getting removed from the node then for each cleanup will happen on different generic thread which can block those threads for longer.

Was looking more on this, during the IndicesFieldDataCache$IndexFieldCache.clear(IndicesFieldDataCache.java:219) call it does 2 things - a) Invalidate the entry for that index and fields and b) call refresh on the global field data cache. Step b happens under the global lock as it will mutate the cache and is expensive. Step a) will not be under any lock and is not expensive as well. So multi-threaded nature here will not be very performant too. Based on these I think we can consider to move the clear call on close under the scheduler thread as well.

Other option I can think of is:

Today scheduler anyways take care of evicting the invalidated entries at some schedule. So as part of IndexService::close we can just perform the invalidation of keys and avoid refresh. Once the keys are invalidated, scheduler will take the global lock and in single iteration will remove all the stale entries from it. This will avoid doing refresh of global cache on each shard movement or IndexService::close method call.

@sgup432
Copy link
Contributor Author

sgup432 commented Aug 1, 2025

getting removed from the node then for each cleanup will happen on different generic thread which can block those threads for longer.

Yeah it will, but I guess that is a known tradeoff we are taking here? As our objective for now was to avoid node drops in the worst case. The underlying problem ie inefficient removal flow is still a problem, so moving this logic to any other thread would still block it

Today scheduler anyways take care of evicting the invalidated entries at some schedule

I don't think thats the case? As CacheCleaner(scheduler thread) only calls this.cache.getCache().refresh(); which only tries to evict items if the cache has exceeded the size or it has expired.

Invalidation is either done through explicit IndexService.close or when the underlying indexReader closes(in which case we call cache.invalidate(key) directly)

@sohami
Copy link
Contributor

sohami commented Aug 1, 2025

Yeah it will, but I guess that is a known tradeoff we are taking here? As our objective for now was to avoid node drops in the worst case. The underlying problem ie inefficient removal flow is still a problem, so moving this logic to any other thread would still block it

I understand main reason is to avoid node drop, but was wondering if it can cause other issue with generic pool. It seems like generic pool is used in other background tasks as well like recovery, etc so should be fine here.

I don't think thats the case? As CacheCleaner(scheduler thread) only calls this.cache.getCache().refresh(); which only tries to evict items if the cache has exceeded the size or it has expired.

You are right, my assumption was invalidation will make it eligible for eviction. Since refresh was called later in the clear as well. But that is not the case.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

✅ Gradle check result for 33ecf13: SUCCESS

@jainankitk jainankitk merged commit 4cd6962 into opensearch-project:main Aug 1, 2025
30 of 31 checks passed
sunqijun1 pushed a commit to sunqijun1/OpenSearch that referenced this pull request Aug 4, 2025
Signed-off-by: Sagar Upadhyaya <[email protected]>
Signed-off-by: Sagar <[email protected]>
Signed-off-by: sunqijun.jun <[email protected]>
tandonks pushed a commit to tandonks/OpenSearch that referenced this pull request Aug 5, 2025
@sgup432 sgup432 deleted the field_data_cache_async_close branch August 12, 2025 18:20
vinaykpud pushed a commit to vinaykpud/OpenSearch that referenced this pull request Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants