-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Close IndexFieldDataService asynchronously #18888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close IndexFieldDataService asynchronously #18888
Conversation
Signed-off-by: Sagar Upadhyaya <[email protected]>
Signed-off-by: Sagar Upadhyaya <[email protected]>
|
❌ Gradle check result for 392f6f2: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Sagar Upadhyaya <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18888 +/- ##
============================================
+ Coverage 72.77% 72.93% +0.16%
- Complexity 68690 68847 +157
============================================
Files 5582 5590 +8
Lines 315456 315816 +360
Branches 45778 45829 +51
============================================
+ Hits 229568 230337 +769
+ Misses 67290 66802 -488
- Partials 18598 18677 +79 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sgup432 for taking stab at this issue. The code change looks good to me mostly, except when clear throws an error. Currently, if clear throws error it will cluster state from getting applied and it will eventually get retried in the next publication attempt. With this code change, the cluster state might be applied irrespective of clear completing successfully. Maybe I am missing something that might still cause the clear to get retried even with this change?
In my opinion, retrying the publication attempt just due to a field data cache error also doesn't seem like a good idea to me unless you think otherwise. I think we can at-least catch an exception here, log it and move on for now. If |
Signed-off-by: Sagar Upadhyaya <[email protected]>
|
❌ Gradle check result for f281b98: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Seems like flaky test, retrying gradle check: |
I was primarily concerned about the impact of
Also I don't see this running into error much, especially since the |
|
I think we should not use |
Was looking more on this, during the Other option I can think of is: Today scheduler anyways take care of evicting the invalidated entries at some schedule. So as part of |
Yeah it will, but I guess that is a known tradeoff we are taking here? As our objective for now was to avoid node drops in the worst case. The underlying problem ie inefficient removal flow is still a problem, so moving this logic to any other thread would still block it
I don't think thats the case? As CacheCleaner(scheduler thread) only calls Invalidation is either done through explicit IndexService.close or when the underlying indexReader closes(in which case we call |
I understand main reason is to avoid node drop, but was wondering if it can cause other issue with
You are right, my assumption was invalidation will make it eligible for eviction. Since |
Signed-off-by: Sagar <[email protected]>
Signed-off-by: Sagar Upadhyaya <[email protected]> Signed-off-by: Sagar <[email protected]> Signed-off-by: sunqijun.jun <[email protected]>
Signed-off-by: Sagar Upadhyaya <[email protected]> Signed-off-by: Sagar <[email protected]>
Signed-off-by: Sagar Upadhyaya <[email protected]> Signed-off-by: Sagar <[email protected]>
Description
This fixes the field data cache clean up when an index is removed. Currently during any index removal, associated field data cache is cleaned up by iterating over ALL the keys irrespective of whether it belongs to desired index or not, and that too on the cluster applier thread.
In some cases, due to large number of entries in the fieldDataCache, this removal flow takes a lot of time and the data node isn't able to apply the cluster state, and is eventually removed due to
laggingSample CPU profiles looks something like below
Related issue, partially solves the problem - #13862.
There will be another PR around the same to refactor field data cache, and improve its data structures and iteration/removal flow further.
Related Issues
Check List
[ ] Public documentation issue/PR created, if applicable.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.