-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase concurrent request of opening point-in-time #96782
Conversation
Hi @dnhatn, I've created a changelog YAML for you. |
Pinging @elastic/es-search (Team:Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I follow your reasoning @dnhatn , my only question is how we can verify that 20 is e.g. better than 5 and better than 100. Is it a matter of benchmarking pit? Or did you do some estimation to come up with this number? cc @henningandersen as we recently discussed concurrency aspects in the search distributed execution .
@javanna Thank you for looking.
Opening a point-in-time is lightweight both in terms of the shard requests themselves and the execution. Therefore, its concurrent level can be higher than the search request, which defaults to 5. Previously, we didn't throttle the can_match requests, which are heavier than opening point-in-time requests. However, I think it would be safer to limit the number of outstanding requests per node to avoid exhausting resources. I estimate 20 as a suitable value, but I believe we can increase it further.
I'm not sure if benchmarking would provide helpful insights. |
I wonder if just increasing to 5 would relieve the pain of this enough until we can batch by node? Notice that opening a PIT triggers refresh for search idle shards, in particular after #96321.
Can you explain the problem with the last part? Is this in relation to CCS? If so, it seems like a problem that will go away and thus we should not care about too deeply? We could make it configurable to help ease the transition though that comes with some future burden. |
Thanks @henningandersen.
We need to add a new node-level action, which won't be available until both the remote cluster and local cluster are upgraded.
I initially thought about 5; I see a case where opening a PIT across clusters took more than an hour and then failed because of ILM. If we go with 5, then that request will still take at least 12 minutes, assuming shards are evenly distributed across target nodes. However, I am fine to proceed with 5 if we aren't confident with a higher number.
Yes, but I am not sure if it should be a request parameter or a node setting. It will take some time for Kibana to use the new request parameter. Are we okay with a node setting, which defaults to 5? I am prototyping to enable minimized_round_trips when opening PIT; then I think 5 should be good enough. |
@javanna @henningandersen To enable the minimized_round_trips is complicated than I thought. Are we okay with a node setting, which defaults to 5? |
@dnhatn shall we do the same that we have in _search, meaning a request parameter ( |
Thinking about this more. I think it's fine to add only the request parameter. I will work with Kibana to add this parameter to the CSV download. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for all the additional tests!
private String[] indices; | ||
private IndicesOptions indicesOptions = DEFAULT_INDICES_OPTIONS; | ||
private TimeValue keepAlive; | ||
|
||
private int maxConcurrentShardRequests = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shall we link to the default constant for the same parameter in search, given the two have the same default value for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I pushed a252899.
@javanna @henningandersen Thanks so much for reviews + feedback. |
@@ -87,6 +87,7 @@ public class SearchRequest extends ActionRequest implements IndicesRequest.Repla | |||
private int batchedReduceSize = DEFAULT_BATCHED_REDUCE_SIZE; | |||
|
|||
private int maxConcurrentShardRequests = 0; | |||
public static final int DEFAULT_MAX_CONCURRENT_SHARD_REQUESTS = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh boy, I thought we had this constant already, thanks for adding it.
Today, we mistakenly throttle the opening point-in-time API to 1 request per node. As a result, when attempting to open a point-in-time across large clusters, it can take a significant amount of time and eventually fails due to relocated target shards or deleted target indices managed by ILM. Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version. This PR increases the number of concurrent requests from 1 to 5, which is the default of search.
Today, we mistakenly throttle the opening point-in-time API to 1 request per node. As a result, when attempting to open a point-in-time across large clusters, it can take a significant amount of time and eventually fails due to relocated target shards or deleted target indices managed by ILM. Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version. This PR increases the number of concurrent requests from 1 to 5, which is the default of search.
…lastic#96957) Today, we mistakenly throttle the opening point-in-time API to 1 request per node. As a result, when attempting to open a point-in-time across large clusters, it can take a significant amount of time and eventually fails due to relocated target shards or deleted target indices managed by ILM. Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version. This PR increases the number of concurrent requests from 1 to 5, which is the default of search.
* Increase concurrent request of opening point-in-time (#96782) (#96957) Today, we mistakenly throttle the opening point-in-time API to 1 request per node. As a result, when attempting to open a point-in-time across large clusters, it can take a significant amount of time and eventually fails due to relocated target shards or deleted target indices managed by ILM. Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version. This PR increases the number of concurrent requests from 1 to 5, which is the default of search. * Fix tests * Fix tests
Today, we mistakenly throttle the opening point-in-time API to 1 request per node. As a result, when attempting to open a point-in-time across large clusters, it can take a significant amount of time and eventually fails due to relocated target shards or deleted target indices managed by ILM. Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version.
This PR increases the number of concurrent requests from 1 to 20. This default is higher than search, which is 5, because opening point-in-time is a lightweight operation, doesn't perform any I/O, and is executed directly on the network threads.This PR increases the number of concurrent requests from 1 to 5, which is the default of search.
Any suggestion are welcome.