Increase concurrent request of opening point-in-time #96782

dnhatn · 2023-06-12T21:09:53Z

Today, we mistakenly throttle the opening point-in-time API to 1 request per node. As a result, when attempting to open a point-in-time across large clusters, it can take a significant amount of time and eventually fails due to relocated target shards or deleted target indices managed by ILM. Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version.

This PR increases the number of concurrent requests from 1 to 20. This default is higher than search, which is 5, because opening point-in-time is a lightweight operation, doesn't perform any I/O, and is executed directly on the network threads.

This PR increases the number of concurrent requests from 1 to 5, which is the default of search.

Any suggestion are welcome.

elasticsearchmachine · 2023-06-12T21:16:50Z

Hi @dnhatn, I've created a changelog YAML for you.

elasticsearchmachine · 2023-06-12T21:19:02Z

Pinging @elastic/es-search (Team:Search)

javanna

I follow your reasoning @dnhatn , my only question is how we can verify that 20 is e.g. better than 5 and better than 100. Is it a matter of benchmarking pit? Or did you do some estimation to come up with this number? cc @henningandersen as we recently discussed concurrency aspects in the search distributed execution .

dnhatn · 2023-06-13T17:08:58Z

@javanna Thank you for looking.

my only question is how we can verify that 20 is e.g. better than 5 and better than 100

Opening a point-in-time is lightweight both in terms of the shard requests themselves and the execution. Therefore, its concurrent level can be higher than the search request, which defaults to 5. Previously, we didn't throttle the can_match requests, which are heavier than opening point-in-time requests. However, I think it would be safer to limit the number of outstanding requests per node to avoid exhausting resources. I estimate 20 as a suitable value, but I believe we can increase it further.

Is it a matter of benchmarking pit?

I'm not sure if benchmarking would provide helpful insights.

henningandersen · 2023-06-14T08:26:04Z

I wonder if just increasing to 5 would relieve the pain of this enough until we can batch by node? Notice that opening a PIT triggers refresh for search idle shards, in particular after #96321.

Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version.

Can you explain the problem with the last part? Is this in relation to CCS? If so, it seems like a problem that will go away and thus we should not care about too deeply?

We could make it configurable to help ease the transition though that comes with some future burden.

dnhatn · 2023-06-14T17:56:26Z

Thanks @henningandersen.

Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version.

Can you explain the problem with the last part? Is this in relation to CCS? If so, it seems like a problem that will go away and thus we should not care about too deeply?

We need to add a new node-level action, which won't be available until both the remote cluster and local cluster are upgraded.

I wonder if just increasing to 5 would relieve the pain of this enough until we can batch by node? Notice that opening a PIT triggers refresh for search idle shards, in particular after #96321.

I initially thought about 5; I see a case where opening a PIT across clusters took more than an hour and then failed because of ILM. If we go with 5, then that request will still take at least 12 minutes, assuming shards are evenly distributed across target nodes. However, I am fine to proceed with 5 if we aren't confident with a higher number.

We could make it configurable to help ease the transition though that comes with some future burden.

Yes, but I am not sure if it should be a request parameter or a node setting. It will take some time for Kibana to use the new request parameter. Are we okay with a node setting, which defaults to 5?

I am prototyping to enable minimized_round_trips when opening PIT; then I think 5 should be good enough.

dnhatn · 2023-06-14T18:36:42Z

@javanna @henningandersen To enable the minimized_round_trips is complicated than I thought. Are we okay with a node setting, which defaults to 5?

javanna · 2023-06-15T14:39:30Z

@dnhatn shall we do the same that we have in _search, meaning a request parameter (max_concurrent_shard_request)? Would we still need a node setting in that case?

dnhatn · 2023-06-19T21:17:58Z

Thinking about this more. I think it's fine to add only the request parameter. I will work with Kibana to add this parameter to the CSV download.

javanna

LGTM, thanks a lot for all the additional tests!

javanna · 2023-06-20T10:20:30Z

server/src/main/java/org/elasticsearch/action/search/OpenPointInTimeRequest.java

    private String[] indices;
    private IndicesOptions indicesOptions = DEFAULT_INDICES_OPTIONS;
    private TimeValue keepAlive;
-
+    private int maxConcurrentShardRequests = 5;


nit: shall we link to the default constant for the same parameter in search, given the two have the same default value for now?

Yes, I pushed a252899.

dnhatn · 2023-06-20T13:54:20Z

@javanna @henningandersen Thanks so much for reviews + feedback.

javanna · 2023-06-20T14:34:57Z

server/src/main/java/org/elasticsearch/action/search/SearchRequest.java

@@ -87,6 +87,7 @@ public class SearchRequest extends ActionRequest implements IndicesRequest.Repla
    private int batchedReduceSize = DEFAULT_BATCHED_REDUCE_SIZE;

    private int maxConcurrentShardRequests = 0;
+    public static final int DEFAULT_MAX_CONCURRENT_SHARD_REQUESTS = 5;


oh boy, I thought we had this constant already, thanks for adding it.

Today, we mistakenly throttle the opening point-in-time API to 1 request per node. As a result, when attempting to open a point-in-time across large clusters, it can take a significant amount of time and eventually fails due to relocated target shards or deleted target indices managed by ILM. Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version. This PR increases the number of concurrent requests from 1 to 5, which is the default of search.

…lastic#96957) Today, we mistakenly throttle the opening point-in-time API to 1 request per node. As a result, when attempting to open a point-in-time across large clusters, it can take a significant amount of time and eventually fails due to relocated target shards or deleted target indices managed by ILM. Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version. This PR increases the number of concurrent requests from 1 to 5, which is the default of search.

* Increase concurrent request of opening point-in-time (#96782) (#96957) Today, we mistakenly throttle the opening point-in-time API to 1 request per node. As a result, when attempting to open a point-in-time across large clusters, it can take a significant amount of time and eventually fails due to relocated target shards or deleted target indices managed by ILM. Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version. This PR increases the number of concurrent requests from 1 to 5, which is the default of search. * Fix tests * Fix tests

Increase the throttle of opening point-in-time

a9e778c

elasticsearchmachine added the v8.9.0 label Jun 12, 2023

dnhatn added :Search/Search Search-related issues that do not fall into other categories >bug v8.8.1 v8.8.2 v7.17.11 and removed v8.8.1 labels Jun 12, 2023

Update docs/changelog/96782.yaml

640d334

dnhatn mentioned this pull request Jun 12, 2023

Enhance the opening point-in-time API #96775

Open

8 tasks

dnhatn requested a review from javanna June 12, 2023 21:18

dnhatn marked this pull request as ready for review June 12, 2023 21:18

elasticsearchmachine added the Team:Search Meta label for search team label Jun 12, 2023

dnhatn added 2 commits June 12, 2023 17:30

Merge remote-tracking branch 'elastic/main' into point-in-time

636b617

Merge remote-tracking branch 'dnhatn/point-in-time' into point-in-time

7fd3057

javanna reviewed Jun 13, 2023

View reviewed changes

dnhatn added 4 commits June 19, 2023 10:49

Merge remote-tracking branch 'elastic/main' into point-in-time

4f7c7f8

Add params

ae86f51

Fix version

e7928d5

Merge branch 'main' into point-in-time

825fc84

dnhatn requested a review from javanna June 19, 2023 21:18

javanna approved these changes Jun 20, 2023

View reviewed changes

Share the default with search request

a252899

Merge remote-tracking branch 'elastic/main' into point-in-time

fa08bf6

javanna reviewed Jun 20, 2023

View reviewed changes

dnhatn added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 20, 2023

elasticsearchmachine merged commit a8fbd24 into elastic:main Jun 20, 2023

dnhatn deleted the point-in-time branch June 20, 2023 15:03

dnhatn mentioned this pull request Jun 20, 2023

Increase concurrent request of opening point-in-time (#96782) #96957

Merged

dnhatn mentioned this pull request Jun 20, 2023

Increase concurrent request of opening point-in-time #96959

Merged

tsullivan mentioned this pull request Jul 10, 2023

[Reporting/CSV] Add setting for max_concurrent_shard_requests in CSV export elastic/kibana#161561

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase concurrent request of opening point-in-time #96782

Increase concurrent request of opening point-in-time #96782

dnhatn commented Jun 12, 2023 •

edited

Loading

elasticsearchmachine commented Jun 12, 2023

elasticsearchmachine commented Jun 12, 2023

javanna left a comment

dnhatn commented Jun 13, 2023

henningandersen commented Jun 14, 2023

dnhatn commented Jun 14, 2023

dnhatn commented Jun 14, 2023

javanna commented Jun 15, 2023

dnhatn commented Jun 19, 2023

javanna left a comment

javanna Jun 20, 2023

dnhatn Jun 20, 2023

dnhatn commented Jun 20, 2023

javanna Jun 20, 2023

Increase concurrent request of opening point-in-time #96782

Increase concurrent request of opening point-in-time #96782

Conversation

dnhatn commented Jun 12, 2023 • edited Loading

elasticsearchmachine commented Jun 12, 2023

elasticsearchmachine commented Jun 12, 2023

javanna left a comment

Choose a reason for hiding this comment

dnhatn commented Jun 13, 2023

henningandersen commented Jun 14, 2023

dnhatn commented Jun 14, 2023

dnhatn commented Jun 14, 2023

javanna commented Jun 15, 2023

dnhatn commented Jun 19, 2023

javanna left a comment

Choose a reason for hiding this comment

javanna Jun 20, 2023

Choose a reason for hiding this comment

dnhatn Jun 20, 2023

Choose a reason for hiding this comment

dnhatn commented Jun 20, 2023

javanna Jun 20, 2023

Choose a reason for hiding this comment

dnhatn commented Jun 12, 2023 •

edited

Loading