Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase concurrent request of opening point-in-time #96782

Merged
merged 10 commits into from
Jun 20, 2023

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jun 12, 2023

Today, we mistakenly throttle the opening point-in-time API to 1 request per node. As a result, when attempting to open a point-in-time across large clusters, it can take a significant amount of time and eventually fails due to relocated target shards or deleted target indices managed by ILM. Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version.

This PR increases the number of concurrent requests from 1 to 20. This default is higher than search, which is 5, because opening point-in-time is a lightweight operation, doesn't perform any I/O, and is executed directly on the network threads.

This PR increases the number of concurrent requests from 1 to 5, which is the default of search.

Any suggestion are welcome.

@dnhatn dnhatn added :Search/Search Search-related issues that do not fall into other categories >bug v8.8.1 v8.8.2 v7.17.11 and removed v8.8.1 labels Jun 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

@dnhatn dnhatn requested a review from javanna June 12, 2023 21:18
@dnhatn dnhatn marked this pull request as ready for review June 12, 2023 21:18
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jun 12, 2023
Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follow your reasoning @dnhatn , my only question is how we can verify that 20 is e.g. better than 5 and better than 100. Is it a matter of benchmarking pit? Or did you do some estimation to come up with this number? cc @henningandersen as we recently discussed concurrency aspects in the search distributed execution .

@dnhatn
Copy link
Member Author

dnhatn commented Jun 13, 2023

@javanna Thank you for looking.

my only question is how we can verify that 20 is e.g. better than 5 and better than 100

Opening a point-in-time is lightweight both in terms of the shard requests themselves and the execution. Therefore, its concurrent level can be higher than the search request, which defaults to 5. Previously, we didn't throttle the can_match requests, which are heavier than opening point-in-time requests. However, I think it would be safer to limit the number of outstanding requests per node to avoid exhausting resources. I estimate 20 as a suitable value, but I believe we can increase it further.

Is it a matter of benchmarking pit?

I'm not sure if benchmarking would provide helpful insights.

@henningandersen
Copy link
Contributor

I wonder if just increasing to 5 would relieve the pain of this enough until we can batch by node? Notice that opening a PIT triggers refresh for search idle shards, in particular after #96321.

Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version.

Can you explain the problem with the last part? Is this in relation to CCS? If so, it seems like a problem that will go away and thus we should not care about too deeply?

We could make it configurable to help ease the transition though that comes with some future burden.

@dnhatn
Copy link
Member Author

dnhatn commented Jun 14, 2023

Thanks @henningandersen.

Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version.

Can you explain the problem with the last part? Is this in relation to CCS? If so, it seems like a problem that will go away and thus we should not care about too deeply?

We need to add a new node-level action, which won't be available until both the remote cluster and local cluster are upgraded.

I wonder if just increasing to 5 would relieve the pain of this enough until we can batch by node? Notice that opening a PIT triggers refresh for search idle shards, in particular after #96321.

I initially thought about 5; I see a case where opening a PIT across clusters took more than an hour and then failed because of ILM. If we go with 5, then that request will still take at least 12 minutes, assuming shards are evenly distributed across target nodes. However, I am fine to proceed with 5 if we aren't confident with a higher number.

We could make it configurable to help ease the transition though that comes with some future burden.

Yes, but I am not sure if it should be a request parameter or a node setting. It will take some time for Kibana to use the new request parameter. Are we okay with a node setting, which defaults to 5?

I am prototyping to enable minimized_round_trips when opening PIT; then I think 5 should be good enough.

@dnhatn
Copy link
Member Author

dnhatn commented Jun 14, 2023

@javanna @henningandersen To enable the minimized_round_trips is complicated than I thought. Are we okay with a node setting, which defaults to 5?

@javanna
Copy link
Member

javanna commented Jun 15, 2023

@dnhatn shall we do the same that we have in _search, meaning a request parameter (max_concurrent_shard_request)? Would we still need a node setting in that case?

@dnhatn
Copy link
Member Author

dnhatn commented Jun 19, 2023

Thinking about this more. I think it's fine to add only the request parameter. I will work with Kibana to add this parameter to the CSV download.

@dnhatn dnhatn requested a review from javanna June 19, 2023 21:18
Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for all the additional tests!

private String[] indices;
private IndicesOptions indicesOptions = DEFAULT_INDICES_OPTIONS;
private TimeValue keepAlive;

private int maxConcurrentShardRequests = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we link to the default constant for the same parameter in search, given the two have the same default value for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I pushed a252899.

@dnhatn
Copy link
Member Author

dnhatn commented Jun 20, 2023

@javanna @henningandersen Thanks so much for reviews + feedback.

@@ -87,6 +87,7 @@ public class SearchRequest extends ActionRequest implements IndicesRequest.Repla
private int batchedReduceSize = DEFAULT_BATCHED_REDUCE_SIZE;

private int maxConcurrentShardRequests = 0;
public static final int DEFAULT_MAX_CONCURRENT_SHARD_REQUESTS = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh boy, I thought we had this constant already, thanks for adding it.

@dnhatn dnhatn added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 20, 2023
@elasticsearchmachine elasticsearchmachine merged commit a8fbd24 into elastic:main Jun 20, 2023
@dnhatn dnhatn deleted the point-in-time branch June 20, 2023 15:03
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jun 20, 2023
Today, we mistakenly throttle the opening point-in-time API to 1 request
per node. As a result, when attempting to open a point-in-time across
large clusters, it can take a significant amount of time and eventually
fails due to relocated target shards or deleted target indices managed
by ILM. Ideally, we should batch the requests per node and eliminate
this throttle completely. However, this requires all clusters to be on
the latest version.

This PR increases the number of concurrent requests from 1 to 5, which
is the default of search.
elasticsearchmachine pushed a commit that referenced this pull request Jun 20, 2023
Today, we mistakenly throttle the opening point-in-time API to 1 request
per node. As a result, when attempting to open a point-in-time across
large clusters, it can take a significant amount of time and eventually
fails due to relocated target shards or deleted target indices managed
by ILM. Ideally, we should batch the requests per node and eliminate
this throttle completely. However, this requires all clusters to be on
the latest version.

This PR increases the number of concurrent requests from 1 to 5, which
is the default of search.
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jun 20, 2023
…lastic#96957)

Today, we mistakenly throttle the opening point-in-time API to 1 request
per node. As a result, when attempting to open a point-in-time across
large clusters, it can take a significant amount of time and eventually
fails due to relocated target shards or deleted target indices managed
by ILM. Ideally, we should batch the requests per node and eliminate
this throttle completely. However, this requires all clusters to be on
the latest version.

This PR increases the number of concurrent requests from 1 to 5, which
is the default of search.
elasticsearchmachine pushed a commit that referenced this pull request Jun 20, 2023
* Increase concurrent request of opening point-in-time (#96782) (#96957)

Today, we mistakenly throttle the opening point-in-time API to 1 request
per node. As a result, when attempting to open a point-in-time across
large clusters, it can take a significant amount of time and eventually
fails due to relocated target shards or deleted target indices managed
by ILM. Ideally, we should batch the requests per node and eliminate
this throttle completely. However, this requires all clusters to be on
the latest version.

This PR increases the number of concurrent requests from 1 to 5, which
is the default of search.

* Fix tests

* Fix tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.17.11 v8.8.2 v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants