Skip to content

Send periodic query progress events#23195

Merged
pgupta2 merged 1 commit intoprestodb:masterfrom
pgupta2:periodic_update_event
Aug 7, 2024
Merged

Send periodic query progress events#23195
pgupta2 merged 1 commit intoprestodb:masterfrom
pgupta2:periodic_update_event

Conversation

@pgupta2
Copy link
Copy Markdown
Contributor

@pgupta2 pgupta2 commented Jul 12, 2024

Description

Add support for sending periodic progress events for Queued/Running queries using EventListener interface.

Motivation and Context

This change enables coordinators to push their current state of Queued/Running queries as progress events which can be logged to different sources for analysis purposes. Progress event is sent every 1 minute for each queued/running queries.

Impact

None

Test Plan

Added a Unit test to check for queryProgress events.
Existing tests

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
== RELEASE NOTES ==
General Changes
* Add a new configuration property ``event.query-progress-publish-interval`` to specify the time interval at which progress events should be generated. Default is every 1 minute. :pr:`23195`

@pgupta2 pgupta2 marked this pull request as ready for review July 12, 2024 20:20
@pgupta2 pgupta2 requested a review from a team as a code owner July 12, 2024 20:20
Copy link
Copy Markdown
Contributor

@prithvip prithvip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better (single responsibility principle) if there was a new singleton class that calls DispatchManager::getQueries periodically and sends the heartbeat, and we can wire in any heartbeat-specific configuration there. "Query heartbeat" is overloaded in DispatchManager context, since DispatchQuery has recordHeartbeat(), which is used for purging abandoned queries, and thats totally different purpose than this "heartbeat". "sendQueryStateInfos" might be better terminology and would match with QueryStateInfoResource::getQueryStateInfos API.

this.securityConfig = requireNonNull(securityConfig, "securityConfig is null");

this.queryMonitor = requireNonNull(queryMonitor, "queryMonitor is null");
this.queryHeartbeatExecutor = Executors.newSingleThreadScheduledExecutor(threadsNamed("query-heartbeat-management-%s"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaner to instantiate this in CoordinatorModule and bind it to a annotated ScheduledExecutorService here - see CoordinatorModule.java for how we do this in other cases

catch (Throwable e) {
log.error(e, "Error sending Heartbeat Event");
}
}, 1, 1, TimeUnit.MINUTES);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this configurable

sendQueryHeartbeats();
}
catch (Throwable e) {
log.error(e, "Error sending Heartbeat Event");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issues in sending the heartbeat should be caught and handled by QueryMonitor, DispatchManager shouldn't worry about that or log anything.


private void sendQueryHeartbeats()
{
List<BasicQueryInfo> basicQueryInfoList = queryTracker.getAllQueries().stream().map(q -> q.getBasicQueryInfo()).collect(Collectors.toList());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DispatchManager::getQueries already does this, let's use that.

@pgupta2 pgupta2 changed the title Send periodic query heartbeat events Send periodic query progress events Jul 16, 2024
@pgupta2 pgupta2 force-pushed the periodic_update_event branch 2 times, most recently from f0e0fd6 to 3b20785 Compare July 16, 2024 18:52
@tdcmeehan tdcmeehan self-assigned this Jul 16, 2024
@pgupta2 pgupta2 requested a review from prithvip July 16, 2024 20:01
prithvip
prithvip previously approved these changes Jul 18, 2024
@pgupta2
Copy link
Copy Markdown
Contributor Author

pgupta2 commented Jul 18, 2024

@prithvip @tdcmeehan : Gentle reminder for the review! Thanks.

ofEpochMilli(queryInfo.getQueryStats().getCreateTime().getMillis())));
}
catch (Throwable e) {
log.error(e, "Error sending Query Progress Event");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need try-catch here, because any issues should be handled by the specific EventListener implementation which does not throw checked exception. The try-catch should happen there.

@jaystarshot
Copy link
Copy Markdown
Member

Can you add how progress events can potentially be analyzed?
I feel that there is more value in adding whole cluster level events instead of individual query events

Copy link
Copy Markdown
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some unit tests?

@pgupta2
Copy link
Copy Markdown
Contributor Author

pgupta2 commented Jul 20, 2024

Can you add how progress events can potentially be analyzed? I feel that there is more value in adding whole cluster level events instead of individual query events

Query progress events can be used to figure out the workload distribution by users or workload type which can be used for doing more realtime quota enforcement. Another use case would be to get insights into the queuing state of the cluster at real time and identify the workloads/users contributing to it. Query progress events allows us to aggregate data on different dimensions as needed. Cluster level events are very coarse and will not expose such detailed information.

@pgupta2
Copy link
Copy Markdown
Contributor Author

pgupta2 commented Jul 20, 2024

Can you please add some unit tests?

@tdcmeehan : Modified existing test to check for progress events and updated the PR. Thanks.

@pgupta2 pgupta2 requested a review from tdcmeehan July 20, 2024 01:15
@pgupta2 pgupta2 force-pushed the periodic_update_event branch from 48b34c6 to d1d0747 Compare July 20, 2024 06:47
@pgupta2 pgupta2 requested a review from prithvip July 20, 2024 06:47
@pgupta2 pgupta2 force-pushed the periodic_update_event branch from d1d0747 to 9460ae2 Compare July 21, 2024 19:01
@PostConstruct
public void start()
{
queryProgressMonitorExecutor.scheduleWithFixedDelay(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a heavily contended cluster, I think this would be pretty wasteful, because most queries will publish identical information each time even though they're queued and it won't change until they dequeue. What do you think about making it based off of query state transition? I.e. it publishes once per query state transition, and when it transitions into EXECUTING then we start a periodic timer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that the decision to process/ignore events for QUEUED queries should be left to the consumer itself. It also depends on the use case for which these events are being utilized. For instance, within Meta itself, these events will be used for two different purposes:

  1. Resource utilization of in-progress queries to perform quota enforcement and throttling. This consumer simply ignore events for QUEUED queries and only logs events for running queries.

  2. Getting the current queuing state of the cluster and power certain alerts/dashboard. This requires getting events for queued queries at periodic interval in order to generate a recent snapshot. Another problem I see is that if a query fails after queueing and before logging query completed event (in case of coordinator crash or cluster restart), then we will not be able to figure out if a query is in QUEUED state because its waiting for resources or whether it died due to to some reason. So, having periodic samples will surface out more accurate snaphot of the cluster state.


import static java.util.Objects.requireNonNull;

public class QueryProgressEvent
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what makes this different from similar classes in this package is this is designed to publish multiple events, instead of an event per state transition or at query start/finish. If you imagine that someone might put this into a queue that has at least once semantics, then it might be possible for messages to arrive out of order or duplicated. That's not a problem for peer classes in this package because you know the state transition order, or you know that you only need to look at one query completed even if there's a duplicate, but for this class, it's impossible to infer correct ordering. This may lead to strange queries or incorrect analysis of the events. I'd recommend you include a monotonically increasing identifier that downstream systems can use to infer ordering and to de-duplicate at the reader side.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan : Added an increasing event Id field in the QueryProgressEvent class.

@tdcmeehan
Copy link
Copy Markdown
Contributor

Can you add how progress events can potentially be analyzed? I feel that there is more value in adding whole cluster level events instead of individual query events

Query progress events can be used to figure out the workload distribution by users or workload type which can be used for doing more realtime quota enforcement. Another use case would be to get insights into the queuing state of the cluster at real time and identify the workloads/users contributing to it. Query progress events allows us to aggregate data on different dimensions as needed. Cluster level events are very coarse and will not expose such detailed information.

Shouldn't higher level quota enforcement work hand-in-hand with local resource management in the resource group? If so, wouldn't it be better to publish these events at the granularity of resource group IDs? If not, how do you ensure that local resources are in line with global quota?

Likewise, wouldn't events at the resource group level give you the insights needed to identify the workloads and users contributing to the queueing state?

@pgupta2
Copy link
Copy Markdown
Contributor Author

pgupta2 commented Jul 22, 2024

Shouldn't higher level quota enforcement work hand-in-hand with local resource management in the resource group? If so, wouldn't it be better to publish these events at the granularity of resource group IDs? If not, how do you ensure that local resources are in line with global quota?

My understanding might be wrong, but I feel that higher level quota enforcement can be different from local resource management provided by resource groups. AFAIK, RG provides quota management on concurrency and memory share. So, if a resource group is configured to use 50% memory share and with 100 concurrency, it can run 100 queries in parallel which can consume a total of 50% cluster memory. Global quota management could happen at team, org, product level wherein a team might have quota to use X memory resources per day and once they have consumed these resources, their workload will get blocked/throttled. This enforcement cannot be achieved at RG level IMO. Also, an org might have multiple teams using different resource groups for their workload and we need to aggregate resource usage at org level to perform such quota enforcement.

Likewise, wouldn't events at the resource group level give you the insights needed to identify the workloads and users contributing to the queueing state?

Yes, we should be able to collect this data at RG level. However, even at RG level, we will be collecting BasicQueryInfos for all running queries and publishing them at periodic interval. In terms of information published, it will most likely still be same, just grouped at RG level.

@pgupta2 pgupta2 force-pushed the periodic_update_event branch from 9460ae2 to ec4ad6a Compare July 22, 2024 23:59
@pgupta2 pgupta2 requested a review from tdcmeehan July 23, 2024 16:13
@pgupta2
Copy link
Copy Markdown
Contributor Author

pgupta2 commented Jul 25, 2024

@tdcmeehan : Gentle reminder for the review! Thanks.

@pgupta2
Copy link
Copy Markdown
Contributor Author

pgupta2 commented Jul 29, 2024

@tdcmeehan : Another gentle reminder for the review!! :)

@steveburnett
Copy link
Copy Markdown
Contributor

Suggest adding the PR number to the release note entry per the Release Notes Guidelines.

== RELEASE NOTES ==
General Changes
* Add a new configuration property ``event.query-progress-publish-interval`` to specify the time interval at which progress events should be generated. Default is every 1 minute. :pr:`23195`

@pgupta2
Copy link
Copy Markdown
Contributor Author

pgupta2 commented Aug 5, 2024

@tdcmeehan : Ping.

prithvip
prithvip previously approved these changes Aug 5, 2024
private void publishQueryProgressEvent()
{
for (BasicQueryInfo basicQueryInfo : dispatchManager.getQueries()) {
if (basicQueryInfo.getState() != FINISHED && basicQueryInfo.getState() != FAILED) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (basicQueryInfo.getState() != FINISHED && basicQueryInfo.getState() != FAILED) {
if (!basicQueryInfo.getState().isDone()) {

public class QueryMonitorConfig
{
private DataSize maxOutputStageJsonSize = new DataSize(16, Unit.MEGABYTE);
private Duration queryProgressPublishInterval = new Duration(1, MINUTES);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be defaulted to 0 (don't publish events)?

@PostConstruct
public void start()
{
queryProgressMonitorExecutor.scheduleWithFixedDelay(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if the duration is 0, and if so, make this a no-op? And also make the executor loaded lazily?

@tdcmeehan
Copy link
Copy Markdown
Contributor

Yes, we should be able to collect this data at RG level. However, even at RG level, we will be collecting BasicQueryInfos for all running queries and publishing them at periodic interval. In terms of information published, it will most likely still be same, just grouped at RG level.

From your message earlier, I read two use cases: more realtime quota enforcement, and identifying sources of queueing. For both of these, I'm wondering if you get sufficient information from resource group nodes. Instead of publishing BasicQueryInfos, I'm wondering if you can publish ResourceGroupInfos, perhaps with additional set of QueryIds if you want to drill down into particular queries. Overall, I'm just wondering how practical this change is for most people, since it seems like a ton of information, and you think that this sort of information is spooled somewhere and hence would require potentially costly, online storage to provide that spooling. And if the main things you need are all for the purposes of aggregating it to the resource group level, why not just publish at the resource group level?

@pgupta2
Copy link
Copy Markdown
Contributor Author

pgupta2 commented Aug 6, 2024

Instead of publishing BasicQueryInfos, I'm wondering if you can publish ResourceGroupInfos, perhaps with additional set of QueryIds if you want to drill down into particular queries.

Just to confirm my understanding, are you suggesting that we can publish ResourceGroupInfos which will contain information about all running queries in the given resource group as a single event instead of publishing information about each query as a separate event?

ResourceGroupInfo only contains information about running queries so we might have to extend it to expose queued queries information as well. It can be done but first I want to ensure that my above understanding is correct or not.

perhaps with additional set of QueryIds if you want to drill down into particular queries.

Drilling down is not possible with the realtime quota enforcement use case given that the enforcement system will consume these raw events to make quota decision at realtime. There is no offline processing that will happen later to get more information about these queries using queryIds. So, we need to expose detailed information in the event itself.

And if the main things you need are all for the purposes of aggregating it to the resource group level, why not just publish at the resource group level?

I think I mentioned above that aggregation will happen at levels of teams, products, orgs which can be mapped to several RGs in a cluster. So, we need to log detailed information at query level and perform aggregation based on these dimensions.

I feel that if we really want to publish events at RG level, we can still do that and push the logic to publish more detailed information (at query level perhaps) to the event consumer itself. Given ResourceGroupInfo contains information about queued/running queries, they can be iterated upon and published as needed.

@pgupta2 pgupta2 force-pushed the periodic_update_event branch 2 times, most recently from 8acbcd1 to 4eef95e Compare August 6, 2024 16:02
@pgupta2 pgupta2 requested a review from tdcmeehan August 6, 2024 16:03
Copy link
Copy Markdown
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand better now the intended use case, and I'm fine with the direction of these changes. Thanks @pgupta2. Just one final nit.

Comment on lines +40 to +56
private ScheduledExecutorService queryProgressMonitorExecutor;

@Inject
public QueryProgressMonitor(
QueryMonitor queryMonitor,
DispatchManager dispatchManager,
QueryMonitorConfig queryMonitorConfig,
ResourceGroupManager resourceGroupManager)
{
resourceGroupManager.getRootResourceGroups();
this.queryMonitor = requireNonNull(queryMonitor, "queryMonitor is null");
this.dispatchManager = requireNonNull(dispatchManager, "dispatchManager is null");
this.queryProgressPublishInterval = requireNonNull(queryMonitorConfig, "queryMonitorConfig is null").getQueryProgressPublishInterval();
}

@PostConstruct
public void start()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private ScheduledExecutorService queryProgressMonitorExecutor;
@Inject
public QueryProgressMonitor(
QueryMonitor queryMonitor,
DispatchManager dispatchManager,
QueryMonitorConfig queryMonitorConfig,
ResourceGroupManager resourceGroupManager)
{
resourceGroupManager.getRootResourceGroups();
this.queryMonitor = requireNonNull(queryMonitor, "queryMonitor is null");
this.dispatchManager = requireNonNull(dispatchManager, "dispatchManager is null");
this.queryProgressPublishInterval = requireNonNull(queryMonitorConfig, "queryMonitorConfig is null").getQueryProgressPublishInterval();
}
@PostConstruct
public void start()
@GuardedBy("this")
private ScheduledExecutorService queryProgressMonitorExecutor;
@Inject
public QueryProgressMonitor(
QueryMonitor queryMonitor,
DispatchManager dispatchManager,
QueryMonitorConfig queryMonitorConfig,
ResourceGroupManager resourceGroupManager)
{
resourceGroupManager.getRootResourceGroups();
this.queryMonitor = requireNonNull(queryMonitor, "queryMonitor is null");
this.dispatchManager = requireNonNull(dispatchManager, "dispatchManager is null");
this.queryProgressPublishInterval = requireNonNull(queryMonitorConfig, "queryMonitorConfig is null").getQueryProgressPublishInterval();
}
@PostConstruct
public synchronized void start()

}

@Config("event.query-progress-publish-interval")
@ConfigDescription("How frequently to publish query progress events")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@ConfigDescription("How frequently to publish query progress events")
@ConfigDescription("How frequently to publish query progress events. 0ms disables the publication of these events.")

This change enables coordinators to push their current
state of Queued/Running queries as progress events
which can be logged to different sources for analysis
purposes. Progress event is sent every 1 minute for
each queued/running queries.
@pgupta2 pgupta2 force-pushed the periodic_update_event branch from 4eef95e to 1285c88 Compare August 6, 2024 23:31
@pgupta2 pgupta2 requested a review from tdcmeehan August 7, 2024 01:52
@steveburnett
Copy link
Copy Markdown
Contributor

Please add documentation for this new configuration property to https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/admin/properties.rst .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants