Skip to content

Conversation

@amybubu
Copy link
Contributor

@amybubu amybubu commented Apr 15, 2025

Description

As discussed in Issue #649 we would like to add a built-in activation status metric to Trino Gateway, to improve telemetry. This metric has a value of 1 if activated and a value of 0 if deactivated. The metric is populated with values polled directly from the database. Each cluster has it's own BackendClusterMetricsStats that can keep track of multiple metrics as we add in the future. When a backend cluster is added/deleted, an associated metric is registered/unregistered.

To keep this information up to date, the backends table in the DB is polled every 30 seconds to check if any changes have been made. We can't rely on triggering a metrics registry refresh within the add/delete backend methods because only one instance gets this API call. So if there are multiple instances, they won't all know to update their respective list of metrics. NOTE: Because these are periodically updated every 30 seconds, there may be a slight delay in new metrics showing up and old metrics disappearing. For deleted backends, their activation status metric value will be -1 until it is unregistered next refresh period.

Testing

  • built and ran locally
  • saw new metrics in /v1/jmx (NOTE: added testMetric to show proof of concept of adding other types of metrics, but removed from code before publishing PR)
Screenshot 2025-04-07 at 5 38 33 PM
  • deactivated and activated cluster via UI and curl calls, and saw changes reflected in metrics after refreshing v1/jmx
  • added, updated, and deleted backends, and saw associated metrics added, updated and deleted respectively (with expected slight delay due to refresh period)

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( x ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Apr 15, 2025
Copy link
Member

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great framework for us to build on top of for collecting various backend-specific metrics. Thanks @amybubu !

@andythsu
Copy link
Member

LGTM!

@andythsu
Copy link
Member

pinging @vishalya

private final MonitorConfiguration monitorConfiguration;
// MBeanExporter uses weak references, so statsMap is needed to maintain strong references to metric objects to prevent garbage collection
private final Map<String, ClusterMetricsStats> statsMap = new HashMap<>();
private final ScheduledExecutorService scheduledExecutor = Executors.newSingleThreadScheduledExecutor();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 - Single thread should be good.

@vishalya
Copy link
Member

vishalya commented Apr 30, 2025

LGTM me as well.
This exporter has it's own executor/thread which exports the metrics, would like to get an opinion from @oneonestar, if that's ok, We should try to avoid multiple of these threads.

@xkrogen
Copy link
Member

xkrogen commented Apr 30, 2025

LGTM me as well. This exporter has it's own executor/thread which exports the metrics, would like to get an onion from @oneonestar, if that's ok, We should try to avoid multiple of these threads.

One thing to point out here -- the executor service isn't actually used to export the metrics. It's used to refresh the list of backends which are registered as having metrics (MBeanExporter#exportWithGeneratedName()/MBeanExporter#unexportWithGeneratedName()). Then the normal MBeanExporter thread handles the actual exporting of metrics.

For a little background on why this is needed, we started with just updating the map on any changes to the backends (add/delete/etc), but of course with a multi-Gateway setup this falls out of sync when changes are made on other instances. It's essentially the same problem we've discussed in other contexts; I think I can remember specifically that we have this split-brain with the health check logic.

Long-term, probably the right place to be in is that we have one central "state sync" which propagates any DB state changes back into the Gateway's local in-memory state, and any kind of triggers involved with state changes (including metrics registration) could happen there.

Copy link
Member

@vishalya vishalya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are aware of the limitation due to the lack of the central state.
Approved.

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test?

@amybubu amybubu requested a review from ebyhr June 3, 2025 16:48
@vishalya
Copy link
Member

vishalya commented Jun 9, 2025

@ebyhr - does this look good to you? I am fine merging it.

.thenReturn(List.of(cluster1, cluster2)); // Then return with 2 clusters to simulate addition

statsExporter.start();
Thread.sleep(2000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can use the exposed method instead for testing.

Suggested change
Thread.sleep(2000);
statsExporter.updateClustersMetricRegistry();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the updating of the clusters metrics registry is done on a periodic basis, I think it makes sense to just use sleep rather than explicitly call the update method to test the refresh logic. The statsExporter we create for this test has a set refresh period of 1 second, so sleeping 2 seconds would simulate going through at least one refresh cycle.

@Chaho12
Copy link
Member

Chaho12 commented Jun 9, 2025

ebyhr - does this look good to you? I am fine merging it.

Don't forget to squash commits 😁

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase (not merge) on master and squash commits into one.

@mosabua
Copy link
Member

mosabua commented Jun 18, 2025

@ebyhr @vishalya @Chaho12 @amybubu - can you please collaborate to figure out next step here

@amybubu amybubu force-pushed the amybubu/add-activation-status-metrics branch from 8332e4b to dc1a18f Compare June 18, 2025 17:28
@ebyhr ebyhr merged commit 420f6e0 into trinodb:main Jun 19, 2025
3 checks passed
@github-actions github-actions bot added this to the 16 milestone Jun 19, 2025
@mosabua
Copy link
Member

mosabua commented Jun 19, 2025

Thank you @ebyhr @amybubu @Chaho12 and @vishalya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

7 participants