-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Expose /v1/integrations/gateway/metrics #26548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8f50337 to
9d43271
Compare
| double systemCpuLoad = 0.0; | ||
| if (OPERATING_SYSTEM_MX_BEAN instanceof com.sun.management.OperatingSystemMXBean) { | ||
| systemCpuLoad = ((com.sun.management.OperatingSystemMXBean) OPERATING_SYSTEM_MX_BEAN).getCpuLoad(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking for suggestions here. This could lead to false metrics if OPERATING_SYSTEM_MX_BEAN instanceof com.sun.management.OperatingSystemMXBean is always false. The systemCpuLoad will always be 0.0, and trino-gateway will think there's no load on this cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could log a warning
| } | |
| } else { | |
| // Log a warning, or throw if this metric is critical | |
| log.warn("Could not retrieve system CPU load: OperatingSystemMXBean is not an instance of com.sun.management.OperatingSystemMXBean"); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider making this a JVM requirement we verify in TrinoSystemRequirements. @electrum what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could make this Optional everywhere, and then the client of this endpoint would need to decide what to do when the statistic is not available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to make it a jvm requirement if possible. It's also how it's used here:
| if (ManagementFactory.getOperatingSystemMXBean() instanceof OperatingSystemMXBean) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems fine to me. We already require com.sun.management.UnixOperatingSystemMXBean and Trino has other implicit requirements on OpenJDK.
b9a76ec to
11317f9
Compare
|
Why another endpoint if public |
|
11317f9 to
5e13008
Compare
|
@wendigo as far as I understand |
|
@andythsu how this PR addresses this need? |
5e13008 to
0aa7d25
Compare
|
@wendigo all nodes are available on the coordinator node. We can easily sum all node's free bytes up |
|
|
||
| @GET | ||
| @Path("metrics") | ||
| @ResourceSecurity(MANAGEMENT_READ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to use MANAGEMENT_READ because this endpoint will be hit by a dummy user from gateway. Most likely it'll be a role account. We may not want this role account to have management privileges.
Since this endpoint is for integrations, maybe we can have something like user A needs to have <integration> privilege in order to call this endpoint. We can easily use this access type for future integrations as well. Not sure if any of the existing access types can achieve such thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion the 'MANAGEMENT_READ' is totally fine :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added management read for just this kind of thing. I allows the caller to read data that some may find sensitive, like the IPs of workers. Honestly I would have make this stuff public, but people get concerned.
|
|
||
| @GET | ||
| @Path("metrics") | ||
| @ResourceSecurity(MANAGEMENT_READ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion the 'MANAGEMENT_READ' is totally fine :)
|
|
||
| import static io.trino.server.security.ResourceSecurity.AccessType.MANAGEMENT_READ; | ||
|
|
||
| @Path("/v1/integrations/trinoGateway") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use kebab-case or snake_case for paths ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should use a path that is more "generic" as this endpoint could be used for other user cases as well. These kind of endpoints over time become a de facto part of the protocol so it's design and shape should be considered carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other endpoints in trino codebase use camel case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wendigo there was a discussion on slack about making a special endpoint for the gateway like we have for the UI. The implication being that changes to the endpoint would be decided by the gateway team, and could be backwards incompatible if they decided, which they likely will not as they need compatiblity, but it would be their choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make the endpont call just gateway
| @ResourceSecurity(MANAGEMENT_READ) | ||
| public ClusterMetrics clusterMetrics() | ||
| { | ||
| long totalFreeBytes = clusterMemoryManager.getAllNodesMemoryInfo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're calling clusterMemoryManager.getAllNodesMemoryInfo() twice, which results in unnecessary overhead and potential inconsistency if the state changes between calls.
PLS cache the result in a variable
| .map(MemoryInfo::getPool) | ||
| .mapToLong(MemoryPoolInfo::getFreeBytes) | ||
| .sum(); | ||
| double totalSystemLoad = clusterMemoryManager.getAllNodesMemoryInfo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| double totalSystemLoad = clusterMemoryManager.getAllNodesMemoryInfo() | |
| double aggregatedSystemLoad = clusterMemoryManager.getAllNodesMemoryInfo() |
| .values() | ||
| .stream() | ||
| .flatMap(Optional::stream) | ||
| .map(MemoryInfo::getPool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can getPool() be null? if so: wrap with a null check to avoid potential NullPointerException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, pool won't be null.
| this.pool = requireNonNull(pool, "pool is null"); |
| return new ClusterMetrics(totalFreeBytes, totalSystemLoad); | ||
| } | ||
|
|
||
| public record ClusterMetrics(long totalFreeBytes, double totalSystemLoad) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps add documentation to clarify what each field means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
| @GET | ||
| @Path("metrics") | ||
| @ResourceSecurity(MANAGEMENT_READ) | ||
| public ClusterMetrics clusterMetrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to getClusterMetrics (as it is actionable)
e3e042f to
c4121d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all seems fine to me, but @wendigo should signoff.
| double systemCpuLoad = 0.0; | ||
| if (OPERATING_SYSTEM_MX_BEAN instanceof com.sun.management.OperatingSystemMXBean) { | ||
| systemCpuLoad = ((com.sun.management.OperatingSystemMXBean) OPERATING_SYSTEM_MX_BEAN).getCpuLoad(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider making this a JVM requirement we verify in TrinoSystemRequirements. @electrum what do you think?
| double systemCpuLoad = 0.0; | ||
| if (OPERATING_SYSTEM_MX_BEAN instanceof com.sun.management.OperatingSystemMXBean) { | ||
| systemCpuLoad = ((com.sun.management.OperatingSystemMXBean) OPERATING_SYSTEM_MX_BEAN).getCpuLoad(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could make this Optional everywhere, and then the client of this endpoint would need to decide what to do when the statistic is not available.
|
|
||
| import static io.trino.server.security.ResourceSecurity.AccessType.MANAGEMENT_READ; | ||
|
|
||
| @Path("/v1/integrations/trinoGateway") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wendigo there was a discussion on slack about making a special endpoint for the gateway like we have for the UI. The implication being that changes to the endpoint would be decided by the gateway team, and could be backwards incompatible if they decided, which they likely will not as they need compatiblity, but it would be their choice.
|
|
||
| @GET | ||
| @Path("metrics") | ||
| @ResourceSecurity(MANAGEMENT_READ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added management read for just this kind of thing. I allows the caller to read data that some may find sensitive, like the IPs of workers. Honestly I would have make this stuff public, but people get concerned.
| public record ClusterMetrics(long totalFreeBytes, double aggregatedSystemLoad) | ||
| {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be on one line
c4121d7 to
b0cadb6
Compare
1e354ce to
4481544
Compare
|
@wendigo Could you take another look at this PR? |
4481544 to
20371e6
Compare
| boolean coordinator = buildConfigObject(ServerConfig.class).isCoordinator(); | ||
| if (coordinator) { | ||
| jaxrsBinder(binder).bind(AnnounceNodeResource.class); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. not sure what changed here because it looks fine in my local
| newExporter(binder).export(ClusterMemoryManager.class).withGeneratedName(); | ||
|
|
||
| // metrics used by Trino Gateway | ||
| jaxrsBinder(binder).bind(TrinoGatewayResource.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop comment. rename class to GatewayResource.class
| @Inject | ||
| public TrinoGatewayResource(ClusterMemoryManager clusterMemoryManager) | ||
| { | ||
| this.clusterMemoryManager = clusterMemoryManager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requireNonNull(clusterMemoryManager, "clusterMemoryManager is null")
| * | ||
| * @param totalFreeBytes the sum of free memory from each node in the cluster | ||
| * @param aggregatedSystemLoad the sum of system load from each node in the cluster | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
record definition doesn't return anything so it should be rather "represents"
core/trino-main/src/main/java/io/trino/memory/LocalMemoryManager.java
Outdated
Show resolved
Hide resolved
4530594 to
5853788
Compare
|
@wendigo this is ready for another review! |
5853788 to
aae18d3
Compare
|
@andythsu I applied some changes, take a look and if you are ok with it, I can merge this PR |
|
@wendigo is this the only change you made? if so, I'm happy with this change. TIL |
|
@wendigo thanks! huh weird I couldn't find that new commit. The diff LGTM! |
|
The PR title and description are stale - the actual endpoint is different. Please update it. |
Description
Expose cpu usage and memory from Trino cluster on an endpoint
Additional context and related issues
Resolves #26549
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text: