Conversation
✅ Docs preview has no changesThe preview was not built because there were no changes. Build ID: 7643b6341b64a907fc66240e |
This comment has been minimized.
This comment has been minimized.
bonnici
left a comment
There was a problem hiding this comment.
The parts that I understand are looking good to me.
| StaticInstrument::Histogram( | ||
| meter | ||
| .f64_histogram(APOLLO_ROUTER_OPERATIONS_FETCH_DURATION) | ||
| .with_unit("s") |
There was a problem hiding this comment.
I guess seconds is the standard unit to use? Milliseconds or nanoseconds makes more sense to me but again I'm new to this code.
There was a problem hiding this comment.
We actually don't have control over this for CustomHistogram.
Some(instant.elapsed().as_secs_f64())In fact, all histogram refs I could find in the router were all using f64 to measure seconds.
There was a problem hiding this comment.
I suppose if it was possible for this to be milliseconds maybe we would be able to use a 32 bit value and save some space over the wire. I'd need to also handle this on the ingestion side since we don't respect the unit yet and just assume that everything is in seconds. https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto#L198 - not sure that this would work since the protobuf specifies a double type for the bounds.
There was a problem hiding this comment.
As far as I understand it is indeed the standard unit to use for all time measurements, but anyone displaying it would convert the unit to a most-appropriate scale at that point
There was a problem hiding this comment.
@timbotnik as I mentioned, CustomHistogram controls the unit recorded to the underlying histogram. We could likely add a way to specify a unit in its constructor, but if that's the route we want I'd prefer to add it as a follow-up to this PR.
There was a problem hiding this comment.
IMO there's not a great reason to change this now (an argument about precision would be the only reason, but the numbers we're dealing with are precise enough). Let's make a tech debt ticket which would include updating the ingestion side to fail gracefully on different units (or support them!)
There was a problem hiding this comment.
apollo-router/src/plugins/telemetry/config_new/apollo/instruments.rs
Outdated
Show resolved
Hide resolved
| attributes: SubgraphAttributes::builder() | ||
| .subgraph_name(StandardAttribute::Bool(true)) | ||
| .graphql_operation_type(StandardAttribute::Aliased { | ||
| alias: "operation.kind".to_string(), |
There was a problem hiding this comment.
We're looking for graphql.operation.type on the ingestion side so this probably should come from the supergraph operation type and shouldn't require an alias. Then again, I wonder if there are any use-cases where the supergraph operation type and the subgraph operation type can be different? Actually in the case of federation or subscriptions perhaps it can be different... in which case we might actually want to track both the supergraph type and subgraph type. Let's ask someone from the Router team to confirm that it's possible.
There was a problem hiding this comment.
The implementation here actually notes that it should always be the same:
That said, I may as well pull it from the context to make it "future proof" in case that changes in the future.
There was a problem hiding this comment.
Asked about whether they can differ here: https://apollograph.slack.com/archives/C02UX05LF4K/p1753980692829289?thread_ts=1753971835.882889&cid=C02UX05LF4K
…vel-subgraph-fetch-historgram
apollo-router/tests/integration/telemetry/apollo_otel_metrics.rs
Outdated
Show resolved
Hide resolved
apollo-router/src/plugins/telemetry/config_new/apollo/instruments.rs
Outdated
Show resolved
Hide resolved
apollo-router/src/plugins/telemetry/config_new/apollo/instruments.rs
Outdated
Show resolved
Hide resolved
…vel-subgraph-fetch-historgram
…ph-fetch-historgram' into rreg/PULSR-1673/top-level-subgraph-fetch-historgram
Description
This change adds a new, experimental histogram to capture subgraph fetch duration for Apollo Studio. The instrument,
apollo.router.operations.fetch.durationhas the following attributes:apollo.client.nameapollo.client.versionhas_errorsapollo.operation.idgraphql.operation.kindgraphql.operation.namesubgraph.nameThis can be toggled on using a new boolean config flag:
The instrument is currently only sent to GraphOS and is not available in 3rd-party OTel export targets. It is not
user customizable. For this purpose, users can take advantage of the existing customizable instrument
http.client.request.durationmeasuring the same value.Implementation
This is implemented using the custom instrument framework. These are Apollo-controlled metrics, so rather than allowing for user customization like existing use cases, we hardcode the
CustomHistogramwith specific attributes. This allows us to add more easily add apollo metrics in the future and means we can use the shared standard ways of pulling data off of the request/response/context.Cardinality is high, so the new metric follows the
realtime_metricspath which allows for thescheduled_delayconfig to change how often the metric is sent.Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and
debug-level logs. Please read this guidance on metrics best-practices. ↩Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩