Conversation
✅ Docs preview readyThe preview is ready to be viewed. View the preview File Changes 2 new, 4 changed, 0 removedBuild ID: b00c2b4a87dac6ca3f7372a3 URL: https://www.apollographql.com/docs/deploy-preview/b00c2b4a87dac6ca3f7372a3 |
This comment has been minimized.
This comment has been minimized.
e2d282d to
606b2e2
Compare
606b2e2 to
0f3090f
Compare
| @@ -832,26 +840,15 @@ where | |||
| operation_name: Option<String>, | |||
| ) -> broadcast::Receiver<()> { | |||
| let (closing_signal_tx, closing_signal_rx) = broadcast::channel(1); | |||
There was a problem hiding this comment.
@goto-bus-stop I've not changed the logic here but I am uncertain as to why creating a topic should overwrite the subscription. Should it not keep the subscription if it already exists?
|
Moving back to draft as there is some sort of issue with the pipelines metric. |
e9fd533 to
c528de6
Compare
… original instrument This refactor makes updown counters finally useful as in combination with the temporality fix to make them always aggregate and the fix to only reload metrics if needed they will not accurately increment and decrement reliably.
Moved tests for otlp into directory
f1ddee4 to
168c1c7
Compare
168c1c7 to
11a3cd9
Compare
11a3cd9 to
d61025d
Compare
|
OK, tests passing, I think we're good to review. |
3132d40 to
e1430e1
Compare
913afed to
e1430e1
Compare
goto-bus-stop
left a comment
There was a problem hiding this comment.
Looks great overall, some questions/nits
| let instrument = create_instrument_fn(meter); | ||
| let attrs : &[opentelemetry::KeyValue] = &$attrs; | ||
| instrument.$mutation($value, attrs); | ||
| $guard::new(std::sync::Arc::new(instrument.clone()), $value, attrs) |
There was a problem hiding this comment.
Can this Arc::new() ever happen with the NoopGuard? That would suck a bit (it wouldnt make or break performance I imagine but still an unnecessary heap allocation).
There was a problem hiding this comment.
c1b06cb I've added an explanation for this. The code is actually not generally run for #cfg[not(test)], it's there to prevent our unit tests from interfering with each other. We do need a test for callsite caching though, so the other arm is run for a specific test hence why the entire block is not under cfg test.
It no longer adds value over just having a guard during the job
| /// with labels that are unique (like launch_id) can accumulate as zeroed | ||
| /// out series that will never be incremented again. | ||
| /// | ||
| /// The true solution for this is to drop Prometheus support as this has been dropped in upstream OTEL. |
There was a problem hiding this comment.
Users would probably not be happy about that, maybe we could use a third party prometheus exporter (eg https://lib.rs/crates/opentelemetry-prometheus-text-exporter) when the time comes, but 🤷🏻♀️ just because it isn't a first party piece of the SDK doesn't mean it can't be done ;)
There was a problem hiding this comment.
Life finds a way....
| fn setup_public_metrics(&mut self) -> Result<(), BoxError> { | ||
| if self.is_metrics_config_changed::<metrics::prometheus::Config>() | ||
| || self.is_metrics_config_changed::<otlp::Config>() | ||
| || self.prometheus_random_change() |
There was a problem hiding this comment.
I guess I'd like it on the record that I don't love this solution but I don't have a better idea, so let's just do ittt
There was a problem hiding this comment.
I also do not like this. Theres no other way though.....
There was a problem hiding this comment.
I changed this to at least be deterministic. It reloads every 10 times rather than randomly.
Previously this was random, but this is bad for deterministic testing.
This PR changes the nature of UpDownCounters within the router and update a few metrics.
UpDownCounters are mostly good for tracking resources. They transmit an aggregate value rather than a delta, so each updown instrument sends the latest value.
Previously we were manually incrementing and decrementing updown counters, however there is nuance to this:
As we don't actually use UpDown counters for anything else (and there seems to be little reason to) the macro has been modified to return a RAII guard that will automatically decrement by the value that was incremented when it is dropped. It also stores the original instrument that was used to increment so that it is guaranteed to decrement on the same instrument that it incremented on.
Notes
Issues
It has been noticed that the Promethues registry will NEVER drop a series. This means that over time cardinality will build up and consume memory. I've put a hacky workaround e1430e1. This will cause prom to be reinitialized 1 in 10 reloads. This may mean a minor number of metrics lost, but it's better than just accumulating forever.
Promethues is deprecated in the latest version of otel rust, and this isn't a problem for otlp. In the next major version of the router we should consider removing Prometheus support.
Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and
debug-level logs. Please read this guidance on metrics best-practices. ↩Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩