rfc: OpenTelemetry observability for wallet-service#105
rfc: OpenTelemetry observability for wallet-service#105andreabadesso merged 10 commits intomasterfrom
Conversation
Proposes adding distributed tracing, metrics, and structured logging to the wallet-service monorepo using OpenTelemetry. Co-Authored-By: Andre Cardoso <andre@hathor.network>
Resolve the backend choice: Grafana Tempo for traces (S3-backed), Prometheus for span-derived metrics via OTel Collector's Span Metrics Connector. Integrates with our existing Grafana instance. - Add OTel Collector config with Span Metrics Connector - Add Tempo + Prometheus architecture diagram - Update rollout plan with infra and dashboard phases - Remove backend choice from unresolved questions
- Fix architecture diagram to show separate ADOT embedded collector for Lambdas instead of sharing the K8s sidecar collector - Note Tempo is VPC-internal only (no internet exposure needed) - Switch from Grafana Alerting to Alertmanager rules (existing on-call setup) - Recalibrate alert severities per on-call guide (Major/Medium/Low) - Remove PagerDuty reference (not used), routing handled by Alertmanager - Add trace retention note (compactor.block_retention, 14 days default) - Add S3 bucket creation and IAM permissions step (Pod Workload Identity) - Note to evaluate pre-built OTel dashboards before building from scratch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pedroferreira1
left a comment
There was a problem hiding this comment.
For me this is approved, I just have some questions:
Storage costs. Traces generate significant data volume. Without sampling, costs can grow quickly. Tail-based sampling mitigates this but adds collector complexity.
Do we have an estimation on how much this will increase?
ADOT layer cold start impact in our specific Lambdas. The 200-800ms range is from AWS documentation. We need to benchmark with our actual deployment packages to get exact numbers.
We should implement this feature with a simple feature flag (even if it's an ENV var) to turn it off in a simple way, if we feel it's adding lot of overhead, not only to the cold start.
Would be amazing to have an analysis for each alert created for the alert manager on how to improve the query/API/error being alerted.
Rendered
Summary
Adds distributed tracing, metrics, and structured logging to the wallet-service monorepo using OpenTelemetry.
Key points:
Open questions