Skip to content

rfc: OpenTelemetry observability for wallet-service#105

Merged
andreabadesso merged 10 commits intomasterfrom
docs/wallet-service-opentelemetry-observability
Mar 24, 2026
Merged

rfc: OpenTelemetry observability for wallet-service#105
andreabadesso merged 10 commits intomasterfrom
docs/wallet-service-opentelemetry-observability

Conversation

@andreabadesso
Copy link
Copy Markdown
Contributor

@andreabadesso andreabadesso commented Mar 5, 2026

Rendered

Summary

Adds distributed tracing, metrics, and structured logging to the wallet-service monorepo using OpenTelemetry.

Key points:

  • Auto-instrumentation for mysql2, ioredis, aws-sdk, http — zero code changes for 80% of value
  • Daemon: OTel SDK init + BatchSpanProcessor + K8s OTel Collector sidecar
  • Lambdas: ADOT Lambda layer (managed by AWS)
  • Backend: Grafana Tempo (traces, S3-backed) + Prometheus (span-derived metrics) — plugs into existing Grafana
  • 6-phase rollout, each independently reversible

Open questions

  • ADOT cold start impact on our specific Lambdas (200-800ms range per AWS docs)
  • Production sampling rates
  • Trace context propagation through SQS between daemon and Lambdas

Proposes adding distributed tracing, metrics, and structured logging
to the wallet-service monorepo using OpenTelemetry.

Co-Authored-By: Andre Cardoso <andre@hathor.network>
@andreabadesso andreabadesso self-assigned this Mar 5, 2026
@andreabadesso andreabadesso added the documentation Improvements or additions to documentation label Mar 5, 2026
@andreabadesso andreabadesso moved this to In Progress (WIP) in Hathor Network Mar 5, 2026
Resolve the backend choice: Grafana Tempo for traces (S3-backed),
Prometheus for span-derived metrics via OTel Collector's Span Metrics
Connector. Integrates with our existing Grafana instance.

- Add OTel Collector config with Span Metrics Connector
- Add Tempo + Prometheus architecture diagram
- Update rollout plan with infra and dashboard phases
- Remove backend choice from unresolved questions
@andreabadesso andreabadesso changed the title RFC: OpenTelemetry observability for wallet-service rfc: OpenTelemetry observability for wallet-service Mar 5, 2026
@andreabadesso andreabadesso moved this from In Progress (WIP) to In Progress (Done) in Hathor Network Mar 6, 2026
Comment thread projects/wallet-service/0001-opentelemetry-observability.md Outdated
Comment thread projects/wallet-service/0001-opentelemetry-observability.md Outdated
Comment thread projects/wallet-service/0001-opentelemetry-observability.md Outdated
Comment thread projects/wallet-service/0001-opentelemetry-observability.md Outdated
Comment thread projects/wallet-service/0001-opentelemetry-observability.md
Comment thread projects/wallet-service/0001-opentelemetry-observability.md
Comment thread projects/wallet-service/0001-opentelemetry-observability.md Outdated
Comment thread projects/wallet-service/0001-opentelemetry-observability.md Outdated
@luislhl luislhl moved this from In Progress (Done) to In Review (WIP) in Hathor Network Mar 18, 2026
- Fix architecture diagram to show separate ADOT embedded collector for
  Lambdas instead of sharing the K8s sidecar collector
- Note Tempo is VPC-internal only (no internet exposure needed)
- Switch from Grafana Alerting to Alertmanager rules (existing on-call setup)
- Recalibrate alert severities per on-call guide (Major/Medium/Low)
- Remove PagerDuty reference (not used), routing handled by Alertmanager
- Add trace retention note (compactor.block_retention, 14 days default)
- Add S3 bucket creation and IAM permissions step (Pod Workload Identity)
- Note to evaluate pre-built OTel dashboards before building from scratch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@pedroferreira1 pedroferreira1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this is approved, I just have some questions:

Storage costs. Traces generate significant data volume. Without sampling, costs can grow quickly. Tail-based sampling mitigates this but adds collector complexity.

Do we have an estimation on how much this will increase?

ADOT layer cold start impact in our specific Lambdas. The 200-800ms range is from AWS documentation. We need to benchmark with our actual deployment packages to get exact numbers.

We should implement this feature with a simple feature flag (even if it's an ENV var) to turn it off in a simple way, if we feel it's adding lot of overhead, not only to the cold start.


Would be amazing to have an analysis for each alert created for the alert manager on how to improve the query/API/error being alerted.

@andreabadesso andreabadesso merged commit 4c192e3 into master Mar 24, 2026
@github-project-automation github-project-automation Bot moved this from In Review (WIP) to Waiting to be deployed in Hathor Network Mar 24, 2026
@andreabadesso andreabadesso deleted the docs/wallet-service-opentelemetry-observability branch March 24, 2026 20:34
@andreabadesso andreabadesso moved this from Waiting to be deployed to Done in Hathor Network Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants