Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial Open Telemetry support #385

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Add initial Open Telemetry support #385

wants to merge 12 commits into from

Conversation

rlopzc
Copy link

@rlopzc rlopzc commented Jul 7, 2024

What kind of change does this PR introduce?

Feature

What is the current behavior?

No current behavior.

What is the new behavior?

This PR will start sending OpenTelemetry traces to the configured OTel vendor.

My goal with this PR is to set up OpenTelemetry in the repository + a development environment to send the traces. This will lower the bar for more contributors to explore this library and play with the traces (add more traces, trace other events, trace the different libraries this project uses, etc).

The OTel vendor can be configured with the following environment variables:

  • OTEL_EXPORTER_OTLP_ENDPOINT. The endpoint to send the traces to. For example: http://localhost:4318.
  • OTEL_EXPORTER_OTLP_HEADERS. The headers to include in the request. For example: authorization="Bearer your-api-key".

What's currently traced?

  • Proxied query to the DB. It's a general trace where we can attach more information of the flow of events. This trace has more information to identify: tenant, user, mode, type, db_name, pool_pid, db_pid. (Example in the image below).
    Ideally, this trace will have information on all the interactions with different systems: caches, pools, partisan, client_handler, db_handler, pg_parser, etc.

As the project is complex, I didn't have enough time to deep dive into each flow of queries + edge cases. That's why this PR just traces the query sent to the proxy, and when the ClientHandler responds to the caller

Next steps

  • Understand the different flows + edge cases, and add spans to the query trace (the one traced in this PR) taking into account the distributed environment nature of the calls.
  • This project uses partisan to communicate to the DB. Partisan produces telemetry events (docs here). An idea that I have is to listen to partisan telemetry events and trace the requests sent to the DB.
    I need more time to think about how to share the otel_span created in the ClientHandler to the DbHandler up to the telemetry produced event.
  • Depending on the chosen OTel provider, it may support multi-tenant. For example, here are the docs for Grafana Tempo.
  • When the traces are good enough, it should be documented how anyone can enable OpenTelemetry (/docs).

Local environment producing traces

Pre-requisites:

  • Dev setup
  • Tenant created in DB, as per the linked example.

To display the traces, I chose Grafana OTel because this project already uses the great library PromEx. I figured that Grafana had been already used in the stack. Of course, this is very easy to change :).

  1. Turn on Grafana OTel collector + WebUI, run: docker compose up grafana-otel.
  2. Go to http://localhost:4300 and login with admin/admin.
  3. Turn on the development environment, run: make dev.otel
  4. Connect via the proxy with: psql postgresql://postgres.dev_tenant:postgres@localhost:6543/postgres
  5. Execute a query in psql: select * from _supavisor.tenants;
  6. Explore the traces in Explore -> Choose Tempo -> Query type: search.
    image

I added dev.otel to the Makefile, which adds two environment variables to send the traces to http://localhost:4318.

Visualizing the traces:

As you can see in the image, the trace shows the duration of the executed query with additional information that'll help filter traces when making queries.

swappy-20240707_142332

Additional context

Related issue: #93

Let me know what can be improved in this PR. I'll address the reviews when I have free time 🙂.

rlopzc added 12 commits July 7, 2024 12:37
From the docs:
> The SDK opentelemetry should be added as early as possible in the
> Release boot process to ensure it is available before any telemetry is
> produced. Here it is also set to temporary under the assumption that we
> prefer to have a running Release not producing telemetry over crashing
> the entire Release.
Allow us to send and visualize produced traces in Grafana.
- Attach information to the root span: tenant, user, etc. Similar to the
  Telem module information.
- Attach Poll and DB PIDs for debugging purposes.
- Sets OTEL_TRACES_EXPORTER="otlp"
- OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
@rlopzc rlopzc requested a review from a team as a code owner July 7, 2024 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant