Skip to content

Conversation

@arikgrahl
Copy link
Contributor

As already described in the README.md:

A monitoring mixin for CloudNativePG, providing Grafana dashboards and Prometheus alerting rules for PostgreSQL clusters running on Kubernetes.

Dashboards

This mixin bundles the Grafana dashboard provided by CloudNativePG.

CloudNativePG Dashboard

Prometheus Alerts

This mixin bundles the sample Prometheus Alert rules provided by CloudNativePG.

  • LongRunningTransaction: A query is taking longer than 5 minutes.
  • BackendsWaiting: If a backend is waiting for longer than 5 minutes
  • PGDatabase: Number of transactions from the frozen XID to the current one
  • PGReplication: The standby is lagging behind the primary
  • LastFailedArchiveTime: Checks the last time archiving failed. Will be < 0 when it has not failed.
  • DatabaseDeadlockConflicts: Checks the number of database conflicts
  • ReplicaFailingReplication: Checks if the replica is failing to replicate

@arikgrahl arikgrahl requested a review from a team as a code owner August 1, 2025 12:37
@arikgrahl arikgrahl marked this pull request as draft August 1, 2025 12:51
@arikgrahl arikgrahl force-pushed the cloudnative-pg-mixin branch from 5052974 to 8bd7aa7 Compare August 5, 2025 13:13
@arikgrahl arikgrahl marked this pull request as ready for review August 5, 2025 13:15
@v-zhuravlev
Copy link
Contributor

Hi! Thanks for contributing. It this a copy cloudnative dashboards and alerts? Would it be more practical to repack their alerts/dashboard as mixin in cloudnative repo instead? So contributors and users can more easily discover it?

@arikgrahl
Copy link
Contributor Author

Generally speaking, this approach makes sense.
I could try submitting a repackaging of the resources as a mixin in the repository, which is where the dashboards and alerts originate.

However, I'm not very optimistic that such a contribution would be accepted, as the repository appears to be limited to Helm charts:

This repository contains the Grafana Dashboards distributed as Helm Charts so they can packaged as a dependency to other projects.

Additionally, I’ve noticed that several mixins within this project have counterparts in their respective upstream repositories.
For example: github.com/ceph/ceph/…/ceph-cluster.json vs. github.com/grafana/jsonnet-libs/…/ceph-cluster.json

For this reason, I thought it might make sense to include the mixins here.
However, there may be subtle differences that I am not currently aware of.

@Dasomeone
Copy link
Member

Hi @arikgrahl
Generally speaking we'd prefer to keep them together just for consistency as changes are made, but as you rightly pointed out that's not always the case, and we can (and will) absolutely accept the contribution if they turn it down!

Let's give it a shot and see what they say, otherwise happy to add it here for everyone. Perhaps even if they turn down the contribution a compromise can be reached with linking between the two?

@Dasomeone
Copy link
Member

Hi @arikgrahl, any updates on this PR? Is this something you were able to contribute upstream or would you still like to have it live here?

Copy link

@aalhour aalhour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of questions since I am not familiar with the CNPG:

  • Does the operator expose more metrics out of PG or is that left to other projects, i.e.: Prometheus and the like?
  • If the answer is yes, would you want to add more health metrics? Commit latencies, slow queries ... etc?

@arikgrahl
Copy link
Contributor Author

@Dasomeone, thanks for checking in and for all the feedback!
Unfortunately, I haven’t had the bandwidth to push this upstream so far, but I’d definitely still like the contribution to be made available, whether here or upstream. Since I can’t commit to a timeline for pursuing the upstream route, I’d be happy to have it live here for now if that works.

@aalhour, thank you for the review and your questions!
CloudNativePG exposes a fairly extensive set of metrics for the PostgreSQL databases it manages.
These include:

  • CPU/memory usage
  • session state (active vs. idle)
  • transactions
    • commited vs. rolled back
    • longest transaction
  • deadlocks
  • blocked queries
  • storage (PGData/WAL)
    • volume space usage
    • volume inode usage
  • tuple I/O (deleted/inserted/fetched/returned/updated)
  • block I/O (hit vs. read)
  • database size
  • WAL
    • segment archive status (ready vs. done)
    • archiver status (archived vs. failed)
    • last archive age
    • WAL count
  • replication
    • replication lag
    • write lag
    • flush lag
    • replay lag

I hope this answers your question, but please let me know if you’d like more detail or have a specific definition of “health metrics” for PostgreSQL in mind.
Commit latencies should be covered by the existing metrics, but I’m not certain if there’s a direct metric for slow queries.
If there is a specific metric or dashboard you'd like added, let me know and I can look into including it!

@Dasomeone
Copy link
Member

@arikgrahl Like I said originally, we'd prefer an attempt is made to commit this upstream so it lives closer to the codebase at first, but if they're unwilling to accept it we can absolutely store it here. Just hesitant to have it live here from the get-go.
Do let me know if you're able to upstream it or not :)
Maybe we can revisit in a few weeks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants