Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LXD cluster does not return metrics for the project if none instance form that project is run on the queried node. #12775

Closed
tregubovav-dev opened this issue Jan 26, 2024 · 2 comments
Assignees

Comments

@tregubovav-dev
Copy link

tregubovav-dev commented Jan 26, 2024

Required information

  • Distribution: Ubuntu
  • Distribution version: 23.10 (Mantic) (arm64)
  • The output of "lxc info" or if that fails:
    • Kernel version: 6.5.0-1009-raspi # 12-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 17 11:45:08 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
    • LXD version: 5.19
    • Storage backend in use: microceph

Issue description

LXD metric API does not return metrics for the project if none instance from project run on queried node. Query return metrics only for projects which nodes implicitly run on queried node.

Steps to reproduce

  1. Create LXD cluster with 3+ nodes.
  2. Create one or more additional project(s) in the cluster
  3. Deploy and start several instances to all of projects. Be sure that each node hosts instances from every project
  4. Run lxc query /1.0/metrics command on all nodes and ensure that the query returns metrics for all instances in all projects in the cluster.
  5. Stop instances from project default hosted on one of the nodes (be sure that other instances from project "default" continue running on other nodes) and then run lxc query /1.0/metrics command on that node. Query return metrics for all instances from all project except project "default".
  6. run lxc query /1.0/metrics command on other nodes and ensure that the query returns metrics for all instances in all projects in the cluster.

This behavior garbles metric collected by external scrapes and external dashboards like Prometheus+Graphana.

@tomponline
Copy link
Member

@tregubovav-dev is this still an issue with LXD 5.20?

@simondeziel would you mind seeing if you can validate if this remains an issue?

@simondeziel
Copy link
Member

Since the introduction of metrics_instances_count extension, this bug is fixed. Here's how I did the initial reproduction with 5.19/stable:

$ lxc launch ubuntu-daily:22.04 c1 -c security.nesting=true -c security.devlxd.images=true
$ lxc shell c1
# snap refresh lxd --channel 5.19/stable
lxd (5.19/stable) 5.19-8635f82 from Canonical✓ refreshed
# lxd init --auto
# lxc init ubuntu-minimal-daily:22.04 c2
# lxc query /1.0/metrics | grep -v ^lxd_go | grep -v ^#
lxd_operations_total 1
lxd_warnings_total 3
lxd_uptime_seconds 65.457576337

This confirms stopped instances are not reported about. Now with 5.21/edge that includes the metrics_instances_count extension, offline instances are reported:

# snap refresh lxd --channel 5.21/edge
# lxc query /1.0/metrics | grep -v ^lxd_go | grep -v ^# | grep -wF c2
lxd_cpu_seconds_total{cpu="0",mode="system",name="c2",project="default",state="STOPPED",type="container"} 0
lxd_cpu_seconds_total{cpu="0",mode="user",name="c2",project="default",state="STOPPED",type="container"} 0
lxd_cpu_effective_total{name="c2",project="default",state="STOPPED",type="container"} -1
lxd_filesystem_avail_bytes{device="",fstype="zfs",mountpoint="/",name="c2",project="default",state="STOPPED",type="container"} 1.5333982208e+11
lxd_filesystem_free_bytes{device="",fstype="zfs",mountpoint="/",name="c2",project="default",state="STOPPED",type="container"} 1.5333982208e+11
lxd_filesystem_size_bytes{device="",fstype="zfs",mountpoint="/",name="c2",project="default",state="STOPPED",type="container"} 1.54700218368e+11
lxd_memory_Active_bytes{name="c2",project="default",state="STOPPED",type="container"} 0
lxd_memory_Inactive_bytes{name="c2",project="default",state="STOPPED",type="container"} 0
lxd_memory_MemAvailable_bytes{name="c2",project="default",state="STOPPED",type="container"} 3.1642516001e+10
lxd_memory_MemFree_bytes{name="c2",project="default",state="STOPPED",type="container"} 3.1642516001e+10
lxd_memory_MemTotal_bytes{name="c2",project="default",state="STOPPED",type="container"} 3.1642516e+10
lxd_memory_Swap_bytes{name="c2",project="default",state="STOPPED",type="container"} -1
lxd_memory_OOM_kills_total{name="c2",project="default",state="STOPPED",type="container"} -1
lxd_procs_total{name="c2",project="default",state="STOPPED",type="container"} 0

So I believe your specific bug is fixed but since I have not use the exact same reproducing steps (cluster setup), please do re-open the bug if not fixed in 5.21 or later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants