Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose Pressure Stall Information as metrics #3052

Open
dqminh opened this issue Jan 27, 2022 · 3 comments · May be fixed by #3083
Open

Expose Pressure Stall Information as metrics #3052

dqminh opened this issue Jan 27, 2022 · 3 comments · May be fixed by #3083

Comments

@dqminh
Copy link
Contributor

dqminh commented Jan 27, 2022

Pressure Stall Information is exposed per cgroup in cgroupv2. It's a good way to understand contention due to lack of resources ( cpu, memory, io ). For example

# /sys/fs/cgroup/system.slice/cpu.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=306212315
full avg10=0.00 avg60=0.00 avg300=0.00 total=246733962

It would be great to expose this data source in cadvisor as metrics.

@dqminh dqminh changed the title Exopse Pressure Stall Information as metrics Expose Pressure Stall Information as metrics Jan 27, 2022
@mrunalp
Copy link
Collaborator

mrunalp commented Jan 28, 2022

This is something @bobbypage and I talked about adding recently. Also, adding @kolyshkin here.

@dqminh
Copy link
Contributor Author

dqminh commented Jan 28, 2022

I added support for runc here opencontainers/runc#3358, so once that goes in we can update libcontainer in cadvisor and expose the metrics here.

@bobbypage
Copy link
Collaborator

That would be awesome to have support in libcontainer and use it in cAdvisor. Thanks @dqminh !

@dqminh dqminh linked a pull request Mar 23, 2022 that will close this issue
xinau added a commit to xinau/cadvisor that referenced this issue Jan 26, 2025
issues: google#3052, google#3083, kubernetes/enhancements#4205

This change adds metrics for pressure stall information, that indicate
why some or all tasks of a cgroupv2 have waited due to resource
congestion (cpu, memory, io). The change exposes this information by
including the _PSIStats_ of each controller in it's stats, i.e.
_CPUStats.PSI_, _MemoryStats.PSI_ and _DiskStats.PSI_.

The information is additionally exposed as Prometheus metrics. The
metrics follow the naming outlined by the prometheus/node-exporter,
where stalled eq full and waiting eq some.

```
container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total
```

Signed-off-by: Felix Ehrenpfort <[email protected]>
xinau added a commit to xinau/cadvisor that referenced this issue Jan 26, 2025
issues: google#3052, google#3083, kubernetes/enhancements#4205

This change adds metrics for pressure stall information, that indicate
why some or all tasks of a cgroupv2 have waited due to resource
congestion (cpu, memory, io). The change exposes this information by
including the _PSIStats_ of each controller in it's stats, i.e.
_CPUStats.PSI_, _MemoryStats.PSI_ and _DiskStats.PSI_.

The information is additionally exposed as Prometheus metrics. The
metrics follow the naming outlined by the prometheus/node-exporter,
where stalled eq full and waiting eq some.

```
container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total
```

Signed-off-by: Felix Ehrenpfort <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants