high cpu usage with only two services managed #272

cyrinux · 2024-04-04T12:19:19Z

First thanks for this cool toy 😃

Describe the bug
High CPU usage

Context

Sablier version: 1.6.0
Provider: docker v20.10.23
Reverse proxy: traefik v2.11
Sablier running inside a container

Expected behavior
Low cpu usage

Additional context
I manage 2 services in a big docker-compose files with sablier, and when I do a top I can see sablier process very hard with my tiny cpu.
When I restart it, the process become cool, until an unknown event make it hard with the cpu.

Is it a know issue ? I can see there is a full rewrite PR in progress, this can help ? (I didnt dig in the code actually)

kraoc · 2024-06-25T11:18:48Z

I'll add a comment here.

I have several VMs with Sablier.
I notice an increase in its cpu consumption without apparent reason.

When I restart Sablier, its cpu consumption returns to normal.

acouvreur · 2024-06-26T00:51:12Z

Hello @kraoc, do you have more inputs on that ?

If you have maybe CPU usage graphs, or specific events after which you noticed that ?

I will investigate this issue for sure when I have time, but more input to narrow down the issue will help!

kraoc · 2024-06-26T08:39:33Z

I'll try to add more usefull debug tips :p
(but I need to dig a bit)

My first think point to a potential leak on first start (eg. on a vm boot) ; because after a restart of sablier it's cpu consumption is ok.
Dunno if I'm making myself clear :p

I'm not talking of Traefik plugin but acouvreur/sablier:latest (maybe a precision)

luked34 · 2024-06-27T17:15:15Z

I have the same issue. I am running in Docker Desktop on Ubuntu, with the CPU limit under "Resources" set to 4.

After restarting the container the CPU usage is almost zero. But about 4 minutes later it ramps up to just over 100% (the maximum available is 400%). This happens whether or not I send any requests through Sablier. The usage is coming from the main Sablier container, not the proxy (I use Caddy).

It seems to work fine even when the CPU usage is high. I can't see anything useful in the logs, but let me know if there's anything else I can provide.

cyrinux · 2024-06-27T18:05:09Z

I still have the bug too, even with the last version, always same behavior, after restart its cool. And when I hit one of the endpoint manage by sablier then the process consumne lot of cpu.
I tried to tweak some sablier params but this dont really help.

acouvreur · 2024-06-27T18:14:21Z

I'd be curious to have more information to try to reproduce the issue.
What I'd need would be the following:

OS (Windows, Mac, linux)
OS Version
Sablier version (you can output the result of sablier version, even with docker)
Where is Sablier running ? (binary, docker, and if it's docker, is it in docker desktop ?)
The configuration file for sablier
The endpoint you are reaching and the parameter used after which you see this CPU usage.

I've tried to reproduce the issue but even after multiple calls, I cannot go over 0.3% CPU usage.

I'm currently running a version of the program with CPU profile, memory profiling and trace profiling.

If I can better reproduce your use cases, I'll be able to better understand why it behaves this way.

Thanks!

cyrinux · 2024-06-27T18:27:53Z

Hi @acouvreur,

For me same It's always #272 (comment) but with last sablier version.

Some self-hosted service on a synology DS918+ with an Intel(R) Celeron(R) CPU J3455 @ 1.50GHz.

I have around 60 containers running, if the docker events flow matter.

(version=1.7.0, branch=HEAD, revision=0f58751390f2439f03c851f76ef9b9a410346b58)

I have config in config in dynamic-config.toml and some labels in my docker-compose.yaml

Here few extract, I have 4 endpoint managed by sablier, I try playing with timeout and session time, for sure, sometimes, the problem doesnt trigger instant, but often this happens instantly.

dynamic-config.toml

...
[http.routers.gethomepage]
  rule = "Host(`foo.bar`)"
  service = "gethomepage"
  entryPoints = ["https"]
  middlewares = ["middleflare@docker", "sablier-gethomepage@docker", "authelia@docker", "compress@docker", "security-headers@docker"]
  [http.routers.gethomepage.tls]
    certresolver = "default"
    options = "default"
...

docker-compose.yml

...
  sablier:
    image: acouvreur/sablier:1.7.0
    container_name: sablier
    mem_limit: 256m
    cpu_shares: 1024
    networks:
    - traefik
    environment:
    - PROVIDER_NAME=docker
    labels:
      com.centurylinklabs.watchtower.enable: "true"
      deunhealth.restart.on.unhealthy: "true"
      traefik.enable: "true"
      # Group Middleware aur
      traefik.http.middlewares.sablier-aur.plugin.sablier.group: aur
      traefik.http.middlewares.sablier-aur.plugin.sablier.sablierUrl: http://sablier:10000
      traefik.http.middlewares.sablier-aur.plugin.sablier.sessionDuration: 720m
      traefik.http.middlewares.sablier-aur.plugin.sablier.blocking.timeout: 30s
      # Group Middleware gethomepage
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.group: gethomepage
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.sablierUrl: http://sablier:10000
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.sessionDuration: 120m
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.dynamic.theme: ghost
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.dynamic.showDetails: "false"

  gethomepage:
    image: ghcr.io/gethomepage/homepage
    container_name: gethomepage
    volumes:
    - /volume2/docker/homepage/config:/app/config:rw
    networks:
    - traefik
    depends_on:
    - traefik
    restart: unless-stopped
    labels:
      com.centurylinklabs.watchtower.enable: "true"
      deunhealth.restart.on.unhealthy: "false"
      sablier.enable: true
      sablier.group: gethomepage
      traefik.enable: "true"
      traefik.docker.network: traefik
...

acouvreur · 2024-06-27T19:05:15Z

I've done the following experiment:

Create a container to be scaled:

docker run -d -l sablier.enable=true --name whoami containous/whoami

Run Sablier

docker run -v '/var/run/docker.sock:/var/run/docker.sock' --cpu-shares=1024 -m 256m --name sablier -p 10000:10000 acouvreur/sablier:1.7.0 start --logging.level=trace

Observe stats

docker stats

Generate load

for i in {1..500}; do 
    curl "http://localhost:10000/api/strategies/dynamic?names=whoami&session_duration=30s" &
    curl "http://localhost:10000/api/strategies/blocking?names=whoami&session_duration=30s" &
    curl "http://localhost:10000/api/strategies/blocking?group=default&session_duration=30s" &
    curl "http://localhost:10000/api/strategies/dynamic?group=default&session_duration=30s" &
done

This will do 2k requests with names and group matching with both strategies.

Recording.2024-06-27.145655.1.mp4

Still I cannot reproduce the CPU usage.

Note that I've set the same resource restrictions that you have.

luked34 · 2024-06-27T19:25:03Z

I've done some testing on my setup, and I can only reproduce it when I use a docker socket proxy. Looks like there's another issue open in the same area:
#287

cyrinux · 2024-06-27T20:07:06Z

I've done some testing on my setup, and I can only reproduce it when I use a docker socket proxy. Looks like there's another issue open in the same area: #287

Ok, I confess I cleanup my config a little for pasting it in the previous post, and yes, I use a cetusguard proxy too ! Good catch !

cyrinux · 2024-06-27T20:08:24Z

I've done the following experiment:

1. Create a container to be scaled:

docker run -d -l sablier.enable=true --name whoami containous/whoami

2. Run Sablier

docker run -v '/var/run/docker.sock:/var/run/docker.sock' --cpu-shares=1024 -m 256m --name sablier -p 10000:10000 acouvreur/sablier:1.7.0 start --logging.level=trace

3. Observe stats

docker stats

4. Generate load

for i in {1..500}; do 
    curl "http://localhost:10000/api/strategies/dynamic?names=whoami&session_duration=30s" &
    curl "http://localhost:10000/api/strategies/blocking?names=whoami&session_duration=30s" &
    curl "http://localhost:10000/api/strategies/blocking?group=default&session_duration=30s" &
    curl "http://localhost:10000/api/strategies/dynamic?group=default&session_duration=30s" &
done

This will do 2k requests with names and group matching with both strategies.
Recording.2024-06-27.145655.1.mp4

Still I cannot reproduce the CPU usage.

Note that I've set the same resource restrictions that you have.

Without doubt it works well in your lab, so I understand i'm facing docker socket proxy issue too.

acouvreur · 2024-06-27T20:15:30Z

I can try with the docker socket proxy. It might be related to Sablier listening for incoming events. Can you share your setup with a docker socket proxy ?

cyrinux · 2024-06-27T20:20:07Z

  cetusguard-sablier:
    container_name: cetusguard-sablier
    image: docker.io/hectorm/cetusguard
    read_only: true
    security_opt:
    - no-new-privileges:true
    volumes:
    - "/var/run/docker.sock:/var/run/docker.sock:ro"
    labels:
      com.centurylinklabs.watchtower.enable: "true"
      deunhealth.restart.on.unhealthy: "true"
    networks:
    - cetusguard-sablier
    environment:
      CETUSGUARD_BACKEND_ADDR: "unix:///var/run/docker.sock"
      CETUSGUARD_FRONTEND_ADDR: "tcp://:2375"
      CETUSGUARD_RULES: |
        ! Monitor events
        GET %API_PREFIX_EVENTS%
        ! Get system information
        GET %API_PREFIX_INFO%
        ! Get data usage information
        GET %API_PREFIX_SYSTEM%
        ! List containers
        GET %API_PREFIX_CONTAINERS%/json
        ! Inspect a container
        GET %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/json
        ! Start stop container
        POST %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/start
        POST %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/stop
        GET %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/stats
      CETUSGUARD_LOG_LEVEL: "2"
    restart: always

  sablier:
    image: acouvreur/sablier:1.7.0
    container_name: sablier
    mem_limit: 256m
    cpu_shares: 1024
    networks:
    - traefik
    - cetusguard-sablier
    environment:
    - PROVIDER_NAME=docker
    - DOCKER_HOST=tcp://cetusguard-sablier:2375
    labels:
      com.centurylinklabs.watchtower.enable: "true"
      deunhealth.restart.on.unhealthy: "true"
      traefik.enable: "true"
      # Group Middleware aur
      traefik.http.middlewares.sablier-aur.plugin.sablier.group: aur
      traefik.http.middlewares.sablier-aur.plugin.sablier.sablierUrl: http://sablier:10000
      traefik.http.middlewares.sablier-aur.plugin.sablier.sessionDuration: 720m
      traefik.http.middlewares.sablier-aur.plugin.sablier.blocking.timeout: 30s
      # Group Middleware gethomepage
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.group: gethomepage
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.sablierUrl: http://sablier:10000
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.sessionDuration: 120m
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.dynamic.theme: ghost
      traefik.http.middlewares.sablier-gethomepage.plugin.sablier.dynamic.showDetails: "false"

acouvreur · 2024-06-27T20:46:07Z

So, I've tried with the docker socket proxy with the following setup:

version: "3.7"

services:
  cetusguard-sablier:
    container_name: cetusguard-sablier
    image: docker.io/hectorm/cetusguard
    read_only: true
    security_opt:
    - no-new-privileges:true
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
    environment:
      CETUSGUARD_BACKEND_ADDR: "unix:///var/run/docker.sock"
      CETUSGUARD_FRONTEND_ADDR: "tcp://:2375"
      CETUSGUARD_RULES: |
        ! Monitor events
        GET %API_PREFIX_EVENTS%
        ! Get system information
        GET %API_PREFIX_INFO%
        ! Get data usage information
        GET %API_PREFIX_SYSTEM%
        ! List containers
        GET %API_PREFIX_CONTAINERS%/json
        ! Inspect a container
        GET %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/json
        ! Start stop container
        POST %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/start
        POST %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/stop
        GET %API_PREFIX_CONTAINERS%/%CONTAINER_ID_OR_NAME%/stats
      CETUSGUARD_LOG_LEVEL: "2"


  sablier:
    image: acouvreur/sablier:1.7.0
    ports:
      - 10000:10000
    environment:
      - PROVIDER_NAME=docker
      - DOCKER_HOST=tcp://cetusguard-sablier:2375

  whoami:
    image: containous/whoami:v1.5.0
    labels:
      - sablier.enable=true

And still after a lot of http calls:

d64a6122dad7   cetusguard-sablier   0.00%     31.59MiB / 7.744GiB   0.40%     789kB / 2.1MB   0B / 0B         9
9bd01d6419c6   sablier-whoami-1     0.00%     2.629MiB / 7.744GiB   0.03%     1.31kB / 0B     0B / 0B         4
eb3ee597b947   sablier-sablier-1    0.00%     50.02MiB / 7.744GiB   0.63%     4.49MB / 12MB   455kB / 4.1kB   12

luked34 · 2024-06-27T20:56:39Z

Docker sock config and proxy config are both commented - uncomment one at a time. Also docker.sock.raw is for Docker Desktop, use docker.sock otherwise

name: sablier2
services:
  proxy2:
    restart: always
    image: caddy:2.8.4-with-sablier # built as per sablier docs
    networks:
      - network2
    ports:
      - "8000:80" # whoami
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro

  whoami2:
    restart: none
    image: traefik/whoami
    networks:
      - network2
    labels:
      - sablier.enable=true
      - sablier.group=demo

  sablier2:
    restart: always
    image: acouvreur/sablier
    networks:
      - network2
      - sablier-socket-network2
#    volumes:
#      - '/var/run/docker.sock.raw:/var/run/docker.sock'
#    environment:
#      - DOCKER_HOST=tcp://socket-proxy-sablier2:2375

  socket-proxy-sablier2:
    image: lscr.io/linuxserver/socket-proxy:latest
    environment:
      - ALLOW_START=1 
      - ALLOW_STOP=1 
      - ALLOW_RESTARTS=1 
      - CONTAINERS=1 
      - EVENTS=1 
      - INFO=1 
      - NETWORKS=1 
      - PING=1 
      - POST=1 
      - SERVICES=1 
      - VERSION=1 
    volumes:
      - /var/run/docker.sock.raw:/var/run/docker.sock:ro
    restart: always
    read_only: true
    tmpfs:
      - /run
    networks:
      - sablier-socket-network2

networks:
  network2:
    name: sablier-network2
  sablier-socket-network2:
    name: sablier-socket-network2

acouvreur · 2024-06-27T21:29:57Z

I can create a discord server for the project so we can actually debug this together, what do you think ?

I need something with voice channels, do you have any suggestions ?

acouvreur · 2024-06-27T21:38:31Z

Docker sock config and proxy config are both commented - uncomment one at a time. Also docker.sock.raw is for Docker Desktop, use docker.sock otherwise

name: sablier2
services:
  proxy2:
    restart: always
    image: caddy:2.8.4-with-sablier # built as per sablier docs
    networks:
      - network2
    ports:
      - "8000:80" # whoami
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro

  whoami2:
    restart: none
    image: traefik/whoami
    networks:
      - network2
    labels:
      - sablier.enable=true
      - sablier.group=demo

  sablier2:
    restart: always
    image: acouvreur/sablier
    networks:
      - network2
      - sablier-socket-network2
#    volumes:
#      - '/var/run/docker.sock.raw:/var/run/docker.sock'
#    environment:
#      - DOCKER_HOST=tcp://socket-proxy-sablier2:2375

  socket-proxy-sablier2:
    image: lscr.io/linuxserver/socket-proxy:latest
    environment:
      - ALLOW_START=1 
      - ALLOW_STOP=1 
      - ALLOW_RESTARTS=1 
      - CONTAINERS=1 
      - EVENTS=1 
      - INFO=1 
      - NETWORKS=1 
      - PING=1 
      - POST=1 
      - SERVICES=1 
      - VERSION=1 
    volumes:
      - /var/run/docker.sock.raw:/var/run/docker.sock:ro
    restart: always
    read_only: true
    tmpfs:
      - /run
    networks:
      - sablier-socket-network2

networks:
  network2:
    name: sablier-network2
  sablier-socket-network2:
    name: sablier-socket-network2

Ok, actually, thanks @luked34 I can actually reproduce the issue now!

89509d81c1ec   sablier-sablier-1                100.17%   11.55MiB / 7.744GiB   0.15%     26.7kB / 18.6kB   889kB / 4.1kB   8

luked34 · 2024-06-27T21:39:55Z

I can create a discord server for the project so we can actually debug this together, what do you think ?

I need something with voice channels, do you have any suggestions ?

I would have suggested discord, might be worth setting one up even if we don't use it for this issue

acouvreur · 2024-06-27T23:16:34Z

I think I found the issue, but the issue seems to happen when the event stream returns a nil channel.

In such case the event stream cannot be properly initialized and the select statement read nil values indefinitely.

I can make a fix to properly break out of the faulty event stream, however, this means that there is something wrong with your setup as we can't properly listen to the event stream.

I'll publish a fix in the beta branch, please test the fix when it's released and look for errors in the logs for details in your specific cases.

acouvreur · 2024-06-27T23:55:45Z

So basically what happens is the client is registered without making sure the docker connection is established.
It tries to listen for events, but it returns closed channels and spin up CPU high like in a for loop.

But it works afterwards because, by the time being, the connection was established.

I've changed this behavior so sablier will actually exit if the connection cannot be established.

You've all probably had some kind of race in which the connection was unavailable to the docker host at the time of registering the event stream.

This ended up burning the CPU, for which I've also added a fail-sage.

acouvreur · 2024-06-28T00:10:01Z

The fix is available in v1.7.1-beta.1, let me know if still happens.

kraoc · 2024-06-28T06:57:54Z

Alpine Linux
3.20
sablier:latest (:p)
Docker (not desktop)
sablier.yml

provider:
  name: docker

server:
  port: 10000
  base-path: /

sessions:
  default-duration: 5m
  expiration-interval: 30s

logging:
  level: warn

strategy:
  dynamic:
    show-details-by-default: false
    default-theme: zogg
    default-refresh-frequency: 5s
  blocking:
    default-timeout: 5m

endpoint ? none... simple Proxmox/Dozzle monitoring...
Usage: sablier container + sablier Traefik plugin

Maybe important points...

Seems to consume cpu only after vm/docker/container start (eg. in case of a vm restart)
usage of wollomatic/socket-proxy:1

So, maybe in a vm restart condition Sablier (which depend on socket proxy) is started before the socket proxy.
This may explain when I restart sablier manually it works fine as proxy is already started...

Sorry for late reply :p

luked34 · 2024-06-28T08:52:12Z

Thanks for the quick fix, it seems to be working.

The log now shows two extra messages:

"provider event stream errorunexpected EOF"
"provider event stream is closed<nil>"

I've tried adding a depends_on to the compose file, restarting Sablier while leaving the socket proxy running, and changing the socket proxy permissions to allow everything, but the messages still appear.

CPU usage is consistently below 1% now, so I'm happy to ignore the messages :)

kraoc · 2024-06-28T09:20:26Z

Maybe using willfarrell/autoheal with setting to unhealthy Sablier container when no socket proxy available should maid the trick ?
Relying on autoheal ability to restart containers ?
Just a suggestion... :p

cyrinux · 2024-06-28T11:46:56Z

The fix is available in v1.7.1-beta.1, let me know if still happens.

I will check again later but at first glance it's far away better! Thank you!

acouvreur · 2024-06-28T13:40:58Z

Maybe using willfarrell/autoheal with setting to unhealthy Sablier container when no socket proxy available should maid the trick ? Relying on autoheal ability to restart containers ? Just a suggestion... :p

When Sablier cannot connect to the docker host, it should fail and exit.

If you simply add the option restart: always you will have a default kind of autoheal.

kraoc · 2024-06-28T18:40:03Z

Already has always :p
But all my compose use inheritance (multi-level extends) so my setup is somewhat unusual I think ^^

cyrinux added the bug Something isn't working label Apr 4, 2024

This was referenced Jun 27, 2024

fix(docker): ensure connection is established with host #329

Merged

Update docker_classic.go with delay #315

Closed

Support for docker-socket-proxy #287

Closed

acouvreur closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

high cpu usage with only two services managed #272

high cpu usage with only two services managed #272

cyrinux commented Apr 4, 2024 •

edited

Loading

kraoc commented Jun 25, 2024

acouvreur commented Jun 26, 2024

kraoc commented Jun 26, 2024 •

edited

Loading

luked34 commented Jun 27, 2024

cyrinux commented Jun 27, 2024

acouvreur commented Jun 27, 2024 •

edited

Loading

cyrinux commented Jun 27, 2024 •

edited

Loading

acouvreur commented Jun 27, 2024

luked34 commented Jun 27, 2024

cyrinux commented Jun 27, 2024 •

edited

Loading

cyrinux commented Jun 27, 2024

acouvreur commented Jun 27, 2024

cyrinux commented Jun 27, 2024

acouvreur commented Jun 27, 2024

luked34 commented Jun 27, 2024 •

edited

Loading

acouvreur commented Jun 27, 2024 •

edited

Loading

acouvreur commented Jun 27, 2024

luked34 commented Jun 27, 2024

acouvreur commented Jun 27, 2024

acouvreur commented Jun 27, 2024

acouvreur commented Jun 28, 2024

kraoc commented Jun 28, 2024 •

edited

Loading

luked34 commented Jun 28, 2024

kraoc commented Jun 28, 2024

cyrinux commented Jun 28, 2024

acouvreur commented Jun 28, 2024

kraoc commented Jun 28, 2024

high cpu usage with only two services managed #272

high cpu usage with only two services managed #272

Comments

cyrinux commented Apr 4, 2024 • edited Loading

kraoc commented Jun 25, 2024

acouvreur commented Jun 26, 2024

kraoc commented Jun 26, 2024 • edited Loading

luked34 commented Jun 27, 2024

cyrinux commented Jun 27, 2024

acouvreur commented Jun 27, 2024 • edited Loading

cyrinux commented Jun 27, 2024 • edited Loading

acouvreur commented Jun 27, 2024

luked34 commented Jun 27, 2024

cyrinux commented Jun 27, 2024 • edited Loading

cyrinux commented Jun 27, 2024

acouvreur commented Jun 27, 2024

cyrinux commented Jun 27, 2024

acouvreur commented Jun 27, 2024

luked34 commented Jun 27, 2024 • edited Loading

acouvreur commented Jun 27, 2024 • edited Loading

acouvreur commented Jun 27, 2024

luked34 commented Jun 27, 2024

acouvreur commented Jun 27, 2024

acouvreur commented Jun 27, 2024

acouvreur commented Jun 28, 2024

kraoc commented Jun 28, 2024 • edited Loading

luked34 commented Jun 28, 2024

kraoc commented Jun 28, 2024

cyrinux commented Jun 28, 2024

acouvreur commented Jun 28, 2024

kraoc commented Jun 28, 2024

cyrinux commented Apr 4, 2024 •

edited

Loading

kraoc commented Jun 26, 2024 •

edited

Loading

acouvreur commented Jun 27, 2024 •

edited

Loading

cyrinux commented Jun 27, 2024 •

edited

Loading

cyrinux commented Jun 27, 2024 •

edited

Loading

luked34 commented Jun 27, 2024 •

edited

Loading

acouvreur commented Jun 27, 2024 •

edited

Loading

kraoc commented Jun 28, 2024 •

edited

Loading