Frequent 504s and Poor Uptime on Docker Compose deployments #821

TheOnlyWayUp · 2024-12-06T04:42:19Z

Fix: Thanks @theboringhumane! Update your traefik.yml with #821 (comment)

To Reproduce

Create multiple docker compose services in the same project
Uptime checking

This isn't an issue with UptimeKuma, because there are long periods of inactivity on my statistics as well.

Uptime stats with large blocks of empty:

Before moving composes to Dokploy

During a "downtime",

Nothing on service logs
Traefik request logs show 504s and timeouts

Current vs. Expected behavior

Services are supposed to be online until turned off
Current: Services are online on Dokploy's console but unreachable by the network intermittently

Provide environment information

CPU: AMD Ryzen 7 3700X (16) @ 3.600
GPU: 2b:00.0 ASPEED Technology, Inc
Memory: 13894MiB / 64221MiB
OS: Ubuntu 24.04 LTS x86_64
Host: 1.0
Kernel: 6.8.0-49-generic
Dokploy Version: v0.12.0

Which area(s) are affected? (Select all that apply)

Docker Compose

Are you deploying the applications where Dokploy is installed or on a remote server?

Same server where Dokploy is installed

Additional context

This doesn't happen when deploying on the host system without Dokploy, circumventing traefik.

Will you send a PR to fix it?

Maybe, need help

TheOnlyWayUp · 2024-12-06T04:45:09Z

Services go down and come back up in a few minutes all throughout the day, it's tanked uptime to 30%.

Forgejo Docker compose:

version: "3"

services:
  server:
    image: codeberg.org/forgejo/forgejo:8
    container_name: forgejo
    environment:
      - USER_UID=1000
      - USER_GID=1000
    restart: always
    networks:
      - default
    volumes:
      - /root/Projects/Forge/forgejo:/data
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    ports:
      - 5005:5005
      - 222:22
    expose:
      - 5005
networks:
  default:

Ghost Docker compose

version: '3.1'

services:
  ghost:
    image: ghost:5-alpine
    restart: always
    expose:
      - 2368
    networks:
      - default
    environment:
      # see https://ghost.org/docs/config/#configuration-options
      database__client: mysql
      database__connection__host: db
      database__connection__user: root
      database__connection__password: 
      database__connection__database: ghost
      # this url value is just an example, and is likely wrong for your environment!
      url: https://blog.rambhat.la
      # contrary to the default mentioned in the linked documentation, this image defaults to NODE_ENV=production (so development mode needs to be explicitly specified if desired)
      #NODE_ENV: development
    labels:
    - "traefik.enable=true"

    # Middleware for replacing content in the body
    - "traefik.http.middlewares.inject-script.plugin.rewrite.rewrites[0].regex=</head>"
    - "traefik.http.middlewares.inject-script.plugin.rewrite.rewrites[0].replacement=<script defer src='https://stats.towu.dev/script.js' data-website-id='4d72a7bf-3049-4c82-8ff4-05c0bc4f8edf'></script></head>"

    # Link the middleware to the router
    - "traefik.http.routers.blog.middlewares=inject-script"

    volumes:
      - ghost_ghost:/var/lib/ghost/content
    depends_on:
      - db

  db:
    image: mysql:8.0
    restart: always
    environment:
      MYSQL_ROOT_PASSWORD: 
    expose:
      - 3306
    volumes:
      - ghost_db:/var/lib/mysql
        #
volumes:
  ghost_ghost:
    external: true
  ghost_db:
    external: true

networks:
  default:

Umami:

version: '3'
services:
  umami:
    image: ghcr.io/umami-software/umami:postgresql-latest
    environment:
      DATABASE_URL: postgresql://umami:umami@db:5432/umami
      DATABASE_TYPE: postgresql
      APP_SECRET: 
    depends_on:
      db:
        condition: service_healthy
    restart: always
    healthcheck:
      test: ['CMD-SHELL', 'curl http://localhost:3000/api/heartbeat']
      interval: 5s
      timeout: 5s
      retries: 5
    expose:
      - 3000
    ports:
      - 4999:3000
    networks:
      - default
      
  db:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: umami
      POSTGRES_USER: umami
      POSTGRES_PASSWORD: umami
    expose:
      - 5432
    volumes:
      - /root/Projects/miami/var/lib/postgresql/data:/var/lib/postgresql/data
    restart: always
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}']
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - default

networks:
  default:
    driver: bridge

These are the docker composes for affected services

TheOnlyWayUp · 2024-12-06T04:52:27Z

I believe it's an issue with Traefik, I can access the port-forwarded services (for example, Umami is forwarded to 4999 on the host and stats.towu.dev via traefik).

When stats.towu.dev is down, I can still access host:4999 to see Umami, so I'm pretty confident it's a proxy issue.

Something peculiar, while all the affected compose services go down at the same time (Ghost, Umami, and Forgejo). Other compose projects, like Immich, don't go down at all. Immich is a photo-management app which has a website as a part of the dockercompose, like the other services.

Immich (no dowmtime) Dockercompose

version: "3"
name: immich

services:
  immich-server:
    container_name: immich_server
    image: ghcr.io/immich-app/immich-server:${IMMICH_VERSION:-release}
    networks:
      - default
    extends:
      file: ../../hwaccel.transcoding.yml
      service: cpu # set to one of [nvenc, quicksync, rkmpp, vaapi, vaapi-wsl] for accelerated transcoding
    volumes:
      # Do not edit the next line. If you want to change the media storage location on your system, edit the value of UPLOAD_LOCATION in the .env file
      - ${UPLOAD_LOCATION}:/usr/src/app/upload
      - /etc/localtime:/etc/localtime:ro
      - /root/Projects/Immich/external:/mnt/media:ro
    env_file:
      - .env
    ports:
      - xxxx:2283
    expose:
      - 2283
    depends_on:
      - redis
      - database
    restart: always
    healthcheck:
      disable: false

  immich-machine-learning:
    container_name: immich_machine_learning
    networks:
      - default
    # For hardware acceleration, add one of -[armnn, cuda, openvino] to the image tag.
    # Example tag: ${IMMICH_VERSION:-release}-cuda
    image: ghcr.io/immich-app/immich-machine-learning:${IMMICH_VERSION:-release}
    # extends: # uncomment this section for hardware acceleration - see https://immich.app/docs/features/ml-hardware-acceleration
    #   file: hwaccel.ml.yml
    #   service: cpu # set to one of [armnn, cuda, openvino, openvino-wsl] for accelerated inference - use the `-wsl` version for WSL2 where applicable
    volumes:
      - model-cache:/cache
    env_file:
      - .env
    restart: always
    healthcheck:
      disable: false

  redis:
    container_name: immich_redis
    networks:
      - default
    image: docker.io/redis:6.2-alpine@sha256:e3b17ba9479deec4b7d1eeec1548a253acc5374d68d3b27937fcfe4df8d18c7e
    healthcheck:
      test: redis-cli ping || exit 1
    restart: always

  database:
    container_name: immich_postgres
    networks:
      - default
    image: docker.io/tensorchord/pgvecto-rs:pg14-v0.2.0@sha256:90724186f0a3517cf6914295b5ab410db9ce23190a2d9d0b9dd6463e3fa298f0
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_DB: ${DB_DATABASE_NAME}
      POSTGRES_INITDB_ARGS: '--data-checksums'
    volumes:
      # Do not edit the next line. If you want to change the database storage location on your system, edit the value of DB_DATA_LOCATION in the .env file
      - ${DB_DATA_LOCATION}:/var/lib/postgresql/data
    command: ["postgres", "-c", "shared_preload_libraries=vectors.so", "-c", 'search_path="$$user", public, vectors', "-c", "logging_collector=on", "-c", "max_wal_size=2GB", "-c", "shared_buffers=512MB", "-c", "wal_compression=on"]
    restart: always

volumes:
  model-cache:

networks:
  default:

TheOnlyWayUp · 2024-12-08T03:58:22Z

Immich has downtime as well.

Related: #656 #734 #752

Related documentation, https://docs.dokploy.com/docs/core/troubleshooting#docker-compose-domain-not-working

version: '3'
services:
  umami:
    image: ghcr.io/umami-software/umami:postgresql-latest
    ...
    expose:
      - 3000
    ports:
-      - 4999:3000
+     - 3000
    networks:
      - default
      
  db:
    image: postgres:15-alpine
    ...
    networks:
      - default

networks:
  default:
    driver: bridge

I'm trying this just to check, I need the ports forwarded as I can't upload large files through the cloudflare-proxied domain for Immich, for example.

TheOnlyWayUp · 2024-12-08T04:06:39Z

The ghost service goes down often (not from Dokploy's template), and has no ports forwarded.

version: '3.1'

services:

  ghost:
    image: ghost:5-alpine
    expose:
      - 2368
    networks:
      - default
    ...
    labels:
    - "traefik.enable=true"

    # Middleware for replacing content in the body
    - "traefik.http.middlewares.inject-script.plugin.rewrite.rewrites[0].regex=</head>"
    - "traefik.http.middlewares.inject-script.plugin.rewrite.rewrites[0].replacement=<script defer src='https://stats.towu.dev/script.js' data-website-id='4d72a7bf-3049-4c82-8ff4-05c0bc4f8edf'></script></head>"

    # Link the middleware to the router
    - "traefik.http.routers.blog.middlewares=inject-script"

    depends_on:
      - db

  db:
    image: mysql:8.0
    ...
    expose:
      - 3306

volumes:
  ghost_ghost:
    external: true
  ghost_db:
    external: true

networks:
  default:

I'll keep an eye on the uptime

Siumauricio · 2024-12-09T04:13:57Z

I know what could be the error, currently there is a very rare bug related to docker compose, if you use the name of a duplicate service in several places it is possible that the information is mixed somehow, I have not yet found a solution to this problem, my suggestion would be, change the name of the service

services:
      db:
          .....

to something like this

services:
      ghost-db:
          .....

TheOnlyWayUp · 2024-12-10T14:04:47Z

I've updated my services to use prefixed names, I guess that's what the randomize compose is for.

Is there anything I can do to provide some more insight? Traefik logs, if you lmk how I can get em. (docker logs would be enough?)

Likely related, umami-software/umami#3080 (reply in thread) - I believe another service was attempting to access Umami's database, leading to that error.

TheOnlyWayUp · 2024-12-10T14:08:13Z

Oh, is it because all the containers are part of the dokploy-network network, and names are resolved over this network instead of default? Dokploy also removes the default network unless it's explicitly included in the compose.

TheOnlyWayUp · 2024-12-10T15:00:46Z

@Siumauricio I updated the services to have unique names and rebuilt the project

Still having uptime issues, this is my updated compose

version: '3'
services:
  umami:
    image: ghcr.io/umami-software/umami:postgresql-latest
    environment:
      DATABASE_URL: postgresql://umami:xxx@umami_db:5432/umami
    depends_on:
      db:
        condition: service_healthy
    restart: always
    healthcheck:
      test: ['CMD-SHELL', 'curl http://localhost:3000/api/heartbeat']
      interval: 5s
      timeout: 5s
      retries: 5
    expose:
      - 3000
    ports:
      - 3000
    networks:
      - default
      
  umami_db:
    image: postgres:15-alpine
    expose:
      - 5432
    volumes:
      - ...:/var/lib/postgresql/data
    restart: always
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}']
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - default

networks:
  default:
    driver: bridge

Siumauricio · 2024-12-14T18:24:23Z

the problem still persists?

TheOnlyWayUp · 2024-12-14T20:21:55Z

Yep,

kamellperry · 2024-12-18T19:28:09Z

Experiencing very similar issues. I'm also using the cloud hosted version of dokploy instead of self hosted because I thought that might have been why. After doing some digging It's definitely the reverse proxy stuff.

2shrestha22 · 2024-12-20T07:32:56Z

My server was down 5 min ago. I am not monitoring but I assume this is still a issue.

TheOnlyWayUp · 2024-12-22T12:20:31Z

@Siumauricio This issue is causing me a lot of trouble, is there anything I can do to help?

Siumauricio · 2024-12-23T06:43:36Z

Yes I definitely think it is a bug in docker at the network level, I think we must find a solution to this problem because currently we can not have 2 instances of the same template because sometimes it causes the information to be mixed which is a very strange behavior, I will investigate in more detail how to solve this, the idea would be to isolate the docker compose in a separate network.

TheOnlyWayUp · 2025-01-03T18:48:31Z

@Siumauricio I tried the fix in #1004 (randomize compose names) and the uptime hasn't improved at all.

This issue is urgent and affecting my users. Broken networking is a dealbreaker, is there anything else I can try?

Last ditch effort would be disabling Traefik and using a reverse proxy on host networking, or moving to another platform - which is a huge effort.

Are there any blockers for this issue? Any logs or information you need? Anything?

dreiekk · 2025-01-04T00:48:13Z

I'm having similiar problems - randomizing compose names also didn't fix it for me.

I also suspect it has something to do with the same internal port which is published from similiar services/containers on the same dokploy-network or the traefik config gets broken because of that same internal port despite they are on different services.

Feel free to ping me as well if I can provide any logs, information or test something helpful to this issue.

I have 2 different dokploy projects on one server - each containing 2 docker compose services.
For example one docker compose is `nextcloud + mariadb + redis`.

I get this problem despite the nextcloud webserver images having different docker image tags/versions in both projects.

Whenever I deploy the service from the second project, the container of the first project is not reachable anymore with traefik error page "404 page not found".
When I now deploy the second docker compose, it will start working, but the first one gets a 540 Gateway Timeout.
I have to stop the second project and deploy the first one again to make the first one start working again.

I also defined a custom-named network for each service (docker compose), so the database and webserver in a single docker compose can communicate:

services:
  aaa-nextcloud-app:
    ...
    networks:
      - aaa-nextcloud-network
  aaa-nextcloud-db:
    ...
    networks:
      - aaa-nextcloud-network

networks:
  aaa-nextcloud-network:
    driver: bridge

There are also many other services running on my single server which are working fine and seem not to be affected by these beforementioned problamatic deployments.

dreiekk · 2025-01-04T01:20:31Z

FYI: Our current workaround is to set the port of the application/webserver itself inside the container to something different for each service. So the similiar webservers which would normally all listen on port 80 now listen on 81, 82, 83, ...

TheOnlyWayUp · 2025-01-05T16:18:28Z

Something else I noticed, whenever the services are unreachable, I'm unable to view logs from the dokploy dashboard, it's just empty.

The logs load when the service is available via the domain, which is weird, because it's reachable through port mappings regardless

theboringhumane · 2025-01-10T13:21:31Z

Go to traefik file system in your dokploy dashboard and do this

traefik.yml

providers:
  swarm:
    exposedByDefault: false
    watch: true
  docker:
    exposedByDefault: false
    watch: true
    network: dokploy-network

@TheOnlyWayUp

The error comes from your networks, you created 2 networks and the authelia container is assigned to both of them. Traefik, while forwarding, doesn't know which network to use. So you have to specify it in your docker provider configuration:

theboringhumane · 2025-01-10T13:23:14Z

@Siumauricio you can close this.

dreiekk · 2025-01-10T15:59:25Z

Thanks! @theboringhumane .

Have to observe it a bit more to be sure, but I suppose it's working now on my end.

I switched all my services to type stack instead of compose, added @theboringhumane 's traefik options and configured my networks inside the docker-compose.yml's like this:

networks:
  my_compose_network_123:
    driver: overlay
    name: my_compose_network_123
    attachable: true

TheOnlyWayUp · 2025-01-10T16:42:32Z

Hey @theboringhumane, thanks for the solution! I'll try it and update the issue.

The error comes from your networks, you created 2 networks and the authelia container is assigned to both of them. Traefik, while forwarding, doesn't know which network to use. So you have to specify it in your docker provider configuration:

I have a few questions,

What Authelia container?
What networks could be causing the conflict? The services are connected to the compose's default network (ocassionally that's a bridge network, but the problem persists regardless) and to dokploy's network.
Are you saying the traefik container is being added to multiple networks?

➜  ~ docker inspect dokploy-traefik.1.s2o77zzkq0hsqi8x7p837a8w8
[
    {
        ...
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "a6e4d65f981460a574c39da307579f5e181301a20f21505d392970a6429a6073",
            "SandboxKey": "/var/run/docker/netns/a6e4d65f9814",
            "Ports": {
                "443/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "443"
                    },
                    {
                        "HostIp": "::",
                        "HostPort": "443"
                    }
                ],
                "80/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "80"
                    },
                    {
                        "HostIp": "::",
                        "HostPort": "80"
                    }
                ]
            },
            "Networks": {
                "dokploy-network": {
                   ...
                }
            }
        }
    ]

That doesn't look like the case.

The traefik container knows the service container through a single network. Would be nice to know why the issue was intermittent connectivity instead of the container being unreachable from the get-go.

you can close this.

Let's hold that off until our uptime recovers. Keeping in mind that this issue is intermittent, a few hours of uptime is normal behaviour without this fix.

theboringhumane · 2025-01-10T16:45:36Z

No your service container is in 2 networks, now traefik is confused between which network to redirect traffic to. So if we specify the network it'll work smooth

TheOnlyWayUp · 2025-01-10T17:07:48Z

My services are currently up! Thanks a ton @theboringhumane

I'll close the issue in 24h if it doesn't go down again.

theboringhumane · 2025-01-10T17:08:56Z

My services are currently up! Thanks a ton @theboringhumane

I'll close the issue in 24h if it doesn't go down again.

Happy to see it worked for you!

TheOnlyWayUp · 2025-01-12T14:55:26Z

I followed @theboringhumane's solution as-is, #821 (comment), replacing the first few lines of my traefik.yml

The fix works flawlessly, I don't understand the mechanism behind it - but it does the trick!

This works without adding a random prefix to service names or converting to stack.

Thanks for the fix!

theboringhumane · 2025-01-12T14:58:06Z

I followed @theboringhumane's solution as-is, #821 (comment), replacing the first few lines of my traefik.yml

The fix works flawlessly, I don't understand the mechanism behind it - but it does the trick!

This works without adding a random prefix to service names or converting to stack.

Thanks for the fix!

Because now traefik knows to which network redirect the traffic. Because in compose if you a network defined other than dokploy network then you have to let the traefik know which is going to serve the http requests. Otherwise traefik will be waiting and you'll see a 504

TheOnlyWayUp added the bug Something isn't working label Dec 6, 2024

Siumauricio mentioned this issue Dec 27, 2024

network conflict #1004

Open

TheOnlyWayUp closed this as completed Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent 504s and Poor Uptime on Docker Compose deployments #821

Frequent 504s and Poor Uptime on Docker Compose deployments #821

TheOnlyWayUp commented Dec 6, 2024 •

edited

Loading

TheOnlyWayUp commented Dec 6, 2024

TheOnlyWayUp commented Dec 6, 2024

TheOnlyWayUp commented Dec 8, 2024

TheOnlyWayUp commented Dec 8, 2024

Siumauricio commented Dec 9, 2024

TheOnlyWayUp commented Dec 10, 2024

TheOnlyWayUp commented Dec 10, 2024

TheOnlyWayUp commented Dec 10, 2024

Siumauricio commented Dec 14, 2024

TheOnlyWayUp commented Dec 14, 2024

kamellperry commented Dec 18, 2024

2shrestha22 commented Dec 20, 2024

TheOnlyWayUp commented Dec 22, 2024

Siumauricio commented Dec 23, 2024

TheOnlyWayUp commented Jan 3, 2025

dreiekk commented Jan 4, 2025

dreiekk commented Jan 4, 2025

TheOnlyWayUp commented Jan 5, 2025

theboringhumane commented Jan 10, 2025 •

edited

Loading

theboringhumane commented Jan 10, 2025

dreiekk commented Jan 10, 2025

TheOnlyWayUp commented Jan 10, 2025 •

edited

Loading

theboringhumane commented Jan 10, 2025

TheOnlyWayUp commented Jan 10, 2025

theboringhumane commented Jan 10, 2025

TheOnlyWayUp commented Jan 12, 2025

theboringhumane commented Jan 12, 2025

Frequent 504s and Poor Uptime on Docker Compose deployments #821

Frequent 504s and Poor Uptime on Docker Compose deployments #821

Comments

TheOnlyWayUp commented Dec 6, 2024 • edited Loading

To Reproduce

Current vs. Expected behavior

Provide environment information

Which area(s) are affected? (Select all that apply)

Are you deploying the applications where Dokploy is installed or on a remote server?

Additional context

Will you send a PR to fix it?

TheOnlyWayUp commented Dec 6, 2024

TheOnlyWayUp commented Dec 6, 2024

TheOnlyWayUp commented Dec 8, 2024

TheOnlyWayUp commented Dec 8, 2024

Siumauricio commented Dec 9, 2024

TheOnlyWayUp commented Dec 10, 2024

TheOnlyWayUp commented Dec 10, 2024

TheOnlyWayUp commented Dec 10, 2024

Siumauricio commented Dec 14, 2024

TheOnlyWayUp commented Dec 14, 2024

kamellperry commented Dec 18, 2024

2shrestha22 commented Dec 20, 2024

TheOnlyWayUp commented Dec 22, 2024

Siumauricio commented Dec 23, 2024

TheOnlyWayUp commented Jan 3, 2025

dreiekk commented Jan 4, 2025

dreiekk commented Jan 4, 2025

TheOnlyWayUp commented Jan 5, 2025

theboringhumane commented Jan 10, 2025 • edited Loading

theboringhumane commented Jan 10, 2025

dreiekk commented Jan 10, 2025

TheOnlyWayUp commented Jan 10, 2025 • edited Loading

theboringhumane commented Jan 10, 2025

TheOnlyWayUp commented Jan 10, 2025

theboringhumane commented Jan 10, 2025

TheOnlyWayUp commented Jan 12, 2025

theboringhumane commented Jan 12, 2025

TheOnlyWayUp commented Dec 6, 2024 •

edited

Loading

theboringhumane commented Jan 10, 2025 •

edited

Loading

TheOnlyWayUp commented Jan 10, 2025 •

edited

Loading