Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Container logging exits prematurely with error #2328

Closed
nmoroze opened this issue Mar 6, 2024 · 14 comments
Closed

[Bug]: Container logging exits prematurely with error #2328

nmoroze opened this issue Mar 6, 2024 · 14 comments
Labels
bug An issue with the library

Comments

@nmoroze
Copy link

nmoroze commented Mar 6, 2024

Testcontainers version

0.29.1

Using the latest Testcontainers version?

Yes

Host OS

Linux

Host arch

x86

Go version

1.22

Docker version

(NOTE: using podman, not docker)

Version:      3.4.4
API Version:  3.4.4
Go Version:   go1.18.1
Built:        Wed Dec 31 19:00:00 1969
OS/Arch:      linux/amd64

Docker info

host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: 'conmon: /usr/bin/conmon'
    path: /usr/bin/conmon
    version: 'conmon version 2.0.25, commit: unknown'
  cpus: 32
  distribution:
    codename: jammy
    distribution: ubuntu
    version: "22.04"
  eventLogger: journald
  hostname: noah-workstation
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 2109978069
      size: 1
    - container_id: 1
      host_id: 10000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 2109978069
      size: 1
    - container_id: 1
      host_id: 10000
      size: 65536
  kernel: 6.1.0-1033-oem
  linkmode: dynamic
  logDriver: journald
  memFree: 101372432384
  memTotal: 134160068608
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version 0.17
      commit: 0e9229ae34caaebcb86f1fde18de3acaf18c6d9a
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/2109978069/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.0.1
      commit: 6a7b16babc95b6a3056b33fb45b74a6f62262dd4
      libslirp: 4.6.1
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 363h 0m 53.36s (Approximately 15.12 days)
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/noah/.config/containers/storage.conf
  containerStore:
    number: 4
    paused: 0
    running: 0
    stopped: 4
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/noah/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 25
  runRoot: /run/user/2109978069/containers
  volumePath: /home/noah/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.4
  Built: 0
  BuiltTime: Wed Dec 31 19:00:00 1969
  GitCommit: ""
  GoVersion: go1.18.1
  OsArch: linux/amd64
  Version: 3.4.4

What happened?

I've found that LogConsumers sometimes stop receiving logs prematurely and dump an error.

I've built a minimal reproducible example of the specific issue. It consists of a very small container image based on distroless/base. The container executes the following program:

package main

import (
	"log"
	"time"
)

func main() {
	i := 0
	for {
		log.Printf("tick %d", i)
		i += 1
		time.Sleep(1 * time.Second)
	}
}

I've attached a prebuilt container to the issue as a tarball.

In order to reproduce the issue, first load the container tarball like so:

gunzip container.tar.gz
podman load -i container.tar

And then compile and run the following test:

package container_test

import (
	"context"
	"fmt"
	"testing"
	"time"

	tc "github.com/testcontainers/testcontainers-go"
)

type StdoutLogConsumer struct{}

func (lc *StdoutLogConsumer) Accept(l tc.Log) {
	fmt.Printf("%s", l.Content)
}

func TestLogIssue(t *testing.T) {
	req := tc.ContainerRequest{
		Image: "localhost/log_repro:latest",
		LogConsumerCfg: &tc.LogConsumerConfig{
			Consumers: []tc.LogConsumer{&StdoutLogConsumer{}},
		},
	}

	ctx := context.Background()
	container, err := tc.GenericContainer(ctx, tc.GenericContainerRequest{
		ContainerRequest: req,
		Started:          true,
	})
	if err != nil {
		t.Fatal(err)
	}
	t.Cleanup(func() {
		if err := container.Terminate(ctx); err != nil {
			t.Fatal(err)
		}
	})

	time.Sleep(10 * time.Second)

	fmt.Println("test over")
}

I'd expect this test to print container logs showing 9-10 ticks and exit cleanly. Instead, it consistently stops displaying output after 6 ticks, and always outputs an error message:

2024/03/06 19:21:03 github.com/testcontainers/testcontainers-go - Connected to docker:
  Server Version: 3.4.4
  API Version: 1.40
  Operating System: ubuntu
  Total Memory: 127945 MB
  Resolved Docker Host: unix:///run/user/2109978069/podman/podman.sock
  Resolved Docker Socket Path: /run/user/2109978069/podman/podman.sock
  Test SessionID: c7ce1db5c519f0188cc6943c86f8faccf62034e6b4d4c1ab5bf6f8ff896ecf3b
  Test ProcessID: 36d03253-4e13-441e-8149-88e46fb4fa15
2024/03/06 19:21:03 🐳 Creating container for image testcontainers/ryuk:0.6.0
2024/03/06 19:21:03 ✅ Container created: 15f55739196d
2024/03/06 19:21:03 🐳 Starting container: 15f55739196d
2024/03/06 19:21:03 ✅ Container started: 15f55739196d
2024/03/06 19:21:03 🚧 Waiting for container id 15f55739196d image: testcontainers/ryuk:0.6.0. Waiting for: &{Port:8080/tcp timeout:<nil> PollInterval:100ms}
2024/03/06 19:21:03 🔔 Container is ready: 15f55739196d
2024/03/06 19:21:03 🐳 Creating container for image localhost/log_repro:latest
2024/03/06 19:21:03 ✅ Container created: f8583fd38a61
2024/03/06 19:21:03 🐳 Starting container: f8583fd38a61
2024/03/06 19:21:03 ✅ Container started: f8583fd38a61
2024/03/06 19:21:03 🔔 Container is ready: f8583fd38a61
2024/03/06 19:21:03 tick 0
2024/03/06 19:21:04 tick 1
2024/03/06 19:21:05 tick 2
2024/03/06 19:21:06 tick 3
2024/03/06 19:21:07 tick 4
2024/03/06 19:21:08 tick 5
container log error: EOF. Stopping log consumer: Headers out of sync2024/03/06 19:21:13 🐳 Terminating container: f8583fd38a61
test over
2024/03/06 19:21:14 🚫 Container terminated: f8583fd38a61
PASS

I did notice some interesting things about this while trying to debug, but I'm not sure what to make of them. In the main loop of docker.startLogProduction(), it seems like the "use of closed network connection" error block consistently triggers before the EOF error is triggered. In addition, it seems like attempting to restart the connection when EOF is received allows the log consumer to get a few more logs, e.g.:

//_, _ = fmt.Fprintf(os.Stderr, "container log error: %+v. %s", err, logStoppedForOutOfSyncMessage)
//// if we would continue here, the next header-read will result into random data...
//return

now := time.Now()
since = fmt.Sprintf("%d.%09d", now.Unix(), int64(now.Nanosecond()))
goto BEGIN

Relevant log output

No response

Additional information

container.tar.gz

@nmoroze nmoroze added the bug An issue with the library label Mar 6, 2024
@nmoroze nmoroze changed the title [Bug]: Container logging errors out [Bug]: Container logging exits prematurely with error Mar 6, 2024
@nmoroze
Copy link
Author

nmoroze commented Mar 6, 2024

As an update, I've noticed that the amount of time the log consumer works for is the same as the "log production timeout" setting (e.g. by default, the log consumer stops working after five seconds).

Is this what the log production timeout setting is intended to control? It's not super clear to me from the docs.

@stevenh
Copy link
Collaborator

stevenh commented Mar 12, 2024

Your repo tar is no longer present, but #2346 might fix this issue as it refactors how logging start / stop is handled eliminating a race and subsequent deadlock. If you could test @nmoroze that would be great.

@nmoroze
Copy link
Author

nmoroze commented Mar 13, 2024

Hi @stevenh, thanks for pointing me to your PR. Unfortunately I don't think it solves my particular problem, the issue persists even using your branch.

I've re-attached the container tarball for repro to this comment in case you or anyone else is interested.

container.tar.gz

@stevenh
Copy link
Collaborator

stevenh commented Mar 13, 2024

I tried to repo locally but can't, here my archive:
log_repo.tgz

# build
(cd container; docker build -t localhost/log_repro:latest .)
# run test
go test -v -race

Output:

=== RUN   TestLogIssue
2024/03/13 18:28:51 github.com/testcontainers/testcontainers-go - Connected to docker: 
  Server Version: 25.0.3
  API Version: 1.44
  Operating System: Docker Desktop
  Total Memory: 15836 MB
  Resolved Docker Host: unix:///var/run/docker.sock
  Resolved Docker Socket Path: /var/run/docker.sock
  Test SessionID: ce94f6e5446cf4dac898cbe3506dbc726ab8896b36336393c1055fccee365497
  Test ProcessID: a04b6431-4468-45ed-8c50-ab58d412f0cf
2024/03/13 18:28:51 🐳 Creating container for image testcontainers/ryuk:0.6.0
2024/03/13 18:28:51 ✅ Container created: b7c0742de4e8
2024/03/13 18:28:51 🐳 Starting container: b7c0742de4e8
2024/03/13 18:28:52 ✅ Container started: b7c0742de4e8
2024/03/13 18:28:52 🚧 Waiting for container id b7c0742de4e8 image: testcontainers/ryuk:0.6.0. Waiting for: &{Port:8080/tcp timeout:<nil> PollInterval:100ms}
2024/03/13 18:28:52 🔔 Container is ready: b7c0742de4e8
2024/03/13 18:28:52 🐳 Creating container for image localhost/log_repro:latest
2024/03/13 18:28:52 ✅ Container created: 839d3fd30990
2024/03/13 18:28:52 🐳 Starting container: 839d3fd30990
2024/03/13 18:28:53 ✅ Container started: 839d3fd30990
2024/03/13 18:28:53 🔔 Container is ready: 839d3fd30990
2024/03/13 18:28:53 tick 0
2024/03/13 18:28:54 tick 1
2024/03/13 18:28:55 tick 2
2024/03/13 18:28:56 tick 3
2024/03/13 18:28:57 tick 4
2024/03/13 18:28:58 tick 5
2024/03/13 18:28:59 tick 6
2024/03/13 18:29:00 tick 7
2024/03/13 18:29:01 tick 8
2024/03/13 18:29:02 tick 9
2024/03/13 18:29:03 tick 10
test over
2024/03/13 18:29:03 🐳 Terminating container: 839d3fd30990
2024/03/13 18:29:03 🚫 Container terminated: 839d3fd30990
--- PASS: TestLogIssue (11.80s)
PASS
ok      log-test        12.844s

@stevenh
Copy link
Collaborator

stevenh commented Mar 13, 2024

I tried adding ProviderType: tc.ProviderPodman, to the request but still seems to want to use docker under WSL.

@stevenh
Copy link
Collaborator

stevenh commented Mar 13, 2024

Tried getting a podman working but its just way to flaky, vm totally hangs and then just wont work for me on WSL:

API forwarding for Docker API clients is not available due to the following startup failures.
        could not start api proxy since expected pipe is not available: podman-machine-default

@stevenh
Copy link
Collaborator

stevenh commented Mar 13, 2024

Finally got podman working after a kernel upgrade but still can't reproduce:

=== RUN   TestLogIssue
2024/03/13 20:22:37 github.com/testcontainers/testcontainers-go - Connected to docker:
  Server Version: 4.9.3
  API Version: 1.41
  Operating System: fedora
  Total Memory: 15836 MB
  Resolved Docker Host: unix:///run/user/1000/podman/podman.sock
  Resolved Docker Socket Path: /run/user/1000/podman/podman.sock
  Test SessionID: c63aa4e8af35d970a47b07194bf27bbbaaac5dfb929678b458c1e515fc310db4
  Test ProcessID: ab2fd2cd-f65f-4e99-9e3a-958ccd19d80b
2024/03/13 20:22:37 🐳 Creating container for image testcontainers/ryuk:0.6.0
2024/03/13 20:22:37 ✅ Container created: 1db0a21708f0
2024/03/13 20:22:37 🐳 Starting container: 1db0a21708f0
2024/03/13 20:22:38 ✅ Container started: 1db0a21708f0
2024/03/13 20:22:38 🚧 Waiting for container id 1db0a21708f0 image: testcontainers/ryuk:0.6.0. Waiting for: &{Port:8080/tcp timeout:<nil> PollInterval:100ms}
2024/03/13 20:22:38 🔔 Container is ready: 1db0a21708f0
2024/03/13 20:22:38 🐳 Creating container for image localhost/log_repro:latest
2024/03/13 20:22:38 ✅ Container created: 4e83a135fa90
2024/03/13 20:22:38 🐳 Starting container: 4e83a135fa90
2024/03/13 20:22:38 ✅ Container started: 4e83a135fa90
2024/03/13 20:22:38 🔔 Container is ready: 4e83a135fa90
2024/03/13 20:22:38 tick 0
2024/03/13 20:22:39 tick 1
2024/03/13 20:22:40 tick 2
2024/03/13 20:22:41 tick 3
2024/03/13 20:22:42 tick 4
2024/03/13 20:22:44 tick 6
2024/03/13 20:22:45 tick 7
2024/03/13 20:22:46 tick 8
2024/03/13 20:22:47 tick 9
test over
2024/03/13 20:22:48 🐳 Terminating container: 4e83a135fa90
2024/03/13 20:22:50 🚫 Container terminated: 4e83a135fa90
--- PASS: TestLogIssue (12.54s)
PASS
ok      log-test        12.546s

@nmoroze
Copy link
Author

nmoroze commented Mar 13, 2024

Thanks for the effort to reproduce this. I noticed you're using a newer version of Podman based on your latest logs, I'll try that out on my end and see if it helps (the reason for the old version I'm using is it's the default available on apt for Ubuntu 22.04).

@nmoroze
Copy link
Author

nmoroze commented Mar 14, 2024

@stevenh Looks like my reproducible example works fine with Podman v4.9.3 both with and without your change.

Unfortunately, after updating, the codebase where I originally found this bug exhibits different buggy behavior. Using testcontainers 0.29.1, it appears to hang indefinitely trying to terminate a container. Using your branch, the Terminate() calls to several containers seem to timeout with this error:

Get "http://%2Frun%2Fuser%2F2109978069%2Fpodman%2Fpodman.sock/v1.41/containers/cc50050f7d744f2d2ae77d0df598f087f1b74b06e2a66dfd537dd12b289b745d/logs?follow=1&since=1710436901.565258953&stderr=1&stdout=1&tail=": context deadline exceeded

I can see if I can come up with a reproducible example for this new issue.

@nmoroze
Copy link
Author

nmoroze commented Mar 15, 2024

Alright, I was able to reproduce this with a pretty simple example. I've attached a new image tar, this container runs a binary that does nothing and simply busy-loops forever.

The test code itself is also fairly minimal:

package container_test

import (
	"context"
	"fmt"
	"testing"

	tc "github.com/testcontainers/testcontainers-go"
)

type StdoutLogConsumer struct{}

func (lc *StdoutLogConsumer) Accept(l tc.Log) {
	fmt.Printf("%s", l.Content)
}

func TestLogIssue(t *testing.T) {
	req := tc.ContainerRequest{
		Image: "localhost/log_repro:latest",
		LogConsumerCfg: &tc.LogConsumerConfig{
			Consumers: []tc.LogConsumer{&StdoutLogConsumer{}},
		},
	}

	ctx := context.Background()

	container, err := tc.GenericContainer(ctx, tc.GenericContainerRequest{
		ContainerRequest: req,
		Started:          true,
	})
	if err != nil {
		t.Fatal(err)
	}
	t.Cleanup(func() {
		if err := container.Terminate(ctx); err != nil {
			t.Fatal(err)
		}
	})

	fmt.Println("test over")
}

The output I get:

2024/03/15 20:47:46 github.com/testcontainers/testcontainers-go - Connected to docker:
  Server Version: 4.9.3
  API Version: 1.41
  Operating System: ubuntu
  Total Memory: 127946 MB
  Resolved Docker Host: unix:///run/user/2109978069/podman/podman.sock
  Resolved Docker Socket Path: /run/user/2109978069/podman/podman.sock
  Test SessionID: 61f8be8f5fb4a6f37c09cd82a69ea05022e3a674641b2add09d5c5593b17f0c2
  Test ProcessID: 9ffc22d9-7015-4555-be43-c7ebd115a846
2024/03/15 20:47:46 🐳 Creating container for image testcontainers/ryuk:0.6.0
2024/03/15 20:47:46 ✅ Container created: 1c9f5ce3ad48
2024/03/15 20:47:46 🐳 Starting container: 1c9f5ce3ad48
2024/03/15 20:47:46 ✅ Container started: 1c9f5ce3ad48
2024/03/15 20:47:46 🚧 Waiting for container id 1c9f5ce3ad48 image: testcontainers/ryuk:0.6.0. Waiting for: &{Port:8080/tcp timeout:<nil> PollInterval:100ms}
2024/03/15 20:47:46 🔔 Container is ready: 1c9f5ce3ad48
2024/03/15 20:47:46 🐳 Creating container for image localhost/log_repro:latest
2024/03/15 20:47:46 ✅ Container created: b3b72d3709fe
2024/03/15 20:47:46 🐳 Starting container: b3b72d3709fe
2024/03/15 20:47:46 ✅ Container started: b3b72d3709fe
2024/03/15 20:47:46 🔔 Container is ready: b3b72d3709fe
test over
2024/03/15 20:47:46 🐳 Terminating container: b3b72d3709fe
--- FAIL: TestLogIssue (6.78s)
    container_test.go:36: Get "http://%2Frun%2Fuser%2F2109978069%2Fpodman%2Fpodman.sock/v1.41/containers/b3b72d3709fed603ed1ddc15159307171130ac77a9981ee86610cbde2e05970f/logs?follow=1&stderr=1&stdout=1&tail=": context deadline exceeded
FAIL

container2.tar.gz

@stevenh
Copy link
Collaborator

stevenh commented Mar 15, 2024

Can you post the build for the container so we don't have to rely on images?

@nmoroze
Copy link
Author

nmoroze commented Mar 18, 2024

@stevenh Sure - I just wrote up a build file and simple Makefile to run everything. I've attached the sources as a tarball, untar and run make to reproduce.

repro.tar.gz

@stevenh
Copy link
Collaborator

stevenh commented Mar 19, 2024

Thanks @nmoroze looks like a bug in podman to me, if the container doesn't generate any log entries the request for log entries just hangs with podman where as docker returns an empty log.

I would suggest raising an issue with the podman team.

@nmoroze
Copy link
Author

nmoroze commented Mar 25, 2024

Thanks for taking a look! That makes sense, I'll close this issue for now and follow up on that side.

@nmoroze nmoroze closed this as completed Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue with the library
Projects
None yet
Development

No branches or pull requests

2 participants