Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgraded ecs agent causes Error loading previously saved state from BoltDB #4119

Closed
pzcfoo opened this issue Mar 20, 2024 · 4 comments
Closed
Assignees
Labels
researching actively looking into the issue

Comments

@pzcfoo
Copy link

pzcfoo commented Mar 20, 2024

Summary

Upgraded ecs agent on external instance.
The ecs service keeps restarting.
Ecs agent server fails after this error is logged:

Error loading previously saved state: failed to load previous data from BoltDB: failed to load task engine state: did not find the task of container

Description

Upgraded ecs agent but the service keeps restarting.
Refer to logs section.

Environment Details

Ubuntu 22.04.2 LTS

ecs agent version

Package: amazon-ecs-init
Version: 1.82.0-1
Status: install ok installed
Priority: optional
Section: misc
Maintainer: ecs-agent-dev <[email protected]>
Installed-Size: 103 MB
Depends: libc6 (>= 2.3.4), systemd, docker-ce (>= 17.12.0) | docker-engine (>= 1.6.0) | docker-ee | docker.io
Homepage: https://aws.amazon.com/ecs
Download-Size: unknown
APT-Manual-Installed: yes
APT-Sources: /var/lib/dpkg/status
Description: Starts the Amazon ECS Agent
 amazon-ecs-init may be run to register an EC2 instance as an Amazon ECS
 Container Instance.

docker info

 Client: Docker Engine - Community
 Version:    24.0.5
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.20.2
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 14
  Running: 11
  Paused: 0
  Stopped: 3
 Images: 15
 Server Version: 24.0.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8165feabfdfe38c65b599c4993d227328c231fca
 runc version: v1.1.8-0-g82f18fe
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 5.15.0-86-generic
 Operating System: Ubuntu 22.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 6
 Total Memory: 15.61GiB
 Name: nmlcaap135
 ID: 48edb6c9-be8d-4bf2-b7ee-fb3b6be57bac
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

df -h

Filesystem                              Size  Used Avail Use% Mounted on
tmpfs                                   1.6G  168M  1.4G  11% /run
/dev/mapper/ubuntu--vg-ubuntu--lv        98G  4.8G   89G   6% /
tmpfs                                   7.9G     0  7.9G   0% /dev/shm
tmpfs                                   5.0M     0  5.0M   0% /run/lock
tmpfs                                   4.0M     0  4.0M   0% /sys/fs/cgroup
/dev/sda2                               2.0G  251M  1.6G  14% /boot
/dev/mapper/ubuntu--vg-ubuntu--lv--var   20G  8.1G   11G  44% /var
tmpfs                                   1.6G  4.0K  1.6G   1% /run/user/1892892083

Supporting Log Snippets

level=info time=2024-03-20T00:07:09Z msg="Agent version associated with task model in boltdb 1.75.0 is bigger or equal to threshold 1.0.0. Skipping transformation."

level=critical time=2024-03-20T00:07:09Z msg="Error loading previously saved state: failed to load previous data from BoltDB: failed to load task engine state: did not find the task of container XXXX: arn:aws:ecs:REGION:1111111111:task/XXXX/06548beea8f34300a560e8aa2e660cb" module=agent.go

@hozkaya2000
Copy link
Contributor

Hi @pzcfoo,

Is this issue continuing to occur, or were you able to fix the starting of agent? This is due to a small edge case that corrupts task and container information when agent is terminating. We are tracking this issue internally. For a temporary mitigation, when upgrading agent, you could try to stop tasks on the instance beforehand. I would suggest setting up the external instance with ECS from scratch if agent is still not starting, if that is feasible..

Thank you

@hozkaya2000 hozkaya2000 self-assigned this Apr 16, 2024
@pzcfoo
Copy link
Author

pzcfoo commented Apr 16, 2024

Hi @hozkaya2000
Reinstalling ecs agent (including deleting all related files) and registering the cluster again fixed the issue.
On other hosts, stopping all tasks before upgrading was successful in preventing this.
Thanks

@hozkaya2000 hozkaya2000 added the researching actively looking into the issue label Apr 16, 2024
@amogh09
Copy link
Contributor

amogh09 commented May 6, 2024

We released a permanent fix for this issue in https://github.com/aws/amazon-ecs-agent/releases/tag/v1.82.3. Please reopen the issue if you see it again. :)

Thank you!

@amogh09 amogh09 closed this as completed May 6, 2024
@amogh09
Copy link
Contributor

amogh09 commented May 6, 2024

Fixed in #3987

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
researching actively looking into the issue
Projects
None yet
Development

No branches or pull requests

3 participants