Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task containers dependencies resolution stuck forever #2579

Closed
taraspos opened this issue Aug 18, 2020 · 13 comments
Closed

Task containers dependencies resolution stuck forever #2579

taraspos opened this issue Aug 18, 2020 · 13 comments
Labels
kind/bug kind/tracking This issue is being tracked internally

Comments

@taraspos
Copy link

taraspos commented Aug 18, 2020

Summary

Task stuck in Pending state because of some problems during containers dependencies resolution.
Setting Start timeout doesn't have any effect as well.
This is reopening of my previous issue #2350, with the additional details and a Task Definition to reproduce the error.

Description

I have 3 containers:

Container name Essential Depends On Container Depends On State Start Timeout
exit1 false 10
mysql true exit1 SUCCESS 30
nginx true mysql START 60

(nginx depends on mysql which depends on exit1)

When container exit1 fails with the exit code 1, containers mysql and nginx remains in the PENDING state forever.
When container exit1 succeeds with exit code 0 everything works fine.

I also tried to set container nginx dependencies like:

      [
        {
          "containerName": "mysql",
          "condition": "START"
        },
        {
          "containerName": "exit1",
          "condition": "SUCCESS"
        }
      ]

the result is the same.

However, behavior seems to be correct if the order of dependencies for nginx is changed:

      [
        {
          "containerName": "exit1",
          "condition": "SUCCESS"
        },
        {
          "containerName": "mysql",
          "condition": "START"
        }
      ]

Expected Behavior

The task is failed to start.

Observed Behavior

Task stuck in the PENDING state forever

Screen Shot 2020-08-18 at 5 02 17 PM

Environment Details

  • ECS Agent version: 1.43.0

Task Definitions to reproduce:

With transitive dependencies - stuck as PENDING
{
    "containerDefinitions": [
        {
            "command": [
                "cat",
                "123"
            ],
            "image": "alpine",
            "startTimeout": 10,
            "name": "exit1",
            "essential": false
        },
        {
            "image": "nginx",
            "startTimeout": 60,
            "dependsOn": [
                {
                    "containerName": "mysql",
                    "condition": "START"
                }
            ],
            "name": "nginx"
        },
        {
            "image": "mysql:5.7",
            "startTimeout": 30,
            "dependsOn": [
                {
                    "containerName": "exit1",
                    "condition": "SUCCESS"
                }
            ],
            "name": "mysql"
        }
    ],
    "memory": "100",
    "family": "reproduce-dependency-problem",
    "requiresCompatibilities": [
        "EC2"
    ],
    "cpu": "128"
}
With unordered multiple dependencies - stuck as PENDING
{
    "containerDefinitions": [
        {
            "command": [
                "cat",
                "123"
            ],
            "image": "alpine",
            "startTimeout": 10,
            "name": "exit1",
            "essential": false
        },
        {
            "image": "nginx",
            "startTimeout": 60,
            "dependsOn": [
                {
                    "containerName": "mysql",
                    "condition": "START"
                },
                {
                    "containerName": "exit1",
                    "condition": "SUCCESS"
                }
            ],
            "name": "nginx"
        },
        {
            "image": "mysql:5.7",
            "startTimeout": 30,
            "dependsOn": [
                {
                    "containerName": "exit1",
                    "condition": "SUCCESS"
                }
            ],
            "name": "mysql"
        }
    ],
    "memory": "100",
    "family": "reproduce-dependency-problem",
    "requiresCompatibilities": [
        "EC2"
    ],
    "cpu": "128"
}
With logically ordered multiple dependencies - task STOPPED as expected ✅
{
    "containerDefinitions": [
        {
            "command": [
                "cat",
                "123"
            ],
            "image": "alpine",
            "startTimeout": 10,
            "name": "exit1",
            "essential": false
        },
        {
            "image": "nginx",
            "startTimeout": 60,
            "dependsOn": [
                {
                    "containerName": "exit1",
                    "condition": "SUCCESS"
                },
                {
                    "containerName": "mysql",
                    "condition": "START"
                }
            ],
            "name": "nginx"
        },
        {
            "image": "mysql:5.7",
            "startTimeout": 30,
            "dependsOn": [
                {
                    "containerName": "exit1",
                    "condition": "SUCCESS"
                }
            ],
            "name": "mysql"
        }
    ],
    "memory": "100",
    "family": "reproduce-dependency-problem",
    "requiresCompatibilities": [
        "EC2"
    ],
    "cpu": "128"
}
@shubham2892
Copy link
Contributor

shubham2892 commented Aug 18, 2020

@trane9991 Sorry that you are facing this issue, I will try to reproduce this on my end.

Meanwhile, Is it possible for you to send the task definition(with which you are seeing the issue) and Agent logs to [email protected].

@taraspos
Copy link
Author

Hey @shubham2892
Thanks for the quick reply, I published the Task Definitions to reproduce the issues under the collapsable spoilers in the Task Definitions to reproduce section of the issue description :)

@taraspos
Copy link
Author

Let me know if you are able to reproduce that because it reproduces in 100% cases for me with Task Definitions shared in the issue description.

@shubham2892
Copy link
Contributor

I was able to reproduce the pending state behavior with With transitive dependencies task def and With unordered multiple dependencies, will mark this as a bug and work on getting this fixed.

@ubhattacharjya
Copy link
Contributor

Hi,

The PR for working on the fix for the ordered container dependency problem is #2615.

Regards,
Utsa

@ellenthsu
Copy link

This fix has been released as part of ECS Agent 1.44.4: https://github.com/aws/amazon-ecs-agent/releases/tag/1.44.4

Please perform an update of the Agent: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-update.html or you can find the latest ECS Optimized AMIs containing ECS Agent 1.44.4 here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html

@lucabelmonte
Copy link

lucabelmonte commented May 13, 2022

Hi,

As you can see in the attached screenshot, i have the same error as mentioned above.
I'm currently using the agent version 1.60.1.

image

image

Here my task definition:

{
  "ipcMode": null,
  "executionRoleArn": null,
  "containerDefinitions": [
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "/ecs/test-sequential-container",
          "awslogs-region": "eu-west-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "entryPoint": null,
      "portMappings": [],
      "command": [
        "/bin/bash",
        "-c",
        "echo ciao && sleep 10 && exit 1"
      ],
      "linuxParameters": null,
      "cpu": 0,
      "environment": [],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": 128,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "ubuntu",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": false,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "container-1"
    },
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "/ecs/test-sequential-container",
          "awslogs-region": "eu-west-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "entryPoint": null,
      "portMappings": [],
      "command": [
        "/bin/bash",
        "-c",
        "echo container2 && sleep 10"
      ],
      "linuxParameters": null,
      "cpu": 0,
      "environment": [],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": 128,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "ubuntu",
      "startTimeout": 10,
      "firelensConfiguration": null,
      "dependsOn": [
        {
          "containerName": "container-1",
          "condition": "SUCCESS"
        }
      ],
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": false,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "container-2"
    },
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "/ecs/test-sequential-container",
          "awslogs-region": "eu-west-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "entryPoint": null,
      "portMappings": [],
      "command": [
        "echo",
        "container3"
      ],
      "linuxParameters": null,
      "cpu": 0,
      "environment": [],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": 128,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "ubuntu",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": [
        {
          "containerName": "container-2",
          "condition": "SUCCESS"
        }
      ],
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "container-3"
    }
  ],
  "placementConstraints": [],
  "memory": null,
  "taskRoleArn": null,
  "compatibilities": [
    "EXTERNAL",
    "EC2"
  ],
  "taskDefinitionArn": "arn:aws:ecs:eu-west-1:XXXX:task-definition/test-sequential-container:15",
  "family": "test-sequential-container",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.container-ordering"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [
    "EC2"
  ],
  "networkMode": null,
  "runtimePlatform": null,
  "cpu": null,
  "revision": 15,
  "status": "ACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": []
}

What am i missing?

Mentions: @ellenthsu

Thanks,

@angelcar
Copy link
Contributor

Hi! @lucabelmonte
Thanks for reaching out.

I was able to repro the issue using your task definition. I will re-open this issue and label it as a bug. We'll work to resolve it.

Correct me if I'm wrong, but I think in this particular case the expectation would be that container-2 would just stop since SUCCESS is never going to happen.

@lucabelmonte
Copy link

lucabelmonte commented May 16, 2022

Hi @angelocar,
Thank you for your answer.

Yes exactly. I was expecting the second container container-2 to go into a STOPPED state and also the runned task definition (for a broken dependence, in this case).
I also noticed that when there are only 2 containers in a chain, it works.

@angelcar
Copy link
Contributor

@lucabelmonte, a fix has been merged and will soon be released.

@angelcar angelcar added the kind/tracking This issue is being tracked internally label May 25, 2022
@gregmoy
Copy link

gregmoy commented Jul 6, 2022

I wasn't having this problem before but now I am after upgrading from 1.61.1 to 1.61.3...

@mssrivas
Copy link
Contributor

The fix was released in 1.61.2

@jonioni
Copy link

jonioni commented Oct 14, 2022

Wondering if this has been applied to Fargate? Encountered similar issue on fargate tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug kind/tracking This issue is being tracked internally
Projects
None yet
Development

No branches or pull requests

9 participants