-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws-ecs: Cluster.addAsgCapacityProvider blocks EC2 metadata in instances created by the ASG #28270
Comments
I tested this a bit further, looks like the task role still works with the SDK, it's just the region that is unable to be automatically determined. I've updated my task definitions to set It only became a big headache for me because I had my containers set up in a cluster that I created using the console, then I decided to move it into my CDK stack, and all of a sudden my containers were failing with So I think ideally either stop blocking EC2 instance metadata (or add a boolean option to turn this on/off), or make it very clear in the docs that SDK usage in ECS containers won't work the same as in normal EC2 instances if the cluster is set up with CDK. |
I think your taskRole should work as expected if you provision your ecs cluster/service/task with CDK. I can't find any relevant document about the blocking for the metadata access but we need to be very careful if we remove it as it's been a stable module for a long time and this might cause additional issues or breaking changes. Your |
@pahud Sure. It's an AspNetCore app, .NET 8. At the point where it was failing, it was trying to fetch SSM parameters for the app configuration using the The task definition is pretty basic, just setting the network mode (I think bridge is the default anyway?), task role permissions for SSM and DynamoDB, and adding the container. // Task Definition
_task = new(scope, $"{id}-TaskDefinition", new TaskDefinitionProps() {
NetworkMode = NetworkMode.BRIDGE,
});
_task.TaskRole.AttachInlinePolicy(new(scope, $"{id}-Policy", new PolicyProps() {
Statements = new PolicyStatement[] {
new(new PolicyStatementProps() {
Actions = new[] {
"ssm:GetParameter",
"ssm:GetParameters",
"ssm:GetParametersByPath",
},
Effect = Effect.ALLOW,
Resources = new[] {
$"arn:aws:ssm:ap-southeast-2:{Account}:parameter/{InfrastructureConstants.SsmParameterPrefixGlobal}",
$"arn:aws:ssm:ap-southeast-2:{Account}:parameter/{InfrastructureConstants.SsmParameterPrefixAuthService}",
},
}),
new(new PolicyStatementProps() {
Actions = new[] {
"dynamodb:BatchWriteItem",
"dynamodb:PutItem",
"dynamodb:DescribeTable",
"dynamodb:DeleteItem",
"dynamodb:GetItem",
"dynamodb:Query",
"dynamodb:UpdateItem",
},
Effect = Effect.ALLOW,
Resources = new[] {
authSessionsTable.TableArn,
},
}),
},
}));
// Container
var environment = new Dictionary<string, string>() {
{ "ASPNETCORE_ENVIRONMENT", "Production" },
{ "ASPNETCORE_URLS", "http://+:80" },
};
// -- This was only added after discovering the issue; you will need to remove this block to reproduce the problem.
if (!String.IsNullOrEmpty(props.Region)) {
environment["AWS_REGION"] = props.Region;
environment["AWS_DEFAULT_REGION"] = props.Region;
}
var container = _task.AddContainer($"{id}-Container", new ContainerDefinitionOptions() {
Image = ContainerImage.FromEcrRepository(Repository),
MemoryReservationMiB = 128,
PortMappings = new IPortMapping[] {
new PortMapping() { ContainerPort = 80, HostPort = 0 },
},
Environment = environment,
Logging = new AwsLogDriver(new AwsLogDriverProps() {
StreamPrefix = id,
}),
});
// Service
var service = new Ec2Service(scope, $"{id}-Service", new Ec2ServiceProps() {
Cluster = props.EcsCluster,
TaskDefinition = _task,
DesiredCount = 0, // I adjust this in the console once I push the code to the ECR repository.
}); The line that's failing in Program.cs: var builder = WebApplication.CreateBuilder(args);
// Exception here. Note -- optional must be `false`. If true, it will fail silently and just not apply the configuration without throwing.
builder.Configuration
.AddSystemsManager($"/{InfrastructureConstants.SsmParameterPrefixGlobal}", optional: false, reloadAfter: TimeSpan.FromMinutes(1))
.AddSystemsManager($"/{InfrastructureConstants.SsmParameterPrefixAuthService}", optional: false, reloadAfter: TimeSpan.FromMinutes(1)); The error message in the logs on service startup:
|
I had a similar issue recently whereby access to the instance metadata endpoint was failing resulting in aws-otel and aws-sdk throwing errors on container startup. I was similarly also able to resolve the issue by simply adding The reason behind the iptables block is documented at: https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesguide/security-iam-roles.html May I suggest that we add in the |
Sounds good! Thank you! |
I also came across that page when I was debugging my issue and thought "well I haven't done that, so that can't be the problem". I think it needs to be documented on the CDK side, that the |
…ider for autoscaling (#28437) ### Why is this needed? When adding a auto scaling group as a capacity provider using `Cluster.addAsgCapacityProvider` and when the task definition being run uses the AWS_VPC network mode, it results in the metadata service at `169.254.169.254` being blocked . This is a security best practice as detailed [here](https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesguide/security-iam-roles.html). This practice is implemented [here](https://github.com/aws/aws-cdk/blame/2d9de189e583186f2b77386ae4fcfff42c864568/packages/aws-cdk-lib/aws-ecs/lib/cluster.ts#L502-L504). However by doing this, some applications such as those raised in #28270 as well as the aws-otel package will not be able to source for the AWS region and thus, cause the application to crash and exit. ### What does it implement? This PR add an override to the addContainer method when using the Ec2TaskDefinition to add in the AWS_REGION environment variable to the container if the network mode is set as AWS_VPC. The region is sourced by referencing to the stack which includes this construct at synth time.This environment variable is only required in the EC2 Capacity Provider mode and not in Fargate as this issue of not being able to source for the region on startup is only present when using the EC2 Capacity Provider with the AWS_VPC networking mode. The initial issue addresses this during the `addAsgCapacityProvider` action which targets the cluster. However, we cannot mutate the task definition at that point in time thus, this change addresses it when the task definition is actually added to a service that meets all the requirements whereby the failure to source for region will occur. Updated the relevant integration tests to reflect the new environment variable being created alongside user-defined environment variables. Closes #28270 ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
|
Describe the bug
Calling
Cluster.addAsgCapacityProvider
modifies the AutoScalingGroup to add a userdata script to the launch template which blocks access to 169.254.169.254:In particular:
sudo iptables --insert FORWARD 1 --in-interface docker+ --destination 169.254.169.254/32 --jump DROP
Without the call to
Cluster.addAsgCapacityProvider
(and possibly other methods do this too), this isn't added.This behaviour isn't documented anywhere that I can find, and causes the error
No RegionEndpoint or ServiceURL configured
from the SDK when using the default settings and expecting everything to be picked up from the environment (which is how it seems to work when creating the cluster from the console).Expected Behavior
Don't block access to EC2 metadata, or at least clearly document it and offer alternatives if someone wants to make use of the TaskRole inside the container
Current Behavior
The EC2 metadata endpoint is blocked using
sudo iptables --insert FORWARD 1 --in-interface docker+ --destination 169.254.169.254/32 --jump DROP
Reproduction Steps
Possible Solution
Don't block access to EC2 metadata, or at least clearly document it and offer alternatives if someone wants to make use of the TaskRole inside the container
Additional Information/Context
No response
CDK CLI Version
2.114.0 (build 12879fa)
Framework Version
No response
Node.js Version
v18.16.1
OS
MacOS 13.6.2
Language
.NET
Language Version
.NET 8
Other information
No response
The text was updated successfully, but these errors were encountered: