Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-ecs: LogDriver automatically creates log-resource-policies, exhausting unmodifiable resource limit #22307

Closed
trks1970 opened this issue Sep 30, 2022 · 23 comments · Fixed by #28495
Labels
@aws-cdk/aws-cloudwatch Related to Amazon CloudWatch @aws-cdk/aws-ecs Related to Amazon Elastic Container feature-request A feature should be added or improved. p1

Comments

@trks1970
Copy link

Describe the bug

We are running 12 Fargate containers, each of them logging to CloudWatch via LogDriver, creare with cdk.

However, the last 2 fails to deploy, due to
"Resource limit exceeded. (Service: CloudWatchLogs, Status Code: 400,

which I understand is fixed at 10 and that is it. No way to change.

However, it looks like log-resource-policy is created automaticall by LogDriver, which uses up the amount.

Expected Behavior

It is not possible a) not to create a policy or b) reuse a policy

Current Behavior

A log-resource-policy is created with each LogDriver.awsLogs call.

Reproduction Steps

Deploy 10 Containers with LogDriver

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.40.0 (build 56ba2ab)

Framework Version

aws-cdk-lib:2.27.0

Node.js Version

16.15.1

OS

Win10

Language

Java

Language Version

Kotlin

Other information

No response

@trks1970 trks1970 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 30, 2022
@github-actions github-actions bot added the @aws-cdk/aws-cloudwatch Related to Amazon CloudWatch label Sep 30, 2022
@trks1970 trks1970 changed the title (module name): (short issue description) aws-ecs: LogDriver automatically creates log-resource-policies, exhausting unmodifiable resource limit Oct 4, 2022
@github-actions github-actions bot added the @aws-cdk/aws-ecs Related to Amazon Elastic Container label Oct 4, 2022
@pahud
Copy link
Contributor

pahud commented Nov 23, 2022

@trks1970 This makes sense to me. Before we dive deep into the fix, can you provide reproduction steps with CDK code that can help us quickly reproduce this issue from our end?

@pahud pahud added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed needs-triage This issue or PR still needs to be triaged. labels Nov 23, 2022
@pahud pahud self-assigned this Nov 23, 2022
@github-actions
Copy link

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Nov 25, 2022
@trks1970
Copy link
Author

Will take some time to provide an example.

@github-actions github-actions bot removed closing-soon This issue will automatically close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Nov 27, 2022
@mvs5465
Copy link

mvs5465 commented Feb 8, 2023

Hi @pahud we are running into this too with some CDK-created CloudWatch log group policies:

Example 1, sending cloudtrail events to eventbridge and logging the events:

const loginEvent = aws_cloudtrail.Trail.onEvent(this, 'login-event', {
  eventPattern: {
    source: ['aws.signin'],
    detailType: ['AWS Console Sign In via CloudTrail'],
  },
});

loginEvent.addTarget(new targets.CloudWatchLogGroup(new logs.LogGroup(this, 'login-event-logs')));

Example 2, creating an OpenSearch domain:

new aws_opensearchservice.Domain(this, 'domain', {
  // Specifically enabling logging here
  logging: {
    slowSearchLogEnabled: true,
    appLogEnabled: true,
    slowIndexLogEnabled: true,
  },
});

@pahud
Copy link
Contributor

pahud commented Feb 9, 2023

@mvs5465

Are you having the Resource limit exceeded error just with this? Can you provide a full sample that reproduces this error?

const loginEvent = aws_cloudtrail.Trail.onEvent(this, 'login-event', {
  eventPattern: {
    source: ['aws.signin'],
    detailType: ['AWS Console Sign In via CloudTrail'],
  },
});

loginEvent.addTarget(new targets.CloudWatchLogGroup(new logs.LogGroup(this, 'login-event-logs')));

@mvs5465
Copy link

mvs5465 commented Feb 9, 2023

@pahud We already have 10 cloudwatch log group policies in our account. Running the code you mentioned will create one more, and thus the error.

@pahud
Copy link
Contributor

pahud commented Feb 9, 2023

OK I see.

Looking at this

this.logGroup.grantWrite(containerDefinition.taskDefinition.obtainExecutionRole());

When we new AwsLogDriver() for a container definition, there will always be a grantWrite() execution, adding a new log group policy to this log group resource.

return iam.Grant.addToPrincipalOrResource({

I don't have immediate workaround and I am making it p1 here for more visibility. Meanwhile I will try reproduce this in my account by creating 10 containers with the same log group.

@pahud pahud added p1 and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Feb 9, 2023
@pahud
Copy link
Contributor

pahud commented Feb 9, 2023

@trks1970

I was trying to create 10 containers with 2 approaches below and I didn't hit the error you mentioned.

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';

export class EcsTsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const logGroup = new cdk.aws_logs.LogGroup(this, 'LG')
    const task = new cdk.aws_ecs.FargateTaskDefinition(this, 'Task', {
      cpu: 512,
      memoryLimitMiB: 4096,
    });

    const task2 = new cdk.aws_ecs.FargateTaskDefinition(this, 'Task2', {
      cpu: 512,
      memoryLimitMiB: 4096,
    });

    // this works
    for (let x = 0; x < 10; x++) {
      task.addContainer(`nginx${x}`, {
        image: cdk.aws_ecs.ContainerImage.fromRegistry('nginx:latest'),
        memoryLimitMiB: 256,
        logging: new cdk.aws_ecs.AwsLogDriver({
          logGroup,
          streamPrefix: 'demo-',
        })
      });
    };

    // this works as well
    for (let x = 0; x < 10; x++) {
      task2.addContainer(`nginx${x}`, {
        image: cdk.aws_ecs.ContainerImage.fromRegistry('nginx:latest'),
        memoryLimitMiB: 256,
        logging: new cdk.aws_ecs.AwsLogDriver({
          logGroup: new cdk.aws_logs.LogGroup(this, `LG${x}`),
          streamPrefix: 'demo-',
        })
      });
    };

  }
}

Can you provide me with the full cdk code that reproduces your error?

@pahud pahud added investigating This issue is being investigated and/or work is in progress to resolve the issue. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed p1 labels Feb 9, 2023
@pahud
Copy link
Contributor

pahud commented Feb 9, 2023

OK I can reproduce this as below:

    // create 10 events each having a loggroup as the event target
    for (let x = 0; x < 10; x++) {
      const loginEvent = cdk.aws_cloudtrail.Trail.onEvent(this, `login-event${x}`, {
        eventPattern: {
          source: ['aws.signin'],
          detailType: ['AWS Console Sign In via CloudTrail'],
        },
      });
      
      loginEvent.addTarget(new cdk.aws_events_targets.CloudWatchLogGroup(
        new cdk.aws_logs.LogGroup(this, `login-event-logs${x}`),
      ));
    }

This will create 10 cloudwatch logs resource policies and according to the doc the hard limit is 10 thus the error.

The resources can be listed with aws logs describe-resource-policies.

I think this is not a bug but a limitation of cloudwatch logs resources. From CDK's perspective we probably can have some workaround such as opt out the resource policy creation but I am not sure if this is a good idea.

I am making this as p1 feature request for more visibility from the team.

@pahud pahud added p1 and removed bug This issue is a bug. investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Feb 9, 2023
@pahud pahud added feature-request A feature should be added or improved. and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Feb 9, 2023
@pahud pahud removed their assignment Mar 2, 2023
@vinayak-kukreja
Copy link
Contributor

Similar Issue: #20313

@marinecode
Copy link

I have come up with a possible workaround.
When I moved LogGroup creation to the separate stack, ResourcePolicy was not created.
In the TaskDefinition stack, I referenced created LogGroup by arn via LogGroup.from_log_group_arn().

@mramirez-fdb
Copy link

Any Update on this?

@XiaowenMaoA
Copy link

@sakurai-ryo just seeing pull request on this. when do we expect this change to available? any local workaround we can have? Thanks

@sakurai-ryo
Copy link
Contributor

@XiaowenMaoA
The CDK core team is currently on vacation, so the review will be done after the vacation.
Please check the other issues for workarounds.
#20313 (comment)

If this doesn't work, please send me the code and I may be able to suggest a workaround.

@XiaowenMaoA
Copy link

XiaowenMaoA commented Dec 27, 2023

@sakurai-ryo we are not inside ECS task. create OS domain logic is like following.
we have TenantNames over 10 and we create OpenSearch Domain for each TenantName.

import { Domain, EncryptionAtRestOptions, EngineVersion } from 'aws-cdk-lib/aws-opensearchservice';

export class OpenSearchClusterStack extends DeploymentStack {
  public kmsKey: KmsKey;
  public cloudwatchPolicy: PolicyStatement;

  constructor(parent: App, name: string, props: OpenSearchClusterStackProps) {
    super(parent, name, {
      env: props.env,
      softwareType: SoftwareType.LONG_RUNNING_SERVICE,
      stackName: props.stackName,
      tags: props.tags,
    });

    vpc = Utils.getVpc(this, this.region);

    const securityGroup = this.createSecurityGroup(vpc, props);

    this.kmsKey = this.createKmsKey(vpc, props);

    this.cloudwatchPolicy = new PolicyStatement({
      actions: ['logs:CreateLogStream', 'logs:PutLogEvents'],
      resources: [
        '*',
      ],
      effect: Effect.ALLOW,
      principals: [new ServicePrincipal('es.amazonaws.com')],
    });

  // create OS domain for each TenantName 
    Object.values(TenantNames).forEach((tenantNameStr) => {
      const tenantName = tenantNameStr as TenantNames;
      const clusterConfig = props.stage.clusterConfigByTenant[tenantName];
      const globaltenantConfig = TENANT_CONFIG[tenantName];
      const domainName = Utils.generateName(globaltenantConfig.clusterName, props.isOnebox)
        .toLowerCase();

      const domainAppLogGroupName = `${Utils.generateName(`${Constants.Resource.PREFIX}-OSDomain-AppLogGroup`, props.isOnebox)}-${domainName}`;
      const domainAppLogGroup = new LogGroup(this, domainAppLogGroupName, {
        logGroupName: domainAppLogGroupName,
        retention: RetentionDays.TEN_YEARS,
        removalPolicy: RemovalPolicy.RETAIN_ON_UPDATE_OR_DELETE,
      });

      const domain = this.createDomain(
        domainName,
        { enabled: true, kmsKey: this.kmsKey.key },
        domainAppLogGroup,
        vpc,
        securityGroup,
        props,
        clusterConfig,
      );
      domain.addAccessPolicies(this.cloudwatchPolicy);
    });
  }

  private createDomain(
    domainName: string,
    encryptionProperties: EncryptionAtRestOptions,
    appLogGroup: LogGroup,
    vpc: IVpc,
    securityGroup:SecurityGroup,
    props: OpenSearchClusterStackProps,
    clusterConfig: ClusterConfig,
  ) {

    return new Domain(this, domainName, {
      domainName,
      version: EngineVersion.OPENSEARCH_1_1,
      vpc,
      zoneAwareness: {
        enabled: true,
        availabilityZoneCount: vpc.privateSubnets.length,
      },
      capacity: {
        masterNodes:  clusterConfig.masterNodeCount,
        masterNodeInstanceType: clusterConfig.masterNodeType,
        dataNodes: clusterConfig.instanceCount,
        dataNodeInstanceType: clusterConfig.instanceType,
      },
      ebs: {
        volumeSize: clusterConfig.instanceVolumeSize,
        volumeType: clusterConfig.ebsVolumeType,
        iops: 18000,
      },
      securityGroups: [securityGroup],
      nodeToNodeEncryption: true,
      encryptionAtRest: encryptionProperties,
      enforceHttps: true,
      logging: {
        appLogEnabled: true,
        appLogGroup,
      },
      removalPolicy: RemovalPolicy.RETAIN_ON_UPDATE_OR_DELETE,
    });
  }

@sakurai-ryo
Copy link
Contributor

Thanks @XiaowenMaoA

Looking at this part of the code, it appears that we are currently using CustomResource to create the ResourcePolicy for the logs.

logGroupResourcePolicy = new LogGroupResourcePolicy(this, `ESLogGroupPolicy${this.node.addr}`, {

Since CustomResource is executed by CloudFormation after cdk deploy, it is difficult to remove the ResourcePolicy in the CDK code, for example by using the tryRemoveChild method.
It may be possible to work around this by using a new custom resource and rewriting or deleting the ResourcePolicy, though.

The PR I created is an ECS fix, and even if it were merged, it would not solve the problem on the OpenSearch side.
Would you be so kind as to create a new issue for us to discuss this matter further?
In addition to the fact that ResourcePolicy can currently be created in CloudFormation without the custom resource, ResourcePolicy should be reused as described in this document.
https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createdomain-configure-slow-logs.html#:~:text=name%22%0A%20%20%20%20%7D%0A%7D-,Important,-CloudWatch%20Logs%20supports

@XiaowenMaoA
Copy link

@sakurai-ryo i found one related issue created before, #23637
Can we use this to track log-resource-policy in OpenSearch case?
this is blocked our prod readiness coz we are scaling up to > 10 domain for sure.

i can not find a way to pass the shared ResourcePolicy and with logging enabled. Is there any possible workaround?

@mergify mergify bot closed this as completed in #28495 Jan 10, 2024
mergify bot pushed a commit that referenced this issue Jan 10, 2024
This PR modified to avoid creating unnecessary `ResourcePolicy` in CloudWatch Logs.

## issue summary
The related issue reports an error when using the awslogs driver on ECS.
This error is caused by the creation of a ResourcePolicy in CloudWatch Logs that reaches the maximum number of ResourcePolicies.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html

## Current behavior
In some cases, this ResourcePolicy will be created and in other cases it will not be created.
Currently, `Grant.addToPrincipalOrResource` is used to grant permissions to ExecutionRole and Log Group in the ECS taskDef.
https://github.com/aws/aws-cdk/blob/607dccb0fd920d25f0fe2613b83c9830322c439e/packages/aws-cdk-lib/aws-ecs/lib/log-drivers/aws-log-driver.ts#L138
https://github.com/aws/aws-cdk/blob/607dccb0fd920d25f0fe2613b83c9830322c439e/packages/aws-cdk-lib/aws-logs/lib/log-group.ts#L194
https://github.com/aws/aws-cdk/blob/607dccb0fd920d25f0fe2613b83c9830322c439e/packages/aws-cdk-lib/aws-iam/lib/grant.ts#L122

`Grant.addToPrincipalOrResource` first grants permissions to the Grantee (ExecutionRole) and creates a resource base policy for cross account access in cases where certain conditions are not met.
This condition is determined by the contents of the `principalAccount` of the ExecutionRole and the accountID in the `env.account` and whether or not these are Tokens, but in this scenario, cross account access is not necessary.
https://github.com/aws/aws-cdk/blob/607dccb0fd920d25f0fe2613b83c9830322c439e/packages/aws-cdk-lib/aws-iam/lib/grant.ts#L141

Also, when the `LogGroup.grantWrite` call was added to `aws-log-driver.ts`, the ResourcePolicy for logs could not be created from CFn and only granted to the ExecutionRole.
#1291
![スクリーンショット 2023-12-27 1 08 20](https://github.com/aws/aws-cdk/assets/58683719/5a17a041-d560-45fa-bac6-cdc3894b18bc)
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/ReleaseHistory.html

Therefore, the resource base policy should not be necessary when using the awslogs driver.

## Major changes
This PR changed to grant permissions only to ExecutionRole when using the awslogs driver.
With this fix, ResourcePolicy will no longer be created when using the awslogs driver.
I don't consider this a breaking change, as it changes the content of the generated template, but does not change the behavior of forwarding logs to CloudWatch Logs.
However, if this is a breaking change, I think it is necessary to use the feature flag.

fixes #22307, fixes #20313

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

GavinZZ pushed a commit that referenced this issue Jan 11, 2024
This PR modified to avoid creating unnecessary `ResourcePolicy` in CloudWatch Logs.

The related issue reports an error when using the awslogs driver on ECS.
This error is caused by the creation of a ResourcePolicy in CloudWatch Logs that reaches the maximum number of ResourcePolicies.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html

In some cases, this ResourcePolicy will be created and in other cases it will not be created.
Currently, `Grant.addToPrincipalOrResource` is used to grant permissions to ExecutionRole and Log Group in the ECS taskDef.
https://github.com/aws/aws-cdk/blob/607dccb0fd920d25f0fe2613b83c9830322c439e/packages/aws-cdk-lib/aws-ecs/lib/log-drivers/aws-log-driver.ts#L138
https://github.com/aws/aws-cdk/blob/607dccb0fd920d25f0fe2613b83c9830322c439e/packages/aws-cdk-lib/aws-logs/lib/log-group.ts#L194
https://github.com/aws/aws-cdk/blob/607dccb0fd920d25f0fe2613b83c9830322c439e/packages/aws-cdk-lib/aws-iam/lib/grant.ts#L122

`Grant.addToPrincipalOrResource` first grants permissions to the Grantee (ExecutionRole) and creates a resource base policy for cross account access in cases where certain conditions are not met.
This condition is determined by the contents of the `principalAccount` of the ExecutionRole and the accountID in the `env.account` and whether or not these are Tokens, but in this scenario, cross account access is not necessary.
https://github.com/aws/aws-cdk/blob/607dccb0fd920d25f0fe2613b83c9830322c439e/packages/aws-cdk-lib/aws-iam/lib/grant.ts#L141

Also, when the `LogGroup.grantWrite` call was added to `aws-log-driver.ts`, the ResourcePolicy for logs could not be created from CFn and only granted to the ExecutionRole.
#1291
![スクリーンショット 2023-12-27 1 08 20](https://github.com/aws/aws-cdk/assets/58683719/5a17a041-d560-45fa-bac6-cdc3894b18bc)
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/ReleaseHistory.html

Therefore, the resource base policy should not be necessary when using the awslogs driver.

This PR changed to grant permissions only to ExecutionRole when using the awslogs driver.
With this fix, ResourcePolicy will no longer be created when using the awslogs driver.
I don't consider this a breaking change, as it changes the content of the generated template, but does not change the behavior of forwarding logs to CloudWatch Logs.
However, if this is a breaking change, I think it is necessary to use the feature flag.

fixes #22307, fixes #20313

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@sakurai-ryo
Copy link
Contributor

@XiaowenMaoA
Yes, of course.
As I mentioned in my previous comment, it may be difficult to avoid this problem.
I would consider a PR to solve this problem.

@stoicad
Copy link

stoicad commented May 9, 2024

Any update on this? we don't want to add policies on our roles, but we are forced to do it when defining the logs..

@sakurai-ryo
Copy link
Contributor

Hi @stoicad.
Do you still have this problem when using aws-log-driver?

@stoicad
Copy link

stoicad commented May 10, 2024

Hi @sakurai-ryo, yes we do. We are not allowed to touch the existing roles via CDK, but when defining this one:

public createContainer(taskDefinition: FargateTaskDefinition): void {
    //TODO: fix logging on container
    //Tried with Log Group ARN but still not working: "arn:aws:logs:us-east-1::log-group:/ecs/web-taskdefinition:*"
    //logging: LogDrivers.awsLogs({ streamPrefix: `project-name-${this.ecsContextParams.environment}-web` })

    var container = taskDefinition.addContainer(`project-name-${this.ecsContextParams.environment}-web-container`, {
      image: ContainerImage.fromRegistry(this.envConfig.WEB_ECS_IMAGE),
   //logging: LogDrivers.awsLogs({ streamPrefix: `project-name-${this.ecsContextParams.environment}-web` })
    });

    container.addPortMappings({ containerPort: 80 });
    container.addPortMappings({ containerPort: 443 });
  }

it tries to add permissions to the role attached to the ECS (as a separate policy), although the logs:* is already available on the role.

@sakurai-ryo
Copy link
Contributor

Thanks @stoicad.
Does the existing roles refer to the ECS execution role?
Could you share the entire code and errors to help us pinpoint the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-cloudwatch Related to Amazon CloudWatch @aws-cdk/aws-ecs Related to Amazon Elastic Container feature-request A feature should be added or improved. p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.