Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_glue_crawler: unable to remove schedule from existing crawler #33194

Closed
1 task
gergobig opened this issue Jan 27, 2025 · 3 comments
Closed
1 task

aws_glue_crawler: unable to remove schedule from existing crawler #33194

gergobig opened this issue Jan 27, 2025 · 3 comments
Assignees
Labels
@aws-cdk/aws-scheduler Related to the AWS Scheduler service bug This issue is a bug. needs-cfn This issue is waiting on changes to CloudFormation before it can be addressed. p2 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.

Comments

@gergobig
Copy link

gergobig commented Jan 27, 2025

Describe the bug

Let's say, I have a crawler that triggers every 10 minutes between 12:20 AM and 12:50 AM cron(20-50/10 0 * * ? *) but when I try to remove this schedule and make it "Run on demand" by providing None as schedule. The schedule is not removed.

When changing the schedule to None the following message appears in cdk diff's output:

[~] AWS::Glue::Crawler xyz/xyz
 xyz
 └─ [-] Schedule
     └─ {"ScheduleExpression":"cron(20-50/10 0 * * ? *)"}

However, the schedule persists and has not been removed on the AWS side.

Regression Issue

  • Select this option if this issue appears to be a regression.

Last Known Working CDK Version

No response

Expected Behavior

Passing None as schedule erases existing rules and makes it "run on demand".

Current Behavior

Passing None as schedule leaves the previous configuration.

Reproduction Steps

Deploy this stack with a schedule and redeploy with schedule=None.

from aws_cdk import aws_iam, aws_s3, aws_glue, Stack, App, Environment
from constructs import Construct
from enum import Enum

class GlueCrawlerRecrawlBehavior(Enum):
    CRAWL_NEW_FOLDERS_ONLY = 'CRAWL_NEW_FOLDERS_ONLY'

    def __init__(self, values):
        self.values = values

class GlueCrawlerUpdateBehavior(Enum):
    LOG = 'LOG'

    def __init__(self, values):
        self.values = values

class ExampleCrawler(Stack):
    def __init__(self, scope: Construct, id: str) -> None:
        super().__init__(scope, id)

        bucket = aws_s3.Bucket(self, "example-bucket")

        crawler_role = aws_iam.Role(
            self,
            "example-crawler-role",
            role_name="example-crawler-role",
            assumed_by=aws_iam.ServicePrincipal('glue.amazonaws.com'),
            inline_policies={
                'crawler-policy': aws_iam.PolicyDocument(
                    statements=[
                        aws_iam.PolicyStatement(
                            actions=[
                                'logs:CreateLogGroup',
                                'logs:CreateLogStream',
                                'logs:PutLogEvents',
                                'cloudwatch:*',
                            ],
                            effect=aws_iam.Effect.ALLOW,
                            resources=['*'],
                        ),
                        aws_iam.PolicyStatement(
                            actions=[
                                'glue:*',
                            ],
                            effect=aws_iam.Effect.ALLOW,
                            resources=[
                                'arn:aws:glue:region:account:catalog',
                                'arn:aws:glue:region:account:database/example_database',
                                'arn:aws:glue:region:account:table/example_database/*',
                            ],
                        ),
                        aws_iam.PolicyStatement(
                            actions=['s3:GetObject', 's3:ListBucket', 's3:ListObjects'],
                            effect=aws_iam.Effect.ALLOW,
                            resources=[
                                bucket.bucket_arn,
                                f"{bucket.bucket_arn}/*"
                            ],
                        )
                    ]
                )
            }
        )

        aws_glue.CfnCrawler(
            self,
            "example-crawler",
            role=crawler_role.role_arn,
            targets=aws_glue.CfnCrawler.TargetsProperty(
                s3_targets=[
                    aws_glue.CfnCrawler.S3TargetProperty(
                        path=f"s3://{bucket.bucket_name}/example-data/"
                    )
                ]
            ),
            database_name="example_database",
            name="example-crawler",
            schedule=aws_glue.CfnCrawler.ScheduleProperty(
                schedule_expression='cron(20-50/10 0 * * ? *)'
            ),
            # schedule=None,
            recrawl_policy=aws_glue.CfnCrawler.RecrawlPolicyProperty(
                recrawl_behavior=GlueCrawlerRecrawlBehavior.CRAWL_NEW_FOLDERS_ONLY.value
            ),
            schema_change_policy=aws_glue.CfnCrawler.SchemaChangePolicyProperty(
                delete_behavior='LOG',
                update_behavior=GlueCrawlerUpdateBehavior.LOG.value,
            ),
        )


if __name__ == '__main__':
    app = App()
    ExampleCrawler(app, f'example-crawler', env=Environment(account='xyz', region='region'))
    app.synth()

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.128.0

Framework Version

No response

Node.js Version

nodejs: 18

OS

alpine 3

Language

Python

Language Version

No response

Other information

No response

@gergobig gergobig added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 27, 2025
@github-actions github-actions bot added the @aws-cdk/aws-scheduler Related to the AWS Scheduler service label Jan 27, 2025
@ashishdhingra ashishdhingra self-assigned this Jan 27, 2025
@ashishdhingra ashishdhingra added p2 needs-reproduction This issue needs reproduction. and removed needs-triage This issue or PR still needs to be triaged. labels Jan 27, 2025
@ashishdhingra
Copy link
Contributor

Used below similar TypeScript code:

import * as cdk from 'aws-cdk-lib';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as glue from 'aws-cdk-lib/aws-glue';

enum GlueCrawlerRecrawlBehavior {
  CRAWL_NEW_FOLDERS_ONLY = 'CRAWL_NEW_FOLDERS_ONLY'
}

enum GlueCrawlerUpdateBehavior {
  LOG = 'LOG'
}

export class GlueClawlerStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const bucket = new s3.Bucket(this, "example-bucket");

    const crawlerRole = new iam.Role(this, "example-crawler-role", {
      roleName: "example-crawler-role",
      assumedBy: new iam.ServicePrincipal('glue.amazonaws.com'),
      inlinePolicies: {
        'crawler-policy': new iam.PolicyDocument({
          statements: [
            new iam.PolicyStatement({
              actions: [
                'logs:CreateLogGroup',
                'logs:CreateLogStream',
                'logs:PutLogEvents',
                'cloudwatch:*',
              ],
              effect: iam.Effect.ALLOW,
              resources: ['*'],
            }),
            new iam.PolicyStatement({
              actions: [
                'glue:*',
              ],
              effect: iam.Effect.ALLOW,
              resources: [
                'arn:aws:glue:region:account:catalog',
                'arn:aws:glue:region:account:database/example_database',
                'arn:aws:glue:region:account:table/example_database/*',
              ],
            }),
            new iam.PolicyStatement({
              actions: ['s3:GetObject', 's3:ListBucket', 's3:ListObjects'],
              effect: iam.Effect.ALLOW,
              resources: [
                bucket.bucketArn,
                `${bucket.bucketArn}/*`
              ],
            })
          ]
        })
      }
    });

    new glue.CfnCrawler(this, "example-crawler", {
      role: crawlerRole.roleArn,
      targets: {
        s3Targets: [
          {
            path: `s3://${bucket.bucketName}/example-data/`
          }
        ]
      },
      databaseName: "example_database",
      name: "example-crawler",
      schedule: {
        scheduleExpression: 'cron(20-50/10 0 * * ? *)'
      },
      recrawlPolicy: {
        recrawlBehavior: GlueCrawlerRecrawlBehavior.CRAWL_NEW_FOLDERS_ONLY
      },
      schemaChangePolicy: {
        deleteBehavior: 'LOG',
        updateBehavior: GlueCrawlerUpdateBehavior.LOG
      },
    });
  }
}

This synthesizes the following CFN template:

Resources:
  examplebucketC9DFA43E:
    Type: AWS::S3::Bucket
    UpdateReplacePolicy: Retain
    DeletionPolicy: Retain
    Metadata:
      aws:cdk:path: GlueClawlerStack/example-bucket/Resource
  examplecrawlerrole1B62B8EE:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Effect: Allow
            Principal:
              Service: glue.amazonaws.com
        Version: "2012-10-17"
      Policies:
        - PolicyDocument:
            Statement:
              - Action:
                  - cloudwatch:*
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                Effect: Allow
                Resource: "*"
              - Action: glue:*
                Effect: Allow
                Resource:
                  - arn:aws:glue:region:account:catalog
                  - arn:aws:glue:region:account:database/example_database
                  - arn:aws:glue:region:account:table/example_database/*
              - Action:
                  - s3:GetObject
                  - s3:ListBucket
                  - s3:ListObjects
                Effect: Allow
                Resource:
                  - Fn::GetAtt:
                      - examplebucketC9DFA43E
                      - Arn
                  - Fn::Join:
                      - ""
                      - - Fn::GetAtt:
                            - examplebucketC9DFA43E
                            - Arn
                        - /*
            Version: "2012-10-17"
          PolicyName: crawler-policy
      RoleName: example-crawler-role
    Metadata:
      aws:cdk:path: GlueClawlerStack/example-crawler-role/Resource
  examplecrawler:
    Type: AWS::Glue::Crawler
    Properties:
      DatabaseName: example_database
      Name: example-crawler
      RecrawlPolicy:
        RecrawlBehavior: CRAWL_NEW_FOLDERS_ONLY
      Role:
        Fn::GetAtt:
          - examplecrawlerrole1B62B8EE
          - Arn
      Schedule:
        ScheduleExpression: cron(20-50/10 0 * * ? *)
      SchemaChangePolicy:
        DeleteBehavior: LOG
        UpdateBehavior: LOG
      Targets:
        S3Targets:
          - Path:
              Fn::Join:
                - ""
                - - s3://
                  - Ref: examplebucketC9DFA43E
                  - /example-data/
    Metadata:
      aws:cdk:path: GlueClawlerStack/example-crawler
  CDKMetadata:
    Type: AWS::CDK::Metadata
    Properties:
      Analytics: v2:deflate64:H4sIAAAAAAAA/y2KQQ7CIBAA39I7rG01qWf5AT7AIN0aCoWEBTkQ/m4aPM1kMjNMywLjoApxvVruzBvqMyltmSr0qnSF+sjaYmJi890aM+qAKoPDM55s7OMygti8iKo4jI1JpJCj7svfG/NhRdjp8p3uMI9wG3YyhsfskzkQZOcPVnt/DZUAAAA=
    Metadata:
      aws:cdk:path: GlueClawlerStack/CDKMetadata/Default
Parameters:
  BootstrapVersion:
    Type: AWS::SSM::Parameter::Value<String>
    Default: /cdk-bootstrap/hnb659fds/version
    Description: Version of the CDK Bootstrap resources in this environment, automatically retrieved from SSM Parameter Store. [cdk:skip]

And creates Glue Crawler in AWS console when deployed. In AWS console, we could select On Demand from dropdown while editing the schedule.

Commenting out code:

      schedule: {
        scheduleExpression: 'cron(20-50/10 0 * * ? *)'
      },

gives the below output for cdk diff

start: Building 2d2ad069231fbdd6d8115dc152428649d7e146e4497b431a55f8d9df8d8796aa:139480602983-us-east-2
success: Built 2d2ad069231fbdd6d8115dc152428649d7e146e4497b431a55f8d9df8d8796aa:139480602983-us-east-2
start: Publishing 2d2ad069231fbdd6d8115dc152428649d7e146e4497b431a55f8d9df8d8796aa:139480602983-us-east-2
success: Published 2d2ad069231fbdd6d8115dc152428649d7e146e4497b431a55f8d9df8d8796aa:139480602983-us-east-2
Hold on while we create a read-only change set to get a diff with accurate replacement information (use --no-change-set to use a less accurate but faster template-only diff)
Stack GlueClawlerStack
Resources
[~] AWS::Glue::Crawler example-crawler examplecrawler
 └─ [-] Schedule
     └─ {"ScheduleExpression":"cron(20-50/10 0 * * ? *)"}


✨  Number of stacks with differences: 1

It generates CloudFormation stack without Schedule property for AWS::Glue::Crawler resource:

...
  examplecrawler:
    Type: AWS::Glue::Crawler
    Properties:
      DatabaseName: example_database
      Name: example-crawler
      RecrawlPolicy:
        RecrawlBehavior: CRAWL_NEW_FOLDERS_ONLY
      Role:
        Fn::GetAtt:
          - examplecrawlerrole1B62B8EE
          - Arn
      SchemaChangePolicy:
        DeleteBehavior: LOG
        UpdateBehavior: LOG
      Targets:
        S3Targets:
          - Path:
              Fn::Join:
                - ""
                - - s3://
                  - Ref: examplebucketC9DFA43E
                  - /example-data/
    Metadata:
      aws:cdk:path: GlueClawlerStack/example-crawler
...

Deploying it appears to update the AWS::Glue::Crawler resource in CloudFormation, but it doesn't actually remove Schedule settings. Looks like once the schedule is set, we cannot simply remove it from CloudFormation by removing schedule property.

Using Empty String for scheduleExpression: '' appears to work.

@gergobig This appears to be CloudFormation issue/limitation. Please use empty string '' for scheduleExpression as schedule_expression='' in your Python code. If you have any concerns with this behavior, please open an issue for CloudFormation team at https://github.com/aws-cloudformation/cloudformation-coverage-roadmap.

Thank,
Ashish

@ashishdhingra ashishdhingra added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. needs-cfn This issue is waiting on changes to CloudFormation before it can be addressed. and removed needs-reproduction This issue needs reproduction. labels Jan 27, 2025
@gergobig
Copy link
Author

Thanks @ashishdhingra! I find this issue pretty weird on the CloudFormation side, however, your suggestion does work. I will raise this issue on their side.

Copy link

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
@aws-cdk/aws-scheduler Related to the AWS Scheduler service bug This issue is a bug. needs-cfn This issue is waiting on changes to CloudFormation before it can be addressed. p2 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.
Projects
None yet
Development

No branches or pull requests

2 participants