Git Repository organization — Glue Job location best practices #123

ystoneman · 2023-03-09T02:34:54Z

ystoneman
Mar 9, 2023

To add an AWS Glue job instead of an AWS Lambda function to either Stage A or Stage B of my SDLF pipeline, would the Glue code go in the datalakelibrary repository or somewhere else? And where would the best place for the Glue job's CloudFormation yaml go (pipeline, dataset, stage a/b, etc.)?

Answered by cnfait

Mar 9, 2023

I would advise against storing the Glue job code in the datalakeLibrary repository. The content of this repository is built into a Lambda layer intended to be used by Lambda functions part of Stage A and Stage B. The code of the Glue job itself is of no use to them and will only make the layer bigger, which can be an issue.

What usually happens is storing the Glue job code with the Glue job's CloudFormation template inside a repository distinct from all the SDLF repositories. This repository can be called sdlf-transforms for example. Stage B (using the datalakeLibrary) runs a Glue job using a specific name so that's the only thing you need to be careful about.

There are alternatives to th…

View full answer

cnfait · 2023-03-09T15:31:32Z

cnfait
Mar 9, 2023
Maintainer

I would advise against storing the Glue job code in the datalakeLibrary repository. The content of this repository is built into a Lambda layer intended to be used by Lambda functions part of Stage A and Stage B. The code of the Glue job itself is of no use to them and will only make the layer bigger, which can be an issue.

What usually happens is storing the Glue job code with the Glue job's CloudFormation template inside a repository distinct from all the SDLF repositories. This repository can be called sdlf-transforms for example. Stage B (using the datalakeLibrary) runs a Glue job using a specific name so that's the only thing you need to be careful about.

There are alternatives to that although they're not perfect:

adding the Glue job code and the CloudFormation yaml to the template of the stage where the Glue job will be run. This works well but if the Glue job is specific to a dataset, obviously this stage will be as well.
adding the Glue job code directly into the sdlf-team-dataset repository, and the CloudFormation yaml to template.yaml in sdlf-{team}-dataset. This keeps the stage generic but you need to make the Glue job optional (with a condition) in the template if you'd like to deploy other datasets that don't require Glue jobs for this team.

6 replies

tommycheek Mar 9, 2023

Also, just to confirm....in all options that you mentioned, we don't need artifact buckets for the glue code. Is that correct?

cnfait Mar 9, 2023
Maintainer

Also, just to confirm....in all options that you mentioned, we don't need artifact buckets for the glue code. Is that correct?

Unfortunately you do. To deploy a Glue job with CloudFormation, you need to point to a S3 location where the code is stored.

As my colleague pointed out below, there is an optional SDLF component you can use to deploy Glue jobs:
https://github.com/awslabs/aws-serverless-data-lake-framework/tree/main/sdlf-utils/pipeline-examples/glue-jobs-deployer

cnfait Mar 9, 2023
Maintainer

Thank you for the response. I have two questions if you don't mind.

* Just so that I'm understanding correctly.  In the option where you create the sdlf-transforms repo, are you calling the glue job from the stage b transform and referencing that newly created repo (sdlf-transforms)?

Stage B expects the Glue job to follow a specific naming convention:
sdlf-{team}-{dataset}-glue-job
For example, for a SDLF team named engineering and a dataset called legislators, the expected Glue job name would be sdlf-engineering-legislators-glue-job. As long as you create your Glue job using that name, it doesn't matter how and where the Glue job is created, Stage B will call it.

* In the first alternative option where you create a new stage repo, do you have to create any dependencies to make sure it kicks off after stage A or will the framework take care of that?

Stage A expects the next stage to be called Stage B, although this can be edited: https://github.com/awslabs/aws-serverless-data-lake-framework/blob/main/sdlf-stageA/lambda/stage-a-postupdate-metadata/src/lambda_function.py#L63

I'm working on documentation on how to add new stages easily, as well as a better way to chain stage - but unfortunately this is not available yet.

tommycheek Mar 9, 2023

Thank you so much for the information. This is super helpful. Does the artifact bucket get created and the glue code gets sent there through the deploy file?

Lastly, to create a new stage could I copy the current stage B, update the code in the step functions and what not appropriately and then all the deploy files in the other repos for this sdlf framework would automatically pick this new stage up if I edit the parameter files correctly?

cnfait Mar 10, 2023
Maintainer

The deploy file does not create the artifacts bucket, it reuses the artifacts bucket that gets created when deploying SDLF in an account. The deploy script does take care of uploading the Glue job files to this bucket.

As for your last question as long as you point to the right repository in the parameters file of a pipeline, things should work!

anmolsgandhi · 2023-03-09T16:11:10Z

anmolsgandhi
Mar 9, 2023
Maintainer

Also check out this example under sdlf-utils which focuses on glue job deployment: https://github.com/awslabs/aws-serverless-data-lake-framework/tree/main/sdlf-utils/pipeline-examples/glue-jobs-deployer

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Git Repository organization — Glue Job location best practices #123

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Git Repository organization — Glue Job location best practices #123

ystoneman Mar 9, 2023

Replies: 2 comments · 6 replies

cnfait Mar 9, 2023 Maintainer

tommycheek Mar 9, 2023

cnfait Mar 9, 2023 Maintainer

cnfait Mar 9, 2023 Maintainer

tommycheek Mar 9, 2023

cnfait Mar 10, 2023 Maintainer

anmolsgandhi Mar 9, 2023 Maintainer

ystoneman
Mar 9, 2023

Replies: 2 comments 6 replies

cnfait
Mar 9, 2023
Maintainer

cnfait Mar 9, 2023
Maintainer

cnfait Mar 9, 2023
Maintainer

cnfait Mar 10, 2023
Maintainer

anmolsgandhi
Mar 9, 2023
Maintainer