Should I use SDLF? #141

valebedu · 2023-05-30T15:42:11Z

valebedu
May 30, 2023

Hi SDLF community,

I'm really new to SDLF and have some questions regarding the framework

Data Ingestion

I work with Airbyte and would like to include Airbyte OSS as a replacement for Glue Crawler, it can fetch many sources and allow users to create their own.

I saw that Glue Crawler is part of the dataset template.

As Airbyte will populate raw catalogs, is it possible to bypass crawlers?

Git & multi-environments

We use gitlab as git hosting provider and prefer to use gitlab instead of CodeCommit.

In all our projects we're using gitflow with branches:

feature/X
develop
release/Y
master (or main)
hotfix/Z

and have 4 environments:

development (feature/X branches | future development, not stable)
testing (develop | all developments, generally stable)
staging (release/Y & hotfix/Z branches | before deploying to production, stable)
production (master or main)

Can I update CI/CD configuration to match to our gitflow? Or do you think it may break the framework too much?

IaC & multi-accounts deployment

Almost all my stacks are migrated to AWS CDK and some AWS solutions such as AWS Instance Scheduler are wrapped into a CDK project in order to deploy to multiple workload and environment accounts at once.

I've seen that SDLF helps deploy a data lake into dev, test and prod accounts, but in my case I'd really like to follow Designing a data lake for growth and scale on the AWS Cloud reference architecture as a base (In the FAQ they said it's better to start with this in order to anticipate future needs)

With AWS CDK, it is very easy to deploy a multi-account project (shared, workload-A, workload-B, ...), for example a multi-account VPC network with peering.

Is it possible to achieve something similar to Designing a data lake for growth and scale on the AWS Cloud with SDLF?

CI/CD

CI/CD in this framework looks really good, but isn't it much simplier with CDK Pipelines?

Thanks,

Answered by cnfait

Jun 5, 2023

Hi SDLF community,

I'm really new to SDLF and have some questions regarding the framework

Data Ingestion

I work with Airbyte and would like to include Airbyte OSS as a replacement for Glue Crawler, it can fetch many sources and allow users to create their own.

I saw that Glue Crawler is part of the dataset template.

As Airbyte will populate raw catalogs, is it possible to bypass crawlers?

You can remove the crawler from sdlf-dataset and also from sdlf-stageB - so yes it is entirely possible!

Git & multi-environments

We use gitlab as git hosting provider and prefer to use gitlab instead of CodeCommit.

In all our projects we're using gitflow with branches:
* feature/X

* develop

* relea…

View full answer

cnfait · 2023-06-05T09:38:20Z

cnfait
Jun 5, 2023

Hi SDLF community,

I'm really new to SDLF and have some questions regarding the framework

Data Ingestion

I work with Airbyte and would like to include Airbyte OSS as a replacement for Glue Crawler, it can fetch many sources and allow users to create their own.

I saw that Glue Crawler is part of the dataset template.

As Airbyte will populate raw catalogs, is it possible to bypass crawlers?

You can remove the crawler from sdlf-dataset and also from sdlf-stageB - so yes it is entirely possible!

Git & multi-environments

We use gitlab as git hosting provider and prefer to use gitlab instead of CodeCommit.

In all our projects we're using gitflow with branches:
* feature/X

* develop

* release/Y

* master (or main)

* hotfix/Z
and have 4 environments:
* development (feature/X branches | future development, not stable)

* testing (develop | all developments, generally stable)

* staging (release/Y & hotfix/Z branches | before deploying to production, stable)

* production (master or main)
Can I update CI/CD configuration to match to our gitflow? Or do you think it may break the framework too much?

In several places the mapping between branches and environments is defined as follows:

dev branch maps to the dev environment
test branch maps to the test environment
master branch maps to the prod environment

It's easy to modify a branch name or an environment name by modifying these mappings. What you're asking is a bit harder though, I must admit. We want to eventually be more flexible, this is a feature I started working on but it's not ready yet.

IaC & multi-accounts deployment

Almost all my stacks are migrated to AWS CDK and some AWS solutions such as AWS Instance Scheduler are wrapped into a CDK project in order to deploy to multiple workload and environment accounts at once.

I've seen that SDLF helps deploy a data lake into dev, test and prod accounts, but in my case I'd really like to follow Designing a data lake for growth and scale on the AWS Cloud reference architecture as a base (In the FAQ they said it's better to start with this in order to anticipate future needs)

With AWS CDK, it is very easy to deploy a multi-account project (shared, workload-A, workload-B, ...), for example a multi-account VPC network with peering.

Is it possible to achieve something similar to Designing a data lake for growth and scale on the AWS Cloud with SDLF?

This reference architecture follows a pattern called data mesh. It is possible to modify SDLF to make it fit such a pattern, yes, there are several ways to achieve that. We are also working on a new major version of the framework that would support this pattern out of the box.

CI/CD

CI/CD in this framework looks really good, but isn't it much simplier with CDK Pipelines?

CDK Pipelines makes things quite a bit easier! SDLF is written in CloudFormation, not CDK, that's why we don't use it. However you might be interested in SDLF DDK Lightweight:
https://github.com/aws-samples/aws-ddk-examples/blob/main/sdlf-ddk-lightweight/typescript/README.md

It's a version of SDLF written with CDK. It is very close in terms of constructs. Our plan is to have it part of the main SDLF repository eventually, hopefully this month or the next.

3 replies

anmolsgandhi Jun 5, 2023

if you are more familiar with python: we do have the cdk version in python as well for the SDLF DDK Lightweight: https://github.com/aws-samples/aws-ddk-examples/blob/main/sdlf-ddk-lightweight/python/README.md

valebedu Jun 12, 2023
Author

Thank you Anmol,

We use TS for all our projects and DDK is now TS native so it's perfect!

I'll look at DDK and SDLF DDK, thank you so much for this amazing project, I hadn't heard of it! You truly deserve more visibility in the AWS Data Engineering community.

In Constructs you are not associated as an official AWS contributor, that's why I didn't see you, aws-analytics-reference-architecture is marked as ✅AWS.

In my case, I almost always filter by AWS publisher, maybe with ✅AWS you could gain visibility for others like me who filter by AWS publisher.

anmolsgandhi Jul 12, 2023

Hi, thanks for the suggestions on getting DDK marked as ✅AWS: we are now AWS verified ✅: https://constructs.dev/packages/aws-ddk-core/v/1.1.0?lang=typescript

valebedu · 2023-06-12T10:20:51Z

valebedu
Jun 12, 2023
Author

Hi cnfait,

Thank you so much for your really detailed answer, I will look at SDLF 2.0 version. Is there an example project using it?

Also, you're right about data mesh, I read a lot about it last week and found that it's what I need but data mesh is about decentralize the data lake into multiple ones, decouple responsibilities and have a central governance for data exploration, publication and request. aws-analytics-reference-architecture is an example architecture I found.

As I understand it's totally possible to have something like aws-analytics-reference-architecture for data mesh and SDLF implementation for each data domain?

Will it be easier with SDLF 2.0 to implement that kind of architecture?

When I read about data mesh I also read about AWS DataZone, a managed service to simplify data mesh implementation with an interface for all teams to publish, explore and request access to data.

SDLF could also work with this new service?

I would love to have some example or further reading about data mesh and/or datazone with SDLF (maybe with SDLF 2.0)

Thanks,

0 replies

valebedu · 2023-06-23T15:51:10Z

valebedu
Jun 23, 2023
Author

Hi @cnfait,

I don't know if you saw my previous comment because i didn't tag you. Have you an quick example of a data mesh with SDLF as data domain (1 producer and 1 consumer for example) and about DataZone

Thanks,

1 reply

cnfait Jul 13, 2023

In regards to DataZone, we are keen on integrating well with the new service. Currently there is no official API (through boto3 or CloudFormation) we can use though so it's hard to know what we'll do exactly and when. That's about all I can say for now!

I'll get back to you on the data mesh topic soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should I use SDLF? #141

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Should I use SDLF? #141

valebedu May 30, 2023

Replies: 3 comments · 4 replies

cnfait Jun 5, 2023

anmolsgandhi Jun 5, 2023

valebedu Jun 12, 2023 Author

anmolsgandhi Jul 12, 2023

valebedu Jun 12, 2023 Author

valebedu Jun 23, 2023 Author

cnfait Jul 13, 2023

valebedu
May 30, 2023

Replies: 3 comments 4 replies

cnfait
Jun 5, 2023

valebedu Jun 12, 2023
Author

valebedu
Jun 12, 2023
Author

valebedu
Jun 23, 2023
Author