Best practices for chaining transformations #122

ystoneman · 2023-03-09T02:27:51Z

ystoneman
Mar 9, 2023

I am working on an SDLF pipeline. The pipeline includes a dataset defined in the dataset repository, which creates a database from a folder in the staging bucket and then creates multiple tables (one table per file from that parent folder).

The stage A transformation applies to the entire dataset, which includes the entire database and all its tables.

If I want to apply a stage B transform to only one of the tables within the "dataset", would it be better to create a new dataset from the already existing one (if that is even possible), or should I write code in the stage B transformation file that targets the specific table that I'm interested in?

Answered by cnfait

Mar 9, 2023

The easiest solution is to write code in the stage B transformation targeting only the table you're interested in, as you suggest. You can even edit the postupdate lambda of stage A to avoid sending the other tables to stage B.

The alternative you are mentioning is definitely more work as it would likely involve:

declaring a new dataset for the table
copying the file from the stage bucket (output from stage A) back into the raw bucket under the prefix of the new dataset
creating a new pipeline with a single stage looking like stage B.

(as an aside, we are working on making it easier to chain stages using AWS EventBridge, but there is no ETA yet)

View full answer

cnfait · 2023-03-09T18:00:53Z

cnfait
Mar 9, 2023
Maintainer

The easiest solution is to write code in the stage B transformation targeting only the table you're interested in, as you suggest. You can even edit the postupdate lambda of stage A to avoid sending the other tables to stage B.

The alternative you are mentioning is definitely more work as it would likely involve:

declaring a new dataset for the table
copying the file from the stage bucket (output from stage A) back into the raw bucket under the prefix of the new dataset
creating a new pipeline with a single stage looking like stage B.

(as an aside, we are working on making it easier to chain stages using AWS EventBridge, but there is no ETA yet)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for chaining transformations #122

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Best practices for chaining transformations #122

ystoneman Mar 9, 2023

Replies: 1 comment

cnfait Mar 9, 2023 Maintainer

ystoneman
Mar 9, 2023

cnfait
Mar 9, 2023
Maintainer