[FEAT] adding transform functionality #2498

otacilio-psf · 2024-07-10T09:57:02Z

As in PySpark, the transform functionality allows you to split your transformations into units of work, creating a function for each, and then call them on your DataFrame, enabling the ability to chain transformations.

Having your transformations as functions helps in unit testing them.

def bussines_rule_1(df):
    df = (
        df
        .with_column(......)
        .with_column(......)
        .with_column(......)
    )
    return df
    
def bussines_rule_2(df):
    df = (
        df
        .with_column(......)
        .with_column(......)
    )
    return df

df = (
    df
    .transform(bussines_rule_1)
    .transform(bussines_rule_2)
)

PySpark reference

otacilio-psf · 2024-07-10T10:37:58Z

Not sure why ruff-format is failing

jaychia · 2024-07-10T17:00:25Z

Thanks @otacilio-psf !

As I understand it, df.transform(f) is syntactic sugar for f(df)?

I'm curious why this would be preferable, since it feels like f(df) is more flexible with regards to arguments (e.g. we could potentially imagine functions that would be something like f(df1, df2, df3) and would be trickier to express with method chaining.

otacilio-psf · 2024-07-11T08:46:59Z

Hi @jaychia

TL;DR: Yes, it helps with code readability and the transform accept a function plus args and kwargs.

df.transform(func, *args, **kwargs)

The main reason behind is to help with code readability when we are using functions to split unit of works to be later unit tested.

Instead of having a lot of dataframe variables (like df1, df_final, df_output, df_final_final), you can chain your transformations (as is already possible)

df_1 = (
        daft.read_csv("/path/to/file.csv")
        .with_column(......)
        .with_column(......)
        .join(df_2)
        .with_column(......)
        .with_column(......)
        .with_column(......)
    )

but would be nice to split the transformations in units of work, or business rules or as it make sense for your project

def meaningful_name_1(df):
    return (
        df
        .with_column(......)
        .with_column(......)
    )

def meaningful_name_2(df, df_join):
    return (
        df
        .join(df_join)
        .with_column(......)
    )

def meaningful_name_3(df):
    return (
        df
        .with_column(......)
        .with_column(......)
    )

And instead of call each function for a "step-variable" or awfully nest the functions, you could use transform to call the function to be chained

# The good
df = (
    daft.read_csv("/path/to/file.csv")
    .transform(meaningful_name_1)
    .transform(meaningful_name_2, df_table_2)
    .transform(meaningful_name_3)
)

# The normal
df = daft.read_csv("/path/to/file.csv")
df_1 = meaningful_name_1(df)
df_2 = meaningful_name_2(df_1, df_table_2)
df_3 = meaningful_name_3(df_2)

# The bad
df = meaningful_name_3(
    meaningful_name_2(
        meaningful_name_1(
            daft.read_csv("/path/to/file.csv")
            ),
            df_table_2
        )
    )

Let me know if you have any other questions, and also this is what I wanted to avoid my daft test

jaychia · 2024-07-12T19:52:53Z

Thanks for the elaboration!

This seems fine to me, would you be able to run pre-commit locally? My guess is that Ruff is trying to reformat some of your docstrings, causing the CI failure.

You should be able to do this with running this command from your Daft virtual environment:

pre-commit run --all-files

Lastly, let's add a unit test in tests/dataframe/test_transform.py?

otacilio-psf · 2024-07-15T11:02:53Z

@jaychia done 😊

samster25 · 2024-07-16T03:28:18Z

great work @otacilio-psf! Merging now :)

otacilio-psf added 2 commits July 10, 2024 11:34

Addinf transform to dataframe.rst doc

b56bf16

Add transform to DataFrame

c504566

github-actions bot added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 10, 2024

otacilio-psf added 3 commits July 10, 2024 12:19

Change transform function to be alinged with ruff format

a934de7

revert: Change transform function to be alinged with ruff format

6532889

Fixing codespell

e22d0d5

otacilio-psf added 3 commits July 15, 2024 10:50

Add transform unit test

de059ca

Fix ruff-format issue

29c353d

Correct transform unit test

2c97b79

samster25 merged commit c02d611 into Eventual-Inc:main Jul 16, 2024
41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] adding transform functionality #2498

[FEAT] adding transform functionality #2498

otacilio-psf commented Jul 10, 2024 •

edited

Loading

otacilio-psf commented Jul 10, 2024 •

edited

Loading

jaychia commented Jul 10, 2024

otacilio-psf commented Jul 11, 2024

jaychia commented Jul 12, 2024

otacilio-psf commented Jul 15, 2024

samster25 commented Jul 16, 2024

[FEAT] adding transform functionality #2498

[FEAT] adding transform functionality #2498

Conversation

otacilio-psf commented Jul 10, 2024 • edited Loading

otacilio-psf commented Jul 10, 2024 • edited Loading

jaychia commented Jul 10, 2024

otacilio-psf commented Jul 11, 2024

jaychia commented Jul 12, 2024

otacilio-psf commented Jul 15, 2024

samster25 commented Jul 16, 2024

otacilio-psf commented Jul 10, 2024 •

edited

Loading

otacilio-psf commented Jul 10, 2024 •

edited

Loading