Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] adding transform functionality #2498

Merged
merged 8 commits into from
Jul 16, 2024

Conversation

otacilio-psf
Copy link
Contributor

@otacilio-psf otacilio-psf commented Jul 10, 2024

As in PySpark, the transform functionality allows you to split your transformations into units of work, creating a function for each, and then call them on your DataFrame, enabling the ability to chain transformations.

Having your transformations as functions helps in unit testing them.

def bussines_rule_1(df):
    df = (
        df
        .with_column(......)
        .with_column(......)
        .with_column(......)
    )
    return df
    
def bussines_rule_2(df):
    df = (
        df
        .with_column(......)
        .with_column(......)
    )
    return df

df = (
    df
    .transform(bussines_rule_1)
    .transform(bussines_rule_2)
)

PySpark reference

@github-actions github-actions bot added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 10, 2024
@otacilio-psf
Copy link
Contributor Author

otacilio-psf commented Jul 10, 2024

Not sure why ruff-format is failing

@jaychia
Copy link
Contributor

jaychia commented Jul 10, 2024

Thanks @otacilio-psf !

As I understand it, df.transform(f) is syntactic sugar for f(df)?

I'm curious why this would be preferable, since it feels like f(df) is more flexible with regards to arguments (e.g. we could potentially imagine functions that would be something like f(df1, df2, df3) and would be trickier to express with method chaining.

@otacilio-psf
Copy link
Contributor Author

Hi @jaychia

TL;DR: Yes, it helps with code readability and the transform accept a function plus args and kwargs.

df.transform(func, *args, **kwargs)

The main reason behind is to help with code readability when we are using functions to split unit of works to be later unit tested.

Instead of having a lot of dataframe variables (like df1, df_final, df_output, df_final_final), you can chain your transformations (as is already possible)

df_1 = (
        daft.read_csv("/path/to/file.csv")
        .with_column(......)
        .with_column(......)
        .join(df_2)
        .with_column(......)
        .with_column(......)
        .with_column(......)
    )

but would be nice to split the transformations in units of work, or business rules or as it make sense for your project

def meaningful_name_1(df):
    return (
        df
        .with_column(......)
        .with_column(......)
    )

def meaningful_name_2(df, df_join):
    return (
        df
        .join(df_join)
        .with_column(......)
    )

def meaningful_name_3(df):
    return (
        df
        .with_column(......)
        .with_column(......)
    )

And instead of call each function for a "step-variable" or awfully nest the functions, you could use transform to call the function to be chained

# The good
df = (
    daft.read_csv("/path/to/file.csv")
    .transform(meaningful_name_1)
    .transform(meaningful_name_2, df_table_2)
    .transform(meaningful_name_3)
)

# The normal
df = daft.read_csv("/path/to/file.csv")
df_1 = meaningful_name_1(df)
df_2 = meaningful_name_2(df_1, df_table_2)
df_3 = meaningful_name_3(df_2)

# The bad
df = meaningful_name_3(
    meaningful_name_2(
        meaningful_name_1(
            daft.read_csv("/path/to/file.csv")
            ),
            df_table_2
        )
    )

Let me know if you have any other questions, and also this is what I wanted to avoid my daft test

@jaychia
Copy link
Contributor

jaychia commented Jul 12, 2024

Thanks for the elaboration!

This seems fine to me, would you be able to run pre-commit locally? My guess is that Ruff is trying to reformat some of your docstrings, causing the CI failure.

You should be able to do this with running this command from your Daft virtual environment:

pre-commit run --all-files

Lastly, let's add a unit test in tests/dataframe/test_transform.py?

@otacilio-psf
Copy link
Contributor Author

@jaychia done 😊

@samster25
Copy link
Member

great work @otacilio-psf! Merging now :)

@samster25 samster25 merged commit c02d611 into Eventual-Inc:main Jul 16, 2024
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants