Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SnowflakeCompare (Snowflake/Snowpark compare) #333

Merged
merged 21 commits into from
Oct 29, 2024

Conversation

rhaffar
Copy link
Contributor

@rhaffar rhaffar commented Sep 12, 2024

  • Adding SFTableCompare class for Snowflake table and Snowpark dataframe comparisons
  • Added tests and documentation
  • Adding snowpark session creation in test config

*commit history is pretty bare because I had to reset and update the author

A few considerations that may or may not be worth addressing, but I'd like your takes on these:

  1. Case-sensitive columns are "partially" supported. In Snowpark, case sensitive columns are actually enclosed in double quotes, so 'A' and 'a' are the same column, but '"a"' is distinct from these. This can be troublesome to handle flexibly for column inclusion checks, and more importantly for instances where you need to tag the dataframe name to the column (the double-quotes screw with this). So for now I just convert all columns and join_columns to be case insensitive. The only way this could be an issue is if users are using the exact same column name for different rows, differentiating only by case sensitivity. I think this is sufficient since the vast majority of Snowflake tables are case insensitive to begin with, and those that aren't likely don't repeat the same column with different casing, but let me know if you have other thoughts.
  2. Running a compare with a table against itself creates some weird issues, as Datacompy starts looking for temporary columns belonging to both tables at a later point in the comparison. There is obviously no use case to run datacompy on a single table, but just thought I’d mention this.
  3. Compares with a float in the join column leads to incorrect results. This isn’t really implementation specific, this is just the natural outcome of trying to join on floating point columns. I don’t know if you’ve seen similar things in Spark. I don’t consider this an issue since it’s generally bad practice to compare floats using a strict equality, at best we add some sort of relative error check but I think that would actually be a bit of a headache to do in Snowpark and would just enable bad practice.

Copy link
Member

@fdosani fdosani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my first round of review here. Excellent work, and much appreciate 🥳
A few changes, comments, and nitpicks. 😅 Feel free to comment and discuss.

To address the 3 question:

  1. I'm ok with not allowing case-sensitive column names as you stated. We can change that if users have a need later on.
  2. You're right, doesn't make much sense. I'd be curious to dig into that later though to understand what is happening.
  3. agreed on the join columns for sure. I'll need to check into how that works for the other compares. It would probably be the same issue I'm guessing.

datacompy/sf_sql.py Outdated Show resolved Hide resolved
datacompy/sf_sql.py Outdated Show resolved Hide resolved
@df1.setter
def df1(self, df1: Union[str, sp.DataFrame]) -> None:
"""Check that df1 is either a Snowpark DF or the name of a valid Snowflake table."""
if isinstance(df1, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we can be consistent with something like we have for polars and push this into _validate_dataframe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit different, in that this is actually used to determine how we construct the dataframe, versus type-checking that the built dataframe is in a valid form actually does occur in _validate_dataframe .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, that is fair.

abs_tol : float, optional
Absolute tolerance between two values.
rel_tol : float, optional
Relative tolerance between two values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider including:

 df1_name: str = "df1",
 df2_name: str = "df2",

to be consistent with the other cleaners.

datacompy/sf_sql.py Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
tests/test_sf_sql.py Outdated Show resolved Hide resolved
@rhaffar rhaffar requested a review from fdosani September 18, 2024 18:58
@rhaffar
Copy link
Contributor Author

rhaffar commented Sep 18, 2024

Just for documentation purposes, Snowpark local testing is still a relatively new product and seems to have a few pieces of functionality that have yet to be implemented, so although all tests pass when using a Snowflake cluster, we get many failures when running locally (usually the same couple of failures affect a handful of tests). For now I've kept the local testing configuration as we can pretty much choose when to run it locally or not, and hopefully within a few Snowpark updates we'll be able to run these tests locally without issue.

@rhaffar rhaffar changed the title Adding SFTableCompare (Snowflake/Snowpark compare) Adding SnowflakeCompare (Snowflake/Snowpark compare) Sep 19, 2024
@rhaffar
Copy link
Contributor Author

rhaffar commented Oct 29, 2024

Conclusion regarding testing:

Due to current CICD limitation as well as a desire to keep testing resources public, we've attempted to implement testing using either the Snowpark local runner or a local Snowflake resource emulator. In both instances, we see potential but find support somewhat limited, with either option requiring either significant mocking or unnecessary modifications of compare logic to work around lack of support. There's also no guarantee that our tests would currently be able to properly represent user usage through the use of an emulator or the local runner. So for now we will stick to testing Snowflake/Snowpark comparisons internally, and revisit these options at a later date.

@fdosani
Copy link
Member

fdosani commented Oct 29, 2024

@rhaffar We will need to exclude the test_snowflake.py from the pytest calls since it will cause everything to fail. Need to add --ignore=path/to/file.py when calling pytest in github actions.

@fdosani
Copy link
Member

fdosani commented Oct 29, 2024

@rhaffar nice work. LGTM.

@fdosani fdosani merged commit 1ea649a into capitalone:develop Oct 29, 2024
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants