-
Notifications
You must be signed in to change notification settings - Fork 29
Feat #70: Add transform_columns for column adjustments during diff
#71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
transform_columns for on-the-fly value adjustments during difftransform_columns for column adjustments during diff
|
Btw you can run the tests locally. It should be pretty easy to set up with docker. (it only tests the basic dbs, but that's usually enough) |
|
Sure working on same |
|
Please ignore the commit. I am testing joindiffer locally |
|
Okay, ignoring |
|
I don't undestand, why is |
@erezsh This is because is_distinct_from is not available in NormalizeAsString. AttributeError: 'NormalizeAsString' object has no attribute 'is_distinct_from'
https://github.com/erezsh/reladiff/actions/runs/14164994160/job/39677702946 Note: If you think is_distinct_from can be added to sqeleteon library directly we could do that. We can work on a PR like that as well |
|
@erezsh Can this be merged. Do you have any thing else in mind? |
|
Something in the implementation doesn't seem right. I'll try to find time soon to look deeper into it. |
…oth Hash Diff and JoinDiffer In HashDiff, added logic to honor transform_rules for Key Columns as well (Tested Well in my Hashdiff usecase) JoinDiff already considers transform_rules for key columns
|
@erezsh accidentally reponed the PR, I had made some more changes There are options to optimize and may be confusing but Hash Diff is tested well
I have validated with the following case, where and the query was considering both cases |
`attr in ("database", "key_columns", "key_types", "relevant_columns", "_schema", "transform_columns")`
Motivation:
Comparing data between different database systems (like Oracle and PostgreSQL) or even within the same system often requires minor, database-specific transformations to make column values truly comparable. Examples include adjusting timezones, applying string functions, or rounding numeric values.
Previously, achieving this required creating temporary views or pre-transforming data, adding complexity and impossible in restricted environments
Solution:
This PR introduces a new
transform_columnsparameter to theTableSegmentclass. This parameter accepts a dictionary where:TableSegment.This allows users to specify minor adjustments directly within the
reladiffconfiguration, eliminating the need for external setup.Examples of
transform_columnsUsage:Here's how you might define
transform_columnsfor different scenarios:When creating
TableSegmentobjects:Implementation Details:
TableSegment:transform_columns: Dict[str, str]attribute to store the transformation rules provided by the user.HashDiffer:_relevant_columns_repr) and final value fetching (get_values).transform_columns, the provided SQL string is embedded usingsqeleton.queries.Code()instead of the original column reference (this[col]).NormalizeAsStringis applied to the result (either the original column or the transformedCode) to ensure consistent string representation for hashing and comparison.JoinDiffer:_create_outer_joinmethod.OUTER JOINquery, for each compared column, it now checks if a transformation exists intransform_columns.sqeleton.queries.Code(transformation_string)is used; otherwise, the original column reference (a[c1]/b[c2]) is used.is_distinct_from) and the selected output columns (a_cols,b_cols) now operate on the result of this potentially transformed expression, wrapped inNormalizeAsStringusing the original column's schema type for correct formatting.EmptyTableSegment:EmptyTableSegment.__getattr__by adding "transform_columns" to the allowed attributes in the assert statement. This allows JoinDiffer to correctly access the (empty) transform_columns dictionary from the underlying TableSegment when dealing with an empty table, preventing the assertion failure.