Skip to content

Conversation

@sarath-mec
Copy link

@sarath-mec sarath-mec commented Mar 31, 2025

Motivation:

Comparing data between different database systems (like Oracle and PostgreSQL) or even within the same system often requires minor, database-specific transformations to make column values truly comparable. Examples include adjusting timezones, applying string functions, or rounding numeric values.

Previously, achieving this required creating temporary views or pre-transforming data, adding complexity and impossible in restricted environments

Solution:

This PR introduces a new transform_columns parameter to the TableSegment class. This parameter accepts a dictionary where:

  • Keys are the original column names in the table segment.
  • Values are raw SQL string expressions representing the transformation to be applied to that column during the diff process. The SQL syntax must be valid for the database associated with the specific TableSegment.

This allows users to specify minor adjustments directly within the reladiff configuration, eliminating the need for external setup.

Note: As transformations are typically specific to either the source or target database, this parameter is not overridden directly in diff_tables.

Examples of transform_columns Usage:

Here's how you might define transform_columns for different scenarios:

# Example for an Oracle TableSegment
oracle_transforms = {
    "AMOUNT": "ROUND(AMOUNT, 2)",
    "LEGACY_ID": "TO_CHAR(LEGACY_ID)",
    "LOB_DATA": "LENGTH(LOB_DATA)",
    "DESCRIPTION": "SUBSTR(DESCRIPTION, 1, 10)",
    "EVENT_TIMESTAMP": "CAST(EVENT_TIMESTAMP AT TIME ZONE 'UTC' AS TIMESTAMP)",
    "USER_NAME": "TRIM(USER_NAME)",
    "ACTIVE_FLAG": "CASE WHEN ACTIVE_FLAG = 'Y' THEN 1 ELSE 0 END"
}

# Example for a PostgreSQL TableSegment
postgres_transforms = {
    "amount": "ROUND(amount, 2)",
    "legacy_id": "CAST(legacy_id AS TEXT)", # or "legacy_id::TEXT",
    "lob_data": "LENGTH(lob_data)",
    "description": "SUBSTRING(description FROM 1 FOR 10)",
    "event_timestamp": "event_timestamp AT TIME ZONE 'America/New_York' AT TIME ZONE 'UTC'",
    "user_name": "TRIM(user_name)",
    "active_flag": "CAST(active_flag AS INTEGER)"
}

When creating TableSegment objects:

src_segment = TableSegment(..., transform_columns=oracle_transforms)
tgt_segment = TableSegment(..., transform_columns=postgres_transforms)

Implementation Details:

  • TableSegment:

    • Added the transform_columns: Dict[str, str] attribute to store the transformation rules provided by the user.
  • HashDiffer:

    • Modified the generation of expressions used for checksum calculation (_relevant_columns_repr) and final value fetching (get_values).
    • If a column name exists in transform_columns, the provided SQL string is embedded using sqeleton.queries.Code() instead of the original column reference (this[col]).
    • NormalizeAsString is applied to the result (either the original column or the transformed Code) to ensure consistent string representation for hashing and comparison.
  • JoinDiffer:

    • Modified the _create_outer_join method.
    • Within the OUTER JOIN query, for each compared column, it now checks if a transformation exists in transform_columns.
    • If a transformation exists, sqeleton.queries.Code(transformation_string) is used; otherwise, the original column reference (a[c1]/b[c2]) is used.
    • The comparison logic (is_distinct_from) and the selected output columns (a_cols, b_cols) now operate on the result of this potentially transformed expression, wrapped in NormalizeAsString using the original column's schema type for correct formatting.
  • EmptyTableSegment:

    • Fixed AssertionError: Modified EmptyTableSegment.__getattr__ by adding "transform_columns" to the allowed attributes in the assert statement. This allows JoinDiffer to correctly access the (empty) transform_columns dictionary from the underlying TableSegment when dealing with an empty table, preventing the assertion failure.

Unified Output: Both HashDiffer and JoinDiffer will now output the transformed values for differing rows, providing a consistent representation of the data as it was compared.

@sarath-mec sarath-mec changed the title Feat #70: Add transform_columns for on-the-fly value adjustments during diff Feat #70: Add transform_columns for column adjustments during diff Mar 31, 2025
@erezsh
Copy link
Owner

erezsh commented Mar 31, 2025

Btw you can run the tests locally. It should be pretty easy to set up with docker. (it only tests the basic dbs, but that's usually enough)

@sarath-mec
Copy link
Author

Sure working on same

@sarath-mec
Copy link
Author

Please ignore the commit. I am testing joindiffer locally

@erezsh
Copy link
Owner

erezsh commented Apr 2, 2025

Okay, ignoring

@sarath-mec
Copy link
Author

sarath-mec commented Apr 2, 2025

@erezsh The test cases had issues with JoinDiffer as I was only using HashDiffer.

The last two commits fixes the testing and I was able to test it locally

Please check the changes and let me know if anything else is needed

@erezsh
Copy link
Owner

erezsh commented Apr 3, 2025

I don't undestand, why is OverrideNormalizeAsString neeeded?

@sarath-mec
Copy link
Author

sarath-mec commented Apr 4, 2025

I don't undestand, why is OverrideNormalizeAsString neeeded?

@erezsh This is because is_distinct_from is not available in NormalizeAsString.

AttributeError: 'NormalizeAsString' object has no attribute 'is_distinct_from'

Introduced OverrideNormalizeAsString: To compare values after normalization and potential transformation using .is_distinct_from(), this was added. This class inherits LazyOps from sqeleton, enabling comparison methods on the normalized string representation. The comparison logic and selected output columns (a_cols, b_cols) now use this class.

https://github.com/erezsh/reladiff/actions/runs/14164994160/job/39677702946

Note: If you think is_distinct_from can be added to sqeleteon library directly we could do that. We can work on a PR like that as well

@sarath-mec
Copy link
Author

@erezsh Can this be merged. Do you have any thing else in mind?

@erezsh
Copy link
Owner

erezsh commented Apr 24, 2025

Something in the implementation doesn't seem right. I'll try to find time soon to look deeper into it.

…oth Hash Diff and JoinDiffer

In HashDiff, added logic to honor transform_rules for Key Columns as well (Tested Well in my Hashdiff usecase)
JoinDiff already considers transform_rules for key columns
@sarath-mec sarath-mec marked this pull request as draft June 11, 2025 21:30
@sarath-mec
Copy link
Author

sarath-mec commented Jun 11, 2025

@erezsh accidentally reponed the PR, I had made some more changes

There are options to optimize and may be confusing but Hash Diff is tested well

  • Created a new TableSegment method _get_transform_columns to support both Hash Diff and JoinDiffer
  • In HashDiff, added logic to honor transform_rules for Key Columns as well (Tested Well in my Hashdiff usecase)
  • JoinDiff already considers transform_rules for key columns

I have validated with the following case, where and the query was considering both cases

src_transform_columns = {}
tgt_transform_columns = {
    "employee_nbr": "CASE WHEN employee_nbr = 62230 THEN -1 ELSE employee_nbr END",
    "email_address": "CASE WHEN email_address = '[email protected]' THEN '[email protected]' ELSE email_address END"
}
src_tbl_segment = TableSegment(
    database=engine,
    table_path=src_table_path,
    key_columns=tuple(["employee_nbr"]),
    extra_columns=tuple(extra_cols),
    transform_columns=src_transform_columns,
    case_sensitive=True
    ).with_schema(refine=False, allow_empty_table=True)

tgt_tbl_segment = TableSegment(
    database=engine,
    table_path=tgt_table_path,
    key_columns=tuple(["employee_nbr"]),
    extra_columns=tuple(extra_cols),
    transform_columns=tgt_transform_columns,
    case_sensitive=True
    ).with_schema(refine=False, allow_empty_table=True)

`attr in ("database", "key_columns", "key_types", "relevant_columns", "_schema", "transform_columns")`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants