Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNION ALL Queries resolves column lineage incorrectly #475

Closed
skada-coder opened this issue Oct 23, 2023 · 4 comments
Closed

UNION ALL Queries resolves column lineage incorrectly #475

skada-coder opened this issue Oct 23, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@skada-coder
Copy link

Describe the bug
When a create statement containing UNION ALL uses table aliases in the SELECT statements, column lineage fails to resolve back to the source table and resolves to the alias table instead.

The table source however does resolve correctly.

SQL
Variation 1 where income_stats is aliased as income

CREATE TABLE dwh.income_rates AS (
    SELECT
        income.id AS id,
        income.rate AS rate
    FROM income_stats income
    UNION ALL
    SELECT
        sup_income.id AS id,
        sup_income.rate AS rate
    FROM sup_income

Variation 2 no alias for income_stats

CREATE TABLE dwh.income_rates AS (
    SELECT
        income_stats.id AS id,
        income_stats.rate AS rate
    FROM income_stats
    UNION ALL
    SELECT
        sup_income.id AS id,
        sup_income.rate AS rate
    FROM sup_income

Variation 3 removing the UNION ALL

CREATE TABLE dwh.income_rates AS (
    SELECT
        income.id AS id,
        income.rate AS rate
    FROM income_stats income
)

To Reproduce
Note here we refer to SQL provided in prior step
Simply run the LineageRunner class with the SQL

from sqllineage.runner import LineageRunner
output = LineageRunner(sql, 'ansi')
print(output.source_tables)
print(output.get_column_lineage())

Looking at Variation 1 (The one with the issue)

CREATE TABLE dwh.income_rates AS (
    SELECT
        income.id AS id,
        income.rate AS rate
    FROM income_stats income
    UNION ALL
    SELECT
        sup_income.id AS id,
        sup_income.rate AS rate
    FROM sup_income

This is the output from sqllineage for source_tables and get_column_lineage() respectively
`[Table: .income_stats, Table: .sup_income]

[(Column: .income.id, Column: dwh.income_rates.id), (Column: .sup_income.id, Column: dwh.income_rates.id), (Column: .income.rate, Column: dwh.income_rates.rate), (Column: .sup_income.rate, Column: dwh.income_rates.rate)]`

If we remove the alias i.e. Variation 2, it resolves column lineage correctly

CREATE TABLE dwh.income_rates AS (
    SELECT
        income_stats.id AS id,
        income_stats.rate AS rate
    FROM income_stats
    UNION ALL
    SELECT
        sup_income.id AS id,
        sup_income.rate AS rate
    FROM sup_income

`[Table: .income_stats, Table: .sup_income]

[(Column: .income_stats.id, Column: dwh.income_rates.id), (Column: .sup_income.id, Column: dwh.income_rates.id), (Column: .income_stats.rate, Column: dwh.income_rates.rate), (Column: .sup_income.rate, Column: dwh.income_rates.rate)]`

If we remove the union all and leave the alias i.e. Variation 3, it resolves column lineage correctly

CREATE TABLE dwh.income_rates AS (
    SELECT
        income.id AS id,
        income.rate AS rate
    FROM income_stats income
)

`[Table: .income_stats]

[(Column: .income_stats.id, Column: dwh.income_rates.id), (Column: .income_stats.rate, Column: dwh.income_rates.rate)]`

Expected behavior
We expect for Variation 1 that income_rates columns resolve back to"income_stats" rather than "income" columns

i.e. this output
`[Table: .income_stats, Table: .sup_income]

[(Column: .income.id, Column: dwh.income_rates.id), (Column: .sup_income.id, Column: dwh.income_rates.id), (Column: .income.rate, Column: dwh.income_rates.rate), (Column: .sup_income.rate, Column: dwh.income_rates.rate)]`

should be this below with no mention of a table called "income"
`[Table: .income_stats, Table: .sup_income]

[(Column: .income_stats.id, Column: dwh.income_rates.id), (Column: .sup_income.id, Column: dwh.income_rates.id), (Column: .income_stats.rate, Column: dwh.income_rates.rate), (Column: .sup_income.rate, Column: dwh.income_rates.rate)]`

Python version (available via python --version)
Python 3.11.5

SQLLineage version (available via sqllineage --version):
1.4.7

Additional context
We are looking to correctly resolve Variation 1

@skada-coder skada-coder added the bug Something isn't working label Oct 23, 2023
@reata
Copy link
Owner

reata commented Oct 23, 2023

I suppose the SQL is not correct as both Variation 1 and Variation 2 miss the closing bracket at the end of query, and calling it with ansi dialect will throw a InvalidSyntaxException.

After append ) at the end of the query. I can reproduce the issue and it looks like this is with ansi, non-validating does not suffer the same issue:

$ sqllineage -f variation1.sql -l column --dialect=ansi
dwh.income_rates.id <- <default>.income.id
dwh.income_rates.id <- <default>.sup_income.id
dwh.income_rates.rate <- <default>.income.rate
dwh.income_rates.rate <- <default>.sup_income.rate
$ sqllineage -f variation1.sql -l column --dialect=non-validating
/home/hujunwei/repos/sqllineage/sqllineage/cli.py:94: DeprecationWarning: dialect `non-validating` is deprecated, use `ansi` or dialect of your SQL instead. `non-validating` will stop being the default dialect in v1.5.x release and be completely removed in v1.6.x
  runner = LineageRunner(
dwh.income_rates.id <- <default>.income_stats.id
dwh.income_rates.id <- <default>.sup_income.id
dwh.income_rates.rate <- <default>.income_stats.rate
dwh.income_rates.rate <- <default>.sup_income.rate

@skada-coder
Copy link
Author

skada-coder commented Oct 23, 2023

Ah yes, missed those closing brackets while formatting.

So in this scenario its more a trivial case to make it easier to replicate.

Our actual use case has much larger variations of this scenario and uses snowflake dialect as the issue also appears there

(server) bash-3.2$ sqllineage -f test.sql -l column --dialect=snowflake
dwh.income_rates.id <- <default>.income.id
dwh.income_rates.id <- <default>.sup_income.id
dwh.income_rates.rate <- <default>.income.rate
dwh.income_rates.rate <- <default>.sup_income.rate

Given that non-validating is on the deprecation path, what are your thoughts on fixing this particular scenario?

Understandable that there are millions of sql variations out there, so wont be able to cover everything.

@reata
Copy link
Owner

reata commented Oct 24, 2023

Anything that non-validating supports while breaks with ansi is treated as high priority since we're deprecating non-validating in the long run. So we'll get this fixed. No worries.

Plus snowflake or any other specific dialect shares majority of the logic with ansi. So I don't see snowflake as a risk either.

@reata
Copy link
Owner

reata commented Dec 10, 2023

This is fixed in master branch via #488 as we fixed a related issue

@reata reata closed this as completed Dec 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants