-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UNION ALL Queries resolves column lineage incorrectly #475
Comments
I suppose the SQL is not correct as both Variation 1 and Variation 2 miss the closing bracket at the end of query, and calling it with ansi dialect will throw a InvalidSyntaxException. After append
|
Ah yes, missed those closing brackets while formatting. So in this scenario its more a trivial case to make it easier to replicate. Our actual use case has much larger variations of this scenario and uses snowflake dialect as the issue also appears there
Given that non-validating is on the deprecation path, what are your thoughts on fixing this particular scenario? Understandable that there are millions of sql variations out there, so wont be able to cover everything. |
Anything that non-validating supports while breaks with ansi is treated as high priority since we're deprecating non-validating in the long run. So we'll get this fixed. No worries. Plus snowflake or any other specific dialect shares majority of the logic with ansi. So I don't see snowflake as a risk either. |
This is fixed in master branch via #488 as we fixed a related issue |
Describe the bug
When a create statement containing UNION ALL uses table aliases in the SELECT statements, column lineage fails to resolve back to the source table and resolves to the alias table instead.
The table source however does resolve correctly.
SQL
Variation 1 where income_stats is aliased as income
Variation 2 no alias for income_stats
Variation 3 removing the UNION ALL
To Reproduce
Note here we refer to SQL provided in prior step
Simply run the LineageRunner class with the SQL
Looking at Variation 1 (The one with the issue)
This is the output from sqllineage for source_tables and get_column_lineage() respectively
`[Table: .income_stats, Table: .sup_income]
[(Column: .income.id, Column: dwh.income_rates.id), (Column: .sup_income.id, Column: dwh.income_rates.id), (Column: .income.rate, Column: dwh.income_rates.rate), (Column: .sup_income.rate, Column: dwh.income_rates.rate)]`
If we remove the alias i.e. Variation 2, it resolves column lineage correctly
`[Table: .income_stats, Table: .sup_income]
[(Column: .income_stats.id, Column: dwh.income_rates.id), (Column: .sup_income.id, Column: dwh.income_rates.id), (Column: .income_stats.rate, Column: dwh.income_rates.rate), (Column: .sup_income.rate, Column: dwh.income_rates.rate)]`
If we remove the union all and leave the alias i.e. Variation 3, it resolves column lineage correctly
`[Table: .income_stats]
[(Column: .income_stats.id, Column: dwh.income_rates.id), (Column: .income_stats.rate, Column: dwh.income_rates.rate)]`
Expected behavior
We expect for Variation 1 that income_rates columns resolve back to"income_stats" rather than "income" columns
i.e. this output
`[Table: .income_stats, Table: .sup_income]
[(Column: .income.id, Column: dwh.income_rates.id), (Column: .sup_income.id, Column: dwh.income_rates.id), (Column: .income.rate, Column: dwh.income_rates.rate), (Column: .sup_income.rate, Column: dwh.income_rates.rate)]`
should be this below with no mention of a table called "income"
`[Table: .income_stats, Table: .sup_income]
[(Column: .income_stats.id, Column: dwh.income_rates.id), (Column: .sup_income.id, Column: dwh.income_rates.id), (Column: .income_stats.rate, Column: dwh.income_rates.rate), (Column: .sup_income.rate, Column: dwh.income_rates.rate)]`
Python version (available via
python --version
)Python 3.11.5
SQLLineage version (available via
sqllineage --version
):1.4.7
Additional context
We are looking to correctly resolve Variation 1
The text was updated successfully, but these errors were encountered: