Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SubQuery with Same Alias Visualized as Same Node #481

Open
maoxingda opened this issue Nov 24, 2023 · 8 comments
Open

SubQuery with Same Alias Visualized as Same Node #481

maoxingda opened this issue Nov 24, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@maoxingda
Copy link
Contributor

maoxingda commented Nov 24, 2023

The results of column lineage using non-validating and Redshift dialects are inconsistent.

SQL

insert into
    public.t3
(
    c1
)
with
    cte1 as (
        select t1.c1 from public.t1 t1
    ),
    cte2 as (
        select t2.c1 from public.t2 t2
    )
select
    sq1.c1
from
    (
        select cte1.c1 from cte1 union all select cte2.c1 from cte2
    ) sq1
;

To Reproduce

Note here we refer to SQL provided in prior step as stored in a file named test.sql

from sqllineage.runner import LineageRunner

with open('test.sql', 'r') as f:
    sql = f.read()

lr1 = LineageRunner(sql)
lr2 = LineageRunner(sql, dialect='redshift')

pprint(lr1.get_column_lineage())
pprint(lr2.get_column_lineage())
[(Column: public.t1.c1, Column: cte1.c1, Column: sq1.c1, Column: public.t3.c1),
 (Column: public.t2.c1, Column: cte2.c1, Column: sq1.c1, Column: public.t3.c1)]

[(Column: <default>.cte1.c1, Column: sq1.c1, Column: public.t3.c1),
 (Column: <default>.cte2.c1, Column: sq1.c1, Column: public.t3.c1)]

Expected behavior

[(Column: public.t1.c1, Column: cte1.c1, Column: sq1.c1, Column: public.t3.c1),
 (Column: public.t2.c1, Column: cte2.c1, Column: sq1.c1, Column: public.t3.c1)]

Python version (available via python --version)

  • 3.8.18

SQLLineage version (available via sqllineage --version):

  • 1.4.8

Additional context

Snipaste_2023-11-27_12-01-37

@maoxingda maoxingda added the bug Something isn't working label Nov 24, 2023
@maoxingda
Copy link
Contributor Author

Giving aliases t3 and t4 to cte1 and cte2 respectively does not yield the correct results.

insert into
    public.t3
(
    c1
)
with
    cte1 as (
        select t1.c1 from public.t1 t1
    ),
    cte2 as (
        select t2.c1 from public.t2 t2
    )
select
    sq1.c1
from
    (
        select t3.c1 from cte1 as t3 union all select t4.c1 from cte2 as t4
    ) sq1
;
Snipaste_2023-11-26_10-04-58

@maoxingda
Copy link
Contributor Author

When using the Redshift dialect and there is a 'join_clause' in the 'from_clause' of the 'set_expression,' the lineage is also incorrect.

insert into
    public.tgt_tbl1
(
    id
)
with
    cte1 as (
        select s1.id from public.src_tbl1 as s1
    ),
    cte2 as (
        select s2.id from public.src_tbl2 as s2
    )
select
    sq1.id
from
    (
        select c1.id || s3.id as id from cte1 as c1 join public.src_tbl3 as s3 on c1.id = s3.id
        union all
        select c2.id from cte2 as c2
    ) sq1
;
Snipaste_2023-11-27_10-29-49

@maoxingda
Copy link
Contributor Author

Does the code on line 154 in the screenshot not take into account scenarios involving nested subqueries?

Snipaste_2023-11-27_14-12-09

@maoxingda
Copy link
Contributor Author

When there are two subqueries with the same name in the set_expression, the column lineage is also incorrect.

insert into
    public.tgt_tbl1
(
    id
)
select
    sq.id
from
    (
        select
            id
        from
            public.src_tbl1
    ) sq

union all

select
    sq.id
from
    (
        select
            id
        from
            public.src_tbl2
    ) sq
;

Snipaste_2023-11-29_12-54-55

** sqllineage.core.models.SubQuery.__str__ **

def __str__(self):
    # return self.alias
    return re.sub(r'\s+', ' ', self.query_raw).strip()

Snipaste_2023-11-29_12-55-14

@reata
Copy link
Owner

reata commented Dec 3, 2023

The table level lineage is incorrect, too

$ sqllineage -f test.sql --dialect=non-validating
Statements(#): 1
Source Tables:
    public.t1
    public.t2
Target Tables:
    public.t3

$ sqllineage -f test.sql --dialect=redshift
Statements(#): 1
Source Tables:
    <default>.cte1
    <default>.cte2
    public.t1
    public.t2
Target Tables:
    public.t3

We have some bugs here when handling set expression together with CTE that we mis-identify CTE as normal table. This should be fixed first.

reata added a commit that referenced this issue Dec 6, 2023
* fix: similar alias across statements

* fix: handling subqueries in a set expression.

* refactor: re-use handle table and column logic for set

* refactor: make test case atomic

* style: black reformat test

---------

Co-authored-by: reata <[email protected]>
@reata
Copy link
Owner

reata commented Dec 6, 2023

With #488 merged, now for the SQL

insert into
    public.t3
(
    c1
)
with
    cte1 as (
        select t1.c1 from public.t1 t1
    ),
    cte2 as (
        select t2.c1 from public.t2 t2
    )
select
    sq1.c1
from
    (
        select cte1.c1 from cte1 union all select cte2.c1 from cte2
    ) sq1
;

non-validating and redshift generate same result for both table lineage and column lineage:

$ python -m sqllineage.cli -f test.sql --dialect=non-validating
Statements(#): 1
Source Tables:
    public.t1
    public.t2
Target Tables:
    public.t3

$ python -m sqllineage.cli -f test.sql --dialect=redshift
Statements(#): 1
Source Tables:
    public.t1
    public.t2
Target Tables:
    public.t3

$ python -m sqllineage.cli -f test.sql --dialect=non-validating -l column         
public.t3.c1 <- sq1.c1 <- cte1.c1 <- public.t1.c1
public.t3.c1 <- sq1.c1 <- cte2.c1 <- public.t2.c1
$ python -m sqllineage.cli -f test.sql --dialect=redshift -l column
public.t3.c1 <- sq1.c1 <- cte1.c1 <- public.t1.c1
public.t3.c1 <- sq1.c1 <- cte2.c1 <- public.t2.c1

@maoxingda
Copy link
Contributor Author

maoxingda commented Dec 6, 2023

Thank you very much, boss. ✅

@reata
Copy link
Owner

reata commented Dec 6, 2023

two subqueries with the same name in the set_expression is the only remaining buggy sql in this issue that we will take care of in #489 .

@reata reata changed the title inconsistent column lineage SubQuery with Same Alias Visualized as Same Node Dec 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants