Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inferring Column Lineage When SELECT * FROM CTE/SubQuery #303

Open
YvesMentens opened this issue Oct 12, 2022 · 3 comments
Open

Inferring Column Lineage When SELECT * FROM CTE/SubQuery #303

YvesMentens opened this issue Oct 12, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@YvesMentens
Copy link

YvesMentens commented Oct 12, 2022

I was wondering why the following is the case:

WITH A as
(
    SELECT colOld as col1,colOld2 as col2 FROM sourceTable
), B as (
    SELECT * FROM A
)
CREATE TABLE newTable as
SELECT * FROM B

returns :
.newtable.* <- B.* <- A.*

while

WITH A as
(
    SELECT colOld as col1,colOld2 as col2 FROM sourceTable
), B as (
    SELECT *,col2 FROM A
)
CREATE TABLE newTable as
SELECT * FROM B

only returns:
.newtable.* <- B.

While extra information is gathered in the query. The column lineage is actually reduced.
In fact in the current setup you should have all the information to create the following lineage:
.newtable.col1 <- B.col1 <- A.col1 <- sourceTable.colOld
.newtable.col2 <- B.col2 <- A.col2 <- sourceTable.colOld2

Is there any way to resolve this issue and get all the columns?
It might be possible to split the SQL up into parts and get the lineage for each CTE. Afterwards you can then deduct the actual lineage from the actual source tables to the target. If this is not clear enough I can provide an example in the near future

Kind regards

@reata
Copy link
Owner

reata commented Oct 15, 2022

The overall design for column lineage regarding SELECT * is that we create a virtual column called *. Because generally we don't know what's in *.

But clearly in your case, we should be able to derive the columns that * contains without the need to go to metadata. This is something sqllineage can improve on.

Right now there's no walk-around. It's better you avoid writing SELECT *.

@reata reata added the enhancement New feature or request label Oct 15, 2022
@ericzhou571
Copy link

ericzhou571 commented Feb 17, 2023

For your situation, you can split the SQL query into separate sections by separating the WITH clauses and the main SELECT statement. You can then parse each section individually using sqllineage. After parsing, you can manually add the relations you need using the WITH clause parsing results.

@reata
Copy link
Owner

reata commented May 31, 2023

+@mberk06 who's interested in solving this issue.

I'd like to first mention that the ambition of sqllineage v1.5.x is to introduce a catalog plugin mechanism, so we're capable of knowing what columns a table contains, even if the query is select * or select unqualified column from table join. A few catalog plugin implementation we'd provide out of the box are: a) hive metadata store; b) sqlalchemy; c) raw DB API 2.

Under this background, it's better we also build a catalog for subquery (note CTE is modeled as SubQuery class underneath sqllineage). But unlike table, the columns of subquery can be inferred during parsing. And from catalog perspective, they're unified.

That's the preferred way aligned with our road map. But you're also welcome to solve this one with point solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants