Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an Argument to Exclude SubQuery Column Node in Column Lineage Path #526

Closed
delphisharp opened this issue Jan 4, 2024 · 4 comments · Fixed by #543
Closed

Add an Argument to Exclude SubQuery Column Node in Column Lineage Path #526

delphisharp opened this issue Jan 4, 2024 · 4 comments · Fixed by #543
Labels
enhancement New feature or request

Comments

@delphisharp
Copy link
Contributor

Describe the bug
now get_column_lineage's result include SubQuery

To Reproduce
For example:

from sqllineage.runner import LineageRunner
v_sql = "insert into ta select b from (select b from tb union all select c from tc ) sub"
parse = LineageRunner(sql=v_sql)
for col_tuple in parse.get_column_lineage():
    print(col_tuple)
(Column: <default>.tb.b, Column: sub.b, Column: <default>.ta.b)
(Column: <default>.tc.c, Column: sub.b, Column: <default>.ta.b)

Expected behavior

(Column: <default>.tb.b, Column: <default>.ta.b)
(Column: <default>.tc.c, Column: <default>.ta.b)

Python version (available via python --version)

  • 3.10.13

SQLLineage version (available via sqllineage --version):

  • 1.4.9

Additional context
I will submit a PR later

@delphisharp delphisharp added the bug Something isn't working label Jan 4, 2024
@reata
Copy link
Owner

reata commented Jan 6, 2024

It's not a bug. Column lineage is defined as a path , instead of source * target column tuple.

The variable name exclude_subquery is misleading though. It merely means by default we want to exclude the column lineage path ending with column in subquery. And get_column_lineage returns List[Tuple[Column, ...]] rather than List[Tuple[Column, Column]]. The annotation is wrong.

We will not change the logic. But let me know if you're interested in getting the document right.

@reata reata added question Further information is requested and removed bug Something isn't working labels Jan 6, 2024
@delphisharp
Copy link
Contributor Author

临时表包括CTE,是加工过程中的临时变量。对于血缘展现和分析是不需要。
所以我还是坚持我的意见,应该在结果中剔除 临时表和CTE的结果。
另外,表血缘不展现临时表,那字段血缘为什么要展现临时表呢? 如果有应该都有,没有应该都没有。

The temporary table includes CTE, which is a temporary variable during processing. It is not required for lineage display and analysis.
So I still insist on my opinion that the results of temporary tables and CTE should be eliminated from the results.

In addition, table blood relations do not display temporary tables, so why should field blood relations display temporary tables? If there is, it should be there, if not there should be none.

@delphisharp
Copy link
Contributor Author

那我如果补充个方法, 实现 column lineage 的 Tuple[List[Column, Column]] 是不是也可以解决问题?
Then if I add a method to implement Tuple[List[Column, Column]] of column lineage, will it also solve the problem?

Repository owner deleted a comment from maoxingda Jan 7, 2024
Repository owner deleted a comment from delphisharp Jan 7, 2024
@reata
Copy link
Owner

reata commented Jan 7, 2024

Actually we have request to expose subquery at table level lineage, though personally I'm not taking that as high priority due to the fact that SubQuery cannot be added to table lineage very easily that it probably requires changes in current graph modeling. You can check details in #192 .

Regarding table vs column lineage, we have two very different APIs:

  • For table, it's source_tables, target_tables, intermediate_tables, each is a set of Tables. From graph point of view, it's 3 set of nodes.
  • For column, it's get_column_lineage, returning a list of column tuple. From graph point of view, it's a list of paths.

They're not comparable, unless we build a get_table_lineage method to be a list of paths alongside tables. If we ever do that, I'll certainly include subquery in get_table_lineage to be consistent with get_column_lineage so that we can solve #192.

If we really want to do this, I think instead of adding another method, we should add a new argument to get_column_lineage:

  1. existing exclude_subquery argument should be renamed to exclude_path_ending_in_subquery.
  2. add a new exclude_subquery_columns argument to exclude column from SubQuery in the path.

And document them clearly.

@reata reata added enhancement New feature or request and removed question Further information is requested labels Jan 14, 2024
@reata reata changed the title get_column_lineage_include_subquery Add an Argument to Exclude SubQuery Column Node in Column Lineage Path Jan 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants