Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing SubQuery in Table Level Lineage #192

Open
wonb168 opened this issue Dec 3, 2021 · 17 comments
Open

Exposing SubQuery in Table Level Lineage #192

wonb168 opened this issue Dec 3, 2021 · 17 comments
Labels
enhancement New feature or request

Comments

@wonb168
Copy link

wonb168 commented Dec 3, 2021

with c1 as (select * from ttt),
c2 as(select * from c1)
select * from (select * from c2 as aa)b

I want the table lineage inclued subquery name, this example need output:
ttt -> c1
c1 -> c2
c2 -> b

@reata
Copy link
Owner

reata commented Dec 4, 2021

I see why you would want this feature, but unfortunately this would need a lot of changes to current code base. To make it happen, we need to refactor the code not limited to the following aspect:

  • Fix column lineage tracing for CTE referencing each other. Here CTE c2 is using data from CTE c1. But our lineage says only "tab1.* <- b.* <- .c2.", rather than "tab1. <- b.* <- .c2.* <- .c1.*". That's definitely a bug.
  • Using the same way to build table level lineage as we do for column. Right now, each SQL statement is exposing table lineage info like read, write, cte, etc. Instead, we need a graph. (Internally it is already stored in a graph, but we're not leveraging it).
  • Allow table lineage to be presented without a target table. Because current design only draws the DAG if there's a target table. I'm actually not convinced that we should change this.

In conclusion, thanks for reporting this, but this may take time.

@reata reata changed the title Not support subquery? Exposing SubQuery in Table Level Lineage Dec 4, 2021
@wonb168
Copy link
Author

wonb168 commented Dec 6, 2021

Because complex buniess login need many steps:temp table/subquery/with.
And then draw a diagraph to view the whole relations, the graph only have source table and target in unenoughh.
Another example:
create temporary table tmp_replenish_transfer_level as
with template_info as (
select id::text as biz_action_template_code
,parameter_table_id
from tenant_${tenant_id}biz.gto_biz_action_template
), parameter_table_info as (
select a.id,b.biz_action_template_code
from tenant
${tenant_id}biz.base_parameter_table as a
inner join template_info as b
on a.id = b.parameter_table_id
where a.is_deleted = 0
)
select biz_action_template_code
,min(id) as id,min(parameter_value_id) as parameter_value_id,min(parameter_code) as parameter_code,min(parameter_value::text) as parameter_value
,coalesce(min(replenish_transfer_level::text),'operating_unit') as replenish_transfer_level
from (
select a.biz_action_template_code
,a.id, b.id as parameter_value_id, b.parameter_code, b.parameter_value
, b.parameter_value ->> 'gto.replenish_transfer_level' as replenish_transfer_level
from parameter_table_info as a
inner join tenant
${tenant_id}biz.base_parameter_value as b
on a.id = b.parameter_table_id
where b.parameter_code = 'gto.replenish_transfer_level' and b.is_deleted = 0
union all
select distinct id::varchar as biz_action_template_code
,null::int as id,null::int as parameter_value_id,null as parameter_code,null::jsonb as parameter_value
,null as replenish_transfer_level
from tenant
${tenant_id}_biz.gto_biz_action_template
where is_deleted='0' and is_enable='1'
) aa
group by biz_action_template_code
order by biz_action_template_code::int

@reata
Copy link
Owner

reata commented Dec 11, 2021

Thanks for the example. I'll prioritize this issue.
Maybe I ask what's the tech stack / sql dialect you're using? Looks like PostgreSQL.

@reata reata added the enhancement New feature or request label Dec 11, 2021
@reata
Copy link
Owner

reata commented Dec 11, 2021

The first step is done after we close #196

@wonb168
Copy link
Author

wonb168 commented Dec 13, 2021

what's the tech stack / sql dialect you're using? Looks like PostgreSQL.
right, it's postgresql

@wonb168
Copy link
Author

wonb168 commented Dec 13, 2021

do you have a wechat or other social account?maybe communicate needs more convenient

@reata
Copy link
Owner

reata commented Dec 21, 2021

do you have a wechat or other social account?maybe communicate needs more convenient

I'd prefer to keep all the discussion here so that later people know how decision is made along the way. But feel free to email me if that's convenient for you.

@wonb168
Copy link
Author

wonb168 commented Feb 25, 2022

how about this issue going?

@reata
Copy link
Owner

reata commented Feb 26, 2022

how about this issue going?

Not so fast. Right now I'm still focusing on fixing various edge case bugs for column level lineage. Might need another 2 or 3 minor version release before things are stable at column level.

@wonb168
Copy link
Author

wonb168 commented Apr 18, 2022

column lineage can show subquery , then table lineage contain subquery perhaps easier, for example directly from column lineage is one way.

@huanglaoxie0503
Copy link

how about this issue going?

@reata
Copy link
Owner

reata commented Feb 5, 2023

We're prioritizing #302, changing the underlying parser to support more SQL dialects. Every other issue is put on hold for the moment.

@BrandenXia
Copy link

how about this issue going?

@reata
Copy link
Owner

reata commented Jul 20, 2023

how about this issue going?

Can you share the use case? I really didn't anticipate this feature should attract so many attentions?

@BrandenXia
Copy link

how about this issue going?

Can you share the use case? I really didn't anticipate this feature should attract so many attentions?

I'm now using MySQL. When doing data analysis, we use a lot of subquery. When a problem occur, I use this library to analyze the SQL so that we trace the origin of each field. However, when meeting subqueries, it's not so useful because we can only see that the data starts from a subquery.
Now we use sqlparse to do some preprocessing to the SQL to expand the subquery before we put it into LineageRunner.

@reata
Copy link
Owner

reata commented Jul 24, 2023

@BrandenXia it feels to me that your problem is mostly related with incomplete column lineage. Can you check #303 to see if that's the same with your issue? Or you can share some queries that you don't get what you expect with column lineage.

@BrandenXia
Copy link

@BrandenXia it feels to me that your problem is mostly related with incomplete column lineage. Can you check #303 to see if that's the same with your issue? Or you can share some queries that you don't get what you expect with column lineage.

Yeah, that's basically same with the problem I mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants