Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column Level Lineage #374

Closed
michael-berk opened this issue May 22, 2023 · 3 comments
Closed

Column Level Lineage #374

michael-berk opened this issue May 22, 2023 · 3 comments
Labels
question Further information is requested

Comments

@michael-berk
Copy link

michael-berk commented May 22, 2023

I'm looking to build column level lineage from a raw SQL query string. My test query, which is pretty difficult from a column-lineage perspective, produces the following issues.

Issues

  1. It can't handle window functions (with and without an alias)
  2. It can't handle aggregations (with and without an alias)
  3. It doesn't provide lineage for columns with more than 1 CTE in of depth
  4. It doesn't infer the columns projected in SELECT * based on source CTEs where columns are explicitly defined

Contributions

I'm happy to PR fixes, but this seems like a lot of work. Furthermore, parsing a sqlglot AST, via repr(parse_one(query)), or sqloxide query, via parse_sql(sql=query, dialect='ansi')[0], might be more effective because sqllineage is limited to the SQLparse output. How much work would it be to resolve the above 4 issues?

Code

hard_query = """
create table destination as
  with cte as (
    select *
      -- comment
    from source_table
  )

  , agged_cte as (
    select
      a.plum
      , max(a.grape)
      --other comment
      , MIN(b.tomato) as tomato
    from cte as a
    inner join (
      select
        tomato
        , pk
      from subquery_table
    ) as b
      on a.pk = b.pk
    group by id
  )


  select
    x.*
    , y.blueberry as berry
    , sum(y.apple) over(partition by y.pear order by y.cherry) as apple_window_function 
  from agged_cte as x
  inner join other_table as y
    using(plum)
  limit 10
"""

import subprocess

cmd = ["sqllineage", "-e", hard_query, "-l", "column"]

# Run SQLLineage and capture its output
output = subprocess.check_output(cmd, universal_newlines=True)
print(output)

# output: 
# <default>.destination.* <- agged_cte.*
# <default>.destination.berry <- <default>.other_table.blueberry
# <default>.destination.sum <- sum
@reata
Copy link
Owner

reata commented May 23, 2023

Issue 4 is tracked via #303

For Issue 1, Issue 2 and Issue 3, I remember to certain point they were brought up by user and fixed. But I don't have the details on when we fixed them. Can you share the sqllineage version you're using? As of v1.4.x, we have introduced sqlfluff as the underlying parser to address some problems that can not be easily solved with sqlparse.

With sqllineage v.1.4.3, and calling it via sqllineage -e hardquery -l column --dialects=ansi, you get the following result

<default>.destination.apple_window_function <- <default>.other_table.apple
<default>.destination.apple_window_function <- <default>.other_table.cherry
<default>.destination.apple_window_function <- <default>.other_table.pear
<default>.destination.berry <- <default>.other_table.blueberry
<default>.destination.x.* <- x.*

Looks to me, SELECT * from CTE is the only problem, correct?

@reata reata added the question Further information is requested label May 23, 2023
@michael-berk
Copy link
Author

Including --dialects=ansi resolved the issue on version 1.4.3. Thanks for the tip. After some quick tests, it does seems that 1-3 are resolved.

Two notes

  • Issue 4 is still relevant. I'm happy to contribute a PR if you think that's in scope and doable as a first PR.
  • <default>.destination.apple_window_function <- <default>.other_table.cherry and <default>.destination.apple_window_function <- <default>.other_table.pear are not selected. Does sqllineage define column-lineage as all columns in the select, regardless of whether they are selected? Given they are not written to <default>.destination, this is a bit misleading.

@reata
Copy link
Owner

reata commented May 31, 2023

Sorry, it's been a busy week.

  • for Issue 4, I'll leave some comment in Inferring Column Lineage When SELECT * FROM CTE/SubQuery #303 , and you can follow up there. I guess it's both yes and no whether it's doable as a first PR. You'll see why. But I welcome any PR even if it's partial solution. We can work together on this.
  • window function is treated as regular functions for unified behavior that any function parameter will be in column-lineage. This has its pros and cons. Image you're writing something like select coalesce(apple, pear, cherry) as fruit, then all three column to appear in the column-lineage is the natural result. A counter example would be, for function like select max_by(apple, weight) as heaviest_apple, from semantic point of view, it seems apple is the only source column because weight is used for sorting purpose only, just like cherry & pear in our window function example. I hope these two examples justify our design choice.

@reata reata closed this as completed May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants