Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata Masked When Table was in a previous UPDATE statement #577

Closed
phalcon22 opened this issue Feb 5, 2024 · 1 comment · Fixed by #581
Closed

Metadata Masked When Table was in a previous UPDATE statement #577

phalcon22 opened this issue Feb 5, 2024 · 1 comment · Fixed by #581
Labels
bug Something isn't working

Comments

@phalcon22
Copy link

Describe the bug

  • Updating a table then using it with a join makes the Column Level Lineage to not deduce the source.

SQL

UPDATE `dataset.table_1` SET A = '';

CREATE OR REPLACE TEMP TABLE `table_x` AS
SELECT DISTINCT
    B
FROM `dataset.table_1`
CROSS JOIN `dataset.table_2`
;

To Reproduce
Note here we refer to SQL provided in prior step as stored in a file named test.sql

import json
from sqllineage.core.metadata.dummy import DummyMetaDataProvider
from sqllineage.runner import LineageRunner

with open("test.sql") as f:
    sp = f.read()

with open('metadata.json') as file:
    metadata = json.load(file)
provider = DummyMetaDataProvider(metadata)

lineage = LineageRunner(sp, dialect="bigquery", metadata_provider=provider, verbose=True)
lineage.print_column_lineage()

Metadata.json:

{
  "dataset.table_1": [
    "A",
    "B"
  ],
  "dataset.table_2": [
    "C"
  ]
}

Result

<default>.table_x.b <- b

Expected behavior

<default>.table_x.b <- dataset.table_1.b

Python version (available via python --version)

  • 3.9.12

SQLLineage version (available via sqllineage --version):

  • 1.5.1

Notes

  • Removing the UPDATE statement or the JOIN makes it work properly.
  • Any kind of JOIN produces the bug
@phalcon22 phalcon22 added the bug Something isn't working label Feb 5, 2024
@reata
Copy link
Owner

reata commented Feb 6, 2024

The issue comes from when we handling session metadata. The intention for session metadata is that for temporary table/view created during the session, the schema info won't be available in metadata, as the table/view is not created yet.

But clearly we should limit the scope of session metadata to CREATE (and maybe INSERT). By now means should we register session metadata for UPDATE statement.

In this case, we register session metadata for dataset.table_1 through UPDATE, which doesn't include any schema information. And in the end this masks the schema info from MetaDataProvider.

@reata reata changed the title Column Level Lineage: No source when updating a table then use a join Metadata Masked When Table was in a previous UPDATE statement Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants