Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qualified wildcard recognized as wrong column name #423

Closed
crossxwill opened this issue Aug 3, 2023 · 7 comments · Fixed by #430
Closed

qualified wildcard recognized as wrong column name #423

crossxwill opened this issue Aug 3, 2023 · 7 comments · Fixed by #430
Labels
bug Something isn't working

Comments

@crossxwill
Copy link

crossxwill commented Aug 3, 2023

The following SQL query uses "sales" as an alias for "T_SALES". How could I ask the LineageRunner() to return the table/view name rather than the alias?

-- sample_query_with_semicolons.sql

SELECT CUST_ID, CUST_NAME, CUST_CITY, CUST_COUNTRY
INTO #TEMP_CUSTOMERS
FROM T_CUSTOMERS
WHERE CUST_COUNTRY = 'USA';

SELECT CUST.*
    , SALES.CUM_SALES
INTO #TEMP_FINAL_RESULTS
FROM #TEMP_CUSTOMERS AS CUST
LEFT JOIN T_SALES AS SALES
ON CUST.CUST_ID = SALES.CUST_ID
WHERE SALES.CUST_ID IS NULL;

SELECT *
FROM #TEMP_FINAL_RESULTS;
from sqllineage.runner import LineageRunner
import csv
import pandas as pd

# read text from sample_query.sql
sql_script = open('sample_query_with_semicolons.sql', 'r').read()

# parse sql_script with LineageRunner
parsed_results = LineageRunner(sql_script, dialect="tsql")

# write parsed_results to data frame
df = pd.DataFrame(parsed_results.get_column_lineage())

df.columns = ['Source', 'Target']

print(df)

Expected output:

Source Target
<default>.t_customers.cust_city <default>.#temp_customers.cust_city
<default>.t_customers.cust_country <default>.#temp_customers.cust_country
<default>.t_customers.cust_id <default>.#temp_customers.cust_id
<default>.t_customers.cust_name <default>.#temp_customers.cust_name
<default>.t_sales.cum_sales <default>.#temp_final_results.cum_sales
<default>.#temp_customers.cust.* <default>.#temp_final_results.cust.*

Actual output:

Source Target
<default>.t_customers.cust_city <default>.#temp_customers.cust_city
<default>.t_customers.cust_country <default>.#temp_customers.cust_country
<default>.t_customers.cust_id <default>.#temp_customers.cust_id
<default>.t_customers.cust_name <default>.#temp_customers.cust_name
<default>.sales.cum_sales <default>.#temp_final_results.cum_sales
<default>.#temp_customers.cust.* <default>.#temp_final_results.cust.*
@reata
Copy link
Owner

reata commented Aug 6, 2023

Thanks for reporting this. I can confirm this is a bug we should fix.

@reata reata added the bug Something isn't working label Aug 6, 2023
@reata
Copy link
Owner

reata commented Aug 12, 2023

I found the problem is with the second statement:

FROM #TEMP_CUSTOMERS CUST AS CUST

@crossxwill Can you help confirm this is valid syntax instead of

FROM #TEMP_CUSTOMERS AS CUST

For this case, we actually should throw exception because there're parsing errors. Later if this proves to be valid syntax, we can fix the parser.

@reata reata added the question Further information is requested label Aug 12, 2023
@crossxwill
Copy link
Author

crossxwill commented Aug 12, 2023

You found a typo in my example. I fixed it. The bug still remains.

@reata
Copy link
Owner

reata commented Aug 13, 2023

Now with #429 merged, we will raise InvalidSyntaxException for previously buggy sql:

SELECT CUST.*
    , SALES.CUM_SALES
INTO #TEMP_FINAL_RESULTS
FROM #TEMP_CUSTOMERS CUST AS CUST
LEFT JOIN T_SALES AS SALES
ON CUST.CUST_ID = SALES.CUST_ID
WHERE SALES.CUST_ID IS NULL;
$ sqllineage -f test.sql --dialect=tsql -l column
...
sqllineage.exceptions.InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL:
SELECT CUST.*
    , SALES.CUM_SALES
INTO #TEMP_FINAL_RESULTS
FROM #TEMP_CUSTOMERS CUST AS CUST
LEFT JOIN T_SALES AS SALES
ON CUST.CUST_ID = SALES.CUST_ID
WHERE SALES.CUST_ID IS NULL;
Line 4, Position 27: Found unparsable section: 'AS CUST\nLEFT JOIN T_SALES AS SALES\nON CU...'

@reata reata removed the question Further information is requested label Aug 13, 2023
@reata
Copy link
Owner

reata commented Aug 13, 2023

Back to this story, the problem is still limited to second statement. It seems we have some issue handling alias with SELECT INTO statement. Using code in master branch, this is the output:

$ sqllineage -f test.sql --dialect=tsql -l column
<default>.#temp_final_results.cum_sales <- <default>.t_sales.cum_sales
<default>.#temp_final_results.cust.* <- cust.*

Whereas the expected output should be:

<default>.#temp_final_results.cum_sales <- <default>.t_sales.cum_sales
<default>.#temp_final_results.* <- <default>.#temp_customers.*

@reata
Copy link
Owner

reata commented Aug 13, 2023

The problem is not with SELECT INTO, rather all qualified wildcard suffered from the same issue:

INSERT INTO tab1
SELECT tab2.*
FROM tab2 a
         INNER JOIN tab3 b
                    ON a.id = b.id
<default>.tab1.tab2.* <- tab2.*

The correct output should be:

<default>.tab1.* <- <default>.tab2.*

@reata reata changed the title Show table or view alias qualified wildcard recognized as wrong column name Aug 13, 2023
@reata
Copy link
Owner

reata commented Aug 13, 2023

The fixed result would be

$ sqllineage -f test.sql -l column --dialect=tsql
<default>.#temp_customers.cust_city <- <default>.t_customers.cust_city
<default>.#temp_customers.cust_country <- <default>.t_customers.cust_country
<default>.#temp_customers.cust_id <- <default>.t_customers.cust_id
<default>.#temp_customers.cust_name <- <default>.t_customers.cust_name
<default>.#temp_final_results.* <- <default>.#temp_customers.*
<default>.#temp_final_results.cum_sales <- <default>.t_sales.cum_sales

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants