Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column lineage does not traverse through CTE containing uppercase letters #531

Closed
martinswan opened this issue Jan 5, 2024 · 5 comments · Fixed by #550
Closed

Column lineage does not traverse through CTE containing uppercase letters #531

martinswan opened this issue Jan 5, 2024 · 5 comments · Fixed by #550
Labels
bug Something isn't working

Comments

@martinswan
Copy link

martinswan commented Jan 5, 2024

For SQL code with a CTE, column lineage is ending at the CTE and not traversing through to the source tables or expressions referenced in the CTE.

For example, I have the following code in test.sql:

WITH EmployeeDepartmentCTE AS (
    SELECT
        E.EmployeeName,
        D.DepartmentName
    FROM Employees AS E
    INNER JOIN Departments AS D
        ON E.DepartmentID = D.DepartmentID
)

INSERT INTO EmployeeDepartment

SELECT
    EmployeeName,
    DepartmentName
FROM EmployeeDepartmentCTE;

When I run the following command I get the subsequent output:

sqllineage -f test.sql -l column

output:

<default>.employeedepartment.departmentname <- <default>.employeedepartmentcte.departmentname
<default>.employeedepartment.employeename <- <default>.employeedepartmentcte.employeename

expected output:

<default>.employeedepartment.departmentname <- <default>.department.departmentname
<default>.employeedepartment.employeename <- <default>.employee.employeename

Python version:

  • 3.12.1

SQLLineage version:

  • 1.4.9

Additional context
If there are multiple CTE's with one referencing another, the lineage stops at the first CTE.

@martinswan martinswan added the bug Something isn't working label Jan 5, 2024
@maoxingda
Copy link
Contributor

Caused by capitalization, if the SQL is modified to all lowercase, the result will be accurate.

@martinswan
Copy link
Author

Thanks @maoxingda for your helpful response!

Would you consider it a bug that it needs to be lowercase, or is transforming to lowercase a best-practice that is (or should be) documented?

@maoxingda
Copy link
Contributor

It's because most databases are case-insensitive, so the underlying layer of sqllineage uniformly converts all identifiers to lowercase. Therefore, currently, lineage analysis is not sensitive to uppercase SQL.

@martinswan
Copy link
Author

Thanks. I'll close this issue since it sounds like it's a known bug that has already been considered.

@reata
Copy link
Owner

reata commented Jan 12, 2024

Thanks @maoxingda for the background. Yes we want to enforce case-insensitivity in lineage analysis in short term.

However, I'm afraid this is indeed a bug introduced via fbad73a and released in v1.4.9.

Converting all identifiers to lower case is just implementation details, the real intention is that for all the following SQL, they should generate the same lineage result:

WITH CTE AS (SELECT name FROM person) SELECT name FROM cte;
WITH cte AS (SELECT name FROM person) SELECT name FROM CTE;
WITH cte AS (SELECT name FROM person) SELECT name FROM cTe;
WITH ctE AS (SELECT name FROM person) SELECT name FROM cte;
WITH CtE AS (SELECT name FROM person) SELECT name FROM Cte;

because in a case-insensitive SQL database, they're indeed the same.

@reata reata reopened this Jan 12, 2024
@reata reata changed the title Column lineage does not traverse through CTE to find source columns Column lineage does not traverse through CTE containing uppercase letters Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants