Skip to content

Commit d0ca3dc

Browse files
maoxingdareata
andauthored
fix: misidentify column name as lateral alias (#540)
* fix: misidentify-column-name-as-alias (#539) * add LATERAL_COLUMN_ALIAS_REFERENCE in SQLLineageConfig * adjust import order * add test_column_top_level_enable_lateral_ref_with_metadata_from_nested_subquery * unknown * refactor: rebase master and convert LATERAL_COLUMN_ALIAS_REFERENCE to bool type * refactor: use as few condition as possible: SQLLineageConfig.LATERAL_COLUMN_ALIAS_REFERENCE * refactor: rebase master and resolve conflict * refactor: move logic from to_source_columns to end_of_query_cleanup * refactor: rebase master and fix black format * docs: LATERAL_COLUMN_ALIAS_REFERENCE how-to guide * docs: starting version for each config --------- Co-authored-by: reata <[email protected]>
1 parent a93b894 commit d0ca3dc

File tree

9 files changed

+540
-62
lines changed

9 files changed

+540
-62
lines changed

docs/gear_up/configuration.rst

+47-7
Original file line numberDiff line numberDiff line change
@@ -15,25 +15,65 @@ default schema name.
1515

1616
Default: ``""``
1717

18+
Since: 1.5.0
19+
1820
DIRECTORY
1921
=========
2022
Frontend app SQL directory. By default the frontend app is showing the data directory packaged with sqllineage,
2123
which includes tpcds queries for demo purposes. User can customize with this key.
2224

2325
Default: ``sqllineage/data``
2426

27+
Since: 1.2.1
28+
29+
LATERAL_COLUMN_ALIAS_REFERENCE
30+
==============================
31+
Enable lateral column alias reference. This is a syntax feature supported by some SQL dialects. See:
32+
33+
- Amazon Redshift: `Amazon Redshift announces support for lateral column alias reference`_
34+
- Spark (since 3.4): `Support "lateral column alias references" to allow column aliases to be used within SELECT clauses`_
35+
- Databricks: `Introducing the Support of Lateral Column Alias`_
36+
37+
.. note::
38+
Lateral column alias reference is a feature that must be used together with metadata for each column to be
39+
correctly resolved. Take below example:
40+
41+
.. code-block:: sql
42+
43+
SELECT clicks / impressions as probability,
44+
round(100 * probability, 1) as percentage
45+
FROM raw_data
46+
47+
If table raw_data has a column named **probability**, **probability** in the second selected column is from table
48+
raw_data. Otherwise, it's referencing alias **clicks / impressions as probability**.
49+
50+
That means with SQLLineage, besides making LATERAL_COLUMN_ALIAS_REFERENCE=TRUE, MetaDataProvider must also be
51+
provided so we can query raw_data table to see if it has a column named **probability** and then check alias reference.
52+
If not provided, we will fallback to default behavior to simply assume column **probability** is from table raw_data
53+
even if LATERAL_COLUMN_ALIAS_REFERENCE is set to TRUE.
54+
55+
Default: ``False``
56+
57+
Since: 1.5.1
58+
2559
TSQL_NO_SEMICOLON
2660
=================
27-
Enable tsql no semicolon splitter mode.
61+
Enable tsql no semicolon splitter mode. Transact-SQL offers this feature that even when SQL statements are not delimited
62+
by semicolon, it can still be parsed and executed.
2863

2964
.. warning::
30-
Transact-SQL offers this feature that even when SQL statements are not delimited by semicolon, it can still be
31-
parsed and executed. But quoting `tsql_syntax_convention`_, "although the semicolon isn't required for most
32-
statements in this version (v16) of SQL Server, it will be required in a future version". So this config key is
33-
kept mostly for backward-compatible purposes. We may drop the support any time without warning. Bear this in mind
34-
when using this feature with sqllineage.
65+
Quoting `Transact-SQL syntax conventions (Transact-SQL)`_, "although the semicolon isn't required for most
66+
statements in this version (v16) of SQL Server, it will be required in a future version".
67+
68+
So with SQLLineage, this config key is kept mostly for backward-compatible purposes. We may drop the support any
69+
time without warning. Bear this in mind when using this feature.
3570

3671
Default: ``False``
3772

73+
Since: 1.4.8
74+
3875

39-
.. _tsql_syntax_convention: https://learn.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-ver16
76+
.. _Amazon Redshift announces support for lateral column alias reference: https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/
77+
.. _Support "lateral column alias references" to allow column aliases to be used within SELECT clauses: https://issues.apache.org/jira/browse/SPARK-27561
78+
.. _Introducing the Support of Lateral Column Alias: https://www.databricks.com/blog/introducing-support-lateral-column-alias
79+
.. _Transact-SQL syntax conventions (Transact-SQL): https://learn.microsoft.com/en-us/sql/t-sql/language-elements/transact-sql-syntax-conventions-transact-sql?view=sql-server-ver16

sqllineage/config.py

+2
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ class _SQLLineageConfigLoader:
1414
"DEFAULT_SCHEMA": (str, ""),
1515
# to enable tsql no semicolon splitter mode
1616
"TSQL_NO_SEMICOLON": (bool, False),
17+
# lateral column alias reference supported by some dialect (redshift, spark 3.4+, etc)
18+
"LATERAL_COLUMN_ALIAS_REFERENCE": (bool, False),
1719
}
1820
BOOLEAN_TRUE_STRINGS = ("true", "on", "ok", "y", "yes", "1")
1921

sqllineage/core/holders.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -420,7 +420,7 @@ def _build_digraph(
420420
for unresolved_col, tgt_col in unresolved_cols:
421421
# check if there's only one parent candidate contains the column with same name
422422
src_cols = []
423-
# check if source column exists in graph
423+
# check if source column exists in graph (either from subquery or from table created in prev statement)
424424
for parent in unresolved_col.parent_candidates:
425425
src_col = Column(unresolved_col.raw_name)
426426
src_col.parent = parent
@@ -442,7 +442,9 @@ def _build_digraph(
442442
# It incorrect for JOIN with ON, but sql without specifying an alias in this case will be invalid
443443
for src_col in src_cols:
444444
g.add_edge(src_col, tgt_col, type=EdgeType.LINEAGE)
445-
g.remove_edge(unresolved_col, tgt_col)
445+
if len(src_cols) > 0:
446+
# only delete unresolved column when it's resolved
447+
g.remove_edge(unresolved_col, tgt_col)
446448

447449
# when unresolved column got resolved, it will be orphan node, and we can remove it
448450
for node in [n for n, deg in g.degree if deg == 0]:

sqllineage/core/models.py

+1
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,7 @@ def __init__(self, name: str, **kwargs):
158158
"source_columns", ((self.raw_name, None),)
159159
)
160160
]
161+
self.from_alias = kwargs.pop("from_alias", False)
161162

162163
def __str__(self):
163164
return (

sqllineage/core/parser/__init__.py

+48-30
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
from typing import List, Tuple, Union
1+
from typing import Dict, List, Tuple, Union
22

3+
from sqllineage.config import SQLLineageConfig
34
from sqllineage.core.holders import SubQueryLineageHolder
45
from sqllineage.core.models import Column, Path, SubQuery, Table
56
from sqllineage.exceptions import SQLLineageException
@@ -23,41 +24,58 @@ def end_of_query_cleanup(self, holder: SubQueryLineageHolder) -> None:
2324
if holder.write:
2425
if len(holder.write) > 1:
2526
raise SQLLineageException
26-
tgt_tbl = list(holder.write)[0]
27-
lateral_aliases = set()
28-
for idx, tgt_col in enumerate(col_grp):
29-
tgt_col.parent = tgt_tbl
30-
for lateral_alias_ref in col_grp[idx + 1 :]: # noqa: E203
31-
if any(
32-
src_col[0] == tgt_col.raw_name
33-
for src_col in lateral_alias_ref.source_columns
34-
):
35-
lateral_aliases.add(tgt_col.raw_name)
36-
break
37-
for src_col in tgt_col.to_source_columns(
38-
holder.get_alias_mapping_from_table_group(tbl_grp)
27+
tgt_tbl = next(iter(holder.write))
28+
lateral_column_aliases: Dict[str, List[Column]] = {}
29+
for idx, tgt_col_from_query in enumerate(col_grp):
30+
tgt_col_from_query.parent = tgt_tbl
31+
tgt_col_resolved = tgt_col_from_query
32+
src_cols_resolved = []
33+
for src_col in tgt_col_from_query.to_source_columns(
34+
holder.get_alias_mapping_from_table_group(tbl_grp),
3935
):
4036
if len(write_columns := holder.write_columns) == len(col_grp):
4137
# example query: create view test (col3) select col1 as col2 from tab
4238
# without write_columns = [col3] information, by default src_col = col1 and tgt_col = col2
4339
# when write_columns exist and length matches, we want tgt_col = col3 instead of col2
4440
# for invalid query: create view test (col3, col4) select col1 as col2 from tab,
4541
# when the length doesn't match, we fall back to default behavior
46-
tgt_col = write_columns[idx]
47-
is_lateral_alias_ref = False
48-
for wc in holder.write_columns:
49-
if wc.raw_name == "*":
50-
continue
51-
if (
52-
src_col.raw_name == wc.raw_name
53-
and src_col.raw_name in lateral_aliases
42+
tgt_col_resolved = write_columns[idx]
43+
# lateral column alias handling
44+
lca_flag = False
45+
if SQLLineageConfig.LATERAL_COLUMN_ALIAS_REFERENCE:
46+
if metadata_provider := getattr(
47+
self, "metadata_provider", None
5448
):
55-
is_lateral_alias_ref = True
56-
for lateral_alias_col in holder.get_source_columns(wc):
57-
holder.add_column_lineage(
58-
lateral_alias_col, tgt_col
49+
from_dataset = False
50+
for parent_candidate in src_col.parent_candidates:
51+
if isinstance(
52+
parent_candidate, Table
53+
) and src_col in metadata_provider.get_table_columns(
54+
parent_candidate
55+
):
56+
from_dataset = True
57+
elif isinstance(
58+
parent_candidate, SubQuery
59+
) and src_col in holder.get_table_columns(
60+
parent_candidate
61+
):
62+
from_dataset = True
63+
if not from_dataset and (
64+
lca_cols_resolved := lateral_column_aliases.get(
65+
src_col.raw_name, []
5966
)
60-
break
61-
if is_lateral_alias_ref:
62-
continue
63-
holder.add_column_lineage(src_col, tgt_col)
67+
):
68+
src_cols_resolved.extend(lca_cols_resolved)
69+
lca_flag = True
70+
if not lca_flag:
71+
src_cols_resolved.append(src_col)
72+
73+
for src_col_resolved in src_cols_resolved:
74+
holder.add_column_lineage(src_col_resolved, tgt_col_resolved)
75+
if (
76+
SQLLineageConfig.LATERAL_COLUMN_ALIAS_REFERENCE
77+
and tgt_col_from_query.from_alias
78+
):
79+
lateral_column_aliases[tgt_col_from_query.raw_name] = (
80+
src_cols_resolved
81+
)

sqllineage/core/parser/sqlfluff/extractors/base.py

+8-3
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from operator import add
33
from typing import List, Optional, Type, Union
44

5+
import networkx as nx
56
from sqlfluff.core.parser import BaseSegment
67

78
from sqllineage.core.holders import SubQueryLineageHolder
@@ -16,6 +17,7 @@
1617
list_join_clause,
1718
list_subqueries,
1819
)
20+
from sqllineage.utils.constant import NodeTag
1921
from sqllineage.utils.entities import AnalyzerContext, SubQueryTuple
2022
from sqllineage.utils.helpers import escape_identifier_name
2123

@@ -220,9 +222,12 @@ def extract_subquery(
220222
if sq.query.get_child("with_compound_statement")
221223
else SelectExtractor
222224
)
223-
holder |= extractor_cls(self.dialect, self.metadata_provider).extract(
224-
sq.query, AnalyzerContext(cte=holder.cte, write={sq})
225-
)
225+
subquery_holder = extractor_cls(
226+
self.dialect, self.metadata_provider
227+
).extract(sq.query, AnalyzerContext(cte=holder.cte, write={sq}))
228+
# remove WRITE tag from subquery so that the combined holder won't have multiple WRITE dataset
229+
nx.set_node_attributes(subquery_holder.graph, {sq, False}, NodeTag.WRITE)
230+
holder |= subquery_holder
226231

227232
@staticmethod
228233
def _init_holder(context: AnalyzerContext) -> SubQueryLineageHolder:

sqllineage/core/parser/sqlfluff/extractors/select.py

+12-4
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,17 @@ def extract(
4848
# so that each handler don't have to worry about what's inside subquery
4949
subqueries.append(sq)
5050

51+
if is_set_expression(segment):
52+
for _, sub_segment in enumerate(
53+
segment.get_children("select_statement", "bracketed")
54+
):
55+
for seg in list_child_segments(sub_segment):
56+
for sq in self.list_subquery(seg):
57+
subqueries.append(sq)
58+
59+
self.extract_subquery(subqueries, holder)
60+
61+
for segment in segments:
5162
self._handle_swap_partition(segment, holder)
5263
self._handle_select_into(segment, holder)
5364
self.tables.extend(
@@ -64,17 +75,14 @@ def extract(
6475
(len(self.columns), len(self.tables))
6576
)
6677
for seg in list_child_segments(sub_segment):
67-
for sq in self.list_subquery(seg):
68-
subqueries.append(sq)
6978
self.tables.extend(
7079
self._list_table_from_from_clause_or_join_clause(
7180
seg, holder
7281
)
7382
)
7483
self._handle_column(seg)
75-
self.end_of_query_cleanup(holder)
7684

77-
self.extract_subquery(subqueries, holder)
85+
self.end_of_query_cleanup(holder)
7886

7987
holder.expand_wildcard(self.metadata_provider)
8088

sqllineage/core/parser/sqlfluff/models.py

+2-8
Original file line numberDiff line numberDiff line change
@@ -105,10 +105,7 @@ def of(column: BaseSegment, **kwargs) -> Column:
105105
if column.type == "select_clause_element":
106106
source_columns, alias = SqlFluffColumn._get_column_and_alias(column)
107107
if alias:
108-
return Column(
109-
alias,
110-
source_columns=source_columns,
111-
)
108+
return Column(alias, source_columns=source_columns, from_alias=True)
112109
if source_columns:
113110
column_name = None
114111
for sub_segment in list_child_segments(column):
@@ -145,10 +142,7 @@ def of(column: BaseSegment, **kwargs) -> Column:
145142

146143
# Wildcard, Case, Function without alias (thus not recognized as an Identifier)
147144
source_columns = SqlFluffColumn._extract_source_columns(column)
148-
return Column(
149-
column.raw,
150-
source_columns=source_columns,
151-
)
145+
return Column(column.raw, source_columns=source_columns)
152146

153147
@staticmethod
154148
def _extract_source_columns(segment: BaseSegment) -> List[ColumnQualifierTuple]:

0 commit comments

Comments
 (0)