Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Hive/SparkSQL Multi Table Insert Syntax #596

Open
krahnikblis opened this issue Apr 17, 2024 · 1 comment
Open

Support Hive/SparkSQL Multi Table Insert Syntax #596

krahnikblis opened this issue Apr 17, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@krahnikblis
Copy link

krahnikblis commented Apr 17, 2024

Describe the bug
when processing SQL scripts where the grammar is backwards, LineageRunner().target_tables fails to parse/fix.

SQL
Paste the SQL text here. For example:

FROM `dbname`.`tblname_src`
INSERT OVERWRITE TABLE `dbname`.`tblname_dest`
PARTITION (parcol1,parcol2)
SELECT businessgroup	,
region	,
source_of_opportunity	,
pcg_rad	,
opportunity_id_d365	,
opportunity_id	,
opportunity_name	,
opportunity_owner	,
parcol1	,
parcol2

To Reproduce
Note here we refer to SQL provided in prior step as stored in a file named test.sql

from sqllineage.runner import LineageRunner
with open("test.sql") as f:
    sql = f.read()
result = LineageRunner(sql, dialect="sparksql")
print(result.target_tables)
---------------------------------------------------------------------------
InvalidSyntaxException                    Traceback (most recent call last)
Cell In[73], line 3
      1 # sql_fmt = sqlfluff.fix(sqlText, dialect="sparksql", config_path=cfg["sqlfluff_cfg"])
      2 sqllin = LineageRunner(sql=sql_fmt, dialect="sparksql")
----> 3 sqllin.target_tables

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sqllineage\runner.py:24, in lazy_method.<locals>.wrapper(*args, **kwargs)
     22 self = args[0]
     23 if not self._evaluated:
---> 24     self._eval()
     25 return func(*args, **kwargs)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sqllineage\runner.py:187, in LineageRunner._eval(self)
    182         warnings.warn(
    183             f"Dialect={self._dialect}, TSQL_NO_SEMICOLON will be ignored unless dialect is tsql"
    184         )
    185     self._stmt = split(self._sql.strip())
--> 187 self._stmt_holders = [analyzer.analyze(stmt) for stmt in self._stmt]
    188 self._sql_holder = SQLLineageHolder.of(*self._stmt_holders)
    189 self._evaluated = True

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sqllineage\runner.py:187, in <listcomp>(.0)
    182         warnings.warn(
    183             f"Dialect={self._dialect}, TSQL_NO_SEMICOLON will be ignored unless dialect is tsql"
    184         )
    185     self._stmt = split(self._sql.strip())
--> 187 self._stmt_holders = [analyzer.analyze(stmt) for stmt in self._stmt]
    188 self._sql_holder = SQLLineageHolder.of(*self._stmt_holders)
    189 self._evaluated = True

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sqllineage\core\parser\sqlfluff\analyzer.py:42, in SqlFluffLineageAnalyzer.analyze(self, sql)
     40     statement_segments = [self.tsql_split_cache[sql]]
     41 else:
---> 42     statement_segments = self._list_specific_statement_segment(sql)
     43 if len(statement_segments) == 0:
     44     raise UnsupportedStatementException(
     45         f"SQLLineage cannot parse SQL:" f"{sql}"
     46     )  # pragma: no cover

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\sqllineage\core\parser\sqlfluff\analyzer.py:73, in SqlFluffLineageAnalyzer._list_specific_statement_segment(self, sql)
     71 if violations:
     72     violation_msg = "\n".join(violations)
---> 73     raise InvalidSyntaxException(
     74         f"This SQL statement is unparsable, please check potential syntax error for SQL:\n"
     75         f"{sql}\n"
     76         f"{violation_msg}"
     77     )
     78 segments = []
     79 for top_segment in getattr(parsed.tree, "segments", []):

InvalidSyntaxException: This SQL statement is unparsable, please check potential syntax error for SQL:
<SQLtext printed>
Line 1, Position 1: Found unparsable section: 'FROM `dbname`.`tblname_src`...'

Expected behavior
processing the SQL as normal, or perhaps (better) put the FROM segment after SELECT [where it belongs; i know i know i didn't write the SQL just trying to organized and make sense of other peoples' stuff]

Python version (available via python --version)

  • 3.11.9

SQLLineage version (available via sqllineage --version):

  • 1.4.8

Additional context
it looks like this library sub-packages sqlfluff within its parser? i didn't know where to log the issue, but decided here since the errors above show \site-packages\sqllineage\ for all traces. separately, i do have sqlfluff 2.3.5 installed, i don't know if this means i have 2 versions or what...

@krahnikblis krahnikblis added the bug Something isn't working label Apr 17, 2024
@reata
Copy link
Owner

reata commented May 12, 2024

This feature is known as multiple insert clauses (also known as Multi Table Insert). It's not standard ANSI SQL syntax but a Hive extension. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Syntax.1

SparkSQL also supports this syntax according to its antlr g4 file, although it's not documented:
https://github.com/apache/spark/blob/v3.5.1/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L444

We need sqlfluff to support parsing this before we can add lineage analysis support. Upstream issue raise: sqlfluff/sqlfluff#5866

@reata reata changed the title "from ... insert ... select" syntax parse failure Support Hive/SparkSQL Multi Table Insert Syntax May 12, 2024
@reata reata added enhancement New feature or request and removed bug Something isn't working labels May 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants