-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add extension plugin to parse SQL into logical plan / user defined SQL parser #533
Comments
Is preventing a query such as the following desirable?
or is this by design? |
@adsharma I can't think of a reason why we would want to prevent the union query you mentioned. I don't think what @jorgecarleitao wrote in the issue description implies this? |
I also do not see the issue with the example above, but I would say that, In general, custom SQL parsers effectively modify the SQL dialect that is being used and therefore the responsibility to document variations, including any limitation that they may introduce to the "default" postgres dialect, lays to the applications that use/install custom parsers. |
I don't have much context about the proposal. Trying to understand things better. Please bear with me. The reason why SQL doesn't have Instead, they prefer a flow such as:
I can also imagine a variant such as:
which has the same effect. The benefit of these variants is that it makes it harder to write SQL that violates isolation properties. |
|
Are the following good references for what delta lake is proposing?
https://docs.databricks.com/delta/concurrency-control.html
https://docs.databricks.com/delta/optimizations/isolation-level.html
I get that delta lake uses transactions on the *metadata* to ensure that a
consistent view of the table is presented to batch jobs that may be reading
it.
The reasoning I was using is that SQL is currently a unified query language
that works for both OLTP and OLAP. In the ideal world, one set of
extensions address both use cases.
So what I understood from the discussion is - if there is an "events" table
with a "timestamp" column and is partitioned by the hour, it's perfectly
legit for a query to aggregate over hourly partitions to compute some sort
of a view. No Tx isolation guarantee is violated.
However, if a table is getting updated with new data and a query is able to
see both the old version and the new version and compute stats using some
mix of the two, delta lake isolation guarantees are violated (assuming the
tables were set up with WriteSerializable isolation level)?
…On Sun, Jun 20, 2021 at 7:34 PM QP Hou ***@***.***> wrote:
as of n is a deltalake specific SQL extension. It's better to think of t
version as of n as a different table. Datafusion is not a transactional
query engine, so querying the same table should always return the same
result.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#533 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFA2A3CSAA2UQ763NQ7FXDTT2QLDANCNFSM46NO5YQQ>
.
|
For avoidance of doubt, the goal of this issue is not to support these extension languages in DataFusion itself, but to allow users to plugin their own custom extensions, so that they can support them themselves. |
An update here: There are several examples of people who have implemented their own custom query language / domain specific language using DataFusion. For example vega fusion with the vega language, and InfluxDB IOx with the language that underlies flux and influxql. Typically the approach is to use the One way we might be able to accomplish this extension point would be like: /// Default SQL planner
pub trait SQLPlanner :
fn plan_query(&mut self, query: sqlparser::ast::Query) -> LogicalPlan {
// call free function
plan_query(self, query)
}
fn plan_expr(&mut self, expr: sqlparser::ast::Expr) -> Expr {
// existing logic work
plan_expr(self, expr)
}
}
// Free functions that do the work (as you can't call default trait impls)
// https://stackoverflow.com/questions/43849041/reuse-default-method-implementations-in-trait-impls
fn plan_query<P: SQLPlanner>(planner: &mut P, query: sqlparser::ast::Query) -> LogicalPlan {
// existing logic
}
fn plan_expr<P: SQLPlanner>(planner: &mut P, expr: sqlparser::ast::Expr) -> Expr {
// existing logic work
} I was able to make this type of pattern work for "sql rewriting" where the user is allowed to rewrite an SQL query prior to planning with the existing datafusion planner. You can see a sketch of such a solution in https://github.com/influxdata/influxdb_iox/pull/5438. |
An extension planner then looks like struct MySqlPlanner {
}
impl QueryPlanner for MySqlPlanner {
// override expression planning
fn plan_expr(&mut self, expr: sqlparser::ast::Expr) -> Expr {
if need_to_rewrite(expr) {
// custom planning logic
} else {
// fall back to default impl
plan_expr(self, expr)
}
}
} The issue with this approach is that it complicates the sql planner non trivially , so I am not sure it is a great idea |
That PR no longer links correctly, perhaps due to a rename. Does it still exist? I'd be curious look at the example |
Hi @philippemnoel -- I am not sure how useful that PR would be, but I can dig up the diff if you would like What are you trying to do? You can see an example of using DataFusion to implement your own language here https://github.com/influxdata/influxdb/tree/main/iox_query_influxql/src (our implementation of InfluxQL) I believe Greptime DB has their own frontend for promql here: https://github.com/GreptimeTeam/greptimedb/tree/9ff7670adfb56a80fe6ffeab8bdab9bcfe55543c/src/promql There are several ways to extend DataFusion, such as with scalar / aggregate / window functions , documented here As well as custom operators and optimizer passes |
@tshauck mentioned this in discord: https://discord.com/channels/885562378132000778/1166447479609376850/1219782560436191342 Hi, I'm wondering if someone could give some advice on how to use DFParser in my own parser. It's impl is: pub struct DFParser<'a> {
parser: Parser<'a>,
} And basically I want to do: pub struct MyParser<'a> {
parser: DFParser<'a>,
} so I can reuse as much as possible (e.g. parse_statement), but implement my own parsing logic on Keyword::COPY. The issue I'm having is that it doesn't seem possible to access the underlying sqlparser Parser as it's private to DFParser. At least with this approach I'd need access to it so I could call next_token() and other methods. Is there a better approach to adding something like that on top of the DFParser, or would the maintainers be open to making parser available for modification on DFParser struct. Put another way, the end goal is to customize how the SQL becomes a logical plan so I can use CopyTo and a user defined logical node to write to a custom file format. Thanks! For reference, here's parse_statement: https://github.com/apache/arrow-datafusion/blob/8074ca1e758470319699a562074290906003b312/datafusion/sql/src/parser.rs#L352-L384 |
@tshauck I think this feature has been missing for a while and it would be great if you help / propose improvements I personally think exposing the underlying Parser would be fine What I think would be the most compelling would be a new example in https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples/examples showing how to do what you are describing above (aka support a custom file format). As part of that PR, you would likely have to add / modify the What do you think? |
@alamb Cool yeah, happy to give this a go. I'll start w/ the example + making Parser public as you suggest. FWIW, I did a spike yesterday to see if I could go the encapsulation route, but perhaps unsurprisingly, you need somewhat lower level control for a parser, so it ended up with a lot of methods on |
As a user of DataFusion, I would like to be able to install custom parsing rules of SQL to DataFusion, so that I can plan custom nodes from SQL.
This would allow me to extend datafusions' core capabilities beyond its supported SQL.
Examples:
OPTIMIZE
,VACUUM
select * from t version as of n
(delta lake)I would like support for 3 main cases:
select * from t version as of n
) into a logical nodemy_custom_expr
inselect my_custom_expr(t) from table1
) a custom logical expressionThe text was updated successfully, but these errors were encountered: