Add extension plugin to parse SQL into logical plan / user defined SQL parser #533

jorgecarleitao · 2021-06-10T06:59:02Z

As a user of DataFusion, I would like to be able to install custom parsing rules of SQL to DataFusion, so that I can plan custom nodes from SQL.

This would allow me to extend datafusions' core capabilities beyond its supported SQL.

Examples:

OPTIMIZE, VACUUM
select * from t version as of n (delta lake)

I would like support for 3 main cases:

Parse entire SQL statements (e.g. select * from t version as of n) into a logical node
Parse single SQL expressions (e.g. my_custom_expr in select my_custom_expr(t) from table1) a custom logical expression
parse a table format into a Custom Table Provider logical node

The text was updated successfully, but these errors were encountered:

adsharma · 2021-06-12T01:31:38Z

select * from t version as of n

Is preventing a query such as the following desirable?

select * from t1 version as of n1
union  all
select * from t2 version as of n2

or is this by design?

houqp · 2021-06-13T04:38:23Z

@adsharma I can't think of a reason why we would want to prevent the union query you mentioned. I don't think what @jorgecarleitao wrote in the issue description implies this?

jorgecarleitao · 2021-06-13T05:35:46Z

I also do not see the issue with the example above, but I would say that, In general, custom SQL parsers effectively modify the SQL dialect that is being used and therefore the responsibility to document variations, including any limitation that they may introduce to the "default" postgres dialect, lays to the applications that use/install custom parsers.

adsharma · 2021-06-14T21:14:56Z

I don't have much context about the proposal. Trying to understand things better. Please bear with me.

The reason why SQL doesn't have select * from t version as of n is that it violates the isolation property in ACID, by allowing a query to read data from two different points in time (or logical sequence numbers if you prefer).

Instead, they prefer a flow such as:

BEGIN // implicitly selects a version and all queries following would read from that version
select * from t1
union all
select * from t2;
COMMIT

I can also imagine a variant such as:

BEGIN TRANSACTION @n1
...
COMMIT

which has the same effect. The benefit of these variants is that it makes it harder to write SQL that violates isolation properties.

houqp · 2021-06-21T02:34:12Z

version as of n is a deltalake specific SQL extension. It's better to think of t version as of n as a different table. Datafusion is not a transactional query engine, so querying the same table should always return the same result.

adsharma · 2021-06-21T16:53:44Z

Are the following good references for what delta lake is proposing? https://docs.databricks.com/delta/concurrency-control.html https://docs.databricks.com/delta/optimizations/isolation-level.html I get that delta lake uses transactions on the *metadata* to ensure that a consistent view of the table is presented to batch jobs that may be reading it. The reasoning I was using is that SQL is currently a unified query language that works for both OLTP and OLAP. In the ideal world, one set of extensions address both use cases. So what I understood from the discussion is - if there is an "events" table with a "timestamp" column and is partitioned by the hour, it's perfectly legit for a query to aggregate over hourly partitions to compute some sort of a view. No Tx isolation guarantee is violated. However, if a table is getting updated with new data and a query is able to see both the old version and the new version and compute stats using some mix of the two, delta lake isolation guarantees are violated (assuming the tables were set up with WriteSerializable isolation level)?

…

On Sun, Jun 20, 2021 at 7:34 PM QP Hou ***@***.***> wrote: as of n is a deltalake specific SQL extension. It's better to think of t version as of n as a different table. Datafusion is not a transactional query engine, so querying the same table should always return the same result. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#533 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFA2A3CSAA2UQ763NQ7FXDTT2QLDANCNFSM46NO5YQQ> .

jorgecarleitao · 2021-06-21T16:57:46Z

For avoidance of doubt, the goal of this issue is not to support these extension languages in DataFusion itself, but to allow users to plugin their own custom extensions, so that they can support them themselves.

alamb · 2022-08-20T10:57:08Z

An update here:

There are several examples of people who have implemented their own custom query language / domain specific language using DataFusion. For example vega fusion with the vega language, and InfluxDB IOx with the language that underlies flux and influxql.

Typically the approach is to use the LogialPlanBuilder and skip the datafusion SQLPLanner entirely

One way we might be able to accomplish this extension point would be like:

/// Default SQL planner 
pub trait  SQLPlanner :
  fn plan_query(&mut self, query: sqlparser::ast::Query) -> LogicalPlan {
    // call free function 
    plan_query(self, query)
   }

  fn plan_expr(&mut self, expr: sqlparser::ast::Expr) -> Expr {
   // existing logic work
    plan_expr(self, expr)
  }
}

// Free functions that do the work (as you can't call default trait impls)
// https://stackoverflow.com/questions/43849041/reuse-default-method-implementations-in-trait-impls

fn plan_query<P: SQLPlanner>(planner: &mut P, query: sqlparser::ast::Query) -> LogicalPlan {
    // existing logic
 }

fn plan_expr<P: SQLPlanner>(planner: &mut P, expr: sqlparser::ast::Expr) -> Expr {
   // existing logic work
}

I was able to make this type of pattern work for "sql rewriting" where the user is allowed to rewrite an SQL query prior to planning with the existing datafusion planner. You can see a sketch of such a solution in https://github.com/influxdata/influxdb_iox/pull/5438.

alamb · 2022-08-20T10:59:17Z

An extension planner then looks like

struct MySqlPlanner {
}

impl QueryPlanner for MySqlPlanner {
  // override expression planning
  fn plan_expr(&mut self, expr: sqlparser::ast::Expr) -> Expr {
    if need_to_rewrite(expr) {
        // custom planning logic
     } else {
       // fall back to default impl
       plan_expr(self, expr)
     }
  }
}

The issue with this approach is that it complicates the sql planner non trivially , so I am not sure it is a great idea

philippemnoel · 2023-11-21T02:51:35Z

An update here:

There are several examples of people who have implemented their own custom query language / domain specific language using DataFusion. For example vega fusion with the vega language, and InfluxDB IOx with the language that underlies flux and influxql.

Typically the approach is to use the LogialPlanBuilder and skip the datafusion SQLPLanner entirely

One way we might be able to accomplish this extension point would be like:
/// Default SQL planner 
pub trait  SQLPlanner :
  fn plan_query(&mut self, query: sqlparser::ast::Query) -> LogicalPlan {
    // call free function 
    plan_query(self, query)
   }

  fn plan_expr(&mut self, expr: sqlparser::ast::Expr) -> Expr {
   // existing logic work
    plan_expr(self, expr)
  }
}

// Free functions that do the work (as you can't call default trait impls)
// https://stackoverflow.com/questions/43849041/reuse-default-method-implementations-in-trait-impls

fn plan_query<P: SQLPlanner>(planner: &mut P, query: sqlparser::ast::Query) -> LogicalPlan {
    // existing logic
 }

fn plan_expr<P: SQLPlanner>(planner: &mut P, expr: sqlparser::ast::Expr) -> Expr {
   // existing logic work
}
I was able to make this type of pattern work for "sql rewriting" where the user is allowed to rewrite an SQL query prior to planning with the existing datafusion planner. You can see a sketch of such a solution in https://github.com/influxdata/influxdb_iox/pull/5438.

That PR no longer links correctly, perhaps due to a rename. Does it still exist? I'd be curious look at the example

alamb · 2023-11-21T19:38:01Z

Hi @philippemnoel -- I am not sure how useful that PR would be, but I can dig up the diff if you would like

What are you trying to do?

You can see an example of using DataFusion to implement your own language here https://github.com/influxdata/influxdb/tree/main/iox_query_influxql/src (our implementation of InfluxQL)

I believe Greptime DB has their own frontend for promql here: https://github.com/GreptimeTeam/greptimedb/tree/9ff7670adfb56a80fe6ffeab8bdab9bcfe55543c/src/promql

There are several ways to extend DataFusion, such as with scalar / aggregate / window functions , documented here
https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html

As well as custom operators and optimizer passes

alamb · 2024-03-21T13:45:21Z

@tshauck mentioned this in discord: https://discord.com/channels/885562378132000778/1166447479609376850/1219782560436191342

Hi, I'm wondering if someone could give some advice on how to use DFParser in my own parser. It's impl is:

pub struct DFParser<'a> {
    parser: Parser<'a>,
}

And basically I want to do:

pub struct MyParser<'a> {
    parser: DFParser<'a>,
}

so I can reuse as much as possible (e.g. parse_statement), but implement my own parsing logic on Keyword::COPY. The issue I'm having is that it doesn't seem possible to access the underlying sqlparser Parser as it's private to DFParser. At least with this approach I'd need access to it so I could call next_token() and other methods.

Is there a better approach to adding something like that on top of the DFParser, or would the maintainers be open to making parser available for modification on DFParser struct.

Put another way, the end goal is to customize how the SQL becomes a logical plan so I can use CopyTo and a user defined logical node to write to a custom file format.

Thanks!

For reference, here's parse_statement: https://github.com/apache/arrow-datafusion/blob/8074ca1e758470319699a562074290906003b312/datafusion/sql/src/parser.rs#L352-L384

alamb · 2024-03-21T13:48:28Z

@tshauck I think this feature has been missing for a while and it would be great if you help / propose improvements

I personally think exposing the underlying Parser would be fine

What I think would be the most compelling would be a new example in https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples/examples showing how to do what you are describing above (aka support a custom file format).

As part of that PR, you would likely have to add / modify the DFParser API, and that modification would be easy to review / understand along with the example.

What do you think?

tshauck · 2024-03-21T16:13:51Z

@alamb Cool yeah, happy to give this a go. I'll start w/ the example + making Parser public as you suggest.

FWIW, I did a spike yesterday to see if I could go the encapsulation route, but perhaps unsurprisingly, you need somewhat lower level control for a parser, so it ended up with a lot of methods on DFParser that mirrored Parser just to delegate to it.

jorgecarleitao added the enhancement New feature or request label Jun 10, 2021

alamb changed the title ~~Add extension plugin to parse SQL into logical plan~~ Add extension plugin to parse SQL into logical plan / user defined SQL Nov 12, 2021

alamb changed the title ~~Add extension plugin to parse SQL into logical plan / user defined SQL~~ Add extension plugin to parse SQL into logical plan / user defined SQL parser Mar 21, 2024

tshauck mentioned this issue Mar 21, 2024

Expose parser on DFParser to enable user controlled parsing #9729

Merged

alamb closed this as completed in #9729 Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add extension plugin to parse SQL into logical plan / user defined SQL parser #533

Add extension plugin to parse SQL into logical plan / user defined SQL parser #533

jorgecarleitao commented Jun 10, 2021 •

edited

Loading

adsharma commented Jun 12, 2021

houqp commented Jun 13, 2021

jorgecarleitao commented Jun 13, 2021

adsharma commented Jun 14, 2021 •

edited

Loading

houqp commented Jun 21, 2021 •

edited

Loading

adsharma commented Jun 21, 2021 via email

jorgecarleitao commented Jun 21, 2021

alamb commented Aug 20, 2022

alamb commented Aug 20, 2022

philippemnoel commented Nov 21, 2023

alamb commented Nov 21, 2023

alamb commented Mar 21, 2024

alamb commented Mar 21, 2024

tshauck commented Mar 21, 2024 •

edited

Loading

Add extension plugin to parse SQL into logical plan / user defined SQL parser #533

Add extension plugin to parse SQL into logical plan / user defined SQL parser #533

Comments

jorgecarleitao commented Jun 10, 2021 • edited Loading

adsharma commented Jun 12, 2021

houqp commented Jun 13, 2021

jorgecarleitao commented Jun 13, 2021

adsharma commented Jun 14, 2021 • edited Loading

houqp commented Jun 21, 2021 • edited Loading

adsharma commented Jun 21, 2021 via email

jorgecarleitao commented Jun 21, 2021

alamb commented Aug 20, 2022

alamb commented Aug 20, 2022

philippemnoel commented Nov 21, 2023

alamb commented Nov 21, 2023

alamb commented Mar 21, 2024

alamb commented Mar 21, 2024

tshauck commented Mar 21, 2024 • edited Loading

jorgecarleitao commented Jun 10, 2021 •

edited

Loading

adsharma commented Jun 14, 2021 •

edited

Loading

houqp commented Jun 21, 2021 •

edited

Loading

tshauck commented Mar 21, 2024 •

edited

Loading