Remove extension point for PlanOptimizersFactory#13333
Remove extension point for PlanOptimizersFactory#13333sopel39 merged 1 commit intotrinodb:masterfrom
Conversation
3832c7e to
bb070a2
Compare
|
@sopel39 @raunaqmorarka Hi. What is the motivation for this change? Whit this factory in place it was possible to inject a custom query optimization strategy without affecting Trino's core. |
In our experience, maintaining optimizer extensions in this way required frequent changes to the extensions and required significant extra efforts to achieve similar level of testing for the extensions as the optimization rules in the OSS repo. |
|
Hi @raunaqmorarka. Thank you for the clarification. My question appears because we consider adding significant extensions to the optimizer (cost-based planning with the Cascades, better Exchange planning, etc), and executor (native execution in C++). Let's take an optimizer as an example. Many systems provide official hooks to inject certain rules at certain optimization phases. Examples are Apache Spark, Apache Drill, Dremio, and connector API in Trino/Presto. These extension points are usually very limiting. For example, Trino's approach with individual operator push-down into connectors may easily lead to bad plans because it is performed after the join order planning: good join order before pushdown might be actually dramatically inefficient from the connector standpoint (and vice versa). A more generic approach could be to combine join order planning with push-down planning, but this would require completely different concepts (e.g., fully-fledged MEMO data structure, different cost-model, etc), and revision of the optimizer program sequence. None "hooks" would be sufficient for this. Therefore, a balanced approach might be to provide coarse-grained integration points, that are both (a) easy to maintain, and (b) flexible enough from the implementation's perspective. This way, the Trino community doesn't need to spend time on maintaining compatibility for non-public APIs, while still allowing external companies to experiment with the product. In this sense, we were very happy having this The idea of bringing as much value as possible to the OSS Trino makes sense. However, some features might be pretty unstable, or hard to maintain, so they bring more harm than good. For example, imagine the implementation of Cascades-like planning - this is a very fundamental change that may require prolonged stabilization. Another example is Velox-based execution - Velox results from their marketing posts look very appealing, but Velox doesn't even have a stable version yet. In both examples, we, as a vendor, may prefer to keep such features private unless they are stable enough and the value is proven by real customers and use cases. Therefore, the more integration points are there, the more likely that some responsible external party will contribute new features to OSS. What do you think about this? Generally, I am trying to get a basic sense of how the Trino community views the proper coordination between OSS and vendors. Thank you, |
|
This extension point should have never been added to Trino, and it not the vision of Trino to have the optimizer extended by hooking directly into the core like this. The project is also planning and in some cases working on similar features to what you mention (obviously expect for C++ because that just a bad idea), and if think kind of core hook existed it would make this kind of development difficult or impossible. If you really want a hook like this, you can easily fork the project and add it back in, but the better thing IMO would be to work with the existing community directly on these features. |
|
Hi @dain, Thank you for the feedback. I agree that providing too coarse-grained extension points might be a non-goal for some projects. However, the pros/cons should be considered in the context of the specific component in question. The optimizer is a very good example. Presto/Trino lacks the fully-fledged cost-based optimization, which contributes to the complexity of the optimization program. However, even if the CBO is implemented, it will not make the optimizer optimal in all cases. In the ideal world, you would like to solve at least the following problems with CBO: join order planning, distribution planning, sorted operator planning, plugin operator pushdown, dynamic filters, materialized views rewrite, subplan reuse. Prectically, this will never work due to unacceptable planning time. Instead, one will still use multiple sequential CBO/heuristic planning phases. E.g., we can first plan joins with distributions, then collations. Or we can plan dynamic filters first, then get subplans. But every such combination will work well in one case, but produce bad plans in another. Some core combinations might be encoded with properties, but practically this overwhelms users pretty fast. Since Trino is an integration product, there could be very different users. Some work with data lakes, others focuses on OLTP, others may focus on logs/streaming queries/time series, others may work on hardware accelerated queries where very complicated pushdown is required, etc. Trino could be potentially used for all this cases because at the end of the day this is just a compute layer on top of the sources. But a single optimizer with a predefined optimization program and small hooks (like custom rule hooks in Spark and Dremio, or An example of such approach in action is Apache Calcite, which is successfully used for a wide variety of projects - streaming (Apache Flink), batch (Apache Hive), hardware/GPU acceleration, realtime analytics, etc. The main advantage is that it allows you to compose arbitrary optimization programs from the existing building blocks. The top-level interface is very simple and almost the same as in So I wonder why not to follow Apache Calcite's approach with Trino's optimizer? It is already composable enough, and the only thing that is missing (after this PR) is a very high level abstraction of optimization program that doesn't enforce any additional maintenance efforts on Trino community, but allow for a flexible product extensions for advanced users. Regards, |
Description
Related issues, pull requests, and links
Documentation
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: