[native] Add hook for Velox plan validation.#23423
[native] Add hook for Velox plan validation.#23423amitkdutta merged 1 commit intoprestodb:masterfrom
Conversation
1fa024d to
dd14b1f
Compare
dd14b1f to
e46d8a7
Compare
94c2811 to
28bc23d
Compare
|
@amitkdutta : Can you give some examples of such validation ? Its quite hard to back-track from a failure at this point to the Presto PlanNode that got translated. Its simpler to fail during translation imo. @tdcmeehan and @BryanCutler are also working of fail fast plan validation for Presto/Prestissimo. Lets decide on something together. |
xiaoxmeng
left a comment
There was a problem hiding this comment.
@amitkdutta LGTM % nits. Thanks!
| virtual void enableWorkerStatsReporting(); | ||
|
|
||
| /// Invoked to get a planValidator instance. The validator will be used to | ||
| /// validate a velox planfragment in TaskResoruce. |
|
|
||
| std::shared_ptr<facebook::presto::PrestoToVeloxPlanValidator> | ||
| PrestoServer::getPlanValidator() { | ||
| auto defaultPlanValidator = |
There was a problem hiding this comment.
static auto defaultPlanValidator =
| registerStatsCounters(); | ||
| } | ||
|
|
||
| std::shared_ptr<facebook::presto::PrestoToVeloxPlanValidator> |
There was a problem hiding this comment.
We can just return a raw pointer and let presto server to hold the reference.
| velox::memory::MemoryPool* const pool_; | ||
|
|
||
| TaskManager& taskManager_; | ||
| std::shared_ptr<facebook::presto::PrestoToVeloxPlanValidator> planValidator_; |
There was a problem hiding this comment.
facebook::presto::PrestoToVeloxPlanValidator* const planValidator_;
And move const member first.
| class PrestoToVeloxPlanValidator { | ||
| public: | ||
| virtual bool validatePlanFragment(const velox::core::PlanFragment& fragment); | ||
| virtual ~PrestoToVeloxPlanValidator() {} |
There was a problem hiding this comment.
virtual ~PrestoToVeloxPlanValidator() = default;
@aditi-pandit Thanks for the review. Indeed its hard to walk back to the plan fragment. The purpose of this hook is to allow such walk (even ineffieicnet) if necessary. A good example of it is Nested loop Join disablement or disabling timestamp with timezone. Key points are:
These use cases might sound custom (and indeed they are). However they serve a purpose where a feature is not fully ready, but part of it can be used (or not used) based on generated plan fragment. Having a hook allows to do such manipulation, hence this PR. The validation call back here always returns true here. If we find somehting that can be genrally used as validation, we can add here. This is similar to what has been done with spilling directory fetching hook, which allows us to override and provide a custom implementation that dynamically (and periodically) get spill path. We will definitely work with @tdcmeehan for fail fast validation, and I assume that will be more generic. If needed, we can remove this default validator also once the general one is more valid. But I assume the general one will do such validation during plan conversion code. Having a hook after plan conversion gives some flexibility (and deployment speed) for partial features. |
28bc23d to
f80e9ce
Compare
Can't this be done as a plan checker? I think this PR has two problems. The first is that it's too late--resources have already been allocated by the time this check occurs. The second is, as @aditi-pandit pointed out, it makes it hard to work backwards to the plan. If this can wait, we're working on a plan checker now. It uses a plugin system, so Meta can write whatever code they want and place it behind a plugin. We can work on getting this out in a reasonable time frame, or we can work together to get it out even quicker. If it can't, then I propose we can let this PR pass, but once we introduce the plan checker we'll deprecate this and eventually delete. I consider this two phased solution, while not ideal, to be better than introducing Meta-specific requirements directly into open source code. WDYT? |
@tdcmeehan This change is not necessarily Meta specific though. We have built hooks in PrestoServer to allow customers customize different aspects of the server. This one just adds a hook after plan conversion in the worker. We can also add a config in worker to completely disable the hook, and Meta (or others) can utilize it if any tricky scenario arises. I understand its not optimal, and I also see at some point it will be deleted when the Velox eval becomes mature. |
f80e9ce to
5a4f802
Compare
|
@amitkdutta that's all understood, but we also need to avoid doing things two different ways. We're working on a solution that doesn't have the deficiencies I listed, so I expect that when we have a plan checker, this can be deleted, not just disabled. |
|
|
||
| /// Invoked to get a planValidator instance. The validator will be used to | ||
| /// validate a velox planfragment by TaskResoruce. | ||
| virtual std::shared_ptr<facebook::presto::PrestoToVeloxPlanValidator> |
There was a problem hiding this comment.
virtual PrestoToVeloxPlanValidator* getPlanValidator()
Drop facebook::presto:: as we already in this namespace?
| // Executor for spilling. | ||
| std::shared_ptr<folly::CPUThreadPoolExecutor> spillerExecutor_; | ||
|
|
||
| std::shared_ptr<facebook::presto::PrestoToVeloxPlanValidator> planValidator_; |
There was a problem hiding this comment.
We probably need two APIs one is to init planValidator_ and the other is to get plan validator? The latter returns a raw pointer.
@tdcmeehan Of course, we should not keep duplicated items. We can remove it once we merge PlanChecker. Will be great if you can share some details about it also, we can go through it and also discuss in the bi-weekly meeting we have. |
| velox::memory::MemoryPool* const pool_; | ||
|
|
||
| TaskManager& taskManager_; | ||
| facebook::presto::PrestoToVeloxPlanValidator* planValidator_; |
There was a problem hiding this comment.
PrestoToVeloxPlanValidator* const planValidator_;
Put const members first. Drop facebook::presto::
|
@tdcmeehan do you have a timeline for the plugin-based solution to be ready? We all agree doing this at query planning stage on the coordinator would be ideal, but depending on how long this would take to be available, it may make sense to have a quick way to toggle on/off traffic in the interim, even if suboptimal. |
@tdcmeehan does the plan checker allow to customize or override some of the check logic which is specific to a customer like Meta without changing OSS code? Thanks! |
5a4f802 to
e0d6075
Compare
xiaoxmeng
left a comment
There was a problem hiding this comment.
@amitkdutta thanks for the update!
| } | ||
|
|
||
| void PrestoServer::initPrestoToVeloxPlanValidator() { | ||
| planValidator_ = std::make_shared<PrestoToVeloxPlanValidator>(); |
There was a problem hiding this comment.
VELOX_CHECK_NULL(planValidator_);
planValidator_ = ...
e0d6075 to
22f124a
Compare
|
|
||
| #include "presto_cpp/main/types/PrestoToVeloxPlanValidator.h" | ||
|
|
||
| namespace facebook::presto { |
There was a problem hiding this comment.
nit: not sure about Presto, but in Velox there is always a blank line after namespace definition
| #include "velox/core/PlanFragment.h" | ||
|
|
||
| namespace facebook::presto { | ||
| class PrestoToVeloxPlanValidator { |
There was a problem hiding this comment.
usually useful to add some documentation here on what this class is intended for, and how it should be used.
| virtual void enableWorkerStatsReporting(); | ||
|
|
||
| /// Invoked to initialize Presto to Velox plan validator. | ||
| virtual void initPrestoToVeloxPlanValidator(); |
There was a problem hiding this comment.
better to turn this into a factory function so users don't need to learn that it actually has side-effects (it sets the class member, can't be called twice, etc).
virtual std::shared_ptr<VeloxPlanValidator> makePlanValidator();
and then callers can use this with more flexibility:
validator_ = makePlanValidator();
or
TaskResource(makePlanValidator(), ...);
There was a problem hiding this comment.
Could also drop the "PrestoTo" from the name since this, in effect, only validates a Velox query fragment.
| velox::memory::MemoryPool* pool, | ||
| folly::Executor* httpSrvCpuExecutor) | ||
| folly::Executor* httpSrvCpuExecutor, | ||
| PrestoToVeloxPlanValidator* planValidator, |
There was a problem hiding this comment.
usually a good idea to pass the shared_ptr so that lifecycle and ownership are more explicitly defined.
There was a problem hiding this comment.
(or unique_ptr, depending on how you organize things).
|
|
||
| folly::Executor* const httpSrvCpuExecutor_; | ||
| velox::memory::MemoryPool* const pool_; | ||
| PrestoToVeloxPlanValidator* const planValidator_; |
There was a problem hiding this comment.
VeloxPlanValidatorPtr planValidator_;
(using the aliases we usually create for shared_ptr in Velox)
| taskManager_ = std::make_unique<TaskManager>( | ||
| driverExecutor_.get(), httpSrvCpuExecutor_.get(), nullptr); | ||
|
|
||
| auto validator = |
There was a problem hiding this comment.
to really validate it works, you could create a mock validator that always throws, and validate that the code actually throws that exception.
| namespace facebook::presto { | ||
| class PrestoToVeloxPlanValidator { | ||
| public: | ||
| virtual bool validatePlanFragment(const velox::core::PlanFragment& fragment); |
There was a problem hiding this comment.
perhaps either leave as pure virtual, or leave the base implementation here in the header so you don't need to define a .cpp that doesn't really have anything?
22f124a to
788e940
Compare
| // Executor for spilling. | ||
| std::shared_ptr<folly::CPUThreadPoolExecutor> spillerExecutor_; | ||
|
|
||
| std::shared_ptr<PrestoToVeloxPlanValidator> planValidator_; |
There was a problem hiding this comment.
Does this need to be a shared_ptr ? Seems like the PrestoServer owns this object and passes a pointer to the underlying object elsewhere. unique_ptr should be a better fit imo.
There was a problem hiding this comment.
I guess you can only do that if you create a new validator per query. Should be pretty much the same, but this isn't how the code is structured today.
| // Executor for spilling. | ||
| std::shared_ptr<folly::CPUThreadPoolExecutor> spillerExecutor_; | ||
|
|
||
| std::shared_ptr<PrestoToVeloxPlanValidator> planValidator_; |
There was a problem hiding this comment.
Nit : We can remove the "PrestoTo" in the name of this class. It can simply be VeloxPlanValidator
| namespace facebook::presto { | ||
| bool PrestoToVeloxPlanValidator::validatePlanFragment( | ||
| const velox::core::PlanFragment& fragment) { | ||
| return true; |
There was a problem hiding this comment.
If there are multiple validations like 1. Don't allow TIMESTAMPTZ 2. Don't allow NLJ 3. Allow only particular combination of 1 and 2, then there would be a config for each validation and all checks in the same function. Did you consider allowing multiple validators instead ?
Also, given that I think this class would need to be aware of SystemConfig.
|
@amitkdutta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@xiaoxmeng we will add a plan checker SPI, and a default implementation which delegates to the Presto sidecar, to validate that the generated plan may successfully be executed with Prestissimo and Velox (or, that at least the plan can get translated successfully). This SPI can additionally take additional implementations, which may have Meta-specific logic, the same as our other SPI interfaces (connectors, event listeners, authenticators, etc). @pedroerp @amitkdutta I agree that this can go out temporarily. Long term, I don't think these sorts of checks shouldn't go into the worker, and I just wanted to give everyone the heads up that when this solution is ready, this interface might get deprecated, so whatever is added here may need to get migrated. This design was briefly mentioned in RFC-0003, but we're working on a more detailed RFC specifically for this, and are targeting Q3 to get the code out. tl;dr I'm fine for this code to go out now (pending review feedback etc) |
|
@amitkdutta : Would be good to add the NLJ check as a plan validation to demonstrate the API in a real use-case. The PR would definitely be more solid after that. Right now its doing more no-op work. |
+1. This validation is meant to be temporary anyhow, and we will relax the restrictions as bugs are fixed and more features are added to the native stack. We just need to have a killswitch we can use to disable traffic as we roll more of these things out to production. |
NLJ is fixed in Velox already, but potentially we could use this to disable timestamp with timezone comparisons which are still broken |
|
@tdcmeehan Totally agree with you. We can remove this callback once we have the Planchecker. |
I picked up NLJ from a previous remark. Agree its fixed in Velox already. Do we really want to disable timestamp with timezone here now instead of the PlanChecker in the co-ordinator ? Are we planning to refine that logic in any way ? Even if we are, the PlanChecker is still a better place for that logic. That validation has to look at all PlanNodes, Function Signatures.. have deeper understanding of what individual planNodes do. Thats a better fit in the co-ordinator. And since we already have that code, I feel we shouldn't refactor it for the sake of having an example. The NLJ situation was a better candidate for Prestissimo PlanValidator. It would be okay to add it as a test and not in the main code path as a compromise. @pedroerp, @amitkdutta, @tdcmeehan : wdyt ? |
|
The problem with the timestamp with timezone blocker (as I understand) is that it is too coarse; it blocks any usage of the custom type, where the problem is really only when you need rely on its comparisons. If this is something that could be done in the coordinator, even better. Although even in that case, this new API provides us more flexibility to quickly iterate on any other query shapes that may need to be blocked. |
|
I think we should create an issue to improve this checker, there's nothing preventing us from refining it. And I don't think there are features or capabilities in C++ that are superior to the tools in Java that we would use to check for this. I'm curious why this approach is considered quicker? Is it because it allows us to write the check in C++? |
|
I suppose not only in C++, but it allow us to do the check based on a Velox Plan (after the conversion from Presto plan). Presto Java, in theory, doesn't know the details of that conversion. |
|
@pedroerp do you have an example of what information the Velox plan would contain that wouldn't be available in the Presto plan? My thinking was it can't contain any essential additional information that isn't known upfront, because we always start with a Presto plan. |
|
Good question. I suppose we would have most of the same information in both places, minus details of the plan translation/mapping from java to cpp, which may not be available in Java. Overall, the idea here is less about re-implementing the blocks we already have (which are trivial and could be implemented somewhere else), and more about having an easy-to-use and convenient kill switch device that let's us disable a very specific slice of traffic. These things are found in production as we roll out additional traffic, so it's very hard to predict what exactly will be needed, but having access to the exact Velox plan being executed gives you the most flexibility. |
|
@pedroerp thanks for the clarification. But I'm still not sure what exactly makes this check easy to use and convenient? Is the ease of use due to it being developed as a hook, which means platforms can develop custom implementations internally? i.e. the idea is you can write a very quick code change to your internal repository, without having to write a patch in open source where there is a desire for understanding and consensus? Or is the ease of use due to it being a Velox plan, instead of a Presto plan? |
Description
This PR adds a hook to further check presto to velox converted plan.
Motivation and Context
After Presto plan is converted to Velox plan, it is often required to disable certain features based on the generated plan. The reasoning is even though the plan is converted properly, due to arbitary limitation in Velox (either correctness or performacne), we may want to fail the query execution. Adding a hook (similar to spilling directory), allows us to customize query execution and make it configurable based on current capabilities of Velox.
Impact
None
Test Plan
Ran worker and cooridnator, printed logs inside dummy validator and observed it was printed.