Skip to content

[native] Add hook for Velox plan validation.#23423

Merged
amitkdutta merged 1 commit intoprestodb:masterfrom
amitkdutta:plan_validator
Aug 13, 2024
Merged

[native] Add hook for Velox plan validation.#23423
amitkdutta merged 1 commit intoprestodb:masterfrom
amitkdutta:plan_validator

Conversation

@amitkdutta
Copy link
Copy Markdown
Contributor

@amitkdutta amitkdutta commented Aug 12, 2024

Description

This PR adds a hook to further check presto to velox converted plan.

Motivation and Context

After Presto plan is converted to Velox plan, it is often required to disable certain features based on the generated plan. The reasoning is even though the plan is converted properly, due to arbitary limitation in Velox (either correctness or performacne), we may want to fail the query execution. Adding a hook (similar to spilling directory), allows us to customize query execution and make it configurable based on current capabilities of Velox.

Impact

None

Test Plan

Ran worker and cooridnator, printed logs inside dummy validator and observed it was printed.

== NO RELEASE NOTE ==

@amitkdutta amitkdutta requested a review from a team as a code owner August 12, 2024 06:14
@amitkdutta amitkdutta marked this pull request as draft August 12, 2024 06:14
@amitkdutta amitkdutta force-pushed the plan_validator branch 3 times, most recently from 1fa024d to dd14b1f Compare August 12, 2024 06:16
@amitkdutta amitkdutta changed the title [native] Add hook for Velox plan validation and provide a default val… [native] Add hook for Velox plan validation. Aug 12, 2024
@amitkdutta amitkdutta requested a review from xiaoxmeng August 12, 2024 06:22
@amitkdutta amitkdutta marked this pull request as ready for review August 12, 2024 06:23
@amitkdutta amitkdutta marked this pull request as draft August 12, 2024 07:02
@amitkdutta amitkdutta force-pushed the plan_validator branch 3 times, most recently from 94c2811 to 28bc23d Compare August 12, 2024 07:14
@amitkdutta amitkdutta marked this pull request as ready for review August 12, 2024 07:18
@aditi-pandit
Copy link
Copy Markdown
Contributor

@amitkdutta : Can you give some examples of such validation ? Its quite hard to back-track from a failure at this point to the Presto PlanNode that got translated. Its simpler to fail during translation imo.

@tdcmeehan and @BryanCutler are also working of fail fast plan validation for Presto/Prestissimo. Lets decide on something together.

Copy link
Copy Markdown
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitkdutta LGTM % nits. Thanks!

virtual void enableWorkerStatsReporting();

/// Invoked to get a planValidator instance. The validator will be used to
/// validate a velox planfragment in TaskResoruce.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/in/by/


std::shared_ptr<facebook::presto::PrestoToVeloxPlanValidator>
PrestoServer::getPlanValidator() {
auto defaultPlanValidator =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static auto defaultPlanValidator =

registerStatsCounters();
}

std::shared_ptr<facebook::presto::PrestoToVeloxPlanValidator>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just return a raw pointer and let presto server to hold the reference.

velox::memory::MemoryPool* const pool_;

TaskManager& taskManager_;
std::shared_ptr<facebook::presto::PrestoToVeloxPlanValidator> planValidator_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

facebook::presto::PrestoToVeloxPlanValidator* const planValidator_;

And move const member first.

class PrestoToVeloxPlanValidator {
public:
virtual bool validatePlanFragment(const velox::core::PlanFragment& fragment);
virtual ~PrestoToVeloxPlanValidator() {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

virtual ~PrestoToVeloxPlanValidator() = default;

@amitkdutta
Copy link
Copy Markdown
Contributor Author

@amitkdutta : Can you give some examples of such validation ? Its quite hard to back-track from a failure at this point to the Presto PlanNode that got translated. Its simpler to fail during translation imo.

@tdcmeehan and @BryanCutler are also working of fail fast plan validation for Presto/Prestissimo. Lets decide on something together.

@aditi-pandit Thanks for the review. Indeed its hard to walk back to the plan fragment. The purpose of this hook is to allow such walk (even ineffieicnet) if necessary. A good example of it is Nested loop Join disablement or disabling timestamp with timezone. Key points are:

  • Sometimes we observe correctness issues with certain features (e.g. NLJ) while the system is already in production. At that time we want to disable (or enable) something based on plan fragment on the fly with custom configurations. This kind of usage may not be generally used (or can be too much Meta specific), hence we want to handle the hook based on runtime requirements instead of introducing random plan conversion errors.
  • There are cases where time stamp with time zone are only used for casting and not for general equality/ordering etc. In such use cases, we want to have a validator which block queries if it finds time stamp with time zone for ordering purpose, but allow for other cases.

These use cases might sound custom (and indeed they are). However they serve a purpose where a feature is not fully ready, but part of it can be used (or not used) based on generated plan fragment. Having a hook allows to do such manipulation, hence this PR.

The validation call back here always returns true here. If we find somehting that can be genrally used as validation, we can add here.

This is similar to what has been done with spilling directory fetching hook, which allows us to override and provide a custom implementation that dynamically (and periodically) get spill path.

We will definitely work with @tdcmeehan for fail fast validation, and I assume that will be more generic. If needed, we can remove this default validator also once the general one is more valid. But I assume the general one will do such validation during plan conversion code. Having a hook after plan conversion gives some flexibility (and deployment speed) for partial features.

@tdcmeehan
Copy link
Copy Markdown
Contributor

There are cases where time stamp with time zone are only used for casting and not for general equality/ordering etc. In such use cases, we want to have a validator which block queries if it finds time stamp with time zone for ordering purpose, but allow for other cases.

Can't this be done as a plan checker?

I think this PR has two problems. The first is that it's too late--resources have already been allocated by the time this check occurs. The second is, as @aditi-pandit pointed out, it makes it hard to work backwards to the plan.

If this can wait, we're working on a plan checker now. It uses a plugin system, so Meta can write whatever code they want and place it behind a plugin. We can work on getting this out in a reasonable time frame, or we can work together to get it out even quicker. If it can't, then I propose we can let this PR pass, but once we introduce the plan checker we'll deprecate this and eventually delete. I consider this two phased solution, while not ideal, to be better than introducing Meta-specific requirements directly into open source code. WDYT?

@amitkdutta
Copy link
Copy Markdown
Contributor Author

There are cases where time stamp with time zone are only used for casting and not for general equality/ordering etc. In such use cases, we want to have a validator which block queries if it finds time stamp with time zone for ordering purpose, but allow for other cases.

Can't this be done as a plan checker?

I think this PR has two problems. The first is that it's too late--resources have already been allocated by the time this check occurs. The second is, as @aditi-pandit pointed out, it makes it hard to work backwards to the plan.

If this can wait, we're working on a plan checker now. It uses a plugin system, so Meta can write whatever code they want and place it behind a plugin. We can work on getting this out in a reasonable time frame, or we can work together to get it out even quicker. If it can't, then I propose we can let this PR pass, but once we introduce the plan checker we'll deprecate this and eventually delete. I consider this two phased solution, while not ideal, to be better than introducing Meta-specific requirements directly into open source code. WDYT?

@tdcmeehan This change is not necessarily Meta specific though. We have built hooks in PrestoServer to allow customers customize different aspects of the server. This one just adds a hook after plan conversion in the worker. We can also add a config in worker to completely disable the hook, and Meta (or others) can utilize it if any tricky scenario arises. I understand its not optimal, and I also see at some point it will be deleted when the Velox eval becomes mature.

@tdcmeehan
Copy link
Copy Markdown
Contributor

@amitkdutta that's all understood, but we also need to avoid doing things two different ways. We're working on a solution that doesn't have the deficiencies I listed, so I expect that when we have a plan checker, this can be deleted, not just disabled.


/// Invoked to get a planValidator instance. The validator will be used to
/// validate a velox planfragment by TaskResoruce.
virtual std::shared_ptr<facebook::presto::PrestoToVeloxPlanValidator>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

virtual PrestoToVeloxPlanValidator*  getPlanValidator()

Drop facebook::presto:: as we already in this namespace?

// Executor for spilling.
std::shared_ptr<folly::CPUThreadPoolExecutor> spillerExecutor_;

std::shared_ptr<facebook::presto::PrestoToVeloxPlanValidator> planValidator_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need two APIs one is to init planValidator_ and the other is to get plan validator? The latter returns a raw pointer.

@amitkdutta
Copy link
Copy Markdown
Contributor Author

@amitkdutta that's all understood, but we also need to avoid doing things two different ways. We're working on a solution that doesn't have the deficiencies I listed, so I expect that when we have a plan checker, this can be deleted, not just disabled.

@tdcmeehan Of course, we should not keep duplicated items. We can remove it once we merge PlanChecker. Will be great if you can share some details about it also, we can go through it and also discuss in the bi-weekly meeting we have.

velox::memory::MemoryPool* const pool_;

TaskManager& taskManager_;
facebook::presto::PrestoToVeloxPlanValidator* planValidator_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrestoToVeloxPlanValidator* const planValidator_;

Put const members first. Drop facebook::presto::

@pedroerp
Copy link
Copy Markdown
Contributor

@tdcmeehan do you have a timeline for the plugin-based solution to be ready? We all agree doing this at query planning stage on the coordinator would be ideal, but depending on how long this would take to be available, it may make sense to have a quick way to toggle on/off traffic in the interim, even if suboptimal.

@xiaoxmeng
Copy link
Copy Markdown
Contributor

@amitkdutta that's all understood, but we also need to avoid doing things two different ways. We're working on a solution that doesn't have the deficiencies I listed, so I expect that when we have a plan checker, this can be deleted, not just disabled.

@tdcmeehan does the plan checker allow to customize or override some of the check logic which is specific to a customer like Meta without changing OSS code? Thanks!

xiaoxmeng
xiaoxmeng previously approved these changes Aug 13, 2024
Copy link
Copy Markdown
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amitkdutta thanks for the update!

}

void PrestoServer::initPrestoToVeloxPlanValidator() {
planValidator_ = std::make_shared<PrestoToVeloxPlanValidator>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VELOX_CHECK_NULL(planValidator_);
planValidator_ = ...


#include "presto_cpp/main/types/PrestoToVeloxPlanValidator.h"

namespace facebook::presto {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure about Presto, but in Velox there is always a blank line after namespace definition

#include "velox/core/PlanFragment.h"

namespace facebook::presto {
class PrestoToVeloxPlanValidator {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually useful to add some documentation here on what this class is intended for, and how it should be used.

virtual void enableWorkerStatsReporting();

/// Invoked to initialize Presto to Velox plan validator.
virtual void initPrestoToVeloxPlanValidator();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to turn this into a factory function so users don't need to learn that it actually has side-effects (it sets the class member, can't be called twice, etc).

virtual std::shared_ptr<VeloxPlanValidator> makePlanValidator();

and then callers can use this with more flexibility:

validator_ = makePlanValidator();
or
TaskResource(makePlanValidator(), ...);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also drop the "PrestoTo" from the name since this, in effect, only validates a Velox query fragment.

velox::memory::MemoryPool* pool,
folly::Executor* httpSrvCpuExecutor)
folly::Executor* httpSrvCpuExecutor,
PrestoToVeloxPlanValidator* planValidator,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually a good idea to pass the shared_ptr so that lifecycle and ownership are more explicitly defined.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or unique_ptr, depending on how you organize things).


folly::Executor* const httpSrvCpuExecutor_;
velox::memory::MemoryPool* const pool_;
PrestoToVeloxPlanValidator* const planValidator_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VeloxPlanValidatorPtr planValidator_;

(using the aliases we usually create for shared_ptr in Velox)

taskManager_ = std::make_unique<TaskManager>(
driverExecutor_.get(), httpSrvCpuExecutor_.get(), nullptr);

auto validator =
Copy link
Copy Markdown
Contributor

@pedroerp pedroerp Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to really validate it works, you could create a mock validator that always throws, and validate that the code actually throws that exception.

namespace facebook::presto {
class PrestoToVeloxPlanValidator {
public:
virtual bool validatePlanFragment(const velox::core::PlanFragment& fragment);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps either leave as pure virtual, or leave the base implementation here in the header so you don't need to define a .cpp that doesn't really have anything?

// Executor for spilling.
std::shared_ptr<folly::CPUThreadPoolExecutor> spillerExecutor_;

std::shared_ptr<PrestoToVeloxPlanValidator> planValidator_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a shared_ptr ? Seems like the PrestoServer owns this object and passes a pointer to the underlying object elsewhere. unique_ptr should be a better fit imo.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can only do that if you create a new validator per query. Should be pretty much the same, but this isn't how the code is structured today.

// Executor for spilling.
std::shared_ptr<folly::CPUThreadPoolExecutor> spillerExecutor_;

std::shared_ptr<PrestoToVeloxPlanValidator> planValidator_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : We can remove the "PrestoTo" in the name of this class. It can simply be VeloxPlanValidator

namespace facebook::presto {
bool PrestoToVeloxPlanValidator::validatePlanFragment(
const velox::core::PlanFragment& fragment) {
return true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple validations like 1. Don't allow TIMESTAMPTZ 2. Don't allow NLJ 3. Allow only particular combination of 1 and 2, then there would be a config for each validation and all checks in the same function. Did you consider allowing multiple validators instead ?

Also, given that I think this class would need to be aware of SystemConfig.

@facebook-github-bot
Copy link
Copy Markdown
Collaborator

@amitkdutta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@tdcmeehan
Copy link
Copy Markdown
Contributor

tdcmeehan commented Aug 13, 2024

@xiaoxmeng we will add a plan checker SPI, and a default implementation which delegates to the Presto sidecar, to validate that the generated plan may successfully be executed with Prestissimo and Velox (or, that at least the plan can get translated successfully). This SPI can additionally take additional implementations, which may have Meta-specific logic, the same as our other SPI interfaces (connectors, event listeners, authenticators, etc).

@pedroerp @amitkdutta I agree that this can go out temporarily. Long term, I don't think these sorts of checks shouldn't go into the worker, and I just wanted to give everyone the heads up that when this solution is ready, this interface might get deprecated, so whatever is added here may need to get migrated. This design was briefly mentioned in RFC-0003, but we're working on a more detailed RFC specifically for this, and are targeting Q3 to get the code out.

tl;dr I'm fine for this code to go out now (pending review feedback etc)

@aditi-pandit
Copy link
Copy Markdown
Contributor

aditi-pandit commented Aug 13, 2024

@amitkdutta : Would be good to add the NLJ check as a plan validation to demonstrate the API in a real use-case.

The PR would definitely be more solid after that. Right now its doing more no-op work.

@pedroerp
Copy link
Copy Markdown
Contributor

@pedroerp @amitkdutta I agree that this can go out temporarily. Long term, I don't think these sorts of checks shouldn't go into the worker, and I just wanted to give everyone the heads up that when this solution is ready, this interface might get deprecated, so whatever is added here may need to get migrated.

+1. This validation is meant to be temporary anyhow, and we will relax the restrictions as bugs are fixed and more features are added to the native stack. We just need to have a killswitch we can use to disable traffic as we roll more of these things out to production.

@pedroerp
Copy link
Copy Markdown
Contributor

pedroerp commented Aug 13, 2024

@amitkdutta : Would be good to add the NLJ check as a plan validation to demonstrate the API in a real use-case.

NLJ is fixed in Velox already, but potentially we could use this to disable timestamp with timezone comparisons which are still broken

@amitkdutta
Copy link
Copy Markdown
Contributor Author

@tdcmeehan Totally agree with you. We can remove this callback once we have the Planchecker.
@aditi-pandit @pedroerp Let me address your comments in a follow up PR. I will add an example that fails queries with timestamp with timezone.

@amitkdutta amitkdutta merged commit 4b920ca into prestodb:master Aug 13, 2024
@aditi-pandit
Copy link
Copy Markdown
Contributor

aditi-pandit commented Aug 14, 2024

@amitkdutta : Would be good to add the NLJ check as a plan validation to demonstrate the API in a real use-case.

NLJ is fixed in Velox already, but potentially we could use this to disable timestamp with timezone comparisons which are still broken

I picked up NLJ from a previous remark. Agree its fixed in Velox already.

Do we really want to disable timestamp with timezone here now instead of the PlanChecker in the co-ordinator ? Are we planning to refine that logic in any way ? Even if we are, the PlanChecker is still a better place for that logic. That validation has to look at all PlanNodes, Function Signatures.. have deeper understanding of what individual planNodes do. Thats a better fit in the co-ordinator. And since we already have that code, I feel we shouldn't refactor it for the sake of having an example.

The NLJ situation was a better candidate for Prestissimo PlanValidator. It would be okay to add it as a test and not in the main code path as a compromise.

@pedroerp, @amitkdutta, @tdcmeehan : wdyt ?

@pedroerp
Copy link
Copy Markdown
Contributor

pedroerp commented Aug 16, 2024

The problem with the timestamp with timezone blocker (as I understand) is that it is too coarse; it blocks any usage of the custom type, where the problem is really only when you need rely on its comparisons. If this is something that could be done in the coordinator, even better. Although even in that case, this new API provides us more flexibility to quickly iterate on any other query shapes that may need to be blocked.

@tdcmeehan
Copy link
Copy Markdown
Contributor

I think we should create an issue to improve this checker, there's nothing preventing us from refining it. And I don't think there are features or capabilities in C++ that are superior to the tools in Java that we would use to check for this.

I'm curious why this approach is considered quicker? Is it because it allows us to write the check in C++?

@pedroerp
Copy link
Copy Markdown
Contributor

pedroerp commented Aug 16, 2024

I suppose not only in C++, but it allow us to do the check based on a Velox Plan (after the conversion from Presto plan). Presto Java, in theory, doesn't know the details of that conversion.

@tdcmeehan
Copy link
Copy Markdown
Contributor

@pedroerp do you have an example of what information the Velox plan would contain that wouldn't be available in the Presto plan? My thinking was it can't contain any essential additional information that isn't known upfront, because we always start with a Presto plan.

@pedroerp
Copy link
Copy Markdown
Contributor

Good question. I suppose we would have most of the same information in both places, minus details of the plan translation/mapping from java to cpp, which may not be available in Java.

Overall, the idea here is less about re-implementing the blocks we already have (which are trivial and could be implemented somewhere else), and more about having an easy-to-use and convenient kill switch device that let's us disable a very specific slice of traffic. These things are found in production as we roll out additional traffic, so it's very hard to predict what exactly will be needed, but having access to the exact Velox plan being executed gives you the most flexibility.

@tdcmeehan
Copy link
Copy Markdown
Contributor

@pedroerp thanks for the clarification. But I'm still not sure what exactly makes this check easy to use and convenient? Is the ease of use due to it being developed as a hook, which means platforms can develop custom implementations internally? i.e. the idea is you can write a very quick code change to your internal repository, without having to write a patch in open source where there is a desire for understanding and consensus? Or is the ease of use due to it being a Velox plan, instead of a Presto plan?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants