Skip to content
This repository was archived by the owner on May 9, 2024. It is now read-only.

Proper heterogen execution modes #652

Merged
merged 3 commits into from
Sep 26, 2023
Merged

Conversation

akroviakov
Copy link
Contributor

@akroviakov akroviakov commented Aug 25, 2023

This patch integrates the device multifragment policy (currently disabled) in its proper form for heterogeneous kernels, that is (depending on query):

  • CPU: kernel per fragment
  • GPU: kernel per N fragments

@akroviakov akroviakov force-pushed the akroviak/heterogen_scheduling branch from 3e72fd7 to ad43824 Compare August 25, 2023 13:00
@@ -2790,66 +2790,109 @@ std::vector<std::unique_ptr<ExecutionKernel>> Executor::createHeterogeneousKerne

CHECK(!ra_exe_unit.input_descs.empty());

const bool use_multifrag_kernel = eo.allow_multifrag && is_agg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads as group-bys would not use a multi-fragment policy. Needs cleanup I guess.

<< " for kernel per fragment execution path.";
throw CompilationRetryNewScanLimit(max_frag_size);
if (use_multifrag_kernel) {
LOG(INFO) << "use_multifrag_kernel=" << use_multifrag_kernel;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is debug information. Same for others.

@@ -100,6 +100,56 @@ class QueryFragmentDescriptor {
}
}

template <typename DISPATCH_FCN>
void assignFragsToMultiHeterogeneousDispatch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a separate method is a bit ugly (that was the reason we didn't merge it long time ago), but I think we can live with that for now.

Copy link
Contributor

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatching becomes hard to follow, I think it needs more abstractions/simplifications.

const auto device_count = deviceCount(device_type);
const int device_count = config_->exec.heterogeneous.enable_heterogeneous_execution
? available_cpus + available_gpus.size()
: deviceCount(query_mem_descs.begin()->first);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks error-prone. Maybe a least add a check for the case when we expect only one query memory descriptor to be present in the map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device_count differed before also only by this flag, but I can add a check :). It was also passed as a parameter with no real goal. Do we even need it except for >0 check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove it if you make sure a device is always available.

// execution_kernels_per_device_.
const ExecutorDeviceType intended_dt = intended_dt_itr.first;
const ExecutorDeviceType actual_dt = intended_dt_itr.second->getDeviceType();
LOG(INFO) << "Query was inteded for " << intended_dt << ", will actually run on "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the situations in which we would want to reschedule some (not all) of the kernels to the CPU? In other words, do we need such logic at all? Maybe have a fallback policy instead?

<< actual_dt;

if (actual_dt == ExecutorDeviceType::GPU && eo.allow_multifrag &&
(!uses_lazy_fetch || is_agg)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this condition come from? And what do these parameters have to do with multi-fragment dispatching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were there previously. And when they are ignored we have validation issues on some tests, so I guess they are there for a reason.

if (actual_dt == ExecutorDeviceType::GPU && eo.allow_multifrag &&
(!uses_lazy_fetch || is_agg)) {
policy->devices_dispatch_modes.at(intended_dt) =
ExecutorDispatchMode::MultifragmentKernel;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not modify the policy during the scheduling. The policy is decided by the cost model is then used by the scheduling algorithm to distribute work.

bool dispatch_finished = false;
while (!dispatch_finished) {
dispatch_finished = true;
for (const auto& device_type_itr : execution_kernels_per_device_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this inner loop when you have the very same outer one?

for (const auto& device_type_itr : execution_kernels_per_device_)
for (const auto& device_itr : device_type_itr.second) {
auto& kernel_idx =
execution_kernel_index[device_type_itr.first][device_itr.first];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why populate the execution_kernel_index at all? Can't you just merge the two loops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we are inside the while (!dispatch_finished){} loop, so we may actually revisit for(const auto& device_itr : device_type_itr.second){}, but to not schedule already scheduled kernels, we keep track of the last scheduled kernel idx to continue from that index on

@akroviakov akroviakov force-pushed the akroviak/heterogen_scheduling branch 4 times, most recently from b5e1067 to 4980599 Compare August 31, 2023 11:32
@akroviakov akroviakov changed the title Proper multifrag Proper heterogen execution modes Sep 1, 2023
query_mem_descs_owned.insert(std::make_pair(dt, std::move(query_mem_desc_owned)));
const ExecutorDeviceType compiled_for_dt{query_comp_desc_owned->getDeviceType()};
if (!query_comp_descs_owned.count(
compiled_for_dt)) { // Can a fallback to CPU generate different kernels?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the latest compilation (after the fallback)

@akroviakov akroviakov force-pushed the akroviak/heterogen_scheduling branch from 4980599 to 1784a5d Compare September 1, 2023 16:07
Copy link
Contributor

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK, some minor comments.

@@ -19,7 +19,9 @@
namespace policy {

ProportionBasedExecutionPolicy::ProportionBasedExecutionPolicy(
std::map<ExecutorDeviceType, unsigned>&& propotion) {
std::map<ExecutorDeviceType, unsigned>&& propotion,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo


auto device_types_for_query = getDeviceTypesForQuery(
ra_exe_unit, query_infos, co.device_type, max_groups_buffer_entry_guess, eo);
CHECK(device_types_for_query.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CHECK(device_types_for_query.size());
CHECK_GT(device_types_for_query.size(), size_t(0));

const auto exe_policy =
getExecutionPolicy(is_agg, query_mem_descs_owned, ra_exe_unit, query_infos, eo);
const ExecutorDeviceType fallback_device{
exe_policy->hasDevice(co.device_type) ? co.device_type : ExecutorDeviceType::CPU};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be always CPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now main has it this way: can return requested_device_type (i.e., co.device_type) as fallback_device. Is current main wrong?

devices_count += get_available_gpus(data_mgr_).size();
}
}
CHECK(devices_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CHECK(devices_count);
CHECK_GT(devices_count, size_t(0));

// modes.
const std::map<ExecutorDeviceType, ExecutorDispatchMode> devices_dispatch_modes_{
{ExecutorDeviceType::CPU, ExecutorDispatchMode::KernelPerFragment},
{ExecutorDeviceType::GPU, ExecutorDispatchMode::KernelPerFragment}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are set in constructors it seems there is no reason to have the default values here (unless we use default constructors somewhere which is likely incorrect). The comment together with const is confusing.

for (const auto& dt_query_desc : query_mem_descs) {
if (policy->getExecutionMode(dt_query_desc.first) ==
ExecutorDispatchMode::KernelPerFragment) {
VLOG(1) << "Creating one execution kernel per fragment";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the log is out of place. Mb move it to where kernels are actually created.

@akroviakov akroviakov force-pushed the akroviak/heterogen_scheduling branch from 1784a5d to 89478b7 Compare September 12, 2023 07:48
Copy link
Contributor

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kurapov-peter kurapov-peter merged commit ff6b55e into main Sep 26, 2023
@kurapov-peter kurapov-peter deleted the akroviak/heterogen_scheduling branch September 26, 2023 08:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants