Skip to content
This repository was archived by the owner on May 9, 2024. It is now read-only.

Get rid of frag count overhead in prepareKernelParams #665

Merged
merged 1 commit into from
Sep 28, 2023

Conversation

akroviakov
Copy link
Contributor

@akroviakov akroviakov commented Sep 14, 2023

Imagine you read CSVs into a table (e.g., 80Mil. rows) and you want to "try" to reduce materialization overhead, so you pick a small fragment size (e.g., 40k) or you somehow managed to operate on arrow chunks directly (roughly 50k rows, zero copy for CPU). You know that you are in a heterogeneous setting, so devices can actually share a workload since there are many fragments and you can also do multifragment on GPU. You do not expect any issues. That is until you try to run 8 subsequent group by count on GPU each on a different column. You will likely find out that fetching a column and preparing kernel parameters gradually takes more and more time and at 8th run each of them takes over 400ms (30% of the whole query). A new bottleneck emerges due to a higher fragment count, which leads to more metadata being passed (i.e., more allocations and data transfers). This might have been OK if it wasn't for preparing kernel parameters being per kernel (i.e., step) overhead. This wasn't an issue because no one would do 40k fragment size on GPU :), but in a heterogeneous setup it might be the case.

This PR does one allocation for all of the kernel parameters and fills the memory by asynchronous (for CUDA) data transfer calls. Why async? Well, why not, we have potentially many independent data transfers that need to be ready only before kernel launch. This PR reduces preparing kernel parameters overhead to basically only data transfers (which in practice almost always becomes 0ms).

That is, in the example above (8 subsequent group bys) the gradual increase of preparing kernel parameters overhead is eliminated.

  • The main benefit of course comes from one allocation instead of linear-to-frag-count allocations.
  • Maybe this small introduction to async data transfer capability might also be useful elsewhere.
  • Also since kernel parameters do not live past launchGpuCode() and now they are a single buffer, we can now easily free it, instead of piling up all those allocations.

@akroviakov akroviakov force-pushed the akroviak/prepare_kernel_params_gpu branch from 2adc732 to c99c683 Compare September 14, 2023 09:46
@ienkovich
Copy link
Contributor

Why do we have many kernels on GPU? If we use multi-frag kernels, then increasing the number of fragments x1000 shouldn't increase the number of kernels, it should increase the number of fragments per kernel instead. What do I miss?

@akroviakov
Copy link
Contributor Author

akroviakov commented Sep 15, 2023

You are correct. One multifrag kernel has to have info on all fragment it touches. In kernel-per-fragment it is only one fragment and the overhead is not noticeable. Assume you have have one kernel with 1000 fragments, if you take a look at the current loop (in main) in prepareKenelParams(), it would do 1000 GPU allocations for #cols pointers, this kills the performance. On top of that, we pass #cols pointers for each of 1000 fragments synchronously.

For example, currently in group by count with one column and 1000 fragments, we get 1000 separate GPU allocations of size 8 bytes and after each allocation we synchronously copy the pointer itself, so 1000 sync 8 bytes data transfers to GPU. After a kernel is done there is no cleanup of what prepareKenelParams() has allocated (I haven't seen any), so they pile up and GPU seems to not like it.

I wrote

per kernel (i.e., step) overhead.

to indicate that this overhead is not like fetching a data (data remains on GPU for next queries) and is not even like running a query, because in a multi-step query you would experience this overhead #steps times.

In fact, there are quite a few other places with per query overhead due to many fragments, currently known ones are:

@ienkovich
Copy link
Contributor

Thanks for clarifications!

Does GPU benefit of having many small fragments? For CPU it might allow to zero-copy fetch data. But for GPU we have to copy it anyway, so we might always to make it a single linear fragment while copying.

@@ -125,7 +125,16 @@ class QueryExecutionContext : boost::noncopyable {
KERN_PARAM_COUNT,
};

std::vector<int8_t*> prepareKernelParams(
size_t getAllocSizeKernelParams(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getKernelParamsSize or getKernelParamsAllocSize sound better to me.

@akroviakov
Copy link
Contributor Author

Does GPU benefit of having many small fragments?

GPU kernel execution on itself does not benefit from many fragments, but it also doesn't suffer much. Some queries can be accelerated by heterogeneous execution with a small enough fragment size. To increase the number of "heterogeneous cases", we have to make HDK (e.g., its GPU code path) more friendly to small fragments (i.e., more fragments).

make it a single linear fragment while copying

Some thoughts and issues of this were expressed here: #648.

@akroviakov akroviakov force-pushed the akroviak/prepare_kernel_params_gpu branch from c99c683 to c81fa63 Compare September 18, 2023 14:04
@kurapov-peter kurapov-peter merged commit 35e77f4 into main Sep 28, 2023
@kurapov-peter kurapov-peter deleted the akroviak/prepare_kernel_params_gpu branch September 28, 2023 13:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants