Get rid of frag count overhead in prepareKernelParams #665

akroviakov · 2023-09-14T09:04:57Z

Imagine you read CSVs into a table (e.g., 80Mil. rows) and you want to "try" to reduce materialization overhead, so you pick a small fragment size (e.g., 40k) or you somehow managed to operate on arrow chunks directly (roughly 50k rows, zero copy for CPU). You know that you are in a heterogeneous setting, so devices can actually share a workload since there are many fragments and you can also do multifragment on GPU. You do not expect any issues. That is until you try to run 8 subsequent group by count on GPU each on a different column. You will likely find out that fetching a column and preparing kernel parameters gradually takes more and more time and at 8th run each of them takes over 400ms (30% of the whole query). A new bottleneck emerges due to a higher fragment count, which leads to more metadata being passed (i.e., more allocations and data transfers). This might have been OK if it wasn't for preparing kernel parameters being per kernel (i.e., step) overhead. This wasn't an issue because no one would do 40k fragment size on GPU :), but in a heterogeneous setup it might be the case.

This PR does one allocation for all of the kernel parameters and fills the memory by asynchronous (for CUDA) data transfer calls. Why async? Well, why not, we have potentially many independent data transfers that need to be ready only before kernel launch. This PR reduces preparing kernel parameters overhead to basically only data transfers (which in practice almost always becomes 0ms).

That is, in the example above (8 subsequent group bys) the gradual increase of preparing kernel parameters overhead is eliminated.

The main benefit of course comes from one allocation instead of linear-to-frag-count allocations.
Maybe this small introduction to async data transfer capability might also be useful elsewhere.
Also since kernel parameters do not live past launchGpuCode() and now they are a single buffer, we can now easily free it, instead of piling up all those allocations.

ienkovich · 2023-09-14T18:45:23Z

Why do we have many kernels on GPU? If we use multi-frag kernels, then increasing the number of fragments x1000 shouldn't increase the number of kernels, it should increase the number of fragments per kernel instead. What do I miss?

akroviakov · 2023-09-15T07:13:58Z

You are correct. One multifrag kernel has to have info on all fragment it touches. In kernel-per-fragment it is only one fragment and the overhead is not noticeable. Assume you have have one kernel with 1000 fragments, if you take a look at the current loop (in main) in prepareKenelParams(), it would do 1000 GPU allocations for #cols pointers, this kills the performance. On top of that, we pass #cols pointers for each of 1000 fragments synchronously.

For example, currently in group by count with one column and 1000 fragments, we get 1000 separate GPU allocations of size 8 bytes and after each allocation we synchronously copy the pointer itself, so 1000 sync 8 bytes data transfers to GPU. After a kernel is done there is no cleanup of what prepareKenelParams() has allocated (I haven't seen any), so they pile up and GPU seems to not like it.

I wrote

per kernel (i.e., step) overhead.

to indicate that this overhead is not like fetching a data (data remains on GPU for next queries) and is not even like running a query, because in a multi-step query you would experience this overhead #steps times.

In fact, there are quite a few other places with per query overhead due to many fragments, currently known ones are:

create QueryExecutionContext (for GPU) : we are creating result sets in a single thread loop which copies fragment info
Query pre-execution steps (for CPU and GPU): for each table we copy fragments and chunk keys

ienkovich · 2023-09-15T18:12:17Z

Thanks for clarifications!

Does GPU benefit of having many small fragments? For CPU it might allow to zero-copy fetch data. But for GPU we have to copy it anyway, so we might always to make it a single linear fragment while copying.

ienkovich · 2023-09-15T18:13:13Z

omniscidb/QueryEngine/QueryExecutionContext.h

@@ -125,7 +125,16 @@ class QueryExecutionContext : boost::noncopyable {
    KERN_PARAM_COUNT,
  };

-  std::vector<int8_t*> prepareKernelParams(
+  size_t getAllocSizeKernelParams(


getKernelParamsSize or getKernelParamsAllocSize sound better to me.

akroviakov · 2023-09-18T10:33:31Z

Does GPU benefit of having many small fragments?

GPU kernel execution on itself does not benefit from many fragments, but it also doesn't suffer much. Some queries can be accelerated by heterogeneous execution with a small enough fragment size. To increase the number of "heterogeneous cases", we have to make HDK (e.g., its GPU code path) more friendly to small fragments (i.e., more fragments).

make it a single linear fragment while copying

Some thoughts and issues of this were expressed here: #648.

akroviakov requested review from kurapov-peter and alexbaden September 14, 2023 09:05

akroviakov force-pushed the akroviak/prepare_kernel_params_gpu branch from 2adc732 to c99c683 Compare September 14, 2023 09:46

ienkovich approved these changes Sep 15, 2023

View reviewed changes

akroviakov force-pushed the akroviak/prepare_kernel_params_gpu branch from c99c683 to c81fa63 Compare September 18, 2023 14:04

improve gpu multifrag

c81fa63

kurapov-peter approved these changes Sep 28, 2023

View reviewed changes

kurapov-peter merged commit 35e77f4 into main Sep 28, 2023

kurapov-peter deleted the akroviak/prepare_kernel_params_gpu branch September 28, 2023 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get rid of frag count overhead in prepareKernelParams #665

Get rid of frag count overhead in prepareKernelParams #665

akroviakov commented Sep 14, 2023 •

edited

Loading

ienkovich commented Sep 14, 2023

akroviakov commented Sep 15, 2023 •

edited

Loading

ienkovich commented Sep 15, 2023

ienkovich Sep 15, 2023

akroviakov commented Sep 18, 2023

Get rid of frag count overhead in prepareKernelParams #665

Get rid of frag count overhead in prepareKernelParams #665

Conversation

akroviakov commented Sep 14, 2023 • edited Loading

ienkovich commented Sep 14, 2023

akroviakov commented Sep 15, 2023 • edited Loading

ienkovich commented Sep 15, 2023

ienkovich Sep 15, 2023

Choose a reason for hiding this comment

akroviakov commented Sep 18, 2023

akroviakov commented Sep 14, 2023 •

edited

Loading

akroviakov commented Sep 15, 2023 •

edited

Loading