-
Notifications
You must be signed in to change notification settings - Fork 14
Get rid of frag count overhead in prepareKernelParams #665
Conversation
2adc732
to
c99c683
Compare
Why do we have many kernels on GPU? If we use multi-frag kernels, then increasing the number of fragments x1000 shouldn't increase the number of kernels, it should increase the number of fragments per kernel instead. What do I miss? |
You are correct. One multifrag kernel has to have info on all fragment it touches. In kernel-per-fragment it is only one fragment and the overhead is not noticeable. Assume you have have one kernel with 1000 fragments, if you take a look at the current loop (in main) in For example, currently in group by count with one column and 1000 fragments, we get 1000 separate GPU allocations of size 8 bytes and after each allocation we synchronously copy the pointer itself, so 1000 sync 8 bytes data transfers to GPU. After a kernel is done there is no cleanup of what I wrote
to indicate that this overhead is not like fetching a data (data remains on GPU for next queries) and is not even like running a query, because in a multi-step query you would experience this overhead In fact, there are quite a few other places with per query overhead due to many fragments, currently known ones are:
|
Thanks for clarifications! Does GPU benefit of having many small fragments? For CPU it might allow to zero-copy fetch data. But for GPU we have to copy it anyway, so we might always to make it a single linear fragment while copying. |
@@ -125,7 +125,16 @@ class QueryExecutionContext : boost::noncopyable { | |||
KERN_PARAM_COUNT, | |||
}; | |||
|
|||
std::vector<int8_t*> prepareKernelParams( | |||
size_t getAllocSizeKernelParams( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getKernelParamsSize
or getKernelParamsAllocSize
sound better to me.
GPU kernel execution on itself does not benefit from many fragments, but it also doesn't suffer much. Some queries can be accelerated by heterogeneous execution with a small enough fragment size. To increase the number of "heterogeneous cases", we have to make HDK (e.g., its GPU code path) more friendly to small fragments (i.e., more fragments).
Some thoughts and issues of this were expressed here: #648. |
c99c683
to
c81fa63
Compare
Imagine you read CSVs into a table (e.g., 80Mil. rows) and you want to "try" to reduce materialization overhead, so you pick a small fragment size (e.g., 40k) or you somehow managed to operate on arrow chunks directly (roughly 50k rows, zero copy for CPU). You know that you are in a heterogeneous setting, so devices can actually share a workload since there are many fragments and you can also do multifragment on GPU. You do not expect any issues. That is until you try to run 8 subsequent group by count on GPU each on a different column. You will likely find out that fetching a column and preparing kernel parameters gradually takes more and more time and at 8th run each of them takes over 400ms (30% of the whole query). A new bottleneck emerges due to a higher fragment count, which leads to more metadata being passed (i.e., more allocations and data transfers). This might have been OK if it wasn't for preparing kernel parameters being per kernel (i.e., step) overhead. This wasn't an issue because no one would do 40k fragment size on GPU :), but in a heterogeneous setup it might be the case.
This PR does one allocation for all of the kernel parameters and fills the memory by asynchronous (for CUDA) data transfer calls. Why async? Well, why not, we have potentially many independent data transfers that need to be ready only before kernel launch. This PR reduces preparing kernel parameters overhead to basically only data transfers (which in practice almost always becomes 0ms).
That is, in the example above (8 subsequent group bys) the gradual increase of preparing kernel parameters overhead is eliminated.
launchGpuCode()
and now they are a single buffer, we can now easily free it, instead of piling up all those allocations.