-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streams for BLAS. maskedCopy. generic reduce kernels. bugfix. #167
Conversation
maskedCopy implemented generic Reduce kernels
all tests pass. |
cc: @dominikgrewe more stuff on it's way. do review when you get a chance. We've been using this branch for most of the last month or more, so it's definitely stable and afaik fairly bug-free. |
Could we couple the BLAS handles with streams? I.e. for each stream, we'll have a separate BLAS handle and when switch streams you automatically switch BLAS handles. |
Btw, it would be great if you could break these patches down a bit more. There is so much going on here, it's hard to make sense of it all. Won't get a chance until early next week to take a closer look. Was the THCDeviceTensor code meant to be in this PR? You don't mention it in your comments. |
Regarding BLAS Handles, response from Nicolas(FB): Running mxm (513x513x513): 53 iterations (parallel over streams), 1 batches, GReductions(virtual fmas)/s = 634.47793 time = 11.28ms These are simple perf results for respectively 1x1, 1x4, 4x1 and 4x4 handles x streams |
I'm going to break these further in the future. It's just that we've gone too far from HEAD, and I wanted to just put the PR out there. |
The THCDeviceTensor* is an isolated subset of facebook::cuda( https://github.com/facebook/fbcuda ). We isolated the subset needed for the reduction kernels, and the Volumetric Max/Avgpooling kernels, and named them THCDevice* so that we dont have a dependency on fbcuda on the core of cutorch. |
|
||
typedef struct _THCCudaResourcesPerDevice { | ||
cudaStream_t* streams; | ||
cublasHandle_t* blasHandles; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't match the definition in THCGeneral.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the scratch space needs to be added here
cc: @nicolasvasilache please take a look at @dominikgrewe 's comments. |
state->currentStream)); | ||
} | ||
|
||
void THCState_setBlasHandle(THCState *state, int device, int handle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only really want to call this method if device
is the current device, right? Otherwise currentBlasHandle
holds a handle for a device other than the one we're currently using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this is how it is used atm but it should be enforced (by having only a THCState_setBlasHandleForCurrentDevice function).
@soumith I'll send out a diff locally fixing the FFI and converting logicalany/logicalall that you can pull into this. |
As far as I can tell, THCDeviceTensor is not used in this PR (the reduce kernels don't use it). Can you factor it out into a separate PR? It should be fairly straightforward and we'd keep the commit history a bit saner. |
sure i'll do that, |
Hi, random comment, are you sure you need the
|
@hughperkins To take the two parts in turn:
the loop iterates in linear index order (i.e., [0, numElements -1]), which is converted to a real byte offset in a multi-dimensional array (with arbitrary strides/holes) using the appropriate math. You could be iterating over 5-d slices of a 6-d array, reducing over the 3rd dimension. It can't just be += reductionStride since the slice being reduced doesn't necessarily have a constant stride. This should compile down to effectively that in the case that the slice being reduced does have a constant stride, though. |
// We've already checked that this offset is <= 2^24, so this is ok. | ||
int srcOffset = (int) (mask - baseMask); | ||
*out = src[(int) maskPrefixSum[srcOffset]]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not use THCudaTensor_pointwiseApply3
here and pass in maskPrefixSum
as an argument? Then we wouldn't need to calculate the offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@soumith I just added a few more comments. Once those ones and the previous ones have been addressed please let me have another quick look, but it should be good to go then. Thanks. |
I get the following compile warning, using yesterday's goodies2:
Looking at the original code:
... looks like numBlocks is long, ie signed, but scratchSpace is size_t, ie unsigned? |
eg CeilDiv is needed by THC[l]ReduceAll.cl/cuh currently, and CeilDiv is currently part of THC[l]DeviceTensor |
Personal observation: it would be easier to find functions if they are prefixed with the module name :-) eg, in THCReduceAll, there is a method |
Ah, I didn't see that it was part of the THCDeviceTensor code. Good catch, thanks! |
I think all functions operating on tensors (and that's the vast majority) start with THCudaTensor, apart from "special" ones like methods wrapping cuBLAS. So maybe files like THCReduceAll should be renamed to THCTensorReduceAll? That would be more consistent with THCTensorMath etc and should hopefully make it fairly obvious where to look for functions like Prefixing method names according to the files they're defined in means that users of THC need to know which function is defined where, which doesn't seem very sensible. |
Question: why do we pass get_local_size(0) into reduceBlock in THClTensor_reduceContigDim, rather than 'reductionSize'? It seems that this works ok since all threads have r set to |
In my opinion, the distinction between reduceAll etc. is that they are not part of the C Torch tensor API (e.g., the same functions implemented in torch7 for CPU tensors). They are implementation utilities for implementing the C Torch CudaTensor API, and aren't meant for end-users of the C tensor api. I can change if you want, or leave it up to @soumith. |
They are C++ templates, not C API calls as well. |
@dominikgrewe please see #181 , @colesbury is taking this over to completion... |
Stream support for BLAS Handles.
maskedCopy implemented
generic Reduce kernels
Fixed a small ffi inconsistency introduced with #158