Streams for BLAS. maskedCopy. generic reduce kernels. bugfix. #167

soumith · 2015-05-28T05:38:24Z

Stream support for BLAS Handles.
maskedCopy implemented
generic Reduce kernels
Fixed a small ffi inconsistency introduced with #158

maskedCopy implemented generic Reduce kernels

soumith · 2015-05-28T05:39:01Z

all tests pass.

soumith · 2015-05-29T13:34:23Z

cc: @dominikgrewe more stuff on it's way. do review when you get a chance.

We've been using this branch for most of the last month or more, so it's definitely stable and afaik fairly bug-free.

dominikgrewe · 2015-05-29T15:30:40Z

Could we couple the BLAS handles with streams? I.e. for each stream, we'll have a separate BLAS handle and when switch streams you automatically switch BLAS handles.
What do you get from managing them separately?

dominikgrewe · 2015-05-29T16:39:38Z

Btw, it would be great if you could break these patches down a bit more. There is so much going on here, it's hard to make sense of it all. Won't get a chance until early next week to take a closer look.

Was the THCDeviceTensor code meant to be in this PR? You don't mention it in your comments.

soumith · 2015-05-29T17:14:30Z

Regarding BLAS Handles, response from Nicolas(FB):
"They are fundamentally different resources with different lifetimes.
Atm, you can run multiple cublas calls with multiple streams on a single handle or only use a single stream across multiple handles.
I don’t have a strong argument against merging them but there may be performance / memory consumption implications of always forcing a single stream to a single handle.
The problem is I can’t think of a good test that would prove or disprove this.

Running mxm (513x513x513): 53 iterations (parallel over streams), 1 batches, GReductions(virtual fmas)/s = 634.47793 time = 11.28ms
Running mxm (513x513x513): 53 iterations (parallel over streams), 1 batches, GReductions(virtual fmas)/s = 785.07805 time = 9.11ms
Running mxm (513x513x513): 53 iterations (parallel over streams), 1 batches, GReductions(virtual fmas)/s = 931.75268 time = 7.68ms
Running mxm (513x513x513): 53 iterations (parallel over streams), 1 batches, GReductions(virtual fmas)/s = 934.15645 time = 7.66ms

These are simple perf results for respectively 1x1, 1x4, 4x1 and 4x4 handles x streams
If you have only 1 handle it’s better to have 4 streams.
But then it’s even better to have 4 handles and there is no significant gain from 4x4 over 4x1."

soumith · 2015-05-29T17:15:24Z

I'm going to break these further in the future. It's just that we've gone too far from HEAD, and I wanted to just put the PR out there.

soumith · 2015-05-29T17:17:36Z

The THCDeviceTensor* is an isolated subset of facebook::cuda( https://github.com/facebook/fbcuda ). We isolated the subset needed for the reduction kernels, and the Volumetric Max/Avgpooling kernels, and named them THCDevice* so that we dont have a dependency on fbcuda on the core of cutorch.
This is approximately what we discussed with @akfidjeland and @koraykv back in that email thread a couple of months ago.

dominikgrewe · 2015-06-01T19:10:19Z

FFI.lua

+
+typedef struct _THCCudaResourcesPerDevice {
+  cudaStream_t* streams;
+  cublasHandle_t* blasHandles;


This doesn't match the definition in THCGeneral.h

cc: @nicolasvasilache

the scratch space needs to be added here

soumith · 2015-06-01T19:35:03Z

cc: @nicolasvasilache please take a look at @dominikgrewe 's comments.

dominikgrewe · 2015-06-01T19:40:22Z

lib/THC/THCGeneral.c

+                                state->currentStream));
+}
+
+void THCState_setBlasHandle(THCState *state, int device, int handle)


You only really want to call this method if device is the current device, right? Otherwise currentBlasHandle holds a handle for a device other than the one we're currently using.

Correct, this is how it is used atm but it should be enforced (by having only a THCState_setBlasHandleForCurrentDevice function).

wickedfoo · 2015-06-01T20:53:20Z

@soumith I'll send out a diff locally fixing the FFI and converting logicalany/logicalall that you can pull into this.

dominikgrewe · 2015-06-02T18:09:03Z

As far as I can tell, THCDeviceTensor is not used in this PR (the reduce kernels don't use it). Can you factor it out into a separate PR? It should be fairly straightforward and we'd keep the commit history a bit saner.

soumith · 2015-06-02T18:10:15Z

sure i'll do that,

hughperkins · 2015-06-12T00:19:46Z

Hi, random comment, are you sure you need the init value? I guess you can do things like? :

  float r = modifyOp(in.data[inOffset]);
  inOffset += reductionStride;
  for (IndexType i = 1; i < reductionSize; ++i) {
    r = reduceOp(r, modifyOp(in.data[inOffset]));
    inOffset += reductionStride;
  }

wickedfoo · 2015-06-12T01:03:17Z

@hughperkins To take the two parts in turn:

Needing init. I think I can remove this. The loops are block strided, so the block would always have to be # of threads >= totalElements in order for that to work. This I could do, but would have to check that in a couple of places, and there's this possible fragile dependency between the minimum number of elements and the block size. I'll look into that.
Simplifying the loop. There's this piece of code that occurs in various places:

    const IndexType inOffset = IndexToOffset<IndexType, ADims>::get(i, in);

the loop iterates in linear index order (i.e., [0, numElements -1]), which is converted to a real byte offset in a multi-dimensional array (with arbitrary strides/holes) using the appropriate math. You could be iterating over 5-d slices of a 6-d array, reducing over the 3rd dimension. It can't just be += reductionStride since the slice being reduced doesn't necessarily have a constant stride. This should compile down to effectively that in the case that the slice being reduced does have a constant stride, though.

dominikgrewe · 2015-06-12T08:38:08Z

lib/THC/THCTensorMasked.cu

+      // We've already checked that this offset is <= 2^24, so this is ok.
+      int srcOffset = (int) (mask - baseMask);
+      *out = src[(int) maskPrefixSum[srcOffset]];
+    }


Can we not use THCudaTensor_pointwiseApply3 here and pass in maskPrefixSum as an argument? Then we wouldn't need to calculate the offset.

dominikgrewe · 2015-06-12T08:45:03Z

@soumith I just added a few more comments. Once those ones and the previous ones have been addressed please let me have another quick look, but it should be good to go then. Thanks.

hughperkins · 2015-06-13T06:13:59Z

I get the following compile warning, using yesterday's goodies2:

/home/user/git/cltorch/lib/THCl/THClReduceAll.h:38:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   if (numBlocks > scratchSpace) {

Looking at the original code:

inline long getTwoPassBlocks(THCState* state, long elements) {
  long numBlocks = THCCeilDiv(elements, THC_REDUCE_ALL_BLOCK_SIZE);

  // We can only have as many blocks as there is scratch space
  size_t scratchSpace =
    THCState_getCurrentDeviceScratchSpaceSize(state) / sizeof(float);
  THAssert(scratchSpace > 0);

  if (numBlocks > scratchSpace) {
    numBlocks = scratchSpace;
  }

  return numBlocks;
}

... looks like numBlocks is long, ie signed, but scratchSpace is size_t, ie unsigned?

hughperkins · 2015-06-13T16:55:33Z

As far as I can tell, THCDeviceTensor is not used in this PR (the reduce kernels don't use it). Can you factor it out into a separate PR? It should be fairly straightforward and we'd keep the commit history a bit saner.

eg CeilDiv is needed by THC[l]ReduceAll.cl/cuh currently, and CeilDiv is currently part of THC[l]DeviceTensor

hughperkins · 2015-06-14T07:22:38Z

Personal observation: it would be easier to find functions if they are prefixed with the module name :-) eg, in THCReduceAll, there is a method THCudaTensor_reduceAll , which I would logically hunt for in THCTensor. (At least, I spend a lot of time doing grep to find functions, since their name gives no indication of where to find them?)

dominikgrewe · 2015-06-14T10:03:12Z

As far as I can tell, THCDeviceTensor is not used in this PR (the reduce kernels don't use it). Can you factor it out into a separate PR? It should be fairly straightforward and we'd keep the commit history a bit saner.

eg CeilDiv is needed by THC[l]ReduceAll.cl/cuh currently, and CeilDiv is currently part of THC[l]DeviceTensor

Ah, I didn't see that it was part of the THCDeviceTensor code. Good catch, thanks!
It's unrelated to the rest of the THCDeviceTensor code though, so we should probably put it into a different file anyway.

dominikgrewe · 2015-06-14T10:11:45Z

Personal observation: it would be easier to find functions if they are prefixed with the module name :-) eg, in THCReduceAll, there is a method THCudaTensor_reduceAll , which I would logically hunt for in THCTensor. (At least, I spend a lot of time doing grep to find functions, since their name gives no indication of where to find them?)

I think all functions operating on tensors (and that's the vast majority) start with THCudaTensor, apart from "special" ones like methods wrapping cuBLAS. So maybe files like THCReduceAll should be renamed to THCTensorReduceAll? That would be more consistent with THCTensorMath etc and should hopefully make it fairly obvious where to look for functions like THCudaTensor_reduceAll.

Prefixing method names according to the files they're defined in means that users of THC need to know which function is defined where, which doesn't seem very sensible.

hughperkins · 2015-06-14T11:19:09Z

Question: why do we pass get_local_size(0) into reduceBlock in THClTensor_reduceContigDim, rather than 'reductionSize'? It seems that this works ok since all threads have r set to init, but if we changed it to reductoinSize, then we might not need the init var? (I'm still a newbie to the code, so I'm posing a question really, rather than making a suggestion.)

wickedfoo · 2015-06-15T16:45:38Z

I think all functions operating on tensors (and that's the vast majority) start with THCudaTensor, apart from "special" ones like methods wrapping cuBLAS. So maybe files like THCReduceAll should be renamed to THCTensorReduceAll? That would be more consistent with THCTensorMath etc and should hopefully make it fairly obvious where to look for functions like THCudaTensor_reduceAll.

In my opinion, the distinction between reduceAll etc. is that they are not part of the C Torch tensor API (e.g., the same functions implemented in torch7 for CPU tensors). They are implementation utilities for implementing the C Torch CudaTensor API, and aren't meant for end-users of the C tensor api.

I can change if you want, or leave it up to @soumith.

wickedfoo · 2015-06-15T16:49:12Z

They are C++ templates, not C API calls as well.

soumith · 2015-06-18T17:28:39Z

@dominikgrewe please see #181 , @colesbury is taking this over to completion...

Stream support for BLAS Handles.

0603ad0

maskedCopy implemented generic Reduce kernels

soumith mentioned this pull request May 28, 2015

adding VolumetricAveragePooling and VolumetricMaxPooling CUDA versions torch/cunn#91

Merged

soumith mentioned this pull request May 29, 2015

missing API against TH #70

Open

35 tasks

dominikgrewe reviewed Jun 1, 2015
View reviewed changes

dominikgrewe reviewed Jun 12, 2015
View reviewed changes

soumith closed this Jun 18, 2015

colesbury mentioned this pull request Jun 18, 2015

Stream support for BLAS Handles. #181

Merged

soumith deleted the goodies2 branch February 12, 2016 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streams for BLAS. maskedCopy. generic reduce kernels. bugfix. #167

Streams for BLAS. maskedCopy. generic reduce kernels. bugfix. #167

soumith commented May 28, 2015

soumith commented May 28, 2015

soumith commented May 29, 2015

dominikgrewe commented May 29, 2015

dominikgrewe commented May 29, 2015

soumith commented May 29, 2015

soumith commented May 29, 2015

soumith commented May 29, 2015

dominikgrewe Jun 1, 2015

soumith Jun 1, 2015

wickedfoo Jun 1, 2015

soumith commented Jun 1, 2015

dominikgrewe Jun 1, 2015

nicolasvasilache Jun 1, 2015

wickedfoo commented Jun 1, 2015

dominikgrewe commented Jun 2, 2015

soumith commented Jun 2, 2015

hughperkins commented Jun 12, 2015

wickedfoo commented Jun 12, 2015

dominikgrewe Jun 12, 2015

wickedfoo Jun 12, 2015

dominikgrewe commented Jun 12, 2015

hughperkins commented Jun 13, 2015

hughperkins commented Jun 13, 2015

hughperkins commented Jun 14, 2015

dominikgrewe commented Jun 14, 2015

dominikgrewe commented Jun 14, 2015

hughperkins commented Jun 14, 2015

wickedfoo commented Jun 15, 2015

wickedfoo commented Jun 15, 2015

soumith commented Jun 18, 2015

Streams for BLAS. maskedCopy. generic reduce kernels. bugfix. #167

Streams for BLAS. maskedCopy. generic reduce kernels. bugfix. #167

Conversation

soumith commented May 28, 2015

soumith commented May 28, 2015

soumith commented May 29, 2015

dominikgrewe commented May 29, 2015

dominikgrewe commented May 29, 2015

soumith commented May 29, 2015

soumith commented May 29, 2015

soumith commented May 29, 2015

dominikgrewe Jun 1, 2015

Choose a reason for hiding this comment

soumith Jun 1, 2015

Choose a reason for hiding this comment

wickedfoo Jun 1, 2015

Choose a reason for hiding this comment

soumith commented Jun 1, 2015

dominikgrewe Jun 1, 2015

Choose a reason for hiding this comment

nicolasvasilache Jun 1, 2015

Choose a reason for hiding this comment

wickedfoo commented Jun 1, 2015

dominikgrewe commented Jun 2, 2015

soumith commented Jun 2, 2015

hughperkins commented Jun 12, 2015

wickedfoo commented Jun 12, 2015

dominikgrewe Jun 12, 2015

Choose a reason for hiding this comment

wickedfoo Jun 12, 2015

Choose a reason for hiding this comment

dominikgrewe commented Jun 12, 2015

hughperkins commented Jun 13, 2015

hughperkins commented Jun 13, 2015

hughperkins commented Jun 14, 2015

dominikgrewe commented Jun 14, 2015

dominikgrewe commented Jun 14, 2015

hughperkins commented Jun 14, 2015

wickedfoo commented Jun 15, 2015

wickedfoo commented Jun 15, 2015

soumith commented Jun 18, 2015