Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams for BLAS. maskedCopy. generic reduce kernels. bugfix. #167

Closed
wants to merge 1 commit into from

Conversation

soumith
Copy link
Member

@soumith soumith commented May 28, 2015

Stream support for BLAS Handles.
maskedCopy implemented
generic Reduce kernels
Fixed a small ffi inconsistency introduced with #158

maskedCopy implemented
generic Reduce kernels
@soumith
Copy link
Member Author

soumith commented May 28, 2015

all tests pass.

@soumith
Copy link
Member Author

soumith commented May 29, 2015

cc: @dominikgrewe more stuff on it's way. do review when you get a chance.

We've been using this branch for most of the last month or more, so it's definitely stable and afaik fairly bug-free.

@soumith soumith mentioned this pull request May 29, 2015
35 tasks
@dominikgrewe
Copy link
Member

Could we couple the BLAS handles with streams? I.e. for each stream, we'll have a separate BLAS handle and when switch streams you automatically switch BLAS handles.
What do you get from managing them separately?

@dominikgrewe
Copy link
Member

Btw, it would be great if you could break these patches down a bit more. There is so much going on here, it's hard to make sense of it all. Won't get a chance until early next week to take a closer look.

Was the THCDeviceTensor code meant to be in this PR? You don't mention it in your comments.

@soumith
Copy link
Member Author

soumith commented May 29, 2015

Regarding BLAS Handles, response from Nicolas(FB):
"They are fundamentally different resources with different lifetimes.
Atm, you can run multiple cublas calls with multiple streams on a single handle or only use a single stream across multiple handles.
I don’t have a strong argument against merging them but there may be performance / memory consumption implications of always forcing a single stream to a single handle.
The problem is I can’t think of a good test that would prove or disprove this.

Running mxm (513x513x513): 53 iterations (parallel over streams), 1 batches, GReductions(virtual fmas)/s = 634.47793 time = 11.28ms
Running mxm (513x513x513): 53 iterations (parallel over streams), 1 batches, GReductions(virtual fmas)/s = 785.07805 time = 9.11ms
Running mxm (513x513x513): 53 iterations (parallel over streams), 1 batches, GReductions(virtual fmas)/s = 931.75268 time = 7.68ms
Running mxm (513x513x513): 53 iterations (parallel over streams), 1 batches, GReductions(virtual fmas)/s = 934.15645 time = 7.66ms

These are simple perf results for respectively 1x1, 1x4, 4x1 and 4x4 handles x streams
If you have only 1 handle it’s better to have 4 streams.
But then it’s even better to have 4 handles and there is no significant gain from 4x4 over 4x1."

@soumith
Copy link
Member Author

soumith commented May 29, 2015

I'm going to break these further in the future. It's just that we've gone too far from HEAD, and I wanted to just put the PR out there.

@soumith
Copy link
Member Author

soumith commented May 29, 2015

The THCDeviceTensor* is an isolated subset of facebook::cuda( https://github.com/facebook/fbcuda ). We isolated the subset needed for the reduction kernels, and the Volumetric Max/Avgpooling kernels, and named them THCDevice* so that we dont have a dependency on fbcuda on the core of cutorch.
This is approximately what we discussed with @akfidjeland and @koraykv back in that email thread a couple of months ago.


typedef struct _THCCudaResourcesPerDevice {
cudaStream_t* streams;
cublasHandle_t* blasHandles;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match the definition in THCGeneral.h

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the scratch space needs to be added here

@soumith
Copy link
Member Author

soumith commented Jun 1, 2015

cc: @nicolasvasilache please take a look at @dominikgrewe 's comments.

state->currentStream));
}

void THCState_setBlasHandle(THCState *state, int device, int handle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only really want to call this method if device is the current device, right? Otherwise currentBlasHandle holds a handle for a device other than the one we're currently using.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this is how it is used atm but it should be enforced (by having only a THCState_setBlasHandleForCurrentDevice function).

@wickedfoo
Copy link
Contributor

@soumith I'll send out a diff locally fixing the FFI and converting logicalany/logicalall that you can pull into this.

@dominikgrewe
Copy link
Member

As far as I can tell, THCDeviceTensor is not used in this PR (the reduce kernels don't use it). Can you factor it out into a separate PR? It should be fairly straightforward and we'd keep the commit history a bit saner.

@soumith
Copy link
Member Author

soumith commented Jun 2, 2015

sure i'll do that,

@hughperkins
Copy link
Contributor

Hi, random comment, are you sure you need the init value? I guess you can do things like? :

  float r = modifyOp(in.data[inOffset]);
  inOffset += reductionStride;
  for (IndexType i = 1; i < reductionSize; ++i) {
    r = reduceOp(r, modifyOp(in.data[inOffset]));
    inOffset += reductionStride;
  }

@wickedfoo
Copy link
Contributor

@hughperkins To take the two parts in turn:

  1. Needing init. I think I can remove this. The loops are block strided, so the block would always have to be # of threads >= totalElements in order for that to work. This I could do, but would have to check that in a couple of places, and there's this possible fragile dependency between the minimum number of elements and the block size. I'll look into that.
  2. Simplifying the loop. There's this piece of code that occurs in various places:
    const IndexType inOffset = IndexToOffset<IndexType, ADims>::get(i, in);

the loop iterates in linear index order (i.e., [0, numElements -1]), which is converted to a real byte offset in a multi-dimensional array (with arbitrary strides/holes) using the appropriate math. You could be iterating over 5-d slices of a 6-d array, reducing over the 3rd dimension. It can't just be += reductionStride since the slice being reduced doesn't necessarily have a constant stride. This should compile down to effectively that in the case that the slice being reduced does have a constant stride, though.

// We've already checked that this offset is <= 2^24, so this is ok.
int srcOffset = (int) (mask - baseMask);
*out = src[(int) maskPrefixSum[srcOffset]];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use THCudaTensor_pointwiseApply3 here and pass in maskPrefixSum as an argument? Then we wouldn't need to calculate the offset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@dominikgrewe
Copy link
Member

@soumith I just added a few more comments. Once those ones and the previous ones have been addressed please let me have another quick look, but it should be good to go then. Thanks.

@hughperkins
Copy link
Contributor

I get the following compile warning, using yesterday's goodies2:

/home/user/git/cltorch/lib/THCl/THClReduceAll.h:38:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   if (numBlocks > scratchSpace) {

Looking at the original code:

inline long getTwoPassBlocks(THCState* state, long elements) {
  long numBlocks = THCCeilDiv(elements, THC_REDUCE_ALL_BLOCK_SIZE);

  // We can only have as many blocks as there is scratch space
  size_t scratchSpace =
    THCState_getCurrentDeviceScratchSpaceSize(state) / sizeof(float);
  THAssert(scratchSpace > 0);

  if (numBlocks > scratchSpace) {
    numBlocks = scratchSpace;
  }

  return numBlocks;
}

... looks like numBlocks is long, ie signed, but scratchSpace is size_t, ie unsigned?

@hughperkins
Copy link
Contributor

As far as I can tell, THCDeviceTensor is not used in this PR (the reduce kernels don't use it). Can you factor it out into a separate PR? It should be fairly straightforward and we'd keep the commit history a bit saner.

eg CeilDiv is needed by THC[l]ReduceAll.cl/cuh currently, and CeilDiv is currently part of THC[l]DeviceTensor

@hughperkins
Copy link
Contributor

Personal observation: it would be easier to find functions if they are prefixed with the module name :-) eg, in THCReduceAll, there is a method THCudaTensor_reduceAll , which I would logically hunt for in THCTensor. (At least, I spend a lot of time doing grep to find functions, since their name gives no indication of where to find them?)

@dominikgrewe
Copy link
Member

As far as I can tell, THCDeviceTensor is not used in this PR (the reduce kernels don't use it). Can you factor it out into a separate PR? It should be fairly straightforward and we'd keep the commit history a bit saner.

eg CeilDiv is needed by THC[l]ReduceAll.cl/cuh currently, and CeilDiv is currently part of THC[l]DeviceTensor

Ah, I didn't see that it was part of the THCDeviceTensor code. Good catch, thanks!
It's unrelated to the rest of the THCDeviceTensor code though, so we should probably put it into a different file anyway.

@dominikgrewe
Copy link
Member

Personal observation: it would be easier to find functions if they are prefixed with the module name :-) eg, in THCReduceAll, there is a method THCudaTensor_reduceAll , which I would logically hunt for in THCTensor. (At least, I spend a lot of time doing grep to find functions, since their name gives no indication of where to find them?)

I think all functions operating on tensors (and that's the vast majority) start with THCudaTensor, apart from "special" ones like methods wrapping cuBLAS. So maybe files like THCReduceAll should be renamed to THCTensorReduceAll? That would be more consistent with THCTensorMath etc and should hopefully make it fairly obvious where to look for functions like THCudaTensor_reduceAll.

Prefixing method names according to the files they're defined in means that users of THC need to know which function is defined where, which doesn't seem very sensible.

@hughperkins
Copy link
Contributor

Question: why do we pass get_local_size(0) into reduceBlock in THClTensor_reduceContigDim, rather than 'reductionSize'? It seems that this works ok since all threads have r set to init, but if we changed it to reductoinSize, then we might not need the init var? (I'm still a newbie to the code, so I'm posing a question really, rather than making a suggestion.)

@wickedfoo
Copy link
Contributor

I think all functions operating on tensors (and that's the vast majority) start with THCudaTensor, apart from "special" ones like methods wrapping cuBLAS. So maybe files like THCReduceAll should be renamed to THCTensorReduceAll? That would be more consistent with THCTensorMath etc and should hopefully make it fairly obvious where to look for functions like THCudaTensor_reduceAll.

In my opinion, the distinction between reduceAll etc. is that they are not part of the C Torch tensor API (e.g., the same functions implemented in torch7 for CPU tensors). They are implementation utilities for implementing the C Torch CudaTensor API, and aren't meant for end-users of the C tensor api.

I can change if you want, or leave it up to @soumith.

@wickedfoo
Copy link
Contributor

They are C++ templates, not C API calls as well.

@soumith
Copy link
Member Author

soumith commented Jun 18, 2015

@dominikgrewe please see #181 , @colesbury is taking this over to completion...

@soumith soumith closed this Jun 18, 2015
@soumith soumith deleted the goodies2 branch February 12, 2016 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants