diff --git a/Project2-StreamCompaction.zip b/Project2-StreamCompaction.zip new file mode 100755 index 0000000..67322d6 Binary files /dev/null and b/Project2-StreamCompaction.zip differ diff --git a/README.md b/README.md old mode 100644 new mode 100755 index 6e02afa..337dccd --- a/README.md +++ b/README.md @@ -1,133 +1,167 @@ -Project-2 -========= - -A Study in Parallel Algorithms : Stream Compaction - -# INTRODUCTION -Many of the algorithms you have learned thus far in your career have typically -been developed from a serial standpoint. When it comes to GPUs, we are mainly -looking at massively parallel work. Thus, it is necessary to reorient our -thinking. In this project, we will be implementing a couple different versions -of prefix sum. We will start with a simple single thread serial CPU version, -and then move to a naive GPU version. Each part of this homework is meant to -follow the logic of the previous parts, so please do not do this homework out of -order. - -This project will serve as a stream compaction library that you may use (and -will want to use) in your -future projects. For that reason, we suggest you create proper header and CUDA -files so that you can reuse this code later. You may want to create a separate -cpp file that contains your main function so that you can test the code you -write. - -# OVERVIEW -Stream compaction is broken down into two parts: (1) scan, and (2) scatter. - -## SCAN -Scan or prefix sum is the summation of the elements in an array such that the -resulting array is the summation of the terms before it. Prefix sum can either -be inclusive, meaning the current term is a summation of all the elements before -it and itself, or exclusive, meaning the current term is a summation of all -elements before it excluding itself. - -Inclusive: - -In : [ 3 4 6 7 9 10 ] - -Out : [ 3 7 13 20 29 39 ] - -Exclusive - -In : [ 3 4 6 7 9 10 ] - -Out : [ 0 3 7 13 20 29 ] - -Note that the resulting prefix sum will always be n + 1 elements if the input -array is of length n. Similarly, the first element of the exclusive prefix sum -will always be 0. In the following sections, all references to prefix sum will -be to the exclusive version of prefix sum. - -## SCATTER -The scatter section of stream compaction takes the results of the previous scan -in order to reorder the elements to form a compact array. - -For example, let's say we have the following array: -[ 0 0 3 4 0 6 6 7 0 1 ] - -We would only like to consider the non-zero elements in this zero, so we would -like to compact it into the following array: -[ 3 4 6 6 7 1 ] - -We can perform a transform on input array to transform it into a boolean array: - -In : [ 0 0 3 4 0 6 6 7 0 1 ] - -Out : [ 0 0 1 1 0 1 1 1 0 1 ] - -Performing a scan on the output, we get the following array : - -In : [ 0 0 1 1 0 1 1 1 0 1 ] - -Out : [ 0 0 0 1 2 2 3 4 5 5 ] - -Notice that the output array produces a corresponding index array that we can -use to create the resulting array for stream compaction. - -# PART 1 : REVIEW OF PREFIX SUM -Given the definition of exclusive prefix sum, please write a serial CPU version -of prefix sum. You may write this in the cpp file to separate this from the -CUDA code you will be writing in your .cu file. - -# PART 2 : NAIVE PREFIX SUM -We will now parallelize this the previous section's code. Recall from lecture -that we can parallelize this using a series of kernel calls. In this portion, -you are NOT allowed to use shared memory. - -### Questions -* Compare this version to the serial version of exclusive prefix scan. Please - include a table of how the runtimes compare on different lengths of arrays. -* Plot a graph of the comparison and write a short explanation of the phenomenon you - see here. - -# PART 3 : OPTIMIZING PREFIX SUM -In the previous section we did not take into account shared memory. In the -previous section, we kept everything in global memory, which is much slower than -shared memory. - -## PART 3a : Write prefix sum for a single block -Shared memory is accessible to threads of a block. Please write a version of -prefix sum that works on a single block. - -## PART 3b : Generalizing to arrays of any length. -Taking the previous portion, please write a version that generalizes prefix sum -to arbitrary length arrays, this includes arrays that will not fit on one block. - -### Questions -* Compare this version to the parallel prefix sum using global memory. -* Plot a graph of the comparison and write a short explanation of the phenomenon - you see here. - -# PART 4 : ADDING SCATTER -First create a serial version of scatter by expanding the serial version of -prefix sum. Then create a GPU version of scatter. Combine the function call -such that, given an array, you can call stream compact and it will compact the -array for you. Finally, write a version using thrust. - -### Questions -* Compare your version of stream compact to your version using thrust. How do - they compare? How might you optimize yours more, or how might thrust's stream - compact be optimized. - -# EXTRA CREDIT (+10) -For extra credit, please optimize your prefix sum for work parallelism and to -deal with bank conflicts. Information on this can be found in the GPU Gems -chapter listed in the references. - -# SUBMISSION -Please answer all the questions in each of the subsections above and write your -answers in the README by overwriting the README file. In future projects, we -expect your analysis to be similar to the one we have led you through in this -project. Like other projects, please open a pull request and email Harmony. - -# REFERENCES -"Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3. +Project-2 +========= + +A Study in Parallel Algorithms : Stream Compaction + +# INTRODUCTION +Many of the algorithms you have learned thus far in your career have typically +been developed from a serial standpoint. When it comes to GPUs, we are mainly +looking at massively parallel work. Thus, it is necessary to reorient our +thinking. In this project, we will be implementing a couple different versions +of prefix sum. We will start with a simple single thread serial CPU version, +and then move to a naive GPU version. Each part of this homework is meant to +follow the logic of the previous parts, so please do not do this homework out of +order. + +This project will serve as a stream compaction library that you may use (and +will want to use) in your +future projects. For that reason, we suggest you create proper header and CUDA +files so that you can reuse this code later. You may want to create a separate +cpp file that contains your main function so that you can test the code you +write. + +# OVERVIEW +Stream compaction is broken down into two parts: (1) scan, and (2) scatter. + +## SCAN +Scan or prefix sum is the summation of the elements in an array such that the +resulting array is the summation of the terms before it. Prefix sum can either +be inclusive, meaning the current term is a summation of all the elements before +it and itself, or exclusive, meaning the current term is a summation of all +elements before it excluding itself. + +Inclusive: + +In : [ 3 4 6 7 9 10 ] + +Out : [ 3 7 13 20 29 39 ] + +Exclusive + +In : [ 3 4 6 7 9 10 ] + +Out : [ 0 3 7 13 20 29 ] + +Note that the resulting prefix sum will always be n + 1 elements if the input +array is of length n. Similarly, the first element of the exclusive prefix sum +will always be 0. In the following sections, all references to prefix sum will +be to the exclusive version of prefix sum. + +## SCATTER +The scatter section of stream compaction takes the results of the previous scan +in order to reorder the elements to form a compact array. + +For example, let's say we have the following array: +[ 0 0 3 4 0 6 6 7 0 1 ] + +We would only like to consider the non-zero elements in this zero, so we would +like to compact it into the following array: +[ 3 4 6 6 7 1 ] + +We can perform a transform on input array to transform it into a boolean array: + +In : [ 0 0 3 4 0 6 6 7 0 1 ] + +Out : [ 0 0 1 1 0 1 1 1 0 1 ] + +Performing a scan on the output, we get the following array : + +In : [ 0 0 1 1 0 1 1 1 0 1 ] + +Out : [ 0 0 0 1 2 2 3 4 5 5 ] + +Notice that the output array produces a corresponding index array that we can +use to create the resulting array for stream compaction. + +# PART 1 : REVIEW OF PREFIX SUM +Given the definition of exclusive prefix sum, please write a serial CPU version +of prefix sum. You may write this in the cpp file to separate this from the +CUDA code you will be writing in your .cu file. + +# PART 2 : NAIVE PREFIX SUM +We will now parallelize this the previous section's code. Recall from lecture +that we can parallelize this using a series of kernel calls. In this portion, +you are NOT allowed to use shared memory. + +### Questions +* Compare this version to the serial version of exclusive prefix scan. Please + include a table of how the runtimes compare on different lengths of arrays. +* Plot a graph of the comparison and write a short explanation of the phenomenon you + see here. + +### Answers +* My gpu accelerated algorithm actually runs much slower than my sequential cpu algorithm. + I think this is due to a poor implementation of the naive algorithm on my part. The + way my naive algorithm works is to so a segmented scan and save the largest value in + each segment to an auxiliary array. I then call scan recursively on the auxiliary + array. My fatal mistake was that I allocated the memory for the auxiliary array within + the naive_scan call. The result of this is that for an array of length N i have log(n) + calls to malloc which is pretty slow. The sequential version already has everything in + memory so it runs much more quickly. I'm pretty ceratin if I were to initialize all + the memory outside the call I would see significant performance improvement and I plan + to do this before using the algorithm in the next assignment. + +* There are also some other places I could get some speedup, for example in my kernel + call I actually move data between my two temporary arrays inside the loop when I could + just swap their pointers. + +* My graph is in string compaction graph.pdf + +# PART 3 : OPTIMIZING PREFIX SUM +In the previous section we did not take into account shared memory. In the +previous section, we kept everything in global memory, which is much slower than +shared memory. + +## PART 3a : Write prefix sum for a single block +Shared memory is accessible to threads of a block. Please write a version of +prefix sum that works on a single block. + +## PART 3b : Generalizing to arrays of any length. +Taking the previous portion, please write a version that generalizes prefix sum +to arbitrary length arrays, this includes arrays that will not fit on one block. + +### Questions +* Compare this version to the parallel prefix sum using global memory. +* Plot a graph of the comparison and write a short explanation of the phenomenon + you see here. + +### Answers +* I do see some speedup, however my implementation is still far worse than CPU. This is + again due to having allocation inside my recursive call (which was a stupid idea). I + plan to fix this before using this algorithm. +* I did see performance improvement in this version over the last one. This is because + the loop within my kernel calls now only has to go to shared memory instead of out to + global memory. +* My graph is in string compaction graph.pdf + +# PART 4 : ADDING SCATTER +First create a serial version of scatter by expanding the serial version of +prefix sum. Then create a GPU version of scatter. Combine the function call +such that, given an array, you can call stream compact and it will compact the +array for you. Finally, write a version using thrust. + +### Questions +* Compare your version of stream compact to your version using thrust. How do + they compare? How might you optimize yours more, or how might thrust's stream + compact be optimized. + +### Answers +* Unfortunately I spent many hours debugging and was not able to implement a version in + thrust, however I'm quite certain their implementation is superior to mine in every way. + They probably have done all sorts of tricks to squeeze out extra performance such as + doing a work-efficient algorithm without bank conflicts. They also probably packed + more data into registers to do more work per thread. + +# EXTRA CREDIT (+10) +For extra credit, please optimize your prefix sum for work parallelism and to +deal with bank conflicts. Information on this can be found in the GPU Gems +chapter listed in the references. + +# SUBMISSION +Please answer all the questions in each of the subsections above and write your +answers in the README by overwriting the README file. In future projects, we +expect your analysis to be similar to the one we have led you through in this +project. Like other projects, please open a pull request and email Harmony. + +# REFERENCES +"Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3. diff --git a/String Compaction Graph.pdf b/String Compaction Graph.pdf new file mode 100644 index 0000000..00a9232 Binary files /dev/null and b/String Compaction Graph.pdf differ diff --git a/cusamatrixmath.sln b/cusamatrixmath.sln new file mode 100755 index 0000000..8bd80b4 --- /dev/null +++ b/cusamatrixmath.sln @@ -0,0 +1,26 @@ + +Microsoft Visual Studio Solution File, Format Version 11.00 +# Visual Studio 2010 +Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "cusamatrixmath", "cusamatrixmath\cusamatrixmath.vcxproj", "{2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}" +EndProject +Global + GlobalSection(SolutionConfigurationPlatforms) = preSolution + Debug|Win32 = Debug|Win32 + Debug|x64 = Debug|x64 + Release|Win32 = Release|Win32 + Release|x64 = Release|x64 + EndGlobalSection + GlobalSection(ProjectConfigurationPlatforms) = postSolution + {2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Debug|Win32.ActiveCfg = Debug|Win32 + {2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Debug|Win32.Build.0 = Debug|Win32 + {2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Debug|x64.ActiveCfg = Debug|x64 + {2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Debug|x64.Build.0 = Debug|x64 + {2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Release|Win32.ActiveCfg = Release|Win32 + {2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Release|Win32.Build.0 = Release|Win32 + {2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Release|x64.ActiveCfg = Release|x64 + {2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Release|x64.Build.0 = Release|x64 + EndGlobalSection + GlobalSection(SolutionProperties) = preSolution + HideSolutionNode = FALSE + EndGlobalSection +EndGlobal diff --git a/cusamatrixmath/cudaMat4.h b/cusamatrixmath/cudaMat4.h new file mode 100755 index 0000000..e8465f3 --- /dev/null +++ b/cusamatrixmath/cudaMat4.h @@ -0,0 +1,25 @@ +// CIS565 CUDA Raytracer: A parallel raytracer for Patrick Cozzi's CIS565: GPU Computing at the University of Pennsylvania +// Written by Yining Karl Li, Copyright (c) 2012 University of Pennsylvania +// This file includes code from: +// Yining Karl Li's TAKUA Render, a massively parallel pathtracing renderer: http://www.yiningkarlli.com + +#ifndef CUDAMAT4_H +#define CUDAMAT4_H + +#include "glm/glm.hpp" +#include + +struct cudaMat3{ + glm::vec3 x; + glm::vec3 y; + glm::vec3 z; +}; + +struct cudaMat4{ + glm::vec4 x; + glm::vec4 y; + glm::vec4 z; + glm::vec4 w; +}; + +#endif \ No newline at end of file diff --git a/cusamatrixmath/glslUtility.h b/cusamatrixmath/glslUtility.h new file mode 100755 index 0000000..28bf8c5 --- /dev/null +++ b/cusamatrixmath/glslUtility.h @@ -0,0 +1,21 @@ +// GLSL Utility: A utility class for loading GLSL shaders, for Patrick Cozzi's CIS565: GPU Computing at the University of Pennsylvania +// Written by Varun Sampath and Patrick Cozzi, Copyright (c) 2012 University of Pennsylvania + +#ifndef GLSLUTILITY_H_ +#define GLSLUTILITY_H_ + +#ifdef __APPLE__ + #include +#else + #include +#endif + +namespace glslUtility +{ + +GLuint createProgram(const char *vertexShaderPath, const char *fragmentShaderPath, const char *attributeLocations[], GLuint numberOfLocations); +GLuint createProgram(const char *vertexShaderPath, const char *geometryShaderPath, const char *fragmentShaderPath, const char *attributeLocations[], GLuint numberOfLocations); + +} + +#endif diff --git a/cusamatrixmath/kernel.h b/cusamatrixmath/kernel.h new file mode 100755 index 0000000..25c11cc --- /dev/null +++ b/cusamatrixmath/kernel.h @@ -0,0 +1,18 @@ +#ifndef KERNEL_H +#define KERNEL_H + +#include +#include +#include +#include + +#define blockSize 128 +#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__) +#define SHARED 0 + +void checkCUDAError(const char *msg, int line); +void cudaNBodyUpdateWrapper(float dt); +void initCuda(int N); +void cudaUpdatePBO(float4 * pbodptr, int width, int height); +void cudaUpdateVBO(float * vbodptr, int width, int height); +#endif diff --git a/cusamatrixmath/matrix_math.cu b/cusamatrixmath/matrix_math.cu new file mode 100755 index 0000000..3082f9f --- /dev/null +++ b/cusamatrixmath/matrix_math.cu @@ -0,0 +1,313 @@ +#include "matrix_math.h" +#define len 50 //max = 100000000 +#define LOOPS 10 +int main (int argc, char** argv){ + float *a = new float[len]; + float *result = new float[len]; + for(long i= 0; i < len; i++){ + a[i] = 1.0; + } + ////////////////////////////////////////////////////// + ///////////////// initialize Cuda /////////////////// + //////////////////////////////////////////////////// + float *dev_a, *dev_tmpA, *dev_tmpB, *dev_result, *dev_seg; + cudaMalloc( (void**)&dev_a, len * sizeof(float) ); + cudaMalloc( (void**)&dev_tmpA, len * sizeof(float) ); + cudaMalloc( (void**)&dev_tmpB, len * sizeof(float) ); + cudaMalloc( (void**)&dev_result, len * sizeof(float) ); + cudaMemcpy( dev_a, a, len * sizeof(float), cudaMemcpyHostToDevice ); + + + + clock_t t; + /////////////////////////////////////////////////////// + /////////////// sequential scan ////////////////// + ///////////////////////////////////////////////////// + + t = clock(); //start timer + for(int i = 0; i < LOOPS; i++){ + seq_scan(a, result, len); + } + t = clock() - t; //end timer + printf ("Sequential Scan took %d clicks (%f seconds).\n",t,((float)t)/CLOCKS_PER_SEC); + cout << result[len - 1] << "\n"; + //print_array(result, len); + + //////////////////////////////////////////////////////////////// + ///////////////// NAIVE SCAN /////////////////////////////// + ////////////////////////////////////////////////////////////// + + t = clock(); + for(int i = 0; i < LOOPS; i++){ + naive_scan(dev_a,dev_tmpA,dev_tmpB,dev_result, len); + } + t = clock() - t; + printf ("Naive GPU took me %d clicks (%f seconds).\n",t,((float)t)/CLOCKS_PER_SEC); + cudaMemcpy( result, dev_result, len * sizeof(float), cudaMemcpyDeviceToHost ); + //print_array(result, len); + cout << result[len - 1] << "\n"; + + + //////////////////////////////////////////////////////////////// + ///////////////// SHARED SCAN /////////////////////////////// + ////////////////////////////////////////////////////////////// + + t = clock(); + for(int i = 0; i < LOOPS; i++){ + shared_scan(dev_a, dev_result, len); + } + t = clock() - t; + printf ("Shared GPU took me %d clicks (%f seconds).\n",t,((float)t)/CLOCKS_PER_SEC); + cudaMemcpy( result, dev_result, len * sizeof(float), cudaMemcpyDeviceToHost ); + //print_array(result, len); + cout << result[len - 1] << "\n"; + + //////////////////////////////////////////////////////////////// + ///////////////// String Compaction ///////////////////////// + ////////////////////////////////////////////////////////////// + float* sparse_array = new float[10]; + for(int i = 0; i<10; i += 2){ + sparse_array[i] = 0.0; + sparse_array[i + 1] = i + 1; + } + print_array(sparse_array, 10); + float* compact_array = new float[10]; + string_compaction(sparse_array, compact_array, 10); + print_array(compact_array, 10); + + + cudaFree(dev_a); + cudaFree(dev_tmpA); + cudaFree(dev_tmpB); + cudaFree(dev_result); + free(a); + free(result); + free(sparse_array); + free(compact_array); + std::cin.get(); +} + +void seq_scan(float* arr, float* result, int length){ + result[0] = arr[0]; + for (int i = 1; i < length; i++){ + result[i] = result[i-1] + arr[i]; + } +} + +void seq_scatter(float* in_arr, float* scan_arr, float* result, int length){ + if(scan_arr[0] == 1){ + result[0] = in_arr[0]; + } + for(int i = 1; i < length; i++){ + if (scan_arr[i] > scan_arr[i - 1]){ + result[(int)scan_arr[i] - 1] = in_arr[i]; + } + } +} + +void naive_scan( float *dev_a, float *dev_tmpA, float* dev_tmpB, float *dev_result, int length){ + int threads = min(32, length); + int block = ceil((float)length / (float)threads); + int segCount = ceil((float)length / (float)threads); + + if(length == threads){ + glob_excl_block_scan<<< 1,threads>>>( dev_a, dev_tmpA, dev_result, length); /////////// + return; + } + float *dev_seg; + cudaMalloc( (void**)&dev_seg, segCount * sizeof(float) ); + + glob_sec_scan<<< block,threads>>>( dev_a, dev_tmpA,dev_tmpB, dev_result, dev_seg, length); //Segmented scan + + //scan resulting segments + naive_scan(dev_seg, dev_tmpA, dev_tmpB, dev_seg, block); + /* + float* segs = new float[block]; + cudaMemcpy( segs, dev_seg, block * sizeof(float), cudaMemcpyDeviceToHost ); + print_array(segs, block); + */ + add_segments<<< block, threads >>>( dev_result, dev_seg, length); + cudaFree(dev_seg); +} + +void shared_scan( float *dev_a, float *dev_result, int length){ + int threads = min(32, length); + int block = ceil((float)length / (float)threads); + int segCount = ceil((float)length / (float)threads); + + if(length == threads){ + //glob_excl_block_scan<<< 1,threads>>>( dev_a, dev_tmpA, dev_result, length); //use shared mem + shared_block_scan<<< 1, threads, 2 * threads * sizeof(float) >>>( dev_a, dev_result, length); + return; + } + float *dev_seg; + cudaMalloc( (void**)&dev_seg, segCount * sizeof(float) ); + + shared_naive_scan<<< block,threads, threads * 2 * sizeof(float)>>>( dev_a, dev_result, dev_seg, length);//Segmented scan + + + //scan resulting segments + shared_scan(dev_seg, dev_seg, block); + /* + float* segs = new float[block]; + cudaMemcpy( segs, dev_seg, block * sizeof(float), cudaMemcpyDeviceToHost ); + print_array(segs, block); + */ + add_segments<<< block, threads >>>( dev_result, dev_seg, length); + cudaFree(dev_seg); +} + +void string_compaction( float* in_array, float* comp_array, int length){ + int threads = min(32, length); + int block = ceil((float)length / (float)threads); + + float* debug = new float[10]; + + float *dev_a, *dev_result, *dev_compact, *dev_bool; + cudaMalloc( (void**)&dev_a, length * sizeof(float) ); + cudaMalloc( (void**)&dev_bool, length * sizeof(float) ); + cudaMalloc( (void**)&dev_result, length * sizeof(float) ); + cudaMalloc( (void**)&dev_compact, length * sizeof(float) ); + cudaMemcpy( dev_a, in_array, length * sizeof(float), cudaMemcpyHostToDevice ); + + make_bool_arr<<< block, threads >>>(dev_a, dev_bool, length); + shared_scan(dev_bool, dev_result, length); + cudaMemcpy( debug, dev_result, length * sizeof(float), cudaMemcpyDeviceToHost ); + print_array(debug, 10); + + scatter<<< block, threads>>>(dev_a, dev_result, dev_compact, length); + + + cudaMemcpy( comp_array, dev_compact, length * sizeof(float), cudaMemcpyDeviceToHost ); +} + +__global__ void make_bool_arr( float* in_arr, float* result, int length){ + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if( in_arr[index] > 0){ + result[index] = 1.0; + }else{ + result[index] = 0.0; + } +} + +__global__ void scatter( float* in_arr, float* scan_arr, float* result, int length){ + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + result[index] = 0; + __syncthreads(); + if( index > length || index == 0){ + if(scan_arr[0] == 1.0){ + result[0] = in_arr[0]; + } + return; + } + //not first element and not off end of array + if(scan_arr[index] > scan_arr[index - 1]){ + result[(int)scan_arr[index] - 1] = in_arr[index]; + } +} + +__global__ void shared_block_scan( float *a, float *result, int length){ + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + extern __shared__ float shared[]; // twice the size of block + if(index < length){ + shared[threadIdx.x] = a[index]; //result[index] = shared[idx.x] + + __syncthreads(); //to be sure result array is fully initialized + int imt = index % blockDim.x; + for(int d = 1; d < blockDim.x; d*=2){ + if(imt >= d){ + shared[blockDim.x + threadIdx.x] = shared[threadIdx.x] + shared[threadIdx.x - d]; + }else{ + shared[blockDim.x + threadIdx.x] = shared[threadIdx.x]; + } + //can just switch the pointers + __syncthreads(); //ensures tmp array is fully updated + shared[threadIdx.x] = shared[blockDim.x + threadIdx.x]; //update results for next iteration + __syncthreads; //ensures result array is fully updated + } + result[index] = shared[blockDim.x + threadIdx.x]; + } +} + +__global__ void glob_excl_block_scan( float *a, float *tmp, float *result, int length){ + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if(index < length){ + result[index] = a[index]; + __syncthreads(); //to be sure result array is fully initialized + int imt = index % blockDim.x; + for(int d = 1; d < blockDim.x; d*=2){ + if(imt >= d){ + tmp[index] = result[index] + result[index - d]; + }else{ + tmp[index] = result[index]; + } + //can just switch the pointers + __syncthreads(); //ensures tmp array is fully updated + result[index] = tmp[index]; //update results for next iteration + __syncthreads; //ensures result array is fully updated + } + } +} + +__global__ void glob_sec_scan( float *a, float *tmpA, float* tmpB, float *result, float *segment, int length){ + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if(index < length){ + tmpB[index] = a[index]; + __syncthreads(); //to be sure result array is fully initialized + int imt = index % blockDim.x; + for(int d = 1; d < blockDim.x; d*=2){ + if(imt >= d){ + tmpA[index] = tmpB[index] + tmpB[index - d]; + }else{ + tmpA[index] = tmpB[index]; + } + //can just switch the pointers + __syncthreads(); //ensures tmp array is fully updated + tmpB[index] = tmpA[index]; //update results for next iteration + __syncthreads; //ensures result array is fully updated + } + result[index] = tmpB[index]; + segment[blockIdx.x] = result[(blockIdx.x * blockDim.x) + blockDim.x -1]; + } +} + +__global__ void add_segments( float *result, float *segment, int length){ + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if(index < length){ + int seg = index / blockDim.x - 1; + if(seg >= 0){ + result[index] += segment[seg]; + } + } +} + +__global__ void shared_naive_scan( float *a, float *result, float *segment, int length){ + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + extern __shared__ float shared[]; // twice the size of block + if(index < length){ + shared[threadIdx.x] = a[index]; //result[index] = shared[idx.x] + + __syncthreads(); //to be sure result array is fully initialized + int imt = index % blockDim.x; + for(int d = 1; d < blockDim.x; d*=2){ + if(imt >= d){ + shared[blockDim.x + threadIdx.x] = shared[threadIdx.x] + shared[threadIdx.x - d]; + }else{ + shared[blockDim.x + threadIdx.x] = shared[threadIdx.x]; + } + //can just switch the pointers + __syncthreads(); //ensures tmp array is fully updated + shared[threadIdx.x] = shared[blockDim.x + threadIdx.x]; //update results for next iteration + __syncthreads; //ensures result array is fully updated + } + result[index] = shared[blockDim.x + threadIdx.x]; + segment[blockIdx.x] = shared[blockDim.x - 1]; //result[(blockIdx.x * blockDim.x) + blockDim.x -1]; + } +} + +void print_array(float* arr, int length){ + for(int i = 0; i < length; i++){ + std::cout << arr[i] << ","; + } + std::cout << "\n"; +} \ No newline at end of file diff --git a/cusamatrixmath/matrix_math.h b/cusamatrixmath/matrix_math.h new file mode 100755 index 0000000..cca199b --- /dev/null +++ b/cusamatrixmath/matrix_math.h @@ -0,0 +1,47 @@ +#ifndef MAIN_H +#define MAIN_H + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "glslUtility.h" + + +using namespace std; + + + +//------------------------------- +//-------------MAIN-------------- +//------------------------------- + +int main(int argc, char** argv); + +//------------------------------- +//----------function declarations---------- +//------------------------------- +void seq_scan(float*,float*, int); +void print_array(float*, int); +void naive_scan( float*, float*, float*, float*, int); +void shared_scan( float*, float*, int); +void seq_scatter(float*, float*, float*, int); +void string_compaction( float*, float*, int); +__global__ void glob_sec_scan( float*, float*, float*, float*,float*, int); +__global__ void shared_naive_scan( float*, float*, float*, int); +__global__ void add_segments( float*, float*,int); +__global__ void glob_excl_block_scan( float*, float*, float*, int); +__global__ void shared_block_scan( float*, float*, int); +__global__ void make_bool_arr( float*, float*, int); +__global__ void scatter( float*, float*, float*, int); + + +#endif diff --git a/cusamatrixmath/utilities.h b/cusamatrixmath/utilities.h new file mode 100755 index 0000000..77c5fee --- /dev/null +++ b/cusamatrixmath/utilities.h @@ -0,0 +1,49 @@ +// UTILITYCORE- A Utility Library by Yining Karl Li +// This file is part of UTILITYCORE, Coyright (c) 2012 Yining Karl Li +// +// File: utilities.h +// Header for utilities.cpp + +#ifndef UTILITIES_H +#define UTILITIES_H + +#include "glm/glm.hpp" +#include +#include +#include +#include +#include +#include +#include +#include "cudaMat4.h" + +#define PI 3.1415926535897932384626422832795028841971 +#define TWO_PI 6.2831853071795864769252867665590057683943 +#define SQRT_OF_ONE_THIRD 0.5773502691896257645091487805019574556476 +#define E 2.7182818284590452353602874713526624977572 +#define G 6.67384e-11 +#define EPSILON .000000001 +#define ZERO_ABSORPTION_EPSILON 0.00001 + +namespace utilityCore { + extern float clamp(float f, float min, float max); + extern bool replaceString(std::string& str, const std::string& from, const std::string& to); + extern glm::vec3 clampRGB(glm::vec3 color); + extern bool epsilonCheck(float a, float b); + extern std::vector tokenizeString(std::string str); + extern cudaMat4 glmMat4ToCudaMat4(glm::mat4 a); + extern glm::mat4 cudaMat4ToGlmMat4(cudaMat4 a); + extern glm::mat4 buildTransformationMatrix(glm::vec3 translation, glm::vec3 rotation, glm::vec3 scale); + extern void printCudaMat4(cudaMat4 m); + extern std::string convertIntToString(int number); + extern std::istream& safeGetline(std::istream& is, std::string& t); //Thanks to http://stackoverflow.com/a/6089413 + + //----------------------------- + //-------GLM Printers---------- + //----------------------------- + extern void printMat4(glm::mat4); + extern void printVec4(glm::vec4); + extern void printVec3(glm::vec3); +} + +#endif