diff --git a/Project2-StreamCompaction.zip b/Project2-StreamCompaction.zip
new file mode 100755
index 0000000..67322d6
Binary files /dev/null and b/Project2-StreamCompaction.zip differ
diff --git a/README.md b/README.md
old mode 100644
new mode 100755
index 6e02afa..337dccd
--- a/README.md
+++ b/README.md
@@ -1,133 +1,167 @@
-Project-2
-=========
-
-A Study in Parallel Algorithms : Stream Compaction
-
-# INTRODUCTION
-Many of the algorithms you have learned thus far in your career have typically
-been developed from a serial standpoint.  When it comes to GPUs, we are mainly
-looking at massively parallel work.  Thus, it is necessary to reorient our
-thinking.  In this project, we will be implementing a couple different versions
-of prefix sum.  We will start with a simple single thread serial CPU version,
-and then move to a naive GPU version.  Each part of this homework is meant to
-follow the logic of the previous parts, so please do not do this homework out of
-order.
-
-This project will serve as a stream compaction library that you may use (and
-will want to use) in your
-future projects.  For that reason, we suggest you create proper header and CUDA
-files so that you can reuse this code later.  You may want to create a separate
-cpp file that contains your main function so that you can test the code you
-write.
-
-# OVERVIEW
-Stream compaction is broken down into two parts: (1) scan, and (2) scatter.
-
-## SCAN
-Scan or prefix sum is the summation of the elements in an array such that the
-resulting array is the summation of the terms before it.  Prefix sum can either
-be inclusive, meaning the current term is a summation of all the elements before
-it and itself, or exclusive, meaning the current term is a summation of all
-elements before it excluding itself. 
-
-Inclusive:
-
-In : [ 3 4 6 7 9 10 ]
-
-Out : [ 3 7 13 20 29 39 ]
-
-Exclusive
-
-In : [ 3 4 6 7 9 10 ]
-
-Out : [ 0 3 7 13 20 29 ]
-
-Note that the resulting prefix sum will always be n + 1 elements if the input
-array is of length n.  Similarly, the first element of the exclusive prefix sum
-will always be 0.  In the following sections, all references to prefix sum will
-be to the exclusive version of prefix sum.
-
-## SCATTER
-The scatter section of stream compaction takes the results of the previous scan
-in order to reorder the elements to form a compact array.
-
-For example, let's say we have the following array:
-[ 0 0 3 4 0 6 6 7 0 1 ]
-
-We would only like to consider the non-zero elements in this zero, so we would
-like to compact it into the following array:
-[ 3 4 6 6 7 1 ]
-
-We can perform a transform on input array to transform it into a boolean array:
-
-In :  [ 0 0 3 4 0 6 6 7 0 1 ]
-
-Out : [ 0 0 1 1 0 1 1 1 0 1 ]
-
-Performing a scan on the output, we get the following array :
-
-In :  [ 0 0 1 1 0 1 1 1 0 1 ]
-
-Out : [ 0 0 0 1 2 2 3 4 5 5 ]
-
-Notice that the output array produces a corresponding index array that we can
-use to create the resulting array for stream compaction. 
-
-# PART 1 : REVIEW OF PREFIX SUM
-Given the definition of exclusive prefix sum, please write a serial CPU version
-of prefix sum.  You may write this in the cpp file to separate this from the
-CUDA code you will be writing in your .cu file. 
-
-# PART 2 : NAIVE PREFIX SUM
-We will now parallelize this the previous section's code.  Recall from lecture
-that we can parallelize this using a series of kernel calls.  In this portion,
-you are NOT allowed to use shared memory.
-
-### Questions 
-* Compare this version to the serial version of exclusive prefix scan. Please
-  include a table of how the runtimes compare on different lengths of arrays.
-* Plot a graph of the comparison and write a short explanation of the phenomenon you
-  see here.
-
-# PART 3 : OPTIMIZING PREFIX SUM
-In the previous section we did not take into account shared memory.  In the
-previous section, we kept everything in global memory, which is much slower than
-shared memory.
-
-## PART 3a : Write prefix sum for a single block
-Shared memory is accessible to threads of a block. Please write a version of
-prefix sum that works on a single block.  
-
-## PART 3b : Generalizing to arrays of any length.
-Taking the previous portion, please write a version that generalizes prefix sum
-to arbitrary length arrays, this includes arrays that will not fit on one block.
-
-### Questions
-* Compare this version to the parallel prefix sum using global memory.
-* Plot a graph of the comparison and write a short explanation of the phenomenon
-  you see here.
-
-# PART 4 : ADDING SCATTER
-First create a serial version of scatter by expanding the serial version of
-prefix sum.  Then create a GPU version of scatter.  Combine the function call
-such that, given an array, you can call stream compact and it will compact the
-array for you.  Finally, write a version using thrust. 
-
-### Questions
-* Compare your version of stream compact to your version using thrust.  How do
-  they compare?  How might you optimize yours more, or how might thrust's stream
-  compact be optimized.
-
-# EXTRA CREDIT (+10)
-For extra credit, please optimize your prefix sum for work parallelism and to
-deal with bank conflicts.  Information on this can be found in the GPU Gems
-chapter listed in the references.  
-
-# SUBMISSION
-Please answer all the questions in each of the subsections above and write your
-answers in the README by overwriting the README file.  In future projects, we
-expect your analysis to be similar to the one we have led you through in this
-project.  Like other projects, please open a pull request and email Harmony.
-
-# REFERENCES
-"Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3.
+Project-2
+=========
+
+A Study in Parallel Algorithms : Stream Compaction
+
+# INTRODUCTION
+Many of the algorithms you have learned thus far in your career have typically
+been developed from a serial standpoint.  When it comes to GPUs, we are mainly
+looking at massively parallel work.  Thus, it is necessary to reorient our
+thinking.  In this project, we will be implementing a couple different versions
+of prefix sum.  We will start with a simple single thread serial CPU version,
+and then move to a naive GPU version.  Each part of this homework is meant to
+follow the logic of the previous parts, so please do not do this homework out of
+order.
+
+This project will serve as a stream compaction library that you may use (and
+will want to use) in your
+future projects.  For that reason, we suggest you create proper header and CUDA
+files so that you can reuse this code later.  You may want to create a separate
+cpp file that contains your main function so that you can test the code you
+write.
+
+# OVERVIEW
+Stream compaction is broken down into two parts: (1) scan, and (2) scatter.
+
+## SCAN
+Scan or prefix sum is the summation of the elements in an array such that the
+resulting array is the summation of the terms before it.  Prefix sum can either
+be inclusive, meaning the current term is a summation of all the elements before
+it and itself, or exclusive, meaning the current term is a summation of all
+elements before it excluding itself. 
+
+Inclusive:
+
+In : [ 3 4 6 7 9 10 ]
+
+Out : [ 3 7 13 20 29 39 ]
+
+Exclusive
+
+In : [ 3 4 6 7 9 10 ]
+
+Out : [ 0 3 7 13 20 29 ]
+
+Note that the resulting prefix sum will always be n + 1 elements if the input
+array is of length n.  Similarly, the first element of the exclusive prefix sum
+will always be 0.  In the following sections, all references to prefix sum will
+be to the exclusive version of prefix sum.
+
+## SCATTER
+The scatter section of stream compaction takes the results of the previous scan
+in order to reorder the elements to form a compact array.
+
+For example, let's say we have the following array:
+[ 0 0 3 4 0 6 6 7 0 1 ]
+
+We would only like to consider the non-zero elements in this zero, so we would
+like to compact it into the following array:
+[ 3 4 6 6 7 1 ]
+
+We can perform a transform on input array to transform it into a boolean array:
+
+In :  [ 0 0 3 4 0 6 6 7 0 1 ]
+
+Out : [ 0 0 1 1 0 1 1 1 0 1 ]
+
+Performing a scan on the output, we get the following array :
+
+In :  [ 0 0 1 1 0 1 1 1 0 1 ]
+
+Out : [ 0 0 0 1 2 2 3 4 5 5 ]
+
+Notice that the output array produces a corresponding index array that we can
+use to create the resulting array for stream compaction. 
+
+# PART 1 : REVIEW OF PREFIX SUM
+Given the definition of exclusive prefix sum, please write a serial CPU version
+of prefix sum.  You may write this in the cpp file to separate this from the
+CUDA code you will be writing in your .cu file. 
+
+# PART 2 : NAIVE PREFIX SUM
+We will now parallelize this the previous section's code.  Recall from lecture
+that we can parallelize this using a series of kernel calls.  In this portion,
+you are NOT allowed to use shared memory.
+
+### Questions 
+* Compare this version to the serial version of exclusive prefix scan. Please
+  include a table of how the runtimes compare on different lengths of arrays.
+* Plot a graph of the comparison and write a short explanation of the phenomenon you
+  see here.
+
+### Answers
+* My gpu accelerated algorithm actually runs much slower than my sequential cpu algorithm.
+  I think this is due to a poor implementation of the naive algorithm on my part.  The 
+  way my naive algorithm works is to so a segmented scan and save the largest value in 
+  each segment to an auxiliary array.  I then call scan recursively on the auxiliary 
+  array.  My fatal mistake was that I allocated the memory for the auxiliary array within 
+  the naive_scan call.  The result of this is that for an array of length N i have log(n) 
+  calls to malloc which is pretty slow.  The sequential version already has everything in 
+  memory so it runs much more quickly.  I'm pretty ceratin if I were to initialize all 
+  the memory outside the call I would see significant performance improvement and I plan 
+  to do this before using the algorithm in the next assignment.
+
+* There are also some other places I could get some speedup, for example in my kernel 
+  call I actually move data between my two temporary arrays inside the loop when I could 
+  just swap their pointers.
+  
+* My graph is in string compaction graph.pdf
+
+# PART 3 : OPTIMIZING PREFIX SUM
+In the previous section we did not take into account shared memory.  In the
+previous section, we kept everything in global memory, which is much slower than
+shared memory.
+
+## PART 3a : Write prefix sum for a single block
+Shared memory is accessible to threads of a block. Please write a version of
+prefix sum that works on a single block.  
+
+## PART 3b : Generalizing to arrays of any length.
+Taking the previous portion, please write a version that generalizes prefix sum
+to arbitrary length arrays, this includes arrays that will not fit on one block.
+
+### Questions
+* Compare this version to the parallel prefix sum using global memory.
+* Plot a graph of the comparison and write a short explanation of the phenomenon
+  you see here.
+
+### Answers
+* I do see some speedup, however my implementation is still far worse than CPU.  This is 
+  again due to having allocation inside my recursive call (which was a stupid idea).  I 
+  plan to fix this before using this algorithm.
+* I did see performance improvement in this version over the last one.  This is because 
+  the loop within my kernel calls now only has to go to shared memory instead of out to 
+  global memory.  
+* My graph is in string compaction graph.pdf
+
+# PART 4 : ADDING SCATTER
+First create a serial version of scatter by expanding the serial version of
+prefix sum.  Then create a GPU version of scatter.  Combine the function call
+such that, given an array, you can call stream compact and it will compact the
+array for you.  Finally, write a version using thrust. 
+
+### Questions
+* Compare your version of stream compact to your version using thrust.  How do
+  they compare?  How might you optimize yours more, or how might thrust's stream
+  compact be optimized.
+
+### Answers
+* Unfortunately I spent many hours debugging and was not able to implement a version in 
+  thrust, however I'm quite certain their implementation is superior to mine in every way.
+  They probably have done all sorts of tricks to squeeze out extra performance such as 
+  doing a work-efficient algorithm without bank conflicts.  They also probably packed 
+  more data into registers to do more work per thread.
+
+# EXTRA CREDIT (+10)
+For extra credit, please optimize your prefix sum for work parallelism and to
+deal with bank conflicts.  Information on this can be found in the GPU Gems
+chapter listed in the references.  
+
+# SUBMISSION
+Please answer all the questions in each of the subsections above and write your
+answers in the README by overwriting the README file.  In future projects, we
+expect your analysis to be similar to the one we have led you through in this
+project.  Like other projects, please open a pull request and email Harmony.
+
+# REFERENCES
+"Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3.
diff --git a/String Compaction Graph.pdf b/String Compaction Graph.pdf
new file mode 100644
index 0000000..00a9232
Binary files /dev/null and b/String Compaction Graph.pdf differ
diff --git a/cusamatrixmath.sln b/cusamatrixmath.sln
new file mode 100755
index 0000000..8bd80b4
--- /dev/null
+++ b/cusamatrixmath.sln
@@ -0,0 +1,26 @@
+﻿
+Microsoft Visual Studio Solution File, Format Version 11.00
+# Visual Studio 2010
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "cusamatrixmath", "cusamatrixmath\cusamatrixmath.vcxproj", "{2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}"
+EndProject
+Global
+	GlobalSection(SolutionConfigurationPlatforms) = preSolution
+		Debug|Win32 = Debug|Win32
+		Debug|x64 = Debug|x64
+		Release|Win32 = Release|Win32
+		Release|x64 = Release|x64
+	EndGlobalSection
+	GlobalSection(ProjectConfigurationPlatforms) = postSolution
+		{2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Debug|Win32.ActiveCfg = Debug|Win32
+		{2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Debug|Win32.Build.0 = Debug|Win32
+		{2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Debug|x64.ActiveCfg = Debug|x64
+		{2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Debug|x64.Build.0 = Debug|x64
+		{2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Release|Win32.ActiveCfg = Release|Win32
+		{2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Release|Win32.Build.0 = Release|Win32
+		{2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Release|x64.ActiveCfg = Release|x64
+		{2A8A854C-1E6A-44C5-B4B6-E66CCCC6E99D}.Release|x64.Build.0 = Release|x64
+	EndGlobalSection
+	GlobalSection(SolutionProperties) = preSolution
+		HideSolutionNode = FALSE
+	EndGlobalSection
+EndGlobal
diff --git a/cusamatrixmath/cudaMat4.h b/cusamatrixmath/cudaMat4.h
new file mode 100755
index 0000000..e8465f3
--- /dev/null
+++ b/cusamatrixmath/cudaMat4.h
@@ -0,0 +1,25 @@
+// CIS565 CUDA Raytracer: A parallel raytracer for Patrick Cozzi's CIS565: GPU Computing at the University of Pennsylvania
+// Written by Yining Karl Li, Copyright (c) 2012 University of Pennsylvania
+// This file includes code from:
+//       Yining Karl Li's TAKUA Render, a massively parallel pathtracing renderer: http://www.yiningkarlli.com
+
+#ifndef CUDAMAT4_H
+#define CUDAMAT4_H
+
+#include "glm/glm.hpp"
+#include <cuda_runtime.h>
+
+struct cudaMat3{
+	glm::vec3 x;
+	glm::vec3 y;
+	glm::vec3 z;
+};
+
+struct cudaMat4{
+	glm::vec4 x;
+	glm::vec4 y;
+	glm::vec4 z;
+	glm::vec4 w;
+};
+
+#endif
\ No newline at end of file
diff --git a/cusamatrixmath/glslUtility.h b/cusamatrixmath/glslUtility.h
new file mode 100755
index 0000000..28bf8c5
--- /dev/null
+++ b/cusamatrixmath/glslUtility.h
@@ -0,0 +1,21 @@
+// GLSL Utility: A utility class for loading GLSL shaders, for Patrick Cozzi's CIS565: GPU Computing at the University of Pennsylvania
+// Written by Varun Sampath and Patrick Cozzi, Copyright (c) 2012 University of Pennsylvania
+
+#ifndef GLSLUTILITY_H_
+#define GLSLUTILITY_H_
+
+#ifdef __APPLE__
+	#include <GL/glfw.h>
+#else
+	#include <GL/glew.h>
+#endif
+
+namespace glslUtility
+{
+
+GLuint createProgram(const char *vertexShaderPath, const char *fragmentShaderPath, const char *attributeLocations[], GLuint numberOfLocations);
+GLuint createProgram(const char *vertexShaderPath, const char *geometryShaderPath, const char *fragmentShaderPath, const char *attributeLocations[], GLuint numberOfLocations);
+
+}
+ 
+#endif
diff --git a/cusamatrixmath/kernel.h b/cusamatrixmath/kernel.h
new file mode 100755
index 0000000..25c11cc
--- /dev/null
+++ b/cusamatrixmath/kernel.h
@@ -0,0 +1,18 @@
+#ifndef KERNEL_H
+#define KERNEL_H
+
+#include <stdio.h>
+#include <thrust/random.h>
+#include <cuda.h>
+#include <cmath>
+
+#define blockSize 128
+#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
+#define SHARED 0
+
+void checkCUDAError(const char *msg, int line);
+void cudaNBodyUpdateWrapper(float dt);
+void initCuda(int N);
+void cudaUpdatePBO(float4 * pbodptr, int width, int height);
+void cudaUpdateVBO(float * vbodptr, int width, int height);
+#endif
diff --git a/cusamatrixmath/matrix_math.cu b/cusamatrixmath/matrix_math.cu
new file mode 100755
index 0000000..3082f9f
--- /dev/null
+++ b/cusamatrixmath/matrix_math.cu
@@ -0,0 +1,313 @@
+#include "matrix_math.h"
+#define len 50 //max = 100000000
+#define LOOPS 10
+int main (int argc, char** argv){
+	float *a = new float[len]; 
+	float *result = new float[len]; 
+	for(long i= 0; i < len; i++){
+		a[i] = 1.0;
+	}
+	//////////////////////////////////////////////////////
+	///////////////// initialize Cuda ///////////////////
+	////////////////////////////////////////////////////
+	float *dev_a, *dev_tmpA, *dev_tmpB, *dev_result, *dev_seg;
+	cudaMalloc( (void**)&dev_a,       len    * sizeof(float) );
+	cudaMalloc( (void**)&dev_tmpA,    len    * sizeof(float) );
+	cudaMalloc( (void**)&dev_tmpB,    len    * sizeof(float) );
+	cudaMalloc( (void**)&dev_result,  len    * sizeof(float) );
+	cudaMemcpy( dev_a, a, len * sizeof(float), cudaMemcpyHostToDevice );
+
+
+
+	clock_t t;
+	///////////////////////////////////////////////////////
+	///////////////   sequential scan   //////////////////
+	/////////////////////////////////////////////////////
+	
+	t = clock(); //start timer
+	for(int i = 0; i < LOOPS; i++){
+		seq_scan(a, result, len);
+	}
+	t = clock() - t; //end timer
+	printf ("Sequential Scan took %d clicks (%f seconds).\n",t,((float)t)/CLOCKS_PER_SEC);
+	cout << result[len - 1] << "\n";
+	//print_array(result, len);
+	
+	////////////////////////////////////////////////////////////////
+	/////////////////    NAIVE SCAN ///////////////////////////////
+	//////////////////////////////////////////////////////////////
+	
+	t = clock();
+	for(int i = 0; i < LOOPS; i++){
+		naive_scan(dev_a,dev_tmpA,dev_tmpB,dev_result, len);
+	}
+	t = clock() - t;
+	printf ("Naive GPU took me %d clicks (%f seconds).\n",t,((float)t)/CLOCKS_PER_SEC);
+	cudaMemcpy( result, dev_result, len * sizeof(float), cudaMemcpyDeviceToHost );
+	//print_array(result, len);
+	cout << result[len - 1] << "\n";
+	
+
+	////////////////////////////////////////////////////////////////
+	/////////////////   SHARED SCAN ///////////////////////////////
+	//////////////////////////////////////////////////////////////
+	
+	t = clock();
+	for(int i = 0; i < LOOPS; i++){
+		shared_scan(dev_a, dev_result, len);
+	}
+	t = clock() - t;
+	printf ("Shared GPU took me %d clicks (%f seconds).\n",t,((float)t)/CLOCKS_PER_SEC);
+	cudaMemcpy( result, dev_result, len * sizeof(float), cudaMemcpyDeviceToHost );
+	//print_array(result, len);
+	cout << result[len - 1] << "\n";
+	
+	////////////////////////////////////////////////////////////////
+	/////////////////   String Compaction /////////////////////////
+	//////////////////////////////////////////////////////////////
+	float* sparse_array = new float[10];
+	for(int i = 0; i<10; i += 2){
+		sparse_array[i] = 0.0;
+		sparse_array[i + 1] = i + 1;
+	}
+	print_array(sparse_array, 10);
+	float* compact_array = new float[10];
+	string_compaction(sparse_array, compact_array, 10);
+	print_array(compact_array, 10);
+
+
+	cudaFree(dev_a);
+	cudaFree(dev_tmpA);
+	cudaFree(dev_tmpB);
+	cudaFree(dev_result);
+	free(a);
+	free(result);
+	free(sparse_array);
+	free(compact_array);
+	std::cin.get();
+}
+
+void seq_scan(float* arr, float* result, int length){
+	result[0] = arr[0];
+	for (int i = 1; i < length; i++){
+		result[i] = result[i-1] + arr[i];
+	}
+}
+
+void seq_scatter(float* in_arr, float* scan_arr,  float* result, int length){
+	if(scan_arr[0] == 1){
+		result[0] = in_arr[0];
+	}
+	for(int i = 1; i < length; i++){
+		if (scan_arr[i] > scan_arr[i - 1]){
+			result[(int)scan_arr[i] - 1] = in_arr[i];
+		}
+	}
+}
+
+void naive_scan( float *dev_a, float *dev_tmpA, float* dev_tmpB, float *dev_result, int length){
+	int threads = min(32, length);
+	int block = ceil((float)length / (float)threads);
+	int segCount = ceil((float)length / (float)threads);
+	
+	if(length == threads){
+		glob_excl_block_scan<<< 1,threads>>>( dev_a, dev_tmpA, dev_result, length); ///////////
+		return;
+	}
+	float *dev_seg;
+	cudaMalloc( (void**)&dev_seg,    segCount  * sizeof(float) );
+	
+	glob_sec_scan<<< block,threads>>>( dev_a, dev_tmpA,dev_tmpB, dev_result, dev_seg, length); //Segmented scan
+
+	//scan resulting segments
+	naive_scan(dev_seg, dev_tmpA, dev_tmpB, dev_seg, block);
+	/*
+	float* segs = new float[block];
+	cudaMemcpy( segs, dev_seg, block * sizeof(float), cudaMemcpyDeviceToHost );
+	print_array(segs, block);
+	*/
+	add_segments<<< block, threads >>>( dev_result, dev_seg, length);
+	cudaFree(dev_seg);
+}
+
+void shared_scan( float *dev_a, float *dev_result, int length){
+	int threads = min(32, length);
+	int block = ceil((float)length / (float)threads);
+	int segCount = ceil((float)length / (float)threads);
+	
+	if(length == threads){
+		//glob_excl_block_scan<<< 1,threads>>>( dev_a, dev_tmpA, dev_result, length); //use shared mem
+		shared_block_scan<<< 1, threads, 2 * threads * sizeof(float) >>>( dev_a, dev_result, length);
+		return;
+	}
+	float *dev_seg;
+	cudaMalloc( (void**)&dev_seg,    segCount  * sizeof(float) );
+	
+	shared_naive_scan<<< block,threads, threads * 2 * sizeof(float)>>>( dev_a, dev_result, dev_seg, length);//Segmented scan
+	
+	
+	//scan resulting segments
+	shared_scan(dev_seg, dev_seg, block);
+	/*
+	float* segs = new float[block];
+	cudaMemcpy( segs, dev_seg, block * sizeof(float), cudaMemcpyDeviceToHost );
+	print_array(segs, block);
+	*/
+	add_segments<<< block, threads >>>( dev_result, dev_seg, length);
+	cudaFree(dev_seg);
+}
+
+void string_compaction( float* in_array, float* comp_array, int length){
+	int threads = min(32, length);
+	int block = ceil((float)length / (float)threads);
+	
+	float* debug = new float[10];
+
+	float *dev_a, *dev_result, *dev_compact, *dev_bool;
+	cudaMalloc( (void**)&dev_a,       length    * sizeof(float) );
+	cudaMalloc( (void**)&dev_bool,       length    * sizeof(float) );
+	cudaMalloc( (void**)&dev_result,  length    * sizeof(float) );
+	cudaMalloc( (void**)&dev_compact, length    * sizeof(float) );
+	cudaMemcpy( dev_a, in_array, length * sizeof(float), cudaMemcpyHostToDevice );
+
+	make_bool_arr<<< block, threads >>>(dev_a, dev_bool, length);
+	shared_scan(dev_bool, dev_result, length);
+	cudaMemcpy( debug, dev_result, length * sizeof(float), cudaMemcpyDeviceToHost );
+	print_array(debug, 10);
+	
+	scatter<<< block, threads>>>(dev_a, dev_result, dev_compact, length);
+	
+	
+	cudaMemcpy( comp_array, dev_compact, length * sizeof(float), cudaMemcpyDeviceToHost );
+}
+
+__global__ void make_bool_arr( float* in_arr, float* result, int length){
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if( in_arr[index] > 0){
+		result[index] = 1.0;
+	}else{
+		result[index] = 0.0;
+	}
+}
+
+__global__ void scatter( float* in_arr, float* scan_arr, float* result, int length){
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	result[index] = 0;
+	__syncthreads();
+	if( index > length || index == 0){
+		if(scan_arr[0] == 1.0){
+			result[0] = in_arr[0];
+		}
+		return;
+	}
+	//not first element and not off end of array
+	if(scan_arr[index] > scan_arr[index - 1]){
+		result[(int)scan_arr[index] - 1] =  in_arr[index];
+	}
+}
+
+__global__ void shared_block_scan( float *a, float *result, int length){
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	extern __shared__ float shared[]; // twice the size of block
+	if(index < length){
+		shared[threadIdx.x] = a[index]; //result[index] = shared[idx.x]
+
+		__syncthreads(); //to be sure result array is fully initialized
+		int imt = index % blockDim.x;
+		for(int d = 1; d < blockDim.x; d*=2){
+			if(imt  >= d){
+				shared[blockDim.x + threadIdx.x] = shared[threadIdx.x] + shared[threadIdx.x - d];
+			}else{
+				shared[blockDim.x + threadIdx.x] = shared[threadIdx.x];
+			}
+			//can just switch the pointers
+			__syncthreads(); //ensures tmp array is fully updated
+			shared[threadIdx.x] = shared[blockDim.x + threadIdx.x]; //update results for next iteration
+			__syncthreads;  //ensures result array is fully updated
+		}
+		result[index] = shared[blockDim.x + threadIdx.x];
+	}
+}
+
+__global__ void glob_excl_block_scan( float *a, float *tmp, float *result, int length){
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if(index < length){
+		result[index] = a[index];
+		__syncthreads(); //to be sure result array is fully initialized
+		int imt = index % blockDim.x;
+		for(int d = 1; d < blockDim.x; d*=2){
+			if(imt  >= d){
+				tmp[index] = result[index] + result[index - d];
+			}else{
+				tmp[index] = result[index];
+			}
+			//can just switch the pointers
+			__syncthreads(); //ensures tmp array is fully updated
+			result[index] = tmp[index]; //update results for next iteration
+			__syncthreads;  //ensures result array is fully updated
+		}
+	}
+}
+
+__global__ void glob_sec_scan( float *a, float *tmpA, float* tmpB, float *result, float *segment, int length){
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if(index < length){
+		tmpB[index] = a[index];
+		__syncthreads(); //to be sure result array is fully initialized
+		int imt = index % blockDim.x;
+		for(int d = 1; d < blockDim.x; d*=2){
+			if(imt  >= d){
+				tmpA[index] = tmpB[index] + tmpB[index - d];
+			}else{
+				tmpA[index] = tmpB[index];
+			}
+			//can just switch the pointers
+			__syncthreads(); //ensures tmp array is fully updated
+			tmpB[index] = tmpA[index]; //update results for next iteration
+			__syncthreads;  //ensures result array is fully updated
+		}
+		result[index] = tmpB[index];
+		segment[blockIdx.x] = result[(blockIdx.x * blockDim.x) + blockDim.x -1];
+	}
+}
+
+__global__ void add_segments( float *result, float *segment, int length){
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if(index < length){
+		int seg = index / blockDim.x - 1; 
+		if(seg >= 0){
+			result[index] += segment[seg];
+		}
+	}
+}
+
+__global__ void shared_naive_scan( float *a, float *result, float *segment, int length){
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	extern __shared__ float shared[]; // twice the size of block
+	if(index < length){
+		shared[threadIdx.x] = a[index]; //result[index] = shared[idx.x]
+
+		__syncthreads(); //to be sure result array is fully initialized
+		int imt = index % blockDim.x;
+		for(int d = 1; d < blockDim.x; d*=2){
+			if(imt  >= d){
+				shared[blockDim.x + threadIdx.x] = shared[threadIdx.x] + shared[threadIdx.x - d];
+			}else{
+				shared[blockDim.x + threadIdx.x] = shared[threadIdx.x];
+			}
+			//can just switch the pointers
+			__syncthreads(); //ensures tmp array is fully updated
+			shared[threadIdx.x] = shared[blockDim.x + threadIdx.x]; //update results for next iteration
+			__syncthreads;  //ensures result array is fully updated
+		}
+		result[index] = shared[blockDim.x + threadIdx.x];
+		segment[blockIdx.x] = shared[blockDim.x - 1]; //result[(blockIdx.x * blockDim.x) + blockDim.x -1];
+	}
+}
+
+void print_array(float* arr, int length){
+	for(int i = 0; i < length; i++){
+		std::cout << arr[i] << ",";
+	}
+	std::cout << "\n";
+}
\ No newline at end of file
diff --git a/cusamatrixmath/matrix_math.h b/cusamatrixmath/matrix_math.h
new file mode 100755
index 0000000..cca199b
--- /dev/null
+++ b/cusamatrixmath/matrix_math.h
@@ -0,0 +1,47 @@
+#ifndef MAIN_H
+#define MAIN_H
+
+#include <stdlib.h>
+#include <GL/glew.h>
+#include <GL/glut.h>
+#include <cuda_runtime.h>
+#include <cuda_gl_interop.h>
+#include <string>
+#include <iostream>
+#include <sstream>
+#include <fstream>
+#include <cmath>
+#include <ctime>
+#include <thrust\scatter.h>
+#include "glslUtility.h"
+
+
+using namespace std;
+
+
+
+//-------------------------------
+//-------------MAIN--------------
+//-------------------------------
+
+int main(int argc, char** argv);
+
+//-------------------------------
+//----------function declarations----------
+//-------------------------------
+void seq_scan(float*,float*, int);
+void print_array(float*, int);
+void naive_scan( float*, float*, float*, float*, int);
+void shared_scan( float*, float*, int);
+void seq_scatter(float*, float*,  float*, int);
+void string_compaction( float*, float*, int);
+__global__ void glob_sec_scan( float*, float*, float*, float*,float*, int);
+__global__ void shared_naive_scan( float*, float*, float*, int);
+__global__ void add_segments( float*, float*,int);
+__global__ void glob_excl_block_scan( float*, float*, float*, int);
+__global__ void shared_block_scan( float*, float*, int);
+__global__ void make_bool_arr( float*, float*, int);
+__global__ void scatter( float*, float*, float*, int);
+
+
+#endif
diff --git a/cusamatrixmath/utilities.h b/cusamatrixmath/utilities.h
new file mode 100755
index 0000000..77c5fee
--- /dev/null
+++ b/cusamatrixmath/utilities.h
@@ -0,0 +1,49 @@
+//  UTILITYCORE- A Utility Library by Yining Karl Li
+//  This file is part of UTILITYCORE, Coyright (c) 2012 Yining Karl Li
+//  
+//  File: utilities.h
+//  Header for utilities.cpp
+
+#ifndef UTILITIES_H
+#define UTILITIES_H
+
+#include "glm/glm.hpp"
+#include <algorithm>
+#include <istream>
+#include <ostream>
+#include <iterator>
+#include <sstream>
+#include <string>
+#include <vector>
+#include "cudaMat4.h"
+
+#define PI                          3.1415926535897932384626422832795028841971
+#define TWO_PI                      6.2831853071795864769252867665590057683943
+#define SQRT_OF_ONE_THIRD           0.5773502691896257645091487805019574556476
+#define E                           2.7182818284590452353602874713526624977572
+#define G                           6.67384e-11
+#define EPSILON                     .000000001
+#define ZERO_ABSORPTION_EPSILON     0.00001
+
+namespace utilityCore {
+    extern float clamp(float f, float min, float max);
+    extern bool replaceString(std::string& str, const std::string& from, const std::string& to);
+    extern glm::vec3 clampRGB(glm::vec3 color);
+    extern bool epsilonCheck(float a, float b);
+    extern std::vector<std::string> tokenizeString(std::string str); 
+    extern cudaMat4 glmMat4ToCudaMat4(glm::mat4 a);
+    extern glm::mat4 cudaMat4ToGlmMat4(cudaMat4 a);
+    extern glm::mat4 buildTransformationMatrix(glm::vec3 translation, glm::vec3 rotation, glm::vec3 scale);
+    extern void printCudaMat4(cudaMat4 m);
+    extern std::string convertIntToString(int number);
+    extern std::istream& safeGetline(std::istream& is, std::string& t); //Thanks to http://stackoverflow.com/a/6089413
+
+    //-----------------------------
+    //-------GLM Printers----------
+    //-----------------------------
+    extern void printMat4(glm::mat4);
+    extern void printVec4(glm::vec4);
+    extern void printVec3(glm::vec3);
+}
+
+#endif