diff --git a/README.md b/README.md
index d63a6a1..c33f09b 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,83 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Saket Karve
+  * [LinkedIn](https://www.linkedin.com/in/saket-karve-43930511b/), [twitter](), etc.
+* Tested on:  Windows 10 Education, Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz 16GB, NVIDIA Quadro P1000 @ 4GB (Moore 100B Lab)
 
-### (TODO: Your README)
+### Description
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+A detailed overview of the implementation in this repository can be found [here](https://github.com/karvesaket/Project1-CUDA-Flocking/blob/master/INSTRUCTION.md)
+
+### Sample Outputs
+1. Naive
+
+**Configuration**: Number of boids = 5,000 | Block Size = 128 | Cell width = 2 x max_distance
+![](images/naive_fixed_fast_5000.gif)
+
+2. Uniform Grid
+
+**Configuration**: Number of boids = 50,000 | Block Size = 128 | Cell width = 2 x max_distance
+![](images/scattered_fixed_fast_again2.gif)
+
+3. Unifirm Grid with coherent arrays
+
+**Configuration**: Number of boids = 50,000 | Block Size = 128 | Cell width = 2 x max_distance
+![](images/cohernet_fixed_fast_again3.gif)
+
+### Performance Analysis
+
+Performance is measured in terms of Frames per second (fps) as an estimate of the average fps value observed after running with different configurations.
+
+#### Effect of changing number of boids
+
+Other fixed parameters: Block size = 128; Cell width = 2 x max_disatnce
+
+| With visualization | Without visualization |
+| -------------------|---------------------- |
+| ![](images/number_of_boids_visualize.png) | ![](images/number_of_boids_non_visualize.png)                 |
+
+Comparison between visualize and non-visualize for all "Uniform grid"
+
+![](images/number_of_boids_visualize_non_visualize.png)
+
+#### Effect of changing block size (number of threads per block)
+
+Other fixed parameters: Number of boids = 5,000; Cell width = 2 x max_disatnce
+
+| With visualization | Without visualization |
+| -------------------|---------------------- |
+| ![](images/block_size_visualize.png) | ![](images/block_size_non_visualize.png)                 |
+
+#### Effect of changing cell width (number of cells)
+
+Other fixed parameters: Number of boids = 5,000; Block size = 128
+
+| With visualization | Without visualization |
+| -------------------|---------------------- |
+| ![](images/cell_width_visualize.png) | ![](images/cell_width_non_visualize.png)                 |
+
+#### Answers to questions
+
+* For each implementation, how does changing the number of boids affect performance? Why do you think this is?
+  * The performance (frames per second) decreass as the number of boids are increased. This was clearly expected as there are more boids in the space, there will be more threads running in all and each boid will have potentially more neighbors thereby increasing the number of iterations for each thread (i.e. boid).
+
+* For each implementation, how does changing the block count and block size affect performance? Why do you think this is?
+  * Changing the block size i.e. the number of threads per block does not have a clear trend with respect to performance. In general this could be because all memory access is currently only from the global memory. So irrespective of how many threads are per block, the memory access time remains same. For shared memory implementation, it is expected to show some performance improvement with a greater block size (but still has a trade-off with respect to how much data can be loaded into shared memory) 
+
+* For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+  * Yes. Coherent uniform grid has much better configuration in all the configurations. This is primarily because eliminating the middleman ensures all memory access is in contiguas memory locations. Being in contigous memory dereases the probability of cache miss and thus improves the overall performance.
+
+* Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
+  * More number of cell (i.e. a smaller cell width) increases the granularity of the volume that is being searched. If the granularity os smaller, we are potentially searching through a smaller volume as compared to bigger cells. Hence, this affects the number of neighboring boids (more for smaller cell width), and thus affects the performance (less performance i.e. fps for a larger cell width). But, if we decrease the cell width a lot, it means that now we look at many smaller cells and has lesser boids in each cell. This means we lose the advantage of contiguous memory we were getting by sorting. Hence, the performance goes down when we make cell-width half the max distance.
+
+#### General trend and conclusion
+
+- Increasing the number of boids decreases the performance
+- Block size does not affect the performance significantly (because we are not using shared memory)
+- The difference in performance is sligtly more significant when block size is changed with more number of boids
+- When cell-width is equal to maximum distance (27 neighboring cells), the performance is better than when the cell-width is twice the max_distance (because of more granular cells in the former case)
+
+#### Extra Credit
+
+**Grid Looping Optimization**: When checking the neighboring cells in the uniform grid approach, the loop iterates over only those cells which can have any part of the sphere based on the boid's position thereby avoiding unecessary checks over cells which are gauranteed to be outside the maximum distance. This is done by considering only those cells which are part of the cube surrounding the boid and are at a distance of max_distance either side instead of always looping over a fixed number of cells.
diff --git a/images/block_size_non_visualize.png b/images/block_size_non_visualize.png
new file mode 100644
index 0000000..f2fed51
Binary files /dev/null and b/images/block_size_non_visualize.png differ
diff --git a/images/block_size_visualize.png b/images/block_size_visualize.png
new file mode 100644
index 0000000..9585989
Binary files /dev/null and b/images/block_size_visualize.png differ
diff --git a/images/cell_width_non_visualize.png b/images/cell_width_non_visualize.png
new file mode 100644
index 0000000..f53c968
Binary files /dev/null and b/images/cell_width_non_visualize.png differ
diff --git a/images/cell_width_visualize.png b/images/cell_width_visualize.png
new file mode 100644
index 0000000..6b570c0
Binary files /dev/null and b/images/cell_width_visualize.png differ
diff --git a/images/cohernet_fixed_fast_again3.gif b/images/cohernet_fixed_fast_again3.gif
new file mode 100644
index 0000000..894301b
Binary files /dev/null and b/images/cohernet_fixed_fast_again3.gif differ
diff --git a/images/naive_fixed_fast_5000.gif b/images/naive_fixed_fast_5000.gif
new file mode 100644
index 0000000..eaed969
Binary files /dev/null and b/images/naive_fixed_fast_5000.gif differ
diff --git a/images/number_of_boids_non_visualize.png b/images/number_of_boids_non_visualize.png
new file mode 100644
index 0000000..9fadff5
Binary files /dev/null and b/images/number_of_boids_non_visualize.png differ
diff --git a/images/number_of_boids_visualize.png b/images/number_of_boids_visualize.png
new file mode 100644
index 0000000..8a27a1a
Binary files /dev/null and b/images/number_of_boids_visualize.png differ
diff --git a/images/number_of_boids_visualize_non_visualize.png b/images/number_of_boids_visualize_non_visualize.png
new file mode 100644
index 0000000..298a103
Binary files /dev/null and b/images/number_of_boids_visualize_non_visualize.png differ
diff --git a/images/scattered_fixed_fast_again2.gif b/images/scattered_fixed_fast_again2.gif
new file mode 100644
index 0000000..dabf332
Binary files /dev/null and b/images/scattered_fixed_fast_again2.gif differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..6ffa705 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -5,6 +5,8 @@
 #include <glm/glm.hpp>
 #include "utilityCore.hpp"
 #include "kernel.h"
+#include "device_launch_parameters.h"
+
 
 // LOOK-2.1 potentially useful for doing grid-based neighbor search
 #ifndef imax
@@ -85,6 +87,9 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_pos_coherent;
+glm::vec3 *dev_vel1_coherent;
+glm::vec3 *dev_vel2_coherent;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -169,6 +174,31 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  cudaMalloc((void**)&dev_pos_coherent, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos_coherent failed!");
+
+  cudaMalloc((void**)&dev_vel1_coherent, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_vel1_coherent failed!");
+
+  cudaMalloc((void**)&dev_vel2_coherent, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_vel2_coherent failed!");
+
+  //Initialize thrust arrays
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
   cudaDeviceSynchronize();
 }
 
@@ -210,8 +240,8 @@ __global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float
 void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
   dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
 
-  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
-  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+  kernCopyPositionsToVBO <<< fullBlocksPerGrid, blockSize >>>(numObjects, dev_pos, vbodptr_positions, scene_scale);
+  kernCopyVelocitiesToVBO <<< fullBlocksPerGrid, blockSize >>>(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
 
   checkCUDAErrorWithLine("copyBoidsToVBO failed!");
 
@@ -223,6 +253,8 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * stepSimulation *
 ******************/
 
+
+
 /**
 * LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
 * __device__ code can be called from a __global__ context
@@ -230,10 +262,45 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+  
+	glm::vec3 percieved_center = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 c = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 percieved_velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+	int rule1count = 0;
+	int rule3count = 0;
+	for (int i = 0; i < N; i++) {
+		// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+		if (i != iSelf && glm::distance(pos[iSelf], pos[i]) <= rule1Distance) {
+			percieved_center += pos[i];
+			rule1count++;
+		}
+		
+		// Rule 2: boids try to stay a distance d away from each other
+		if (i != iSelf && glm::distance(pos[iSelf], pos[i]) <= rule2Distance) {
+			c -= (pos[i] - pos[iSelf]);
+		}
+
+		// Rule 3: boids try to match the speed of surrounding boids
+		if (i != iSelf && glm::distance(pos[iSelf], pos[i]) <= rule3Distance) {
+			percieved_velocity += vel[i];
+			rule3count++;
+		}
+	}
+	glm::vec3 rule1velocity;
+	if (rule1count > 0) {
+		percieved_center /= rule1count;
+		rule1velocity = (percieved_center - pos[iSelf]) * rule1Scale;
+	}
+  
+	glm::vec3 rule2velocity = c * rule2Scale;
+	glm::vec3 rule3velocity;
+	if (rule3count > 0) {
+		percieved_velocity /= rule3count;
+		rule3velocity = percieved_velocity * rule3Scale;
+	}
+
+	glm::vec3 velocity_change = rule1velocity + rule2velocity + rule3velocity;
+  return velocity_change;
 }
 
 /**
@@ -242,9 +309,21 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
   // Compute a new velocity based on pos and vel1
+	glm::vec3 new_velocity = computeVelocityChange(N, index, pos, vel1) + vel1[index];
   // Clamp the speed
+	
   // Record the new velocity into vel2. Question: why NOT vel1?
+	float speed = glm::length(new_velocity);
+	if (speed >= maxSpeed) {
+		new_velocity = glm::normalize(new_velocity)*maxSpeed;
+	}
+	vel2[index] = new_velocity;
+	//vel2[index] = glm::clamp(glm::length(new_velocity), 0.0f, maxSpeed);
 }
 
 /**
@@ -285,6 +364,14 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	glm::vec3 grid_position_3d = glm::floor((pos[index] - gridMin)*inverseCellWidth);
+	gridIndices[index] = gridIndex3Dto1D(grid_position_3d.x, grid_position_3d.y, grid_position_3d.z, gridResolution);
+
+	indices[index] = index;
     // TODO-2.1
     // - Label each boid with the index of its grid cell.
     // - Set up a parallel array of integer indices as pointers to the actual
@@ -302,6 +389,26 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
 
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   int *gridCellStartIndices, int *gridCellEndIndices) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	int current_grid = particleGridIndices[index];
+	if (index == 0) {
+		gridCellStartIndices[current_grid] = index;
+	}
+	else {
+		int previous_grid = particleGridIndices[index - 1];
+		if (previous_grid != current_grid) {
+			gridCellStartIndices[current_grid] = index;
+			gridCellEndIndices[previous_grid] = index - 1;
+		}
+
+		if (index == N - 1) {
+			gridCellEndIndices[current_grid] = index;
+		}
+	}
+
   // TODO-2.1
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
@@ -314,21 +421,220 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   int *gridCellStartIndices, int *gridCellEndIndices,
   int *particleArrayIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+
+	// TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
+  	// the number of boids that need to be checked.
+  	
+  	// - Identify the grid cell that this particle is in
+	glm::vec3 current_cell_3d = glm::vec3(pos[index] - gridMin) * inverseCellWidth;
+	int current_cell = gridIndex3Dto1D(current_cell_3d.x, current_cell_3d.y, current_cell_3d.z, gridResolution);
+
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	const float radius = glm::max(glm::max(rule1Distance, rule2Distance), rule3Distance);
+	glm::vec3 min_cell = glm::floor((pos[index] - gridMin - glm::vec3(radius)) * inverseCellWidth);
+	glm::vec3 max_cell = glm::floor((pos[index] - gridMin + glm::vec3(radius)) * inverseCellWidth);
+
+	int minX = imax(0, min_cell.x);
+	int minY = imax(0, min_cell.y);
+	int minZ = imax(0, min_cell.z);
+
+	int maxX = imin(gridResolution - 1, max_cell.x);
+	int maxY = imin(gridResolution - 1, max_cell.y);
+	int maxZ = imin(gridResolution - 1, max_cell.z);
+
+	glm::vec3 percieved_center = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 c = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 percieved_velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+	int rule1count = 0;
+	int rule3count = 0;
+
+	for (int i = minX; i <= maxX; i++) {
+		for (int j = minY; j <= maxY; j++) {
+			for (int k = minZ; k <= maxZ; k++) {
+				int curr_cell = gridIndex3Dto1D(i, j, k, gridResolution);
+				
+				// - For each cell, read the start/end indices in the boid pointer array.
+				int start = gridCellStartIndices[curr_cell];
+				int end = gridCellEndIndices[curr_cell];
+				if (start == -1) {
+					continue;
+				}
+				if (end == -1) {
+					continue;
+				}
+				
+				for (int l = start; l <= end; l++) {
+					// - Access each boid in the cell and compute velocity change from
+					//   the boids rules, if this boid is within the neighborhood distance.
+					int curr_neighbor = particleArrayIndices[l];
+					// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+					if (index != curr_neighbor && glm::distance(pos[curr_neighbor], pos[index]) <= rule1Distance) {
+						percieved_center += pos[curr_neighbor];
+						rule1count++;
+					}
+
+					// Rule 2: boids try to stay a distance d away from each other
+					if (index != curr_neighbor && glm::distance(pos[curr_neighbor], pos[index]) <= rule2Distance) {
+						c -= (pos[curr_neighbor] - pos[index]);
+					}
+
+					// Rule 3: boids try to match the speed of surrounding boids
+					if (index != curr_neighbor && glm::distance(pos[curr_neighbor], pos[index]) <= rule3Distance) {
+						percieved_velocity += vel1[curr_neighbor];
+						rule3count++;
+					}
+				}
+			}
+		}
+	}
+	glm::vec3 rule1velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+	if (rule1count > 0) {
+		percieved_center /= rule1count;
+		rule1velocity = (percieved_center - pos[index]) * rule1Scale;
+	}
+
+	glm::vec3 rule2velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+	rule2velocity = c * rule2Scale;
+	
+	glm::vec3 rule3velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+	if (rule3count > 0) {
+		percieved_velocity /= rule3count;
+		rule3velocity = percieved_velocity * rule3Scale;
+	}
+
+	glm::vec3 velocity_change = rule1velocity + rule2velocity + rule3velocity;
+
+	glm::vec3 new_velocity = vel1[index] + velocity_change;
+
+	//clamp velocity
+	float speed = glm::length(new_velocity);
+	if (speed >= maxSpeed) {
+		new_velocity = glm::normalize(new_velocity)*maxSpeed;
+	}
+	vel2[index] = new_velocity;
+
+	//vel2[index] = glm::clamp(vel1[index] + velocity_change, 0.0f, maxSpeed);
+}
+
+__global__ void kernReShuffle(int N, int *indices, glm::vec3 *dev_pos, glm::vec3 *dev_vel1, glm::vec3 *dev_vel2, glm::vec3 *dev_pos_coherent, glm::vec3 *dev_vel1_coherent, glm::vec3 *dev_vel2_coherent) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+
+	int boid_index = indices[index];
+
+	dev_pos_coherent[index] = dev_pos[boid_index];
+	dev_vel1_coherent[index] = dev_vel1[boid_index];
+	dev_vel2_coherent[index] = dev_vel2[boid_index];
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
   int N, int gridResolution, glm::vec3 gridMin,
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2, glm::vec3 *org_pos) {
+
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+
+	// TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
+	// the number of boids that need to be checked.
+
+	// - Identify the grid cell that this particle is in
+	//glm::vec3 current_cell_3d = glm::vec3(org_pos[index] - gridMin) * inverseCellWidth;
+	//int current_cell = gridIndex3Dto1D(current_cell_3d.x, current_cell_3d.y, current_cell_3d.z, gridResolution);
+
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	const float radius = glm::max(glm::max(rule1Distance, rule2Distance), rule3Distance);
+	glm::vec3 min_cell = glm::floor((org_pos[index] - gridMin - glm::vec3(radius)) * inverseCellWidth);
+	glm::vec3 max_cell = glm::floor((org_pos[index] - gridMin + glm::vec3(radius)) * inverseCellWidth);
+
+	int minX = imax(0, min_cell.x);
+	int minY = imax(0, min_cell.y);
+	int minZ = imax(0, min_cell.z);
+
+	int maxX = imin(gridResolution - 1, max_cell.x);
+	int maxY = imin(gridResolution - 1, max_cell.y);
+	int maxZ = imin(gridResolution - 1, max_cell.z);
+
+	glm::vec3 percieved_center = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 c = glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 percieved_velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+	int rule1count = 0;
+	int rule3count = 0;
+
+	for (int i = minX; i <= maxX; i++) {
+		for (int j = minY; j <= maxY; j++) {
+			for (int k = minZ; k <= maxZ; k++) {
+				int curr_cell = gridIndex3Dto1D(i, j, k, gridResolution);
+
+				// - For each cell, read the start/end indices in the boid pointer array.
+				int start = gridCellStartIndices[curr_cell];
+				int end = gridCellEndIndices[curr_cell];
+				if (start == -1) {
+					continue;
+				}
+				if (end == -1) {
+					continue;
+				}
+
+				for (int l = start; l <= end; l++) {
+					// - Access each boid in the cell and compute velocity change from
+					//   the boids rules, if this boid is within the neighborhood distance.
+					// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+					if (glm::distance(pos[l], org_pos[index]) <= rule1Distance) {
+						percieved_center += pos[l];
+						rule1count++;
+					}
+
+					// Rule 2: boids try to stay a distance d away from each other
+					if (glm::distance(pos[l], org_pos[index]) <= rule2Distance) {
+						c -= (pos[l] - org_pos[index]);
+					}
+
+					// Rule 3: boids try to match the speed of surrounding boids
+					if (glm::distance(pos[l], org_pos[index]) <= rule3Distance) {
+						percieved_velocity += vel1[l];
+						rule3count++;
+					}
+				}
+			}
+		}
+	}
+
+	glm::vec3 rule1velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+	if (rule1count > 0) {
+		percieved_center /= rule1count;
+		rule1velocity = (percieved_center - org_pos[index]) * rule1Scale;
+	}
+
+	glm::vec3 rule2velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+	rule2velocity = c * rule2Scale;
+
+	glm::vec3 rule3velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+	if (rule3count > 0) {
+		percieved_velocity /= rule3count;
+		rule3velocity = percieved_velocity * rule3Scale;
+	}
+
+	glm::vec3 velocity_change = rule1velocity + rule2velocity + rule3velocity;
+
+	glm::vec3 new_velocity = vel2[index] + velocity_change;
+
+	//clamp velocity
+	float speed = glm::length(new_velocity);
+	if (speed >= maxSpeed) {
+		new_velocity = glm::normalize(new_velocity)*maxSpeed;
+	}
+	vel2[index] = new_velocity;
+
   // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
   // except with one less level of indirection.
   // This should expect gridCellStartIndices and gridCellEndIndices to refer
@@ -348,27 +654,94 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
 */
 void Boids::stepSimulationNaive(float dt) {
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	kernUpdateVelocityBruteForce << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos << < fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+
   // TODO-1.2 ping-pong the velocity buffers
+	dev_vel1 = dev_vel2;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+	// TODO-2.1
+   // Uniform Grid Neighbor search using Thrust sort.
+   // In Parallel:
+
+   // - label each particle with its array index as well as its grid index.
+   //   Use 2x width grids.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	
+	/*thrust::fill(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_particleGridIndices);
+	thrust::fill(dev_thrust_particleArrayIndices, dev_thrust_particleArrayIndices + numObjects, dev_particleArrayIndices);*/
+	
+	// LOOK-2.1 Example for using thrust::sort_by_key
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+	dim3 fullBlocksPerGridForGrid((gridCellCount + blockSize - 1) / blockSize);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	
+	// - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+	
+	// - Update positions
+	kernUpdatePos << < fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+	
+	// - Ping-pong buffers as needed
+	cudaMemcpy(dev_vel1, dev_vel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+	//dev_vel1 = dev_vel2;
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+	// TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+  	// Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+
+	// - label each particle with its array index as well as its grid index.
+   	//   Use 2x width grids.
+	// - label each particle with its array index as well as its grid index.
+   //   Use 2x width grids.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+
+	// LOOK-2.1 Example for using thrust::sort_by_key
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+	
+	kernReShuffle << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2, dev_pos_coherent, dev_vel1_coherent, dev_vel2_coherent);
+
+
+	dim3 fullBlocksPerGridForGrid((gridCellCount + blockSize - 1) / blockSize);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+
+	//cudaMemcpy(dev_pos, dev_pos_coherent, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+	//cudaMemcpy(dev_vel1, dev_vel1_coherent, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+	//cudaMemcpy(dev_vel2, dev_vel2_coherent, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+
+	// - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_pos_coherent, dev_vel1_coherent, dev_vel2, dev_pos);
+
+	// - Update positions
+	kernUpdatePos << < fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+
+	cudaMemcpy(dev_vel1, dev_vel2, sizeof(glm::vec3) * numObjects, cudaMemcpyDeviceToDevice);
+
+  
   // In Parallel:
   // - Label each particle with its array index as well as its grid index.
   //   Use 2x width grids
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..ed5e748 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
+#define UNIFORM_GRID 1
 #define COHERENT_GRID 0
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 1000000;
 const float DT = 0.2f;
 
 /**