diff --git a/README.md b/README.md
index d63a6a1..dae9065 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,97 @@
+# Project 1: Flocking
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+Caroline Lachanski: [LinkedIn](https://www.linkedin.com/in/caroline-lachanski/), [personal website](http://carolinelachanski.com/)
 
-### (TODO: Your README)
+Tested on: Windows 10, i5-6500 @ 3.20GHz 16GB, GTX 1660 (personal computer)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![10000 Boids](/images/20000boidsBIG.gif) 
+
+## Project Description
+
+The goal of this project was to implement a flocking simulation based on the Reynolds Boids algorithm using CUDA. The flocking simulation is heavily based on [Conrad Parker's notes](http://www.vergenet.net/~conrad/boids/pseudocode.html), with some small changes. 
+
+10,000 Boids | 100,000 Boids
+------------ | -------------
+![10000 Boids](/images/10000boids.gif) | ![100000 Boids](/images/100000boids.gif)
+
+This project includes three implementations: a naive simulation, then an optimization using a uniform grid, and lastly another optimization using the uniform grid with semi-coherent memory access.
+
+### Boids Simulation
+In a basic boids flocking simulation, the boid particles follow three rules:
+1. Cohesion - boids move towards the perceived center of mass of their neighbors
+2. Separation - boids avoid getting to close to their neighbors
+3. Alignment - boids generally try to move with the same direction and speed as their neighbors
+
+In this particular project, each rule has a neighborhood distance: for each rule, each boid is only affected by its neighbors who are within a particular distance. During each simulation timestep, each boid calculates the velocity change contribution from each of the three rules and adds it to its current velocity, which is then used to update the boid's position.
+
+### Naive Implementation
+In the naive implementation, we check each boid against every other boid in the simulation to see if it is within each rule's neighborhood distance.
+
+### (Scattered) Uniform Grid Implementation
+We can improve on the naive implementation with the addition of a uniform grid. Checking each boid against every other boid to see if it's within a certain neighborhood distance is inefficient, especially when that neighborhood distance is much smaller than the simulation space. Instead, we can bin the boids within a uniform grid, where each cell width is the double the maximum of the three neighborhood distances. 
+
+![](/images/Boids%20Ugrid%20base.png)
+
+This way, in a 3D implementation, a maximum of 8 cells has to be checked to find boids within the neighborhood distance.
+
+![](/images/Boids%20Ugrid%20neighbor%20search%20shown.png)
+
+The uniform grid is constructed by determining which grid cell a boid belongs to using the boid's 3D position, then sorting the grid cell indices of the boids. From there, we can determine the start and end indices of each cell, making it easy to get the indices of the boids that belong to a given cell we wish to check.
+
+![](/images/Boids%20Ugrids%20buffers%20naive.png)
+
+### Coherent Uniform Grid Implementation
+
+One can optimize on the previous implementation by removing a level of indirection. In the previous implementation, one relies on a buffer of boid indices (originally an identity buffer) that was sorted using the boid's grid cell index as the key. This buffer of boid indices can then be used to index into the position and velocity buffers that are the input to the simulation. In the coherent uniform grid implementation, we simply rearrange the position and velocity buffers according to this buffer. These buffers now follow the order of the grid cell indices, which is useful because we will access them based on the sorted order of grid indices. This is what is meant by semi-coherent memory access.
+
+![](images/Boids%20Ugrids%20buffers%20data%20coherent.png)
+
+## Performance Analysis
+
+Performance was analyzed by recording the FPS under various conditions.
+
+First, we can look at how FPS changes by increasing the number of boids in the simulation. This is with visualization of the boids turned off, meaning no computing power is being used to actually draw the boids.
+
+![FPS vs Num Boids (Vis OFF)](/images/FPSvsNumBoidsVisOff.png)
+
+We can look at the same conditions with boid visualization turned on (the y-axis scale is kept the same as in the previous graph, to emphasize the difference in FPS). This change seems to decrease performance, likely because some amount of the GPU's computing power is needed to draw the boids. Nicely enough, the overall trend seen as the number of boids increase looks the same as when visualization is off.
+
+![FPS vs Num Boids (Vis ON)](/images/FPSvsNumBoidsVisOnAdjustedScale.png)
+
+We can also look at how block size, or number of threads per block (and correspondingly, number of blocks), affects performance.
+
+![FPS vs Block Size (25000 Boids)](/images/FPSvsBlockSize25000Boids.png)
+
+I found any change in performance difficult to discern, so I performed the same tests with an increased number of boids (now 100,000), wondering if differences would begin to appear. I also included block sizes that were not multiples of 32 (the warp size) to see the effect. Again, the differences seem minimal.
+
+![FPS vs Block Size (100000 Boids)](/images/FPSvsBlockSize100000Boids.png)
+
+Lastly, here is a comparison between grids of different cell width, one where cell width is 2 * max distance, meaning we generally have to check 8 cells, and one where cell width is simply the max distance, meaning we generally have to check 27 cells. The first graph is for 50,000 boids:
+
+![FPS vs Grid Cell Size (50000 Boids)](/images/FPSvsGridCellSize50000Boids.png)
+
+The next is for 100,000 boids:
+
+![FPS vs Grid Cell Size (100000 Boids)](/images/FPSvsGridCellSize100000Boids.png)
+
+Interesting enough, the trends do not match when you increase the number of boids.
+
+## Question Responses
+
+**For each implementation, how does changing the number of boids affect performance? Why do you think this is?**
+
+Looking at each implementation separately, as you increase the number of boids, performance decreases. This is likely due to there being an increased number of boids to simulate, which also means that for each simulation step, each boid must be checked against a larger number of boids. This is perhaps why the naive implementation has a more drastic downward trend; because it involves checking each boid against every other boid, increasing the number of total boids in the simulation will have a large effect. Interestingly, for smaller numbers of boids (1000 and 5000), the "more efficient" implementations don't necessarily have better performance (in fact, for 1000 boids, the naive implemenation performs the best). This is probably because for such a small number of boids, the overhead accrued for running the preprocessing steps and maintaining the uniform grid is probably greater than the advantage it provides.
+
+**For each implementation, how does changing the block count and block size affect performance? Why do you think this is?**
+
+For each implementation, changing the block count and block size doesn't seem to affect performance in any meaningful way. Since the number of blocks is calculated based on the block size, it seems that a similar number of threads is created for each run, thus not changing performance. I was interested in seeing if block sizes that were not multiples of 32 would change anything, and surprisingly, the changes seemed generally minimal. I would have expected to see decreases in performance since not using multiples of 32 (the warp size) would mean there were unused threads, but perhaps the differences were simply ndiscernible. Also, human error in recording FPS counts could also be at fault.
+
+**For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?**
+
+Yes, using the coherent uniform grid over the scattered uniform grid saw large performance improvements, especially at larger numbers of boids. I personally didn't expect such as a drastic difference, thinking one level of indirection wasn't such large issue, especially when the coherent uniform grid implementation also required two cudaMemcpy()'s. The difference is likely because in the scattered uniform grid implementation, we had to index into an additional array within a triple for-loop (iterating over the number of grid cells to check) for each boid, versus simply re-indexing two buffers for each boid along with two cudaMemcpy()'s. And because we rearranged the position and velocity buffers to match the order of grid cells indices, we are less likely to cause a cache miss as we iterate over grid cells in the same order as they are in memory.
+
+**Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!**
+
+In my implementation, the number of cells check is not hard-coded and is instead based on how many cells fit within the max neighborhood distance, though this should generally correspond to checking 27 versus checking 8 neighboring cells. Changing cell width between 1 * max neighborhood distance and 2 * max neighborhood distance had varying effects on performance given the number of boids and the implementation as seen in the graphs above. While using 1 * max neighborhood distance, one generally has to check more cells, there will also generally be a fewer number of boids within each cell, meaning one has to iterate over the boids for-loop within the cells for-loops less times. No matter cell size, the number of boids within the neighborhood distance would not change. However, using a smaller cell size means we achieve a tighter search resolution, and are less likely to check cells that do not contain boids within the neighborhood distance. But, increasing the total number of boids would mean one is more likely to find boids within a nearby cell, and perhaps cell size becomes less important. It's hard to say what exactly caused the changes we see in the graphs above.
diff --git a/images/10000.PNG b/images/10000.PNG
new file mode 100644
index 0000000..f784d73
Binary files /dev/null and b/images/10000.PNG differ
diff --git a/images/100000.PNG b/images/100000.PNG
new file mode 100644
index 0000000..cb0490b
Binary files /dev/null and b/images/100000.PNG differ
diff --git a/images/100000boids.gif b/images/100000boids.gif
new file mode 100644
index 0000000..91b24c9
Binary files /dev/null and b/images/100000boids.gif differ
diff --git a/images/10000boids.gif b/images/10000boids.gif
new file mode 100644
index 0000000..b52058f
Binary files /dev/null and b/images/10000boids.gif differ
diff --git a/images/20000boidsBIG.gif b/images/20000boidsBIG.gif
new file mode 100644
index 0000000..98c9cc3
Binary files /dev/null and b/images/20000boidsBIG.gif differ
diff --git a/images/FPSvsBlockSize100000Boids.png b/images/FPSvsBlockSize100000Boids.png
new file mode 100644
index 0000000..11daed7
Binary files /dev/null and b/images/FPSvsBlockSize100000Boids.png differ
diff --git a/images/FPSvsBlockSize25000Boids.png b/images/FPSvsBlockSize25000Boids.png
new file mode 100644
index 0000000..a37ec0b
Binary files /dev/null and b/images/FPSvsBlockSize25000Boids.png differ
diff --git a/images/FPSvsGridCellSize100000Boids.png b/images/FPSvsGridCellSize100000Boids.png
new file mode 100644
index 0000000..c4663c4
Binary files /dev/null and b/images/FPSvsGridCellSize100000Boids.png differ
diff --git a/images/FPSvsGridCellSize50000Boids.png b/images/FPSvsGridCellSize50000Boids.png
new file mode 100644
index 0000000..c35e3fc
Binary files /dev/null and b/images/FPSvsGridCellSize50000Boids.png differ
diff --git a/images/FPSvsNumBoidsVisOff.png b/images/FPSvsNumBoidsVisOff.png
new file mode 100644
index 0000000..8240835
Binary files /dev/null and b/images/FPSvsNumBoidsVisOff.png differ
diff --git a/images/FPSvsNumBoidsVisOn.png b/images/FPSvsNumBoidsVisOn.png
new file mode 100644
index 0000000..1cc3f67
Binary files /dev/null and b/images/FPSvsNumBoidsVisOn.png differ
diff --git a/images/FPSvsNumBoidsVisOnAdjustedScale.png b/images/FPSvsNumBoidsVisOnAdjustedScale.png
new file mode 100644
index 0000000..c9a2cc0
Binary files /dev/null and b/images/FPSvsNumBoidsVisOnAdjustedScale.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..6f59926 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -85,6 +85,8 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_rearrangedPos;
+glm::vec3 *dev_rearrangedVel1;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -169,6 +171,24 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+  
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  cudaMalloc((void**)&dev_rearrangedPos, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_rearrangedPos failed!");
+
+  cudaMalloc((void**)&dev_rearrangedVel1, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_rearrangedVel1 failed!");
+
   cudaDeviceSynchronize();
 }
 
@@ -230,10 +250,48 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+	glm::vec3 newVel(0.0f);
+
+	glm::vec3 perceivedCenter(0.0f);
+	glm::vec3 separate(0.0f);
+	glm::vec3 perceivedVel(0.0f);
+
+	int rule1NeighborCount = 0;
+	int rule3NeighborCount = 0;
+
+	for (int i = 0; i < N; i++) {
+		if (i != iSelf) {
+			float dist = glm::distance(pos[i], pos[iSelf]);
+
+			// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+			if (dist < rule1Distance) {
+				perceivedCenter += pos[i];
+				rule1NeighborCount++;
+			}
+
+			// Rule 2: boids try to stay a distance d away from each other
+			if (dist < rule2Distance) {
+				separate -= (pos[i] - pos[iSelf]);
+			}
+
+			// Rule 3: boids try to match the speed of surrounding boids
+			if (dist < rule3Distance) {
+				perceivedVel += vel[i];
+				rule3NeighborCount++;
+			}
+		}
+	}
+	if (rule1NeighborCount > 0) {
+		perceivedCenter /= rule1NeighborCount;
+		newVel += rule1Scale * (perceivedCenter - pos[iSelf]);
+	}
+	if (rule3NeighborCount > 0) {
+		perceivedVel /= rule3NeighborCount;
+		newVel += rule3Scale * perceivedVel;
+	}
+	newVel += rule2Scale * separate;
+	
+	return newVel;
 }
 
 /**
@@ -242,9 +300,21 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+
+	// Compute a new velocity based on pos and vel1
+	glm::vec3 newVel = vel1[index] + computeVelocityChange(N, index, pos, vel1);
+
+	// Clamp the speed
+	if (glm::length(newVel) > maxSpeed) {
+		newVel = glm::normalize(newVel) * maxSpeed;
+	}
+
+	// Record the new velocity into vel2. Question: why NOT vel1?
+	vel2[index] = newVel;
 }
 
 /**
@@ -285,10 +355,20 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
     // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+	// - Label each boid with the index of its grid cell.
+	// for now, indices is identity, will be sorted using grid indices later
+	indices[index] = index;
+	
+	// - Set up a parallel array of integer indices as pointers to the actual
+	glm::vec3 boidPos = pos[index] - gridMin; // account for grid offset
+	glm::vec3 gridPos = floor(boidPos * inverseCellWidth);
+	int gridIndex = gridIndex3Dto1D(gridPos.x, gridPos.y, gridPos.z, gridResolution);
+	gridIndices[index] = gridIndex;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -302,10 +382,34 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
 
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	// TODO-2.1
+	// Identify the start point of each cell in the gridIndices array.
+	// This is basically a parallel unrolling of a loop that goes
+	// "this index doesn't match the one before it, must be a new cell!"
+
+	// can assume particleGridIndices is sorted
+	int gridIndex = particleGridIndices[index];
+	if (index != 0) {
+		if (particleGridIndices[index - 1] != gridIndex) {
+			gridCellStartIndices[gridIndex] = index;
+		}
+	}
+	else {
+		gridCellStartIndices[gridIndex] = index;
+	}
+
+	if (index != N - 1) {
+		if (particleGridIndices[index + 1] != gridIndex) {
+			gridCellEndIndices[gridIndex] = index;
+		}
+	}
+	else {
+		gridCellEndIndices[gridIndex] = index;
+	}
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -314,14 +418,105 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   int *gridCellStartIndices, int *gridCellEndIndices,
   int *particleArrayIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	// TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
+	// the number of boids that need to be checked.
+	glm::vec3 velChange(0.0f);
+
+	// - Identify the grid cell that this particle is in
+	glm::vec3 boidPos = pos[index] - gridMin;
+	glm::vec3 gridPos = floor(boidPos * inverseCellWidth);
+	int gridIndex = gridIndex3Dto1D(gridPos.x, gridPos.y, gridPos.z, gridResolution);
+	
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	float maxDistance = imax(rule1Distance, imax(rule2Distance, rule3Distance));
+	glm::vec3 cellMinCheckIndex = (boidPos - maxDistance) * inverseCellWidth;
+	glm::vec3 cellMaxCheckIndex = (boidPos + maxDistance) * inverseCellWidth;
+	cellMinCheckIndex = glm::clamp(cellMinCheckIndex, glm::vec3(0), glm::vec3(gridResolution));
+	cellMaxCheckIndex = glm::clamp(cellMaxCheckIndex, glm::vec3(0), glm::vec3(gridResolution));
+
+	// begin flocking rules
+	glm::vec3 perceivedCenter(0.0f);
+	glm::vec3 separate(0.0f);
+	glm::vec3 perceivedVel(0.0f);
+
+	int rule1NeighborCount = 0;
+	int rule3NeighborCount = 0;
+
+	// - For each cell, read the start/end indices in the boid pointer array.
+	for (int z = cellMinCheckIndex.z; z <= cellMaxCheckIndex.z; z++) {
+		for (int y = cellMinCheckIndex.y; y <= cellMaxCheckIndex.y; y++) {
+			for (int x = cellMinCheckIndex.x; x <= cellMaxCheckIndex.x; x++) {
+				int cellIndex = gridIndex3Dto1D(x, y, z, gridResolution);
+
+				int startIndex = gridCellStartIndices[cellIndex];
+				int endIndex = gridCellEndIndices[cellIndex];
+
+				// skips cells with no boids in them
+				if (startIndex == -1 || endIndex == -1) {
+					continue;
+				}
+
+				// - Access each boid in the cell and compute velocity change from
+				//   the boids rules, if this boid is within the neighborhood distance.
+				for (int i = startIndex; i <= endIndex; i++) {
+					int boidIndex = particleArrayIndices[i];
+
+					if (boidIndex != index) {
+						float dist = glm::distance(pos[boidIndex], pos[index]);
+
+						// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+						if (dist < rule1Distance) {
+							perceivedCenter += pos[boidIndex];
+							rule1NeighborCount++;
+						}
+
+						// Rule 2: boids try to stay a distance d away from each other
+						if (dist < rule2Distance) {
+							separate -= (pos[boidIndex] - pos[index]);
+						}
+
+						// Rule 3: boids try to match the speed of surrounding boids
+						if (dist < rule3Distance) {
+							perceivedVel += vel1[boidIndex];
+							rule3NeighborCount++;
+						}
+					}
+				}
+			}
+		}
+	}
+
+	if (rule1NeighborCount > 0) {
+		perceivedCenter /= rule1NeighborCount;
+		velChange += rule1Scale * (perceivedCenter - pos[index]);
+	}
+	if (rule3NeighborCount > 0) {
+		perceivedVel /= rule3NeighborCount;
+		velChange += rule3Scale * perceivedVel;
+	}
+	velChange += rule2Scale * separate;
+	
+	glm::vec3 newVel = vel1[index] + velChange;
+
+	// - Clamp the speed change before putting the new speed in vel2
+	if (glm::length(newVel) > maxSpeed) {
+		newVel = glm::normalize(newVel) * maxSpeed;
+	}
+
+	vel2[index] = newVel;
+}
+
+__global__ void kernRearrangeBoidData(
+	int N, int *particleArrayIndices, glm::vec3 *vec3Buffer, glm::vec3 *rearrangedVec3Buffer) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		int boidIndex = particleArrayIndices[index];
+		rearrangedVec3Buffer[index] = vec3Buffer[boidIndex];
+	}
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -329,59 +524,222 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	// TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+	// except with one less level of indirection.
+	// This should expect gridCellStartIndices and gridCellEndIndices to refer
+	// directly to pos and vel1.
+
+	glm::vec3 velChange(0.0f);
+
+	// - Identify the grid cell that this particle is in
+	glm::vec3 boidPos = pos[index] - gridMin;
+	glm::vec3 gridPos = floor(boidPos * inverseCellWidth);
+	int gridIndex = gridIndex3Dto1D(gridPos.x, gridPos.y, gridPos.z, gridResolution);
+
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	float maxDistance = imax(rule1Distance, imax(rule2Distance, rule3Distance));
+	glm::vec3 minGridCheckIndex = (boidPos - maxDistance) * inverseCellWidth;
+	glm::vec3 maxGradCheckIndex = (boidPos + maxDistance) * inverseCellWidth;
+	minGridCheckIndex = glm::clamp(minGridCheckIndex, glm::vec3(0), glm::vec3(gridResolution));
+	maxGradCheckIndex = glm::clamp(maxGradCheckIndex, glm::vec3(0), glm::vec3(gridResolution));
+
+	// begin flocking rules
+	glm::vec3 perceivedCenter(0.0f);
+	glm::vec3 separate(0.0f);
+	glm::vec3 perceivedVel(0.0f);
+
+	int rule1NeighborCount = 0;
+	int rule3NeighborCount = 0;
+
+	// - For each cell, read the start/end indices in the boid pointer array.
+	//   DIFFERENCE: For best results, consider what order the cells should be
+	//   checked in to maximize the memory benefits of reordering the boids data.
+	for (int z = minGridCheckIndex.z; z <= maxGradCheckIndex.z; z++) {
+		for (int y = minGridCheckIndex.y; y <= maxGradCheckIndex.y; y++) {
+			for (int x = minGridCheckIndex.x; x <= maxGradCheckIndex.x; x++) {
+				int cellIndex = gridIndex3Dto1D(x, y, z, gridResolution);
+
+				int startIndex = gridCellStartIndices[cellIndex];
+				int endIndex = gridCellEndIndices[cellIndex];
+
+				// skips cells with no boids in them
+				if (startIndex == -1 || endIndex == -1) {
+					continue;
+				}
+
+				// - Access each boid in the cell and compute velocity change from
+				//   the boids rules, if this boid is within the neighborhood distance.
+				for (int i = startIndex; i <= endIndex; i++) {
+
+					if (i != index) {
+						float dist = glm::distance(pos[i], pos[index]);
+
+						// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+						if (dist < rule1Distance) {
+							perceivedCenter += pos[i];
+							rule1NeighborCount++;
+						}
+
+						// Rule 2: boids try to stay a distance d away from each other
+						if (dist < rule2Distance) {
+							separate -= (pos[i] - pos[index]);
+						}
+
+						// Rule 3: boids try to match the speed of surrounding boids
+						if (dist < rule3Distance) {
+							perceivedVel += vel1[i];
+							rule3NeighborCount++;
+						}
+					}
+				}
+			}
+		}
+	}
+
+	if (rule1NeighborCount > 0) {
+		perceivedCenter /= rule1NeighborCount;
+		velChange += rule1Scale * (perceivedCenter - pos[index]);
+	}
+	if (rule3NeighborCount > 0) {
+		perceivedVel /= rule3NeighborCount;
+		velChange += rule3Scale * perceivedVel;
+	}
+	velChange += rule2Scale * separate;
+
+	glm::vec3 newVel = vel1[index] + velChange;
+
+	// - Clamp the speed change before putting the new speed in vel2
+	if (glm::length(newVel) > maxSpeed) {
+		newVel = glm::normalize(newVel) * maxSpeed;
+	}
+
+	vel2[index] = newVel;
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	// TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+	kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdateVelocityBruteForce failed!");
+
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdatePos failed!");
+	
+	// TODO-1.2 ping-pong the velocity buffers
+	cudaMemcpy(dev_vel1, dev_vel2, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+	checkCUDAErrorWithLine("cudaMemcpy dev_vel1 dev_vel2 failed!");
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+	// TODO-2.1
+	// Uniform Grid Neighbor search using Thrust sort.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	// In Parallel:
+	// - label each particle with its array index as well as its grid index.
+	//   Use 2x width grids.
+	kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos,
+		dev_particleArrayIndices, dev_particleGridIndices);
+	checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+	// initialize start and end indices to -1
+	// uses different number of blocks because it is run per cell, not per boid
+	dim3 fullBlocksPerGridForCells((gridCellCount + blockSize - 1) / blockSize);
+	kernResetIntBuffer<<<fullBlocksPerGridForCells, blockSize>>>(gridCellCount, dev_gridCellStartIndices, -1);
+	checkCUDAErrorWithLine("kernResetIntBuffer dev_gridCellStartIndices failed!");
+	kernResetIntBuffer<<<fullBlocksPerGridForCells, blockSize >>>(gridCellCount, dev_gridCellEndIndices, -1);
+	checkCUDAErrorWithLine("kernResetIntBuffer dev_gridCellEndIndices failed!");
+
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices 
+	kernIdentifyCellStartEnd<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+	// - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchScattered<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+		gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
+	// - Update positions
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+	// - Ping-pong buffers as needed
+	cudaMemcpy(dev_vel1, dev_vel2, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+	checkCUDAErrorWithLine("cudaMemcpy dev_vel1 dev_vel2 failed!");
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	// TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+	// Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+	// In Parallel:
+	// - Label each particle with its array index as well as its grid index.
+	//   Use 2x width grids
+	kernComputeIndices<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, dev_pos,
+		dev_particleArrayIndices, dev_particleGridIndices);
+	checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+	// initialize start and end indices to -1
+	// uses different number of blocks because it is run per cell, not per boid
+	dim3 fullBlocksPerGridForCells((gridCellCount + blockSize - 1) / blockSize);
+	kernResetIntBuffer<<<fullBlocksPerGridForCells, blockSize>>>(gridCellCount, dev_gridCellStartIndices, -1);
+	checkCUDAErrorWithLine("kernResetIntBuffer dev_gridCellStartIndices failed!");
+	kernResetIntBuffer<<<fullBlocksPerGridForCells, blockSize>>>(gridCellCount, dev_gridCellEndIndices, -1);
+	checkCUDAErrorWithLine("kernResetIntBuffer dev_gridCellEndIndices failed!");
+
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+	// - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
+	//   the particle data in the simulation array.
+	//   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+	kernRearrangeBoidData<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_particleArrayIndices, dev_pos, dev_rearrangedPos);
+	checkCUDAErrorWithLine("cudaMemcpy kernRearrangeBoidData dev_pos failed!");
+	kernRearrangeBoidData<<<fullBlocksPerGrid, blockSize >>>(numObjects, dev_particleArrayIndices, dev_vel1, dev_rearrangedVel1);
+	checkCUDAErrorWithLine("cudaMemcpy kernRearrangeBoidData dev_vel1 failed!");
+	
+	cudaMemcpy(dev_pos, dev_rearrangedPos, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+	checkCUDAErrorWithLine("cudaMemcpy dev_pos dev_rearrangedPos failed!");
+	cudaMemcpy(dev_vel1, dev_rearrangedVel1, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+	checkCUDAErrorWithLine("cudaMemcpy dev_vel1 dev_rearrangedVel1 failed!");
+
+	// - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchCoherent<<<fullBlocksPerGrid, blockSize>>>(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+		gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_pos, dev_vel1, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+
+	// - Update positions
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+	checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+	// - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	cudaMemcpy(dev_vel1, dev_vel2, numObjects * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+	checkCUDAErrorWithLine("cudaMemcpy dev_vel1 dev_vel2 failed!");
 }
 
 void Boids::endSimulation() {
@@ -390,6 +748,13 @@ void Boids::endSimulation() {
   cudaFree(dev_pos);
 
   // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+
+  cudaFree(dev_rearrangedPos);
+  cudaFree(dev_rearrangedVel1);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..fafcc7b 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 50000;
 const float DT = 0.2f;
 
 /**