diff --git a/README.md b/README.md
index d63a6a1..0178849 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,82 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yan Dong
+  - [LinkedIn](https://www.linkedin.com/in/yan-dong-572b1113b/)
+  - [personal website](coffeier.com)
+  - [github](https://github.com/coffeiersama)
+* Tested on: Windows 10, i7-8750 @ 2.22GHz  (12CPUs)  16GB, GTX 1060 14202MB (OMEN 15-dc0xxx)
 
-### (TODO: Your README)
+------
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+[Result](#boid-result) - [Rules](#rules) - [Runtime Analysis](#runtime-analysis) - [Responses](#questions)
+
+------
+
+## Boid Result
+
+10,000 boids
+
+![](images/1_10000.gif)
+
+150,000 boids
+
+![](images/3_150000.gif)
+
+
+
+## Rules
+
+#### Naive
+
+For each boid, search all other boids, find the boid that is enough near to the current boid, and do the calculating in rules 1,2 and 3. This is slow when we deal with a large number of boids because we need to ask 'boids_count' numbers for each boid. There is no select and filter. 
+
+#### Uniform Grid
+
+Making buffers to store particle index for each grid cell.  Control the relationship between cell width and max_distance(which is also our search distance) to let the approach be more flexible approach.  (seen in my codes when define max_search)
+
+Avoid of hard-coding a search of the designated area.(this part can be seen when define my_8).
+
+#### Coherent Grid
+
+Making new position and velocity storage buffer.  Assign the value according to the particle index after sort by grid index. Save time in getting position and velocity index. 
+
+
+
+## Runtime Analysis
+
+When we increase the boids count, the fps for both 3 methods decrease.  It's hard for Naive method to support more than 20k thousand boids, but it works well when the boids number is a small. Using uniform grid and coherent grid to just search in valid neighbors, we successfully reduce some comparing work.  Coherent grid can even support more than 300k thousand boids with a good FPS.  Notice that, for grid method, before going down, there is a rise  when we increase boids count from a small number. I think this is because the the two method use more buffer, read and write buffer takes more part of time when the boids count is small.
+
+![](images/bc_visual2.png)
+
+As for increasing the blocksize, here is the line chart. My boids count is 2w, and increasing the blocksize do not let the fps go up, to the opposite, it turns a little bit down in the coherent grid case.
+
+![](images/bs_visual.png)
+
+Making the boids show in the window do impact the performance, I compare the 2 situation(with and without visual) and here is the result. Well I think the shapes of lines in two charts are similar. That's may describe the algorithm performance itself. Without visual, Naive is even the best method when the boids count is under 0.5k.
+
+![](images/upbc_compare.png)
+
+For increasing blocksize:
+
+![](images/upbs_compare.png)
+
+## Questions
+
+##### For each implementation, how does changing the number of boids affect performance? Why do you think this is?
+
+Increasing the number of boids, usually causes the fps to decrease. This is reasonable, since there are more boids in the space and we need to search neighbors, compare distance for much more time.  For Naive, the fps decrease rapidly in 1k-10k, and for Uniform Grid and Coherent Grid, the fps decrease mostly in after 20k. That's reasonable for Naive is suitable for a small boids count, Uni and Coherent is better in large boids count. 
+
+There is a little strange that in my testing data, if starting from a little number, increasing boids makes the fps a little bit increase. I think this is because, first, the fps is unstable, it can be vibrate for around +-50 or +-80, and I also notice that if you test twice in a long time interval, you get the different fps. Also, different initial position can makes a different result. 
+
+##### For each implementation, how does changing the block count and block size affect performance? Why do you think this is?
+
+Changing block count and block size do not affect much in my testing (based on the vibrated fps, this small changes means almost no effect...). Maybe when I make more threads and use up the global memory, and takes some local memory so the other work is affected?
+
+##### For the coherent uniform grid: did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+
+Yes. What coherent uniform grid focus on is avoid read data in particle_array to get the position and velocity data for each time in searching neighbors. This method truly improve the fps for about 100.
+
+##### Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not? Be careful: it is insufficient (and possibly incorrect) to say that 27-cell is slower simply because there are more cells to check!
+
+Yes. Of course 27-cell means more checking and calculating than 8-cell, so when using 27-cell, it can be slower. But, the search zone is not change in our algorithm, when the cell becomes smaller, there are more cells in the space. Besides, we do not search in a circle zone, all cells intersect with the searching circle need to check.  So making more cells can specify the space for us better, and save some time. In my test, when the boids count is 250k, 27-cell is better, when 150k or small, 8-cell is better.
\ No newline at end of file
diff --git a/images/1_10000.gif b/images/1_10000.gif
new file mode 100644
index 0000000..6faf2bc
Binary files /dev/null and b/images/1_10000.gif differ
diff --git a/images/2_50000.gif b/images/2_50000.gif
new file mode 100644
index 0000000..d84388a
Binary files /dev/null and b/images/2_50000.gif differ
diff --git a/images/3_150000.gif b/images/3_150000.gif
new file mode 100644
index 0000000..82f4d19
Binary files /dev/null and b/images/3_150000.gif differ
diff --git a/images/bc_novisual.png b/images/bc_novisual.png
new file mode 100644
index 0000000..f116fc8
Binary files /dev/null and b/images/bc_novisual.png differ
diff --git a/images/bc_visual.png b/images/bc_visual.png
new file mode 100644
index 0000000..673a13c
Binary files /dev/null and b/images/bc_visual.png differ
diff --git a/images/bc_visual2.png b/images/bc_visual2.png
new file mode 100644
index 0000000..9d2f026
Binary files /dev/null and b/images/bc_visual2.png differ
diff --git a/images/bs_visual.png b/images/bs_visual.png
new file mode 100644
index 0000000..3249579
Binary files /dev/null and b/images/bs_visual.png differ
diff --git a/images/cor_vnov.png b/images/cor_vnov.png
new file mode 100644
index 0000000..d49d56d
Binary files /dev/null and b/images/cor_vnov.png differ
diff --git a/images/upbc_compare.png b/images/upbc_compare.png
new file mode 100644
index 0000000..042d4f5
Binary files /dev/null and b/images/upbc_compare.png differ
diff --git a/images/upbs_compare.png b/images/upbs_compare.png
new file mode 100644
index 0000000..8dd230d
Binary files /dev/null and b/images/upbs_compare.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..78a99af 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -1,4 +1,4 @@
-#define GLM_FORCE_CUDA
+﻿#define GLM_FORCE_CUDA
 #include <stdio.h>
 #include <cuda.h>
 #include <cmath>
@@ -6,29 +6,35 @@
 #include "utilityCore.hpp"
 #include "kernel.h"
 
+//# define my_8
+# define max_search
+//# define debug
+
 // LOOK-2.1 potentially useful for doing grid-based neighbor search
+//return a max value
 #ifndef imax
 #define imax( a, b ) ( ((a) > (b)) ? (a) : (b) )
 #endif
-
+//return a min value
 #ifndef imin
 #define imin( a, b ) ( ((a) < (b)) ? (a) : (b) )
 #endif
 
 #define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
-
+#define glmvec3(_name, _x,_y,_z)    glm::vec3 _name(_x,_y,_z); \
+				    DebugVector _name##_ = debugVectorViewer(_name);
 /**
 * Check for CUDA errors; print and exit if there was a problem.
 */
 void checkCUDAError(const char *msg, int line = -1) {
-  cudaError_t err = cudaGetLastError();
-  if (cudaSuccess != err) {
-    if (line >= 0) {
-      fprintf(stderr, "Line %d: ", line);
-    }
-    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
-    exit(EXIT_FAILURE);
-  }
+	cudaError_t err = cudaGetLastError();
+	if (cudaSuccess != err) {
+		if (line >= 0) {
+			fprintf(stderr, "Line %d: ", line);
+		}
+		fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err));
+		exit(EXIT_FAILURE);
+	}
 }
 
 
@@ -41,16 +47,19 @@ void checkCUDAError(const char *msg, int line = -1) {
 
 // LOOK-1.2 Parameters for the boids algorithm.
 // These worked well in our reference implementation.
+// 3 distance parameters 
 #define rule1Distance 5.0f
 #define rule2Distance 3.0f
 #define rule3Distance 5.0f
-
+// 3 scale parameters 
 #define rule1Scale 0.01f
 #define rule2Scale 0.1f
 #define rule3Scale 0.1f
-
+// max speed limitation
 #define maxSpeed 1.0f
 
+#define search_distance 1.0f
+
 /*! Size of the starting area in simulation space. */
 #define scene_scale 100.0f
 
@@ -61,6 +70,7 @@ void checkCUDAError(const char *msg, int line = -1) {
 int numObjects;
 dim3 threadsPerBlock(blockSize);
 
+
 // LOOK-1.2 - These buffers are here to hold all your boid information.
 // These get allocated for you in Boids::initSimulation.
 // Consider why you would need two velocity buffers in a simulation where each
@@ -85,6 +95,9 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+//coherent position
+glm::vec3 *dev_pos_cohere;
+glm::vec3 *dev_vel_cohere;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -99,77 +112,124 @@ glm::vec3 gridMinimum;
 ******************/
 
 __host__ __device__ unsigned int hash(unsigned int a) {
-  a = (a + 0x7ed55d16) + (a << 12);
-  a = (a ^ 0xc761c23c) ^ (a >> 19);
-  a = (a + 0x165667b1) + (a << 5);
-  a = (a + 0xd3a2646c) ^ (a << 9);
-  a = (a + 0xfd7046c5) + (a << 3);
-  a = (a ^ 0xb55a4f09) ^ (a >> 16);
-  return a;
+	a = (a + 0x7ed55d16) + (a << 12);
+	a = (a ^ 0xc761c23c) ^ (a >> 19);
+	a = (a + 0x165667b1) + (a << 5);
+	a = (a + 0xd3a2646c) ^ (a << 9);
+	a = (a + 0xfd7046c5) + (a << 3);
+	a = (a ^ 0xb55a4f09) ^ (a >> 16);
+	return a;
 }
 
 /**
 * LOOK-1.2 - this is a typical helper function for a CUDA kernel.
 * Function for generating a random vec3.
+* 创建一个随机的vec3
 */
 __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
-  thrust::default_random_engine rng(hash((int)(index * time)));
-  thrust::uniform_real_distribution<float> unitDistrib(-1, 1);
+	thrust::default_random_engine rng(hash((int)(index * time)));
+	thrust::uniform_real_distribution<float> unitDistrib(-1, 1);
 
-  return glm::vec3((float)unitDistrib(rng), (float)unitDistrib(rng), (float)unitDistrib(rng));
+	return glm::vec3((float)unitDistrib(rng), (float)unitDistrib(rng), (float)unitDistrib(rng));
 }
 
 /**
 * LOOK-1.2 - This is a basic CUDA kernel.
 * CUDA kernel for generating boids with a specified mass randomly around the star.
+* 这是一个kernel函数 创建n个boid，一个特定的mass的boids随机的围绕着star
 */
 __global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, float scale) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if (index < N) {
-    glm::vec3 rand = generateRandomVec3(time, index);
-    arr[index].x = scale * rand.x;
-    arr[index].y = scale * rand.y;
-    arr[index].z = scale * rand.z;
-  }
+	// n -> total number
+	// arr -> vec3的数组
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;//threadindex
+	if (index < N) {
+		glm::vec3 rand = generateRandomVec3(time, index);//randomly generate a vector3
+		arr[index].x = scale * rand.x;
+		arr[index].y = scale * rand.y;
+		arr[index].z = scale * rand.z;
+	}
 }
 
+__global__ void kernResetIntBuffer(int N, int *intBuffer, int value);
 /**
 * Initialize memory, update some globals
+* 初始化参数，更新一些global 参数
 */
 void Boids::initSimulation(int N) {
-  numObjects = N;
-  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
-
-  // LOOK-1.2 - This is basic CUDA memory management and error checking.
-  // Don't forget to cudaFree in  Boids::endSimulation.
-  cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
-
-  cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
-
-  cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
-
-  // LOOK-1.2 - This is a typical CUDA kernel invocation.
-  kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
-    dev_pos, scene_scale);
-  checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
-
-  // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
-  int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
-  gridSideCount = 2 * halfSideCount;
-
-  gridCellCount = gridSideCount * gridSideCount * gridSideCount;
-  gridInverseCellWidth = 1.0f / gridCellWidth;
-  float halfGridWidth = gridCellWidth * halfSideCount;
-  gridMinimum.x -= halfGridWidth;
-  gridMinimum.y -= halfGridWidth;
-  gridMinimum.z -= halfGridWidth;
-
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
-  cudaDeviceSynchronize();
+	numObjects = N;
+	dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
+
+	// LOOK-1.2 - This is basic CUDA memory management and error checking.
+	// Don't forget to cudaFree in  Boids::endSimulation.
+	cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+
+	cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
+
+	cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
+
+	// LOOK-1.2 - This is a typical CUDA kernel invocation.
+	kernGenerateRandomPosArray << <fullBlocksPerGrid, blockSize >> > (1, numObjects,
+		dev_pos, scene_scale);
+	checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
+
+	// LOOK-2.1 computing grid params
+	gridCellWidth = 2.f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+	int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
+	gridSideCount = 2 * halfSideCount;
+
+	gridCellCount = gridSideCount * gridSideCount * gridSideCount;
+	gridInverseCellWidth = 1.0f / gridCellWidth;
+	float halfGridWidth = gridCellWidth * halfSideCount;
+	gridMinimum.x -= halfGridWidth;
+	gridMinimum.y -= halfGridWidth;
+	gridMinimum.z -= halfGridWidth;
+
+	cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (N, dev_particleArrayIndices, -1);
+
+	cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+	kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (N, dev_particleGridIndices, -1);
+
+	dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+	dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+	dim3 fullBlocksPerGrid2((gridCellCount + blockSize - 1) / blockSize);//不足的多算一个
+
+	cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+	kernResetIntBuffer << <fullBlocksPerGrid2, blockSize >> > (gridCellCount, dev_gridCellStartIndices, -1);
+
+	cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+	kernResetIntBuffer << <fullBlocksPerGrid2, blockSize >> > (gridCellCount, dev_gridCellEndIndices, -1);
+
+	//Part-2.3 coherent pos and dev
+	cudaMalloc((void**)&dev_pos_cohere, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_pos_cohere failed!");
+
+	cudaMalloc((void**)&dev_vel_cohere, N * sizeof(glm::vec3));
+	checkCUDAErrorWithLine("cudaMalloc dev_vel_cohere failed!");
+
+#ifdef debug
+	std::unique_ptr<int[]>start{ new int[gridCellCount / 2] };//int array
+	std::unique_ptr<int[]>end{ new int[gridCellCount / 2] };//int array
+	cudaMemcpy(start.get(), dev_gridCellStartIndices, sizeof(int) * (gridCellCount / 2), cudaMemcpyDeviceToHost);
+	cudaMemcpy(end.get(), dev_gridCellEndIndices, sizeof(int) * (gridCellCount / 2), cudaMemcpyDeviceToHost);
+	checkCUDAErrorWithLine("memcpy back failed!");
+	std::cout << " fullblockpergrid " << fullBlocksPerGrid2.x << std::endl;
+	std::cout << "end and start: " << std::endl;
+	for (int i = 0; i < gridCellCount / 2; i++) {
+		std::cout << i << " -> start: " << start[i];
+		std::cout << " end: " << end[i] << std::endl;
+	}
+#endif // debug
+
+	cudaDeviceSynchronize();
 }
 
 
@@ -179,50 +239,112 @@ void Boids::initSimulation(int N) {
 
 /**
 * Copy the boid positions into the VBO so that they can be drawn by OpenGL.
+* kernel function
 */
 __global__ void kernCopyPositionsToVBO(int N, glm::vec3 *pos, float *vbo, float s_scale) {
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
-
-  float c_scale = -1.0f / s_scale;
-
-  if (index < N) {
-    vbo[4 * index + 0] = pos[index].x * c_scale;
-    vbo[4 * index + 1] = pos[index].y * c_scale;
-    vbo[4 * index + 2] = pos[index].z * c_scale;
-    vbo[4 * index + 3] = 1.0f;
-  }
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	float c_scale = -1.0f / s_scale;
+	if (index < N) {
+		vbo[4 * index + 0] = pos[index].x * c_scale;
+		vbo[4 * index + 1] = pos[index].y * c_scale;
+		vbo[4 * index + 2] = pos[index].z * c_scale;
+		vbo[4 * index + 3] = 1.0f;
+	}
 }
 
 __global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float s_scale) {
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
-
-  if (index < N) {
-    vbo[4 * index + 0] = vel[index].x + 0.3f;
-    vbo[4 * index + 1] = vel[index].y + 0.3f;
-    vbo[4 * index + 2] = vel[index].z + 0.3f;
-    vbo[4 * index + 3] = 1.0f;
-  }
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+	if (index < N) {
+		vbo[4 * index + 0] = vel[index].x + 0.3f;
+		vbo[4 * index + 1] = vel[index].y + 0.3f;
+		vbo[4 * index + 2] = vel[index].z + 0.3f;
+		vbo[4 * index + 3] = 1.0f;
+	}
 }
 
 /**
 * Wrapper for call to the kernCopyboidsToVBO CUDA kernel.
 */
 void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
-  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
 
-  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
-  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+	kernCopyPositionsToVBO << < fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, vbodptr_positions, scene_scale);
+	kernCopyVelocitiesToVBO << < fullBlocksPerGrid, blockSize >> > (numObjects, dev_vel1, vbodptr_velocities, scene_scale);
 
-  checkCUDAErrorWithLine("copyBoidsToVBO failed!");
+	checkCUDAErrorWithLine("copyBoidsToVBO failed!");
 
-  cudaDeviceSynchronize();
+	cudaThreadSynchronize();
 }
 
 
+/******************
+* Debug Vertex *
+******************/
+struct DebugVector {
+	float x;
+	float y;
+	float z;
+};
+
+__device__ DebugVector debugVectorViewer(glm::vec3 v) {
+	return { v.x, v.y, v.z };
+}
+
 /******************
 * stepSimulation *
 ******************/
 
+__device__ glm::vec3 rule1(int N, int iSelf, const glm::vec3 *pos) {
+	glm::vec3 perceived_center = glm::vec3(0.0);
+	glm::vec3 boid_pos = pos[iSelf];
+	int count = 0;
+	for (int i = 0; i < N; i++) {
+		//glm::vec3 d = boid_pos - pos[i];
+		//float dis = __sqrt((d[0] * d[0]) + (d[1] * d[1]) + (d[2] * d[2]));
+		float dis = glm::distance(boid_pos, pos[i]);
+		if (iSelf != i && dis < rule1Distance) {
+			perceived_center += pos[i];
+			count++;
+		}
+	}
+	if (count != 0) {
+		perceived_center /= count;
+		return (perceived_center - boid_pos) * rule1Scale;
+	}
+	return glm::vec3(0.0);
+}
+
+__device__ glm::vec3 rule2(int N, int iSelf, const glm::vec3 *pos) {
+	glm::vec3 c = glm::vec3(0.0);
+	glm::vec3 boid_pos = pos[iSelf];
+	for (int i = 0; i < N; i++) {
+		float dis = glm::distance(boid_pos, pos[i]);
+		if (iSelf != i && dis < rule2Distance) {
+			c -= (pos[i] - boid_pos);
+		}
+	}
+	return c * rule2Scale;
+}
+
+__device__ glm::vec3 rule3(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
+	glm::vec3 perceived_velocity = glm::vec3(0.0);
+	glm::vec3 boid_pos = pos[iSelf];
+	int count = 0;
+	for (int i = 0; i < N; i++) {
+		float dis = glm::distance(boid_pos, pos[i]);
+		if (iSelf != i && dis < rule3Distance) {
+			perceived_velocity += vel[i];
+			count++;
+		}
+	}
+	if (count != 0) {
+		perceived_velocity /= count;
+		return perceived_velocity * rule3Scale;
+	}
+	return glm::vec3(0.0);
+}
+
 /**
 * LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
 * __device__ code can be called from a __global__ context
@@ -230,10 +352,13 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+	// Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+	// Rule 2: boids try to stay a distance d away from each other
+	// Rule 3: boids try to match the speed of surrounding boids
+	glm::vec3 v1 = rule1(N, iSelf, pos);
+	glm::vec3 v2 = rule2(N, iSelf, pos);
+	glm::vec3 v3 = rule3(N, iSelf, pos, vel);
+	return v1 + v2 + v3;
 }
 
 /**
@@ -241,10 +366,24 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
-  glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+	glm::vec3 *vel1, glm::vec3 *vel2) {
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	// Compute a new velocity based on pos and vel1 根据 pos和vel1计算新的vel存储在vel2
+	//(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel)
+	glm::vec3 newvel = computeVelocityChange(N, index, pos, vel1);
+	newvel += vel1[index];
+	//checkCUDAErrorWithLine("compute new vel failed!");
+	// Clamp the speed 
+	newvel = glm::vec3(
+		glm::clamp(newvel.x, -maxSpeed, maxSpeed),
+		glm::clamp(newvel.y, -maxSpeed, maxSpeed),
+		glm::clamp(newvel.z, -maxSpeed, maxSpeed)
+	);
+	// Record the new velocity into vel2. Question: why NOT vel1?
+	vel2[index] = newvel;
 }
 
 /**
@@ -252,206 +391,547 @@ __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
-  // Update position by velocity
-  int index = threadIdx.x + (blockIdx.x * blockDim.x);
-  if (index >= N) {
-    return;
-  }
-  glm::vec3 thisPos = pos[index];
-  thisPos += vel[index] * dt;
-
-  // Wrap the boids around so we don't lose them
-  thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
-  thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
-  thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
-
-  thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
-  thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
-  thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
-
-  pos[index] = thisPos;
+	// Update position by velocity
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	glm::vec3 thisPos = pos[index];
+	thisPos += vel[index] * dt;
+
+	// Wrap the boids around so we don't lose them
+	thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
+	thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
+	thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
+
+	thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
+	thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
+	thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
+
+	pos[index] = thisPos;
 }
 
 // LOOK-2.1 Consider this method of computing a 1D index from a 3D grid index.
 // LOOK-2.3 Looking at this method, what would be the most memory efficient
 //          order for iterating over neighboring grid cells?
-//          for(x)
-//            for(y)
-//             for(z)? Or some other order?
+//          for(x)									->		for(z)			
+//            for(y)								->			for(y)		
+//             for(z)? Or some other order?			->				for(x)
 __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
-  return x + y * gridResolution + z * gridResolution * gridResolution;
+	return x + y * gridResolution + z * gridResolution * gridResolution;
 }
 
 __global__ void kernComputeIndices(int N, int gridResolution,
-  glm::vec3 gridMin, float inverseCellWidth,
-  glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+	glm::vec3 gridMin, float inverseCellWidth,
+	glm::vec3 *pos, int *indices, int *gridIndices) {
+	// TODO-2.1
+	// - Label each boid with the index of its grid cell.给每一个boid一个他们所在cell的编号
+	// - Set up a parallel array of integer indices as pointers to the actual 
+	//   boid data in pos and vel1/vel2 ?
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	//DebugVector v = glmvec3(hellopos, p.x, p.y, p.z);
+	glm::vec3 p = (pos[index] - gridMin) * inverseCellWidth;
+	glm::ivec3 gridind = (glm::floor)(p);
+	gridIndices[index] = gridIndex3Dto1D(gridind.x, gridind.y, gridind.z, gridResolution);
+	indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
 //          does not enclose any boids
 __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if (index < N) {
-    intBuffer[index] = value;
-  }
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		intBuffer[index] = value;
+	}
 }
 
+// particleGridIndices here should have already been sorted
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
-  int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+	int *gridCellStartIndices, int *gridCellEndIndices) {
+	// TODO-2.1
+	// Identify the start point of each cell in the gridIndices array.
+	// This is basically a parallel unrolling of a loop that goes
+	// "this index doesn't match the one before it, must be a new cell!"
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	if ((index - 1 < 0) || (index - 1 >= 0 && particleGridIndices[index] != particleGridIndices[index - 1]))
+		gridCellStartIndices[particleGridIndices[index]] = index;
+
+	if ((index + 1 >= N) || (index + 1 < N && particleGridIndices[index] != particleGridIndices[index + 1]))
+		gridCellEndIndices[particleGridIndices[index]] = index;
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  int *particleArrayIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int N, int gridResolution, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	int *particleArrayIndices,
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+	// TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
+	// the number of boids that need to be checked.
+	// - Identify the grid cell that this particle is in
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+#ifdef  my_8
+	glm::vec3 pp = (pos[index] - gridMin) * inverseCellWidth;
+	glm::ivec3 gridind = glm::floor(pp);
+	glm::ivec3 gridind_edge = glm::round(pp);
+	glm::ivec3 delta(0, 0, 0);
+
+	for (int i = 0; i < 3; i++) {
+		if (gridind_edge[i] == gridind[i]) {
+			delta[i] = -1;
+		}
+		else if (gridind_edge[i] > gridind[i]) {
+			delta[i] = 0;
+		}
+	}
+	glm::ivec3 s_end = delta + glm::ivec3(1, 1, 1); */
+#endif //  my_8
+
+#ifdef max_search
+		glm::vec3 boid_p = pos[index];
+	//not hardcode
+	float max_distance = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+	int min_gridx = imax(glm::floor((boid_p[0] - gridMin[0] - max_distance) * inverseCellWidth), 0);
+	int min_gridy = imax(glm::floor((boid_p[1] - gridMin[1] - max_distance) * inverseCellWidth), 0);
+	int min_gridz = imax(glm::floor((boid_p[2] - gridMin[2] - max_distance) * inverseCellWidth), 0);
+
+	int max_gridx = imin(glm::floor((boid_p[0] - gridMin[0] + max_distance) * inverseCellWidth), gridResolution - 1);
+	int max_gridy = imin(glm::floor((boid_p[1] - gridMin[1] + max_distance) * inverseCellWidth), gridResolution - 1);
+	int max_gridz = imin(glm::floor((boid_p[2] - gridMin[2] + max_distance) * inverseCellWidth), gridResolution - 1);
+#endif
+
+	//glm::ivec3 curr_grid = (glm::ivec3)((pos[index] - gridMin) * inverseCellWidth);
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	//rule1
+	glm::vec3 v1 = glm::vec3(0.0, 0.0, 0.0);
+	glm::vec3 perceived_center = glm::vec3(0.0, 0.0, 0.0);
+	int center_num = 0;
+	//rule2
+	glm::vec3 v2 = glm::vec3(0.0, 0.0, 0.0);
+	//rule3
+	glm::vec3 v3 = glm::vec3(0.0, 0.0, 0.0);
+	int match_num = 0;
+
+	//max distance search
+#ifdef max_search
+	for (int z = min_gridz; z <= max_gridz; z++) {
+		for (int y = min_gridy; y <= max_gridy; y++) {
+			for (int x = min_gridx; x <= max_gridx; x++) {
+				//this grid
+				glm::ivec3 g = glm::ivec3(x, y, z);
+#endif
+
+				// 8 search
+#ifdef my_8
+	for (int i = delta[0]; i <= s_end[0]; i++) {
+		for (int j = delta[1]; j <= s_end[1]; j++) {
+			for (int k = delta[2]; k <= s_end[2]; k++) {
+				glm::ivec3 g = gridind + glm::ivec3(i, j, k);
+				g = glm::clamp(g, glm::ivec3(0, 0, 0), glm::ivec3(gridResolution, gridResolution, gridResolution));
+#endif // my_8
+				//get 1d index
+				int g_ind = gridIndex3Dto1D(g.x, g.y, g.z, gridResolution);
+
+				//get start and end
+				int start = gridCellStartIndices[g_ind];
+				int end = gridCellEndIndices[g_ind];
+
+				if (start != -1 && end != -1) {
+					for (int it = start; it <= end; it++) {
+						int ind = particleArrayIndices[it];
+						//rule1 towards the centre of mass of neighbouring 
+						if (ind != index && glm::distance(pos[index], pos[ind]) < rule1Distance) {
+							perceived_center += pos[ind];
+							center_num += 1;
+						}
+
+						//rule2  keep a small distance away from other objects
+						if (ind != index && glm::distance(pos[index], pos[ind]) < rule2Distance) {
+							v2 -= (pos[ind] - pos[index]);
+						}
+
+						//rule3 : match velocity with near
+						if (ind != index && glm::distance(pos[index], pos[ind]) < rule3Distance) {
+							v3 += vel1[ind];
+							match_num += 1;
+						}
+					}
+				}
+			}
+		}
+	}
+	if (center_num != 0) {
+		perceived_center /= center_num;
+		v1 = (perceived_center - pos[index]) * rule1Scale;
+	}
+	v2 = v2 * rule2Scale;
+	if (match_num != 0) {
+		v3 /= match_num;
+		v3 = v3 * rule3Scale;
+	}
+	// - For each cell, read the start/end indices in the boid pointer array.
+	// - Access each boid in the cell and compute velocity change from
+	//   the boids rules, if this boid is within the neighborhood distance.
+	// - Clamp the speed change before putting the new speed in vel2
+	glm::vec3 newvel = vel1[index] + v1 + v2 + v3;
+	newvel = glm::vec3(
+		glm::clamp(newvel.x, -maxSpeed, maxSpeed),
+		glm::clamp(newvel.y, -maxSpeed, maxSpeed),
+		glm::clamp(newvel.z, -maxSpeed, maxSpeed)
+	);
+	vel2[index] = newvel;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int N, int gridResolution, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+	// TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
+	// except with one less level of indirection.
+	// This should expect gridCellStartIndices and gridCellEndIndices to refer
+	// directly to pos and vel1.
+	// - Identify the grid cell that this particle is in
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+#ifdef my_8
+	glm::vec3 pp = (pos[index] - gridMin) * inverseCellWidth;
+	glm::ivec3 gridind = glm::floor(pp);
+	glm::ivec3 gridind_edge = glm::round(pp);
+	glm::ivec3 delta(0, 0, 0);
+
+	for (int i = 0; i < 3; i++) {
+		if (gridind_edge[i] == gridind[i]) {
+			delta[i] = -1;
+		}
+		else if (gridind_edge[i] > gridind[i]) {
+			delta[i] = 0;
+		}
+	}
+	glm::ivec3 s_end = delta + glm::ivec3(1, 1, 1);
+#endif // my_8
+
+#ifdef max_search
+	glm::vec3 boid_p = pos[index];
+	//not hardcode
+	float max_distance = imax(imax(rule1Distance, rule2Distance), rule3Distance);
+	int min_gridx = imax(glm::floor((boid_p[0] - gridMin[0] - max_distance) * inverseCellWidth), 0);
+	int min_gridy = imax(glm::floor((boid_p[1] - gridMin[1] - max_distance) * inverseCellWidth), 0);
+	int min_gridz = imax(glm::floor((boid_p[2] - gridMin[2] - max_distance) * inverseCellWidth), 0);
+
+	int max_gridx = imin(glm::floor((boid_p[0] - gridMin[0] + max_distance) * inverseCellWidth), gridResolution - 1);
+	int max_gridy = imin(glm::floor((boid_p[1] - gridMin[1] + max_distance) * inverseCellWidth), gridResolution - 1);
+	int max_gridz = imin(glm::floor((boid_p[2] - gridMin[2] + max_distance) * inverseCellWidth), gridResolution - 1);
+#endif
+	//glm::ivec3 curr_grid = (glm::ivec3)((pos[index] - gridMin) * inverseCellWidth);
+	// - Identify which cells may contain neighbors. This isn't always 8.
+	//rule1
+	glm::vec3 v1 = glm::vec3(0.0, 0.0, 0.0);
+	glm::vec3 perceived_center = glm::vec3(0.0, 0.0, 0.0);
+	int center_num = 0;
+	//rule2
+	glm::vec3 v2 = glm::vec3(0.0, 0.0, 0.0);
+	//rule3
+	glm::vec3 v3 = glm::vec3(0.0, 0.0, 0.0);
+	int match_num = 0;
+
+#ifdef my_8
+	for (int i = delta[0]; i <= s_end[0]; i++) {
+		for (int j = delta[1]; j <= s_end[1]; j++) {
+			for (int k = delta[2]; k <= s_end[2]; k++) {
+				glm::ivec3 g = gridind + glm::ivec3(i, j, k);
+				g = glm::clamp(g, glm::ivec3(0, 0, 0), glm::ivec3(gridResolution, gridResolution, gridResolution));
+#endif // my_8
+
+#ifdef max_search
+		for (int z = min_gridz; z <= max_gridz; z++) {
+			for (int y = min_gridy; y <= max_gridy; y++) {
+				for (int x = min_gridx; x <= max_gridx; x++) {
+
+				//this grid
+				glm::ivec3 g = glm::ivec3(x, y, z);
+#endif
+				//get 1d index from 3d index
+				int g_ind = gridIndex3Dto1D(g.x, g.y, g.z, gridResolution);
+
+				//get start and end
+				int start = gridCellStartIndices[g_ind];
+				int end = gridCellEndIndices[g_ind];
+
+				if (start != -1 && end != -1) {
+					for (int it = start; it <= end; it++) {
+						int ind = it;//directly use the index
+						//rule1 towards the centre of mass of neighbouring 
+						if (ind != index && glm::distance(pos[index], pos[ind]) < rule1Distance) {
+							perceived_center += pos[ind];
+							center_num += 1;
+						}
+
+						//rule2  keep a small distance away from other objects
+						if (ind != index && glm::distance(pos[index], pos[ind]) < rule2Distance) {
+							v2 -= (pos[ind] - pos[index]);
+						}
+
+						//rule3 : match velocity with near
+						if (ind != index && glm::distance(pos[index], pos[ind]) < rule3Distance) {
+							v3 += vel1[ind];
+							match_num += 1;
+						}
+					}
+				}
+			}
+		}
+	}
+	if (center_num != 0) {
+		perceived_center /= center_num;
+		v1 = (perceived_center - pos[index]) * rule1Scale;
+	}
+	v2 = v2 * rule2Scale;
+	if (match_num != 0) {
+		v3 /= match_num;
+		v3 = v3 * rule3Scale;
+	}
+	// - For each cell, read the start/end indices in the boid pointer array.
+	// DIFFERENCE: For best results, consider what order the cells should be
+	// checked in to maximize the memory benefits of reordering the boids data.
+	// - Access each boid in the cell and compute velocity change from
+	// the boids rules, if this boid is within the neighborhood distance.
+	// - Clamp the speed change before putting the new speed in vel2
+	glm::vec3 newvel = vel1[index] + v1 + v2 + v3;
+	newvel = glm::vec3(
+		glm::clamp(newvel.x, -maxSpeed, maxSpeed),
+		glm::clamp(newvel.y, -maxSpeed, maxSpeed),
+		glm::clamp(newvel.z, -maxSpeed, maxSpeed)
+	);
+	vel2[index] = newvel;
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);//block number
+	// TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+	kernUpdateVelocityBruteForce << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_vel2);
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+	// TODO-1.2 ping-pong the velocity buffers
+	//? vel2->vel1
+	glm::vec3 *ex = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = ex;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+	// TODO-2.1
+	// Uniform Grid Neighbor search using Thrust sort.
+	// In Parallel:
+	// - label each particle with its array index as well as its grid index.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects,
+		gridSideCount, gridMinimum, gridInverseCellWidth,
+		dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+	//   Use 2x width grids.
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	thrust::sort_by_key(dev_thrust_particleGridIndices,
+		dev_thrust_particleGridIndices + numObjects,
+		dev_thrust_particleArrayIndices);
+
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects,
+		dev_particleGridIndices,
+		dev_gridCellStartIndices,
+		dev_gridCellEndIndices);
+
+#ifdef debug
+	std::unique_ptr<int[]>start{ new int[gridCellCount / 2] };//int array
+	std::unique_ptr<int[]>end{ new int[gridCellCount / 2] };//int array
+	cudaMemcpy(start.get(), dev_gridCellStartIndices, sizeof(int) * (gridCellCount / 2), cudaMemcpyDeviceToHost);
+	cudaMemcpy(end.get(), dev_gridCellEndIndices, sizeof(int) * (gridCellCount / 2), cudaMemcpyDeviceToHost);
+	checkCUDAErrorWithLine("memcpy back failed!");
+	std::cout << "end and start: " << std::endl;
+	for (int i = 0; i < gridCellCount / 2; i++) {
+		std::cout << i << " -> start: " << start[i];
+		std::cout << " end: " << end[i] << std::endl;
+	}
+#endif // debug
+
+	// - Perform velocity updates using neighbor search
+	kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (numObjects,
+		gridSideCount, gridMinimum, gridInverseCellWidth,
+		gridCellWidth, dev_gridCellStartIndices,
+		dev_gridCellEndIndices, dev_particleArrayIndices,
+		dev_pos, dev_vel1, dev_vel2);
+
+	// - Update positions
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel2);
+
+	// - Ping-pong buffers as needed ?
+	glm::vec3* newvel = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = newvel;
+}
+
+//special for cohere
+__global__ void rearrangePosandVel(int N,
+	int *particleArrayIndices,
+	glm::vec3 *pos, glm::vec3 *vel,
+	glm::vec3 *new_pos, glm::vec3 *new_vel) {
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	new_pos[index] = pos[particleArrayIndices[index]];
+	new_vel[index] = vel[particleArrayIndices[index]];
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	// TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
+	// Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+	// In Parallel:
+	// - Label each particle with its array index as well as its grid index.
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects,
+		gridSideCount, gridMinimum, gridInverseCellWidth,
+		dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+	//   Use 2x width grids
+	// - Unstable key sort using Thrust. A stable sort isn't necessary, but you
+	//   are welcome to do a performance comparison.
+	thrust::sort_by_key(dev_thrust_particleGridIndices,
+		dev_thrust_particleGridIndices + numObjects,
+		dev_thrust_particleArrayIndices);
+	// - Naively unroll the loop for finding the start and end indices of each
+	//   cell's data pointers in the array of boid indices
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects,
+		dev_particleGridIndices,
+		dev_gridCellStartIndices,
+		dev_gridCellEndIndices);
+	// - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
+	//   the particle data in the simulation array.
+	//   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+	rearrangePosandVel << < fullBlocksPerGrid, blockSize >> > (numObjects,
+		dev_particleArrayIndices,
+		dev_pos, dev_vel1,
+		dev_pos_cohere, dev_vel_cohere);
+
+	kernUpdateVelNeighborSearchCoherent << < fullBlocksPerGrid, blockSize >> > (numObjects,
+		gridSideCount, gridMinimum, gridInverseCellWidth, gridCellWidth,
+		dev_gridCellStartIndices, dev_gridCellEndIndices,
+		dev_pos_cohere, dev_vel_cohere, dev_vel2);
+	// - Perform velocity updates using neighbor search
+	// - Update positions
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos_cohere, dev_vel2);
+	// - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+	glm::vec3 *temp = dev_pos_cohere;
+	dev_pos_cohere = dev_pos;
+	dev_pos = temp;
+
+	temp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = temp;
 }
 
 void Boids::endSimulation() {
-  cudaFree(dev_vel1);
-  cudaFree(dev_vel2);
-  cudaFree(dev_pos);
-
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+	cudaFree(dev_vel1);
+	cudaFree(dev_vel2);
+	cudaFree(dev_pos);
+
+	// TODO-2.1 TODO-2.3 - Free any additional buffers here.
+	cudaFree(dev_particleArrayIndices);
+	cudaFree(dev_particleGridIndices);
+	cudaFree(dev_gridCellStartIndices);
+	cudaFree(dev_gridCellEndIndices);
+
+	cudaFree(dev_vel_cohere);
+	cudaFree(dev_pos_cohere);
 }
 
+/*Thrust是一个基于STL，针对CUDA开发的C++模板库。
+Trust提供与C++、CUDA、 OpenMP和TBB完全兼容的接口，
+可以使我们用最小的编程代价来实现高性能的并行程序。
+Thrust提供了丰富的数据并行算法，例如scan、sort、reduce等，
+我们可以结合这些算法编写出简洁、可读性强的代码。
+我们只需要提供一些高级算法描述他们的计算，并将计算的实施完全托管给Thrust，
+Thrust会自动选择最有效率的算法实现。
+因此，程序员得以快速构建CUDA程序，并能够获得极高的稳定性和性能。*/
 void Boids::unitTest() {
-  // LOOK-1.2 Feel free to write additional tests here.
-
-  // test unstable sort
-  int *dev_intKeys;
-  int *dev_intValues;
-  int N = 10;
-
-  std::unique_ptr<int[]>intKeys{ new int[N] };
-  std::unique_ptr<int[]>intValues{ new int[N] };
-
-  intKeys[0] = 0; intValues[0] = 0;
-  intKeys[1] = 1; intValues[1] = 1;
-  intKeys[2] = 0; intValues[2] = 2;
-  intKeys[3] = 3; intValues[3] = 3;
-  intKeys[4] = 0; intValues[4] = 4;
-  intKeys[5] = 2; intValues[5] = 5;
-  intKeys[6] = 2; intValues[6] = 6;
-  intKeys[7] = 0; intValues[7] = 7;
-  intKeys[8] = 5; intValues[8] = 8;
-  intKeys[9] = 6; intValues[9] = 9;
-
-  cudaMalloc((void**)&dev_intKeys, N * sizeof(int));
-  checkCUDAErrorWithLine("cudaMalloc dev_intKeys failed!");
-
-  cudaMalloc((void**)&dev_intValues, N * sizeof(int));
-  checkCUDAErrorWithLine("cudaMalloc dev_intValues failed!");
-
-  dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
-
-  std::cout << "before unstable sort: " << std::endl;
-  for (int i = 0; i < N; i++) {
-    std::cout << "  key: " << intKeys[i];
-    std::cout << " value: " << intValues[i] << std::endl;
-  }
-
-  // How to copy data to the GPU
-  cudaMemcpy(dev_intKeys, intKeys.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
-  cudaMemcpy(dev_intValues, intValues.get(), sizeof(int) * N, cudaMemcpyHostToDevice);
-
-  // Wrap device vectors in thrust iterators for use with thrust.
-  thrust::device_ptr<int> dev_thrust_keys(dev_intKeys);
-  thrust::device_ptr<int> dev_thrust_values(dev_intValues);
-  // LOOK-2.1 Example for using thrust::sort_by_key
-  thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
-
-  // How to copy data back to the CPU side from the GPU
-  cudaMemcpy(intKeys.get(), dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost);
-  cudaMemcpy(intValues.get(), dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost);
-  checkCUDAErrorWithLine("memcpy back failed!");
-
-  std::cout << "after unstable sort: " << std::endl;
-  for (int i = 0; i < N; i++) {
-    std::cout << "  key: " << intKeys[i];
-    std::cout << " value: " << intValues[i] << std::endl;
-  }
-
-  // cleanup
-  cudaFree(dev_intKeys);
-  cudaFree(dev_intValues);
-  checkCUDAErrorWithLine("cudaFree failed!");
-  return;
+	// LOOK-1.2 Feel free to write additional tests here.
+
+	// test unstable sort
+	int *dev_intKeys;
+	int *dev_intValues;
+	int N = 10;
+
+	std::unique_ptr<int[]>intKeys{ new int[N] };//int array
+	std::unique_ptr<int[]>intValues{ new int[N] };//int array
+
+	intKeys[0] = 0; intValues[0] = 0;
+	intKeys[1] = 1; intValues[1] = 1;
+	intKeys[2] = 0; intValues[2] = 2;
+	intKeys[3] = 3; intValues[3] = 3;
+	intKeys[4] = 0; intValues[4] = 4;
+	intKeys[5] = 2; intValues[5] = 5;
+	intKeys[6] = 2; intValues[6] = 6;
+	intKeys[7] = 0; intValues[7] = 7;
+	intKeys[8] = 5; intValues[8] = 8;
+	intKeys[9] = 6; intValues[9] = 9;
+
+	//malloc memory on the device size: n * size of int
+	//dev_intkeys->pointer of the memory in device
+	cudaMalloc((void**)&dev_intKeys, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_intKeys failed!");
+	//malloc memory on the device size: n * size of int
+	cudaMalloc((void**)&dev_intValues, N * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc dev_intValues failed!");
+
+	dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);//block number
+
+	std::cout << "before unstable sort: " << std::endl;
+	for (int i = 0; i < N; i++) {
+		std::cout << "  key: " << intKeys[i];
+		std::cout << " value: " << intValues[i] << std::endl;
+	}
+
+	// How to copy data to the GPU
+	//destination, source
+	cudaMemcpy(dev_intKeys, intKeys.get(), sizeof(int) * N, cudaMemcpyHostToDevice);//host to device
+	cudaMemcpy(dev_intValues, intValues.get(), sizeof(int) * N, cudaMemcpyHostToDevice);//host to device
+
+	// Wrap device vectors in thrust iterators for use with thrust.
+	thrust::device_ptr<int> dev_thrust_keys(dev_intKeys);
+	thrust::device_ptr<int> dev_thrust_values(dev_intValues);
+	// LOOK-2.1 Example for using thrust::sort_by_key
+	// thrust :: sort_by_key（d_keys.begin ，d_keys.end（），d_values.begin（））
+	// from small to large
+	thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
+
+	// How to copy data back to the CPU side from the GPU
+	//device to host
+	cudaMemcpy(intKeys.get(), dev_intKeys, sizeof(int) * N, cudaMemcpyDeviceToHost);
+	cudaMemcpy(intValues.get(), dev_intValues, sizeof(int) * N, cudaMemcpyDeviceToHost);
+	checkCUDAErrorWithLine("memcpy back failed!");
+
+	std::cout << "after unstable sort: " << std::endl;
+	for (int i = 0; i < N; i++) {
+		std::cout << "  key: " << intKeys[i];
+		std::cout << " value: " << intValues[i] << std::endl;
+	}
+
+	// cleanup
+	cudaFree(dev_intKeys);
+	cudaFree(dev_intValues);
+	checkCUDAErrorWithLine("cudaFree failed!");
+	return;
 }