diff --git a/README.md b/README.md
index 7c4bb72..70bbd88 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,47 @@
 This is an sudoku-solver implementation coded in three days for GPU-class homework.
 
 ## Parallelizing Sudoku
-is hard. The most popular cpu-solution is backtracking, which is built on backtracking and recursion.
+is hard. The most popular cpu-solution is backtracking, which is built on backtracking and recursion. [This webpage](https://www.sudoku-solutions.com/) is really nice to improve your sudoku skills and it has `View Steps` functionality to look the logic behind the solution.
+
+However since it is all about parallelizing I've focused on parallelizing and only implemented the basic logic. Basic logic is this:
+1. For each cell(out of 81) if it is empty, find out the set of digits which are not used in the current row, column or cell-group. This set is the possibility set of each cell.
+2. If the set is empty, this setting/board is invalid.(which may happen as a result of incorrect guessing)
+3. If the set consists of single element, fill the cell with the only value you got.
+4. If there are more then one digits in the set, then we don't do anything.
+
+We keep repeat this process until we solve the sudoku or stop progressing. When there is no progress we schedule a fork:
+1. Find the number of digits in the smallest set.
+2. From the cell with smallest sets, pick the one with smaller id(id here is the row-based 1d matrix id)
+3. Fork the cell by generating new boards for possible values of the cell. Each board gets another digit and they apply the simple logic again.
+
+Yes the problem here is to how to fork. Gpu's have some recursion and dynamic allocation capability, but it is always better to allocate at the beginning and thats what I do.
+
+## Optimizations and Results
+Current:
+- Static allocation with 50000 blocks(observed to be enough for all hard examples)
+- Bit masks to reduce storaga.
+
+Future:
+- Share bit mask generation task better within block.
+
+<2s kernel time for 95 hard sudoku.
 
 ## What cuda-sudoku-solver does.
-Generates 50000 blocks, each is capable of solving a logic based sudoku. First one block starts and forks whenever necessary. 
+The `controller` kernel is the main Kernel which calls  `fillSudokuSafeAndFork` repeatedly until a solution is found.
+The program has some default values like #blocks and #threads.
+- `#threads`: 96=32*3 which is the smallest multiple of 32 which is bigger then 81. #todo We can do 81 here and remove if statements.
+- `#blocks`: available solvers. Each block works on its own block.
+- `arr_dev`: has #blocks many boards and one extra for the solution.
+- `block_stat`:  has status for each block. If it is 0, block is idle/available. If it is 1, then it is active, working on a solution. If the last element of block_stat(block_stat[nBlocks]) is equal to 2, we have a solution ready on the last 81 element of `arr_dev`.
+
+**fillSudokuSafeAndFork** is pretty long kernel, following steps are done in order:
+1. block 0 checks for errors, it stops the process if there is no active block. This shouldn't happen.
+2. Each active block calculates row,column,cell-group binary masks(9bit). Using binary masks reduces the shared memory requirement.
+3. Each thread is matched with a sudoku cell and each thread calculates its possibilities by OR'ing its corresponding masks. The result is the set of possible values(0's in the binary mask).
+4. 2 and 3 repeated until no progress is made.
+5. after the loop
+  - if **done_flag**, then we copy the result to the result spot and set the `stats[nBlocks]` to 2.
+  - if **error_flag**, current block is wrong, so we spot and set the stat to 0, so the block can be rescheduled.
+  - if no **progress_flag**, then we need to fork.
+    1. Using atomic instructions choose a cell with multiple possibilities. After this point only the corresponding thread performs.
+    2. First possibility stays with the block. For the remaining digits make a search to find an available block and copy the current block to the new_blocks global storage. Now in the next iteration the new blocks are going to work on different possibilities.