Skip to content

Commit

Permalink
Add Review: MapReduce, Spark
Browse files Browse the repository at this point in the history
  • Loading branch information
h1994st committed Apr 1, 2018
1 parent eb608bc commit 1aa6047
Show file tree
Hide file tree
Showing 3 changed files with 68 additions and 0 deletions.
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,9 @@ UMich EECS 591: Distributed Systems
|2018.03.26|[Bigtable](https://web.eecs.umich.edu/~manosk/assets/papers/bigtable.pdf)|[2018.03.25-Bigtable](https://github.com/h1994st/EECS-591/blob/master/Reviews/2018.03.25-Bigtable.md)|
|2018.03.28|[Megastore](https://web.eecs.umich.edu/~manosk/assets/papers/megastore.pdf)|[2018.03.27-Megastore](https://github.com/h1994st/EECS-591/blob/master/Reviews/2018.03.27-Megastore.md)|
|2018.03.28|[Spanner](https://web.eecs.umich.edu/~manosk/assets/papers/spanner.pdf)|[2018.03.27-Spanner](https://github.com/h1994st/EECS-591/blob/master/Reviews/2018.03.27-Spanner.md)|
|2018.04.02|[MapReduce](https://web.eecs.umich.edu/~manosk/assets/papers/mapreduce.pdf)|[2018.03.30-MapReduce](https://github.com/h1994st/EECS-591/blob/master/Reviews/2018.03.30-MapReduce.md)|
|2018.04.02|[Spark](https://web.eecs.umich.edu/~manosk/assets/papers/spark.pdf)|[2018.03.30-Spark](https://github.com/h1994st/EECS-591/blob/master/Reviews/2018.03.30-Spark.md)|
|2018.04.04|[Bitcoin](https://web.eecs.umich.edu/~manosk/assets/papers/bitcoin.pdf)|[2018.04.??-Bitcoin](https://github.com/h1994st/EECS-591/blob/master/Reviews/2018.04.??-Bitcoin.md)|
|2018.04.04|[Algorand](https://web.eecs.umich.edu/~manosk/assets/papers/algorand.pdf)|[2018.04.??-Algorand](https://github.com/h1994st/EECS-591/blob/master/Reviews/2018.04.??-Algorand.md)|
|2018.04.09|[COPS](https://web.eecs.umich.edu/~manosk/assets/papers/cops.pdf)|[2018.04.??-COPS](https://github.com/h1994st/EECS-591/blob/master/Reviews/2018.04.??-COPS.md)|
|2018.04.09|[RAMCloud](https://web.eecs.umich.edu/~manosk/assets/papers/ramcloud.pdf)|[2018.04.??-RAMCloud](https://github.com/h1994st/EECS-591/blob/master/Reviews/2018.04.??-RAMCloud.md)|
31 changes: 31 additions & 0 deletions Reviews/2018.03.30-MapReduce.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
MapReduce: Simplified Data Processing on Large Clusters
===

###### Jeffrey Dean and Sanjay Ghemawat

---

### Summary

For companies like Google, they need to process extremely large data, and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. While handling such distributed tasks, they have to face the issue of how to parallelize the computation, distribute the data, and handle failures, which obscures the original simple computation with large amounts of complex engineering work to deal with these issues.

This paper presents a new programming model, MapReduce, which is inspired by the map and reduce primitives in many functional languages. This model provides users with an abstraction that allows them to express the simple computations they are trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. In this paper, the authors also describe an implementation of the MapReduce interface towards the cluster-based computing environment in Google and discuss several useful refinements of the programming model.

### Contributions

- The authors present MapReduce, a new and simple abstraction for processing any size of data that fits into this programming model. They bring the idea from functional languages.
- MapReduce is allowed to run over a large cluster of commodity machines. As well, it can handle a large amount of data.

<!-- Refs:
- https://web.eecs.umich.edu/~mozafari/fall2015/eecs584/reviews/summaries/summary25.html -->

### Strengths

- MapReduce provides a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations. It hides complex technical details under the simple interface and helps users deal with data partition problem, task scheduling problem, failure handling problem, the inter-machine communication problem. Moreover, such design can achieve high performance on large clusters of commodity PCs.
- The distributed fashion and the use of a master node enable the system to tolerate node failures by simply rescheduling the work to other available nodes. Furthermore, the master node periodically stores checkpoints of the master data structures so that a new copy of the master node can be restarted and resumes the whole work from the last interrupted state.
- MapReduce does not limit input and output types. In this case, users can implement their readers to retrieve data from different sources and can produce customed data types.

### Weaknesses

- Such programming model is not suitable for real-time data processing scenario. First, it relies on a distributed file system to store large amounts of data. Then, MapReduce framework starts to process the data. This two-step process may downgrade the performance.
- MapReduce only supports a restricted programming model, which requires that tasks must be written as acyclic data-flow programs. Moreover, based on the programming model in this paper, the map function and the reduce function must be stateless, which imposes limitations that are felt in fields such as machine learning, where iterative algorithms may lead to data dependency among iterations.
31 changes: 31 additions & 0 deletions Reviews/2018.03.30-Spark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
===

###### Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica

---

### Summary

Nowadays, many cluster computing frameworks provide many abstractions for accessing a cluster's computational resources, but they do not provide a distributed memory abstraction. The authors consider that this lack of shared memory abstractions is harmful to the effectiveness of an important class of applications: those that reuse intermediate results across multiple computations, including interactive algorithms and interactive data mining tools. Although there exist some solutions, such as Pregel, HaLoop, they only support specific computation patterns and perform data sharing implicitly for these patterns.

In this paper, the authors propose a new abstraction called resilient distributed datasets (RDDs), which lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs provide an interface based on coarse-grained transformations, providing fault tolerance by logging the transformations used to build a dataset. For the implementation part, the authors develop RDDs on top of Spark. Then, they evaluate RDDs and Spak through both microbenchmarks and measurements of user applications.


### Contributions

- The authors design and implement RDDs in a system called Spark, which provides a convenient language-integrated programming interface.
- The authors conduct extensive experiments to evaluate the effectiveness and performance of RDDs and Spark.
- The authors also demonstrate the generality of RDDs by implementing the Pregel ad HaLoop programming models on top of Spark.

### Strengths

- RDDs provide abstractions for more general reuse rather than optimizations for specific computation patterns.
- RDDs can explicitly persist intermediate results in memory.
- RDDs can control the partitioning to optimize data placement.
- RDDs provide a rich set of operators to manipulate intermediate results.
- According to the experimental results, RDDs atop Spark are much faster than other framework like Hadoop for interative applications.

### Weaknesses

- RDDs have limited application scenario. RDDs are best suited for batch applications, but they would be less suitable for applications that make asynchronous fine-grained updates to shared state, such as a storage system for a web application or an incremental web crawler.

0 comments on commit 1aa6047

Please sign in to comment.