Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica
Nowadays, many cluster computing frameworks provide many abstractions for accessing a cluster's computational resources, but they do not provide a distributed memory abstraction. The authors consider that this lack of shared memory abstractions is harmful to the effectiveness of an important class of applications: those that reuse intermediate results across multiple computations, including interactive algorithms and interactive data mining tools. Although there exist some solutions, such as Pregel, HaLoop, they only support specific computation patterns and perform data sharing implicitly for these patterns.
In this paper, the authors propose a new abstraction called resilient distributed datasets (RDDs), which lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs provide an interface based on coarse-grained transformations, providing fault tolerance by logging the transformations used to build a dataset. For the implementation part, the authors develop RDDs on top of Spark. Then, they evaluate RDDs and Spak through both microbenchmarks and measurements of user applications.
- The authors design and implement RDDs in a system called Spark, which provides a convenient language-integrated programming interface.
- The authors conduct extensive experiments to evaluate the effectiveness and performance of RDDs and Spark.
- The authors also demonstrate the generality of RDDs by implementing the Pregel ad HaLoop programming models on top of Spark.
- RDDs provide abstractions for more general reuse rather than optimizations for specific computation patterns.
- RDDs can explicitly persist intermediate results in memory.
- RDDs can control the partitioning to optimize data placement.
- RDDs provide a rich set of operators to manipulate intermediate results.
- According to the experimental results, RDDs atop Spark are much faster than other framework like Hadoop for interative applications.
- RDDs have limited application scenario. RDDs are best suited for batch applications, but they would be less suitable for applications that make asynchronous fine-grained updates to shared state, such as a storage system for a web application or an incremental web crawler.