Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica

Summary

Nowadays, many cluster computing frameworks provide many abstractions for accessing a cluster's computational resources, but they do not provide a distributed memory abstraction. The authors consider that this lack of shared memory abstractions is harmful to the effectiveness of an important class of applications: those that reuse intermediate results across multiple computations, including interactive algorithms and interactive data mining tools. Although there exist some solutions, such as Pregel, HaLoop, they only support specific computation patterns and perform data sharing implicitly for these patterns.

In this paper, the authors propose a new abstraction called resilient distributed datasets (RDDs), which lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs provide an interface based on coarse-grained transformations, providing fault tolerance by logging the transformations used to build a dataset. For the implementation part, the authors develop RDDs on top of Spark. Then, they evaluate RDDs and Spak through both microbenchmarks and measurements of user applications.

Contributions

The authors design and implement RDDs in a system called Spark, which provides a convenient language-integrated programming interface.
The authors conduct extensive experiments to evaluate the effectiveness and performance of RDDs and Spark.
The authors also demonstrate the generality of RDDs by implementing the Pregel ad HaLoop programming models on top of Spark.

Strengths

RDDs provide abstractions for more general reuse rather than optimizations for specific computation patterns.
RDDs can explicitly persist intermediate results in memory.
RDDs can control the partitioning to optimize data placement.
RDDs provide a rich set of operators to manipulate intermediate results.
According to the experimental results, RDDs atop Spark are much faster than other framework like Hadoop for interative applications.

Weaknesses

RDDs have limited application scenario. RDDs are best suited for batch applications, but they would be less suitable for applications that make asynchronous fine-grained updates to shared state, such as a storage system for a web application or an incremental web crawler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2018.03.30-Spark.md

2018.03.30-Spark.md

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica

Summary

Contributions

Strengths

Weaknesses

Files

2018.03.30-Spark.md

Latest commit

History

2018.03.30-Spark.md

File metadata and controls

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica

Summary

Contributions

Strengths

Weaknesses