Spark RDD – Resilient Distributed Dataset
Spark RDD stands for ‘Resilient Distributed Dataset’ that are a key concept in the Apache Spark tool in order to work with big data. RDDs are a fault-tolerant collection of elements to operate on in parallel on datasets. The key idea of Resilient Distributed Datasets (RDDs) in Apache Spark is that they enable the work with distributed data collections as if one would work with local data collections. It enables good usability so that Spark users need to write less code thus using rich APIs and various standard programming languages. This data concept can be used across the rich set of libraries that come with Apache Spark like SQL, streaming, graph, and machine learning.
RDDs are the key approach to work on large quantities of data in Apache Spark as immutable collections of objects across a cluster. They are partitions of the data typically created through parallel transformations such as map, filter, group, and other operations. In other words Spark transformations build RDDs from other RDDs through map, filter, groubBy, join, or union operations. In contrast Spark actions return a real result or write it to storage with operations like count, collect, or save. RDDs have controllable persistence and a key benefit are cached partitions that make Spark analytics fast. That means caching data in RAM if it fits in memory. In terms of fault tolerance they automatically rebuilt on failure. This is possible because a RDD tracks the transformations that created them (i.e. lineage) to re-compute lost data. Please see another article on using the Apache Spark tool for more pieces of information.
Spark RDD details
The following video provides more pieces of information about the topic: