Spark RDDs are very simple at the same time very important concept in Apache Spark. Most of you might be knowing the full form of RDD, it is Resilient Distributed Datasets. Resilient because RDDs are immutable(can’t be modified once created) and fault tolerant, Distributed because it is distributed across cluster and Dataset because it holds data.
So why RDD? Apache Spark lets you treat your input files almost like any other variable, which you cannot do in Hadoop MapReduce. RDDs are automatically distributed across the network by means of Partitions.
RDDs are divided into smaller chunks called Partitions, and when you execute some action, a task is launched per partition. So it means, the more the number of partitions, the more the parallelism. Spark automatically decides the number of partitions that an RDD has to be divided into but you can also specify the number of partitions when creating an RDD. These partitions of an RDD is distributed across all the nodes in the network.
Creating an RDD
Creating an RDD is easy, it can be created either from an external file or by parallelizing collections in your driver. For example,
The first line creates an RDD from an external file, and the second line creates an RDD from a list of Strings. Note that the argument ‘3’ in the method call sc.textFile() specifies the number of partitions that has to be created. If you don’t want to specify the number of partitions, then you can simply call sc.textFile(“some_file”).
There are two types of operations that you can perform on an RDD- Transformations and Actions. Transformation applies some function on a RDD and creates a new RDD, it does not modify the RDD that you apply the function on.(Remember that RDDs are resilient/immutable). Also, the new RDD keeps a pointer to it’s parent RDD.
When you call a transformation, Spark does not execute it immediately, instead it creates a lineage. A lineage keeps track of what all transformations has to be applied on that RDD, including from where it has to read the data. For example, consider the below example
sc.textFile() and rdd.filter() do not get executed immediately, it will only get executed once you call an Action on the RDD - here filtered.count(). An Action is used to either save result to some location or to display it. You can also print the RDD lineage information by using the command
filtered.toDebugString(filtered is the RDD here).
RDDs can also be thought of as a set of instructions that has to be executed, first instruction being the load instruction.
You can cache an RDD in memory by calling
rdd.cache(). When you cache an RDD, it’s Partitions are loaded into memory of the nodes that hold it.
Caching can improve the performance of your application to a great extent. In the previous section you saw that when an action is performed on a RDD, it executes it’s entire lineage. Now imagine you are going to perform an action multiple times on the same RDD which has a long lineage, this will cause an increase in execution time. Caching stores the computed result of the RDD in the memory thereby eliminating the need to recompute it every time. You can think of caching as if it is breaking the lineage, but it does remember the lineage so that it can be recomputed in case of a node failure.