What is Apache Spark?
Apache Spark is an in-memory distributed data processing engine that is used for processing and analytics of large data-sets. Spark presents a simple interface for the user to perform distributed computing on the entire cluster.
Spark does not have its own file systems, so it has to depend on the storage systems for data-processing. It can run on HDFS or cloud based file systems like Amazon S3 and Azure BLOB.
Besides cloud based file systems it can also run with NoSQL databases like Cassandra and MongoDB.
Spark jobs can be written in Java, Scala, Python, R, and SQL. It provides out of the box libraries for Machine Learning, Graph Processing, Streaming and SQL like data-processing. We will go into detail about each of these libraries later in the article.
The engine was developed at the University of California, Berkeley's AMPLab and was donated to Apache Software Foundation in 2013.
Need for Spark
The traditional way of processing data on Hadoop is using its MapReduce framework. MapReduce involves a lot of disk usage and as such the processing is slower. As data analytics became more main-stream, the creators felt a need to speed up the processing by reducing the disk utilization during job runs.
Apache Spark addresses this issue by performing the computation in the main memory (RAM) of the worker nodes and does not store mid-step results of computation on disk.
Secondly, it doesn't actually load the data until it is required for computation. It converts the given set of commands into a Directed Acyclic Graph (DAG) and then executes it. This prevents the need to read data from the disk and writing back the output of each step as is the case with Hadoop MapReduce. As a result, Spark claims to process data at 100X faster than a corresponding job using MapReduce for in-memory computation jobs.
Spark Architecture
Credit: https://spark.apache.org/
Spark Core uses a master-slave architecture. The Driver program runs in the master node and distributes the tasks to an Executor running on various slave nodes. The Executor runs on their own separate JVMs, which perform the tasks assigned to them in multiple threads.
Each Executor also has a cache associated with it. Caches can be in-memory as well as written to disk on the worker Node. The Executors execute the tasks and send the result back to the Driver.
The Driver communicates to the nodes in clusters using a Cluster Manager like the built-in cluster manager, Mesos, YARN, etc. The batch programs we write get executed in the Driver Node.
Simple Spark Job Using Java
We have discussed a lot about Spark and its architecture, so now let's take a look at a simple Spark job which counts the sum of space-separated numbers from a given text file:
32 23 45 67 2 5 7 9
12 45 68 73 83 24 1
12 27 51 34 22 14 31
...
We will start off by importing the dependencies for Spark Core which contains the Spark processing engine. It has no further requirements as it can use the local file-system to read the data file and write the results:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.2.3</version>
</dependency>
With the core setup, let's proceed to write our Spark batch!
public class CalculateFileSum {
public static String SPACE_DELIMITER = " ";
public static void main(String[] args) {
SparkConf conf = new parkConf().setMaster("local[*]").setAppName("SparkFileSumApp");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> input = sc.textFile("numbers.txt");
JavaRDD<String> numberStrings = input.flatMap(s -> Arrays.asList(s.split(SPACE_DELIMITER)).iterator());
JavaRDD<String> validNumberString = numberStrings.filter(string -> !string.isEmpty());
JavaRDD<Integer> numbers = validNumberString.map(numberString -> Integer.valueOf(numberString));
int finalSum = numbers.reduce((x,y) -> x+y);
System.out.println("Final sum is: " + finalSum);
sc.close();
}
}
Running this piece of code should yield:
Final sum is: 687
The JavaSparkContext
object we have created acts as a connection to the cluster. The Spark Context we have created here has been allocated all the available local processors, hence the *
.
The most basic abstraction in Spark is RDD
, which stands for Resilient Distributed Datasets. It is resilient and distributed since the data is replicated across the cluster and can be recovered if any of the nodes crash.
Another benefit of distributing data is that it can be processed in parallel thus promoting horizontal scaling. Another important feature of RDDs is that they are immutable. If we apply any action or transformation to a given RDD, the result is another set of RDDs.
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
In this example, we have read the words from the input file as RDD
s and converted them into numbers. Then we have applied the reduce
function on them to sum up the values of each of the RDDs before displaying them on the console.
Introduction to Spark Libraries
Spark provides us with a number of built-in libraries which run on top of Spark Core.
Spark SQL
Spark SQL provides a SQL-like interface to perform processing of structured data. When the user executes an SQL query, internally a batch job is kicked-off by Spark SQL which manipulates the RDDs as per the query.
The benefit of this API is that those familiar with RDBMS-style querying find it easy to transition to Spark and write jobs in Spark.
Spark Streaming
Spark Streaming is suited for applications which deal in data flowing in real-time, like processing Twitter feeds.
Spark can integrate with Apache Kafka and other streaming tools to provide fault-tolerant and high-throughput processing capabilities for the streaming data.
Spark MLlib
MLlib is short for Machine Learning Library which Spark provides. It includes the common learning algorithms like classification, recommendation, modeling, etc. which are used in Machine learning.
These algorithms can be used to train the model as per the underlying data. Due to the extremely fast data processing supported by Spark, the machine learning models can be trained in a relatively shorter period of time.
GraphX
As the name indicates, GraphX is the Spark API for processing graphs and performing graph-parallel computation.
The user can create graphs and perform operations like joining and transforming the graphs. As with MLlib, GraphX comes with built-in graph algorithms for page rank, triangle count, and more.
Conclusion
Apache Spark is the platform of choice due to its blazing data processing speed, ease-of-use, and fault tolerant features.
In this article, we took a look at the architecture of Spark and what is the secret of its lightning-fast processing speed with the help of an example. We also took a look at the popular Spark Libraries and their features.