Apache Spark is known for cluster computing designed for lightning fast computing. Although it is built on Hadoop MapReduce, it extends the MapReduce model to efficiently use more types of computations like Interactive Queries and Stream Processing. It also functions as a tutorial that explains the basics of Spark Core programming.
The main feature of Spark is its in-memory cluster computing that increases the processing speed. It is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. Other than supporting all these workloads in the respective systems, it also reduces the burden for maintaining separate tools.
Speed: Speed is the most important feature of Spark as it helps to run an application 100 times faster in Hadoop cluster, than in memory, and 10 times faster when it is running on disk. This is possible by reducing number of read/write operations to disks. Here the intermediate processing data is stored in memories.
Supports multiple languages: It provides built-in APIs in Java, Scala, and Python. Thus, it supports different languages to write applications. It also comes up with the feature of interactive querying for 80 high-level operators.
Apache Spark Core: The Spark Core is the underlying general execution engine for Sparks platform, where all other functionalities are built upon. It provides in memory computing and refers data-sets in external storage systems.
Spark SQL: It is a component on top of Spark Core that introduces a new data abstraction called Schema-RDD, which provides support for structured and semi-structured data.
Spark Streaming: The Spark Streaming leverages Spark Cores, which has fast scheduling capabilities to perform Streaming Analytics. It ingests the data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches.
Mllib: The Machine Learning Library is a distributed machine learning framework above Spark because of the distributed memory-based Spark architectures. It is, according to the benchmark, set by the Machine Learning Library developers against the Alternating Least Squares (ALS) implementations. Spark Machine Learning Library is nine times as fast as the Hadoop disk-based versions of Apache.
GraphX: It is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computations that can model the user-defined graphs developed by using the Pregel abstraction API. It also optimizes runtimes for these abstractions.
Data Sharing using the Spark RDD
Data sharing is slow in MapReduce due to Replications, Serializations, and Disk I/O.
Most of the Hadoop applications spend more than 90% of the time in doing HDFS Read/Write operations.
Recognizing the problem in MapReduce, researchers have developed a specialized framework called Apache Spark. The key idea of Spark is Resilient Distributed Datasets (RDD); to support in memory-processing computation. This means, it stores the state of memory as an object across the jobs and the object can be sharable between those jobs. Also, data sharing in memories is ten to hundered times faster than Network and Disk.