As previously mentioned, Spark is a highly scalable fast general purpose engine for large-scale data processing.
It contains functions that lets you import data from a distributed data source like an HDFS file system or S3 or similar, provides a mechanism for very simply and very efficiently processing that data like assigning functions that transform that data or perform various actions upon it.
All of this can be performed on a cluster using the same code that you used to run it on your own desktop.
So the whole power of horizontal scaling is available to you where if you have massive data processing job, all you need to do is throw more hardware at it until it can actually handle the whole thing.
Spark runs programs up to 100x faster than Hadoop MapReduce when running in memory, or 10x faster on disk.
The reason that it’s so much faster is that Spark uses a directed acyclic graph (DAG) engine to optimize its workflows.
Nothing actually happens until you actually hit a command to collect the results and do something with them.
You can code in Python, Java or Scala.
Scala is recommended because Spark is written in Scala. It also offers the most efficient, reliable and concise code for dealing with Spark.
The functional programming model available in Scala is ideal for distributed processing and involves significantly more concise code with minimum boilerplate than Java code.
Python is significantly slower than Scala, however Scala code in Spark looks a lot like Python code.
Spark is based on the central concept of the Resilient Distributed Dataset (RDD) – a convenient abstraction over large data sets for performing transformations and actions. An RDD abstracts away all the complexity of fault-tolerance management and distributed processing of large dataset objects.
The spark shell instantiates a Sparkcontext object which is responsible for creating RDDs.
Spark consists of many components. Spark Core deals with the basics of dealing with RDDs and transforming them and collecting their results and tallying them and reducing things together.
The libraries built on top of Spark to simplify some more complex operations:
- Streaming – a technology built on top of Spark to handle micro-bursts of data as they come in in real-time from sources such as several web servers or multiple sensors from an IOT application for example;
- Spark SQL – for a simple SQL like interface to Spark, complete with a database like connection to Spark to run SQL like queries on distributed data processed using Spark;
- MLLib – lets you do machine learning operations on huge data sets. It is somewhat limited in its current form with ongoing feature improvements. It is currently capable of things like linear regression or making recommendations based on user behavior using built-in routines distributed across the cluster;
- GraphX – provides a framework for extracting information about the attributes and properties of a graph network. Also a generalized mechanism called Pregel for processing graph networks in an efficient and distributed manner.