Apache Spark is an open-source framework for performing general data analytics on distributed computing cluster, with libraries for machine learning, interactive queries and graph analytics.
Sporting a multi-stage RAM-capable compute framework, Spark’s real-time data processing capability is considerably faster than Hadoop’s MapReduce.
Spark consists of Spark Core and a set of libraries with user-friendly APIs for Scala (its native language), Java, Python, and Spark SQL, making it is easier to use than Hadoop.
Additional libraries, built atop the core, allow diverse workloads for streaming, SQL, and machine learning.
According to the Apache Spark website:
‘Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.‘
It can run on a Hadoop cluster with YARN but also Mesos or in standalone mode and works with a variety filesystems such as HDFS, MapR File System, Cassandra, OpenStack Swift, Amazon S3, Kudu, etc.
A Data Scientist’s Perspective
A number of Apache Spark features may be quite attractive to the data scientist:
- provides APIs supporting related tasks, like data access, ETL, and integration,
- an interactive environment complete with access to a rich set of libraries from a REPL (Read-Evaluate-Print Loop) environment,
- data operations are transparently distributed across the cluster, even as you type,
- code implemented in the REPL environment can be used in an operational context with little or no modifications,
- scalable and distributed environment,
- memory resident data for speedy iterative machine learning workloads,
- comes with a machine-learning library, MLlib,
- ability to directly leverage existing Java libraries,
- versatile RDD (Resilient Distributed Dataset) abstraction,
- a collection API similar to Scala’s collections API and functional style, and to Python APIs and functional style as well,
- offers Scala’s compelling features for statistical computing.