Category Archives: Data Science

High Level Architecture of Spark

As previously mentioned, Spark is a highly scalable fast general purpose engine for large-scale data processing.
It contains functions that lets you import data from a distributed data source like an HDFS file system or S3 or similar, provides a mechanism for very simply and very efficiently processing that data like assigning functions that transform that data or perform various actions upon it.

All of this can be performed on a cluster using the same code that you used to run it on your own desktop.
So the whole power of horizontal scaling is available to you where if you have massive data processing job, all you need to do is throw more hardware at it until it can actually handle the whole thing.

DAG Engine

Spark runs programs up to 100x faster than Hadoop MapReduce when running in memory, or 10x faster on disk.

The reason that it’s so much faster is that Spark uses a directed acyclic graph (DAG) engine to optimize its workflows.

Nothing actually happens until you actually hit a command to collect the results and do something with them.

You can code in Python, Java or Scala.

Scala is recommended because Spark is written in Scala. It also offers the most efficient, reliable and concise code for dealing with Spark.

The functional programming model available in Scala is ideal for distributed processing and involves significantly more concise code with minimum boilerplate than Java code.

Python is significantly slower than Scala, however Scala code in Spark looks a lot like Python code.

Spark is based on the central concept of the Resilient Distributed Dataset (RDD) – a convenient abstraction over large data sets for performing transformations and actions. An RDD abstracts away all the complexity of fault-tolerance management and distributed processing of large dataset objects.

The spark shell instantiates a Sparkcontext object which is responsible for creating RDDs.

Spark consists of many components. Spark Core deals with the basics of dealing with RDDs and transforming them and collecting their results and tallying them and reducing things together.

The libraries built on top of Spark to simplify some more complex operations:

  • Streaming – a technology built on top of Spark to handle micro-bursts of data as they come in in real-time from sources such as several web servers or multiple sensors from an IOT application for example;
  • Spark SQL – for a simple SQL like interface to Spark, complete with a database like connection to Spark to run SQL like queries on distributed data processed using Spark;
  • MLLib – lets you do machine learning operations on huge data sets. It is somewhat limited in its current form with ongoing feature improvements. It is currently capable of things like linear regression or making recommendations based on user behavior using built-in routines distributed across the cluster;
  • GraphX – provides a framework for extracting information about the attributes and properties of a graph network. Also a generalized mechanism called Pregel for processing graph networks in an efficient and distributed manner.

Setting up Apache Spark

Setup Apache Spark development sandbox to run in a LXC Container.
I have Ubuntu Desktop 16.04 LTS (Xenial release) installed on my machine.

Installation and Setup Steps – Sandbox

Install LXC
sudo apt install -y lxc

The system now has all the LXC commands available, all its templates as well as the Python3 binding to script LXC.

See: https://help.ubuntu.com/lts/serverguide/lxc.html#lxc-installation

Create a Container

This creates a privileged container called sparksandbox from the Ubuntu distribution, Xenial release, for amd64 architecture:

sudo lxc-create -n sparksandbox -t ubuntu -- -r xenial

## # The default user is 'ubuntu' with password 'ubuntu'! # Use the 'sudo' command to run tasks as root in the container. ##

List containers:

sudo lxc-ls -f

NAME STATE AUTOSTART GROUPS IPV4 IPV6 sparksandbox STOPPED 0 - - -
Start Container
sudo lxc-start -d -n sparksandbox

Get detailed container information. Take note of the container’s IP address shown below:

sudo lxc-info -n sparksandbox

Name: sparksandbox State: RUNNING PID: 28605 IP: 10.0.3.129 CPU use: 1.01 seconds BlkIO use: 60.66 MiB Memory use: 80.02 MiB KMem use: 5.84 MiB Link: vethQRIHXO TX bytes: 1.59 KiB RX bytes: 12.54 KiB Total bytes: 14.13 KiB
Configure Sandbox

Secure login into the new container. The password for default user ubuntu is ubuntu.
At this point, ssh will fail to forward X because there is no xauth.

ssh -X ubuntu@10.0.3.129

To connect using ssh with X11 forwarding, install the xauth package.
First update the package lists for upgrades with outstanding package updates, as well as with new packages that have just come to the repositories:

sudo apt-get update

Install xauth package:

sudo apt-get install xauth

Exit the container:

exit

..and re-enter:

ssh -X ubuntu@10.0.3.129

Install Firefox:

sudo apt-get install -y firefox

Install handy tools/utils:

sudo apt-get install -y tree
sudo apt-get install -y git
sudo apt-get install -y unzip
sudo add-apt-repository ppa:atareao/sunflower
sudo apt-get apt-get update
sudo sudo apt-get install sunflower
sudo apt-get install -y software-properties-common

Installation and Setup Steps – Oracle JDK8, Apache Spark and Scala IDE

Install Oracle JDK8

Add Oracle’s PPA, then update your package repository:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update

Install JDK8

sudo apt-get install -y oracle-java8-installer

Add export JAVA_HOME=/usr/java/jdk1.8.0_xxx/ to .bashrc in home directory

Install Spark

Install Spark 2.2.1 (pre-built for Apache Hadoop 2.7 and later).
Download archive from: https://spark.apache.org/downloads.html
Unpack archive

tar -xvf spark-2.2.1-bin-hadoop2.7.tgz

Move the resulting folder and create a symbolic link so that you can have multiple versions of Spark installed.

sudo mv spark-2.1.1-bin-hadoop2.7 /usr/local/
sudo ln -s /usr/local/spark-2.1.1-bin-hadoop2.7/ /usr/local/spark

Update path in /etc/environment with /usr/local/spark/bin
Also add SPARK_HOME to your environment in ~/.bashrc

export SPARK_HOME=/usr/local/spark

Adjust default log level for Spark – rename and edit log4j.properties.template in /usr/local/spark/conf/
Change: log4j.rootCategory=ERROR, console

mv /usr/local/spark/conf/log4j.properties.template /usr/local/spark/conf/log4j.properties

Test!

spark-shell


Install IntelliJ IDEA

Download archive from: https://www.jetbrains.com/idea/download/#section=linux
Unpack archive

tar xvf ideaIU-2017.3.5.tar.gz

Apache Spark as a General Purpose Data Science Tool

Apache Spark is an open-source framework for performing general data analytics on distributed computing cluster,  with libraries for machine learning, interactive queries and graph analytics.

Sporting a multi-stage RAM-capable compute framework, Spark’s real-time data processing capability is considerably faster than Hadoop’s MapReduce.

Spark consists of Spark Core and a set of libraries with user-friendly APIs for Scala (its native language), Java, Python, and Spark SQL, making it is easier to use than Hadoop.

Additional libraries, built atop the core, allow diverse workloads for streaming, SQL, and machine learning.

According to the Apache Spark website:

‘Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

It can run on a Hadoop cluster with YARN but also Mesos or in standalone mode and works with a variety filesystems such as HDFS, MapR File System, Cassandra, OpenStack Swift, Amazon S3, Kudu, etc.

A Data Scientist’s Perspective

A number of Apache Spark features may be quite attractive to the data scientist:

  • provides APIs supporting related tasks, like data access, ETL, and integration,
  • an interactive environment complete with access to a rich set of libraries from a REPL (Read-Evaluate-Print Loop) environment,
  • data operations are transparently distributed across the cluster, even as you type,
  • code implemented in the REPL environment can be used in an operational context with little or no modifications,
  • scalable and distributed environment,
  • memory resident data  for speedy iterative machine learning workloads,
  • comes with a machine-learning library, MLlib,
  • ability to directly leverage existing Java libraries,
  • versatile RDD (Resilient Distributed Dataset) abstraction,
  • a collection API similar to Scala’s collections API and functional style, and to Python APIs and functional style as well,
  • offers Scala’s compelling features for statistical computing.