Data Science

In their article ‘A Taxonomy of Data Science’, Hilary Mason and Chris Wiggins offer a useful definition of this fledging field. They identify five areas of expertise for a data scientist:

  1. Obtaining data,
  2. Scrubbing data,
  3. Exploring data,
  4. Modeling data, and
  5. Interpreting data.

Together, these steps form the OSEMN (pronounced awesome) model.

Data science is clearly a unifying and an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured.

I will explore statistics, data analysis, machine learning and their related methods, apply techniques and theories from a diverse range of disciplines including mathematics, statistics, information science and computer science.

Tools

To kick off my investigations, I shall be working with Apache Spark.

Apache Spark as a General Purpose Data Science Tool
Setting up Apache Spark
High Level Architecture of Spark
Working with Resilient Distributed Dataset