In their article ‘A Taxonomy of Data Science’, Hilary Mason and Chris Wiggins offer a useful definition of this fledging field. They identify five areas of expertise for a data scientist:
- Obtaining data,
- Scrubbing data,
- Exploring data,
- Modeling data, and
- Interpreting data.
Together, these steps form the OSEMN (pronounced awesome) model.
Data science is clearly a unifying and an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured.
I will explore statistics, data analysis, machine learning and their related methods, apply techniques and theories from a diverse range of disciplines including mathematics, statistics, information science and computer science.
To kick off my investigations, I shall be working with Apache Spark.
|Apache Spark as a General Purpose Data Science Tool|
|Setting up Apache Spark|
|High Level Architecture of Spark|
|Working with Resilient Distributed Dataset|