1. Introduction
Data science is an interdisciplinary field encompassing scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured. It draws principles from mathematics, statistics, information science, computer science, machine learning, visualization, data mining, and predictive analytics. However, it is fundamentally grounded in mathematics.
This book explains and applies the fundamentals of data science crucial for technical professionals such as DBAs and developers who are making career moves toward practicing data science. It is an example-driven book providing complete Python coding examples to complement and clarify data science concepts, and enrich the learning experience. Coding examples include visualizations whenever appropriate. The book is a necessary precursor to applying and implementing machine learning algorithms, because it introduces the reader to foundational principles of the science of data.
The book is self-contained. All the math, statistics, stochastic, and programming skills required to master the content are covered in the book. In-depth knowledge of object-oriented programming isnt required, because working and complete examples are provided and explained. The examples are in-depth and complex when necessary to ensure the acquisition of appropriate data science acumen. The book helps you to build the foundational skills necessary to work with and understand complex data science algorithms.
Data Science Fundamentals by Example is an excellent starting point for those interested in pursuing a career in data science. Like any science, the fundamentals of data science are prerequisite to competency. Without proficiency in mathematics, statistics, data manipulation, and coding, the path to success is rocky at best. The coding examples in this book are concise, accurate, and complete, and perfectly complement the data science concepts introduced.
The book is organized into six chapters. Chapter focusing on exploring data by dimensionality reduction, web scraping, and working with large data sets efficiently.
Python programming code for all coding examples and data files are available for viewing and download through Apress at www.apress.com/9781484235966 . Specific linking instructions are included on the copyright pages of the book.
To install a Python module, pip is the preferred installer program. So, to install the matplotlib module from an Anaconda prompt: pip install matplotlib. Anaconda is a widely popular open source distribution of Python (and R) for large-scale data processing, predictive analytics, and scientific computing that simplifies package management and deployment. I have worked with other distributions with unsatisfactory results, so I highly recommend Anaconda.
Python Fundamentals
Python has several features that make it well suited for learning and doing data science. Its free, relatively simple to code, easy to understand, and has many useful libraries to facilitate data science problem solving. It also allows quick prototyping of virtually any data science scenario and demonstration of data science concepts in a clear, easy to understand manner.
The goal of this chapter is not to teach Python as a whole, but present, explain, and clarify fundamental features of the language (such as logic, data structures, and libraries) that help prototype, apply, and/or solve data science problems.
Python fundamentals are covered with a wide spectrum of activities with associated coding examples as follows:
functions and strings
lists, tuples, and dictionaries
reading and writing data
list comprehension
generators
data randomization
MongoDB and JSON
visualization
Functions and Strings
Python functions are first-class functions, which means they can be used as parameters, a return value, assigned to variable, and stored in data structures. Simply, functions work like a typical variable. Functions can be either custom or built-in. Custom are created by the programmer, while built-in are part of the language. Strings are very popular types enclosed in either single or double quotes.
The following code example defines custom functions and uses built-in ones:
def num_to_str(n):
return str(n)
def str_to_int(s):
return int(s)
def str_to_float(f):
return float(f)
if __name__ == "__main__":
# hash symbol allows single-line comments
'''
triple quotes allow multi-line comments
'''
float_num = 999.01
int_num = 87
float_str = '23.09'
int_str = '19'
string = 'how now brown cow'
s_float = num_to_str(float_num)
s_int = num_to_str(int_num)
i_str = str_to_int(int_str)
f_str = str_to_float(float_str)
print (s_float, 'is', type(s_float))
print (s_int, 'is', type(s_int))
print (f_str, 'is', type(f_str))
print (i_str, 'is', type(i_str))
print ('\nstring', '"' + string + '" has', len(string), 'characters')
str_ls = string.split()
print ('split string:', str_ls)
print ('joined list:', ' '.join(str_ls))
A popular coding style is to present library importation and functions first, followed by the main block of code. The code example begins with three custom functions that convert numbers to strings, strings to numbers, and strings to float respectively. Each custom function returns a built-in function to let Python do the conversion. The main block begins with comments. Single-line comments are denoted with the # (hash) symbol. Multiline comments are denoted with three consecutive single quotes. The next five lines assign values to variables. The following four lines convert each variable type to another type. For instance, function num_to_str() converts variable float_num to string type. The next five lines print variables with their associated Python data type. Built-in function type() returns type of given object. The remaining four lines print and manipulate a string variable.
Lists, Tuples, and Dictionaries
Lists are ordered collections with comma-separated values between square brackets. Indices start at 0 (zero). List items need not be of the same type and can be sliced, concatenated, and manipulated in many ways.
The following code example creates a list, manipulates and slices it, creates a new list and adds elements to it from another list, and creates a matrix from two lists :
import numpy as np
if __name__ == "__main__":
ls = ['orange', 'banana', 10, 'leaf', 77.009, 'tree', 'cat']
print ('list length:', len(ls), 'items')
print ('cat count:', ls.count('cat'), ',', 'cat index:', ls.index('cat'))
print ('\nmanipulate list:')
cat = ls.pop(6)
print ('cat:', cat, ', list:', ls)
ls.insert(0, 'cat')
ls.append(99)
print (ls)
ls[7] = '11'
print (ls)
ls.pop(1)
print (ls)