Usman Qamar and Muhammad Summair Raza
Data Science Concepts and Techniques with Applications
Usman Qamar
Knowledge and Data Science Research Centre, National University of Sciences and Technology (NUST), Islamabad, Pakistan
Muhammad Summair Raza
Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan
ISBN 978-981-15-6132-0 e-ISBN 978-981-15-6133-7
https://doi.org/10.1007/978-981-15-6133-7
The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
This book is dedicated to our students.
Preface
As this book is about data science, the first question it immediately begs is: What is data science? It is a surprisingly hard definition to nail down. However, for us, data science is perhaps the best label for the cross-disciplinary set of skills that are becoming increasingly important in many applications across industry and academia. It comprises three distinct and overlapping areas: firstly, statistician who knows how to model and summarize the data, data scientist who can design and use algorithms to efficiently process and visualize the data, and finally the domain expert who will formulate the right questions and put the answers in context. With this in mind, we would encourage all to think of data science not as a new domain of knowledge to learn, but a new set of skills that you can apply within your current area of expertise.
The book is divided into three parts. The first part consists of the first three chapters. In Chap. introduces widely used techniques for data analytics. Prior to the discussion of common data analytics techniques, firstly we will try to explain three types of learning to which the majority of the data analytics algorithms will fall under.
The second part is composed of Chaps. introduces text mining as well as opining mining.
And finally, the third part of the book is composed of Chap. which focuses on two programming languages commonly used for data science projects, i.e. Python and R programming language.
Data science is an umbrella term that encompasses data analytics, data mining, machine learning, and several other related disciplines, so contents have been devised keeping in mind this perspective. An attempt is made to keep the book as self-contained as possible. The book is suitable for both undergraduate and postgraduate students as well as those carrying out research in data science. It can be used as a textbook for undergraduate students in computer science, engineering, and mathematics. It can also be accessible to undergraduate students from other areas with the adequate background. The more advanced chapters can be used by postgraduate researchers intending to gather a deeper theoretical understanding.
Dr. Usman Qamar
Dr. Muhammad Summair Raza
Islamabad, Pakistan Lahore, Pakistan
Contents
About the Authors
Dr. Usman Qamar
has over 15 years of experience in data engineering and decision sciences both in academia and industry. He has a Masters in Computer Systems Design from University of Manchester Institute of Science and Technology (UMIST), UK. His MPhil in Computer Systems was a joint degree between UMIST and University of Manchester which focused on feature selection in big data. In 2008 he was awarded PhD from University of Manchester, UK. His Post PhD work at University of Manchester, involved various research projects including hybrid mechanisms for statistical disclosure (feature selection merged with outlier analysis) for Office of National Statistics (ONS), London, UK, churn prediction for Vodafone UK and customer profile analysis for shopping with the University of Ghent, Belgium. He is currently Associate Professor of Data Engineering at National University of Sciences and Technology (NUST), Pakistan. He has authored over 200 peer reviewed publications which includes 3 books published by Springer & Co. He is on the Editorial Board of many journals including Applied Soft Computing, Neural Computing and Applications, Computers in Biology and Medicine, Array. He has successfully supervised 5 PhD students and over 100 master students.
Dr. Muhammad Summair Raza
has been affiliated with the Virtual University of Pakistan for more than 8 years and has taught a number of subjects to graduate-level students. He has authored several articles in quality journals and is currently working in the field of data analysis, big data with a focus on rough sets.
The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020
U. Qamar, M. S. Raza Data Science Concepts and Techniques with Applications https://doi.org/10.1007/978-981-15-6133-7_1
1. Introduction
Usman Qamar
(1)
Knowledge and Data Science Research Centre, National University of Sciences and Technology (NUST), Islamabad, Pakistan
(2)
Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan
Usman Qamar (Corresponding author)
Email:
Muhammad Summair Raza
Email:
In this chapter we will discuss the data analytics process. Starting from the basic concepts, we will highlight the types of data, its use, its importance, and issues that are normally faced in data analytics. Efforts have been made to present the concepts in most simple possible way, as conceptual clarity before studying the advance concepts of data science and related techniques is very much necessary.
1.1 Data
Data is an essential need in all domains of life. From research community to business markets data is always required for analysis and decision-making purpose. However, the emerging developments in technology of data storage, processing, and transmission have changed the entire scenario. Now bulk of data is produced on daily basis. Whenever, you type a message, upload a picture, browse a web, type a social media message, you are producing data which is being stored somewhere and available online for processing. Just couple this with development of advance software applications and inexpensive hardware. With the emergence of the concepts like Internet of Things (IoT), where the focus is on connected data, the scenario has worsened the further. From writing something on paper to online distributed storages, data is everywhere.