Data Science
Theory, Analysis, and Applications
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
2020 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
International Standard Book Number-13: 978-0-367-20861-5 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Names: Memon, Qurban A. (Qurban Ali), editor. | Khoja, Shakeel Ahmed, editor.
Title: Data science : theory, analysis, and applications / edited by Qurban
A Memon, Shakeel Ahmed Khoja.
Description: Boca Raton : CRC Press, [2020] | Includes bibliographical
references and index. |
Identifiers: LCCN 2019029260 (print) | LCCN 2019029261 (ebook) |
ISBN 9780367208615 (hardback) | ISBN 9780429263798 (ebook)
Subjects: LCSH: Data miningStatisical methods | Big dataStatisical
methods | Quantitative research.
Classification: LCC QA76.9.D343 D3944 2020 (print) | LCC QA76.9.D343
(ebook) | DDC 006.3/12dc23
LC record available at https://lccn.loc.gov/2019029260
LC ebook record available at https://lccn.loc.gov/2019029261
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Dedicated to our mothers, to whom we owe everything.
Qurban A. Memon
Shakeel Ahmed Khoja
Data Science is an interdisciplinary scientific and technical field that combines techniques and approaches for efficient and effective data management, integration, analysis, visualization, and interaction with vast amounts of data, all as a critical prerequisite for a successful digitized economy. Currently, this science stands at the forefront of new scientific discoveries and is playing a pivotal role largely in our everyday lives.
Availability of multidisciplinary data is due to hyperconnectivity, and is heterogeneous, online, cheap, and ubiquitous. Old data is being digitized and collecting new data from web logs is being added to generate business intelligence. Essentially, the increasing availability of vast libraries of digitized information influences the way in which we comprehend and analyze our environment and realize businesses for our societal benefit. In this situation, data science creates a transformational force with an impact on current innovation potential in industry and academia. Nowadays, people are aware that this data can make a huge difference in business and engineering fields. It promises to revolutionize industriesfrom business to academics.
Currently, every sector of the economy has access to huge data and accumulates this data at a rate that exceeds their capacity to extract meaningful intelligence from it. The data exists in the form of text, audio, video, images, sensor, blog data, etc., but is unstructured, incomplete in form, and messy. New technologies, for example big data have emerged to organize and make sense of this data to create commercial and social value. It seems that every sector of the economy has access to this huge avalanche of data to extract relevant information. The question that still remains is how to use it effectively.
THE OBJECTIVE OF THIS BOOK
The aim of this book is to provide an internationally respected collection of scientific research methods, technologies, and applications in the area of data science. This book enhances the understanding of concepts and technologies in data science and discusses intelligent methods, solutions, and applications in data science. This book can prove useful to researchers, professors, research students, and practitioners as it reports novel research work on challenging topics in the area surrounding data science. In this book, some of the chapters are written in tutorial style concerning machine learning algorithms, data analysis, etc.
THE STRUCTURE
The book is a collection of fourteen chapters written by scholars and experts in this field and organized into three parts. Each part consists of a set of chapters addressing the respective subject outline to provide readers an in-depth and focused understanding of concept and technology related to that part of the book. Some of the chapters in each part are written in tutorial style in chapters concerning the development process of data science and its emerging applications.
The book is structured as follows:
: Data Science: Theory, Concepts, and Algorithms
: Data Design and Analysis
: Applications and New Trends in Data Science
comprises five chapters on data science theory, concepts, techniques, and algorithms.
The first chapter extends the earlier work on Cassandra integrated with Hadoop to a system called GeoMongoSpark and investigates on storage and retrieval of geospatial data using various sharding techniques. Hashed indexing is used to improve the processing performance with less memory.
The purpose of is to study the different evolutionary algorithms for optimizing neural networks in different ways for image segmentation purposes.
introduces a new adaptive algorithm called Feature Selection Penguin Search optimization algorithm, which is a metaheuristic feature subset selection method. It is adapted from the natural hunting strategy of penguins, in which a group of penguins take jumps at random depths and come back and share the status of food availability with other penguins, and in this way, the global optimum solution is found, namely Penguin Search Optimization Algorithm. It is combined with different classifiers to find an optimal feature subset.
Currently, graph technology is becoming increasingly important, and graphs are used to model dynamic and complex relationships of data order to generate knowledge. Particularly, Neo4j is a database management system that currently leads the NoSQL system on graph databases. In , the main objective is to propose physical design guidelines that improve query execution time on graph databases in terms of a specific workload in Neo4j. In this work, indexes, path materialization, and query rewriting are considered as guidelines for the physical design on Neo4j databases.