The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de.
whose unwavering love and support make these accomplishments possible and worth pursuing.
This book was only made possible as a result of my collaboration with many world-renowned data scientists, researchers, CIOs, and leading technology innovators who have taught me a tremendous amount about scientific research, innovation, and more importantly, about the value of collaboration. To all of them I owe a huge debt of gratitude.
About the Author
Peter K. Ghavami received his PhD in Systems and Industrial Engineering from the University of Washington in Seattle, specializing in big data analytics. He has served as head of data analytics at several financial institutions including CapitalOne Financial. He received his BA from Oregon University in Mathematics with an emphasis in Computer Science. He received his MS in Engineering Management from Portland State University. His career started as a software engineer, with progressive responsibilities at IBM as systems engineer. Later he became director of engineering, chief scientist, VP of engineering and product management at various high technology firms.Before coming to CapitalOne Financial, he was director of informatics at UW Medicine leading numerous clinical system implementations and new product development projects. He has been a strategic advisor and VP of informatics at various analytics companies.
He has authored several papers, books, and book chapters on software process improvement, vector processing, distributed network architectures, and software quality. His first book, titled Lean, Agile and Six Sigma Information Technology Management was published in 2008. He has also published a book on data analytics titled Big Data Analytics Methods: Analytics Techniques in Data Mining, Deep Learning and Natural Language Processing, which has become a popular textbook on data science.
Peter is on the advisory board of several clinical analytics companies and is often invited as a lecturer and speaker on this topic. He is certified in ITIL and TOGAF9. He is a member of IEEE Reliability Society, IEEE Life Sciences Initiative, and HIMSS. He has been an active member of the HIMSS Data Analytics Task Force and advises Fortune 500 executives on data strategy.
Introduction
The future of business is big data. While the wealth of an organization may be displayed in balance sheets and electronic ledgers, the real wealth of the organization is in its information assets in data and how well the organization harnesses value from it.
While open source storage systems for big data (such as Hadoop) promise to provide the ultimate flexibility and power in storing and analyzing data, because Hadoop was not designed with security and governance in mind, we face new and additional challenges in managing data to meet corporate and IT governance standards. I offer the best practices in data governance after sampling the best and most successful policies and processes from around the world and offer you a simplified, low cost, but highly effective handbook to big data governance. Thats why this book is indispensable to implementing big data analytics.
Knowledge is information and information is derived from data. Without data governance and data quality, without adequate data integration and information lifecycle management, the chance of harnessing this value and leveraging from data will be very limited.
According to expert reports, data volumes in 2020 are about 50 zettabytes, compared to 2010 when there were just around 1.2 zettabytes The majority of this data is unstructured in the form of PDFs, spreadsheets, images, multimedia (audio, video), geolocation data (GPS), emails, social content, web pages, machine data, as well as GPS and sensor data.
The purpose of this book is to present a practical, effective, no frills, and yet low-cost data governance framework for big data. Youll find this book to be concise and to the point, highlighting the important and salient topics in big data that you can implement to achieve an effective data governance structure but at a low implementation cost. The premise of the policies and recommendations in this book are based on best practices from around the world in big data governance. Ive included best practices from some of the most respected and leading-edge companies who have successfully implemented big data and governance.
To learn more about big data analytics, you can read two companion books. The first book is titled Clinical IntelligenceThe Big Data Analytics Revolution in Healthcare: A Framework for Clinical and Business Intelligence. It can be found at: https://www.createspace.com/4772104. The second companion book is titled Big Data Analytics Methods: Analytics Techniques in Data Mining, Deep Learning and Natural Language Processing 2nd Edition (ISBN 9781547417957). It can be found on Amazon and at fine booksellers.
This book consists of four major parts. Part 1 offers an overview of big data and open source big data storage options like Hadoop. Part 2 is an overview of big data governance concepts, structure, architecture, policies, principles, and best practices. Part 3 presents the best practices in big data governance policies. Finally, Part 4 includes a ready-to-use template for governance structure written in a flexible format that you can easily adapt to your organization.
The contents of this book are presented in a lecture-like manner using a presentation slide deck style that is available from the publisher for academic courses or corporate training programs. The companion book, mentioned above, covers the data science aspects of big data for those who are interested in big data analytics.
Now, lets start our journey through the book.
Part 1: Big Data Overview
Chapter 1 Introduction to Big Data
Data is the new gold. And analytics is the machinery that mines, molds, and mints it. Big data analytics is a set of computer-enabled analytics methods, processes, and discipline of extracting and transforming raw data into meaningful insight, new discovery, and knowledge that helps make more effective decision making. Another definition describes big data analytics as the discipline of extracting and analyzing data to deliver new insight about the past performance, current operations, and prediction of future events.
Before there was big data analytics, the study of large data sets was called data mining. But big data analytics has come a long way in a decade and is now gaining popularity thanks to the eruption of five new technologies: big data analytics, cloud computing, mobility, social networking, and smaller sensors. Each of these technologies is significant in its unique way to how business decisions and performance can be improved and how vast amounts of data can be generated.
Big data is known by its three key attributes known as the three Vs: volume, velocity, and variety. The worlds storage volume is increasing at a rapid pace, estimated to double every year. The velocity at which this data is generated is rising, fueled by the advent of mobile devices and social networking. In medicine and healthcare, the cost and size of sensors has shrunk, making continuous patient monitoring and data acquisition from a multitude of human physiological systems an accepted practice.