Jules J. Berman, Ph.D., M.D.
Copyright
Acquiring Editor:Andrea Dierna
Editorial Project Manager:Heather Scherer
Project Manager:Punithavathy Govindaradjane
Designer:Russell Purdy
Morgan Kaufmann is an imprint of Elsevier
225 Wyman Street, Waltham, MA 02451, USA
Copyright 2013 Elsevier Inc. All rights reserved
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publishers permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods or professional practices, may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information or methods described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
Library of Congress Cataloging-in-Publication Data
Berman, Jules J.
Principles of big data : preparing, sharing, and analyzing complex information / Jules J Berman.
pages cm
ISBN 978-0-12-404576-7
1. Big data. 2. Database management. I. Title.
QA76.9.D32B47 2013
005.74dc23
2013006421
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
Printed and bound in the United States of America
13 14 15 16 17 10 9 8 7 6 5 4 3 2 1
For information on all MK publications visit our website at www.mkp.com
Dedication
To my father, Benjamin
Acknowledgments
I thank Roger Day, and Paul Lewis who resolutely poured through the entire manuscript, placing insightful and useful comments in every chapter. I thank Stuart Kramer, whose valuable suggestions for the content and organization of the text came when the project was in its formative stage. Special thanks go to Denise Penrose, who worked on her very last day at Elsevier to find this title a suitable home at Elseviers Morgan Kaufmann imprint. I thank Andrea Dierna, Heather Scherer, and all the staff at Morgan Kaufmann who shepherded this book through the publication and marketing processes.
Author Biography
Jules Berman holds two Bachelor of Science degrees from MIT (Mathematics, and Earth and Planetary Sciences), a Ph.D. from Temple University, and an M.D. from the University of Miami. He was a graduate researcher in the Fels Cancer Research Institute at Temple University and at the American Health Foundation in Valhalla, New York. His postdoctoral studies were completed at the U.S. National Institutes of Health, and his residency was completed at the George Washington University Medical Center in Washington, DC. Dr. Berman served as Chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he became the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the U.S. National Cancer Institute, where he worked and consulted on Big Data projects. In 2006, Dr. Berman was President of the Association for Pathology Informatics. In 2011, he received the Lifetime Achievement Award from the Association for Pathology Informatics. He is a coauthor on hundreds of scientific publications. Today, Dr. Berman is a freelance author, writing extensively in his three areas of expertise: informatics, computer programming, and pathology.
Preface
We cant solve problems by using the same kind of thinking we used when we created them.
Albert Einstein
Data pours into millions of computers every moment of every day. It is estimated that the total accumulated data stored on computers worldwide is about 300 exabytes (thats 300 billion gigabytes). Data storage increases at about 28% per year. The data stored is peanuts compared to data that is transmitted without storage. The annual transmission of data is estimated at about 1.9 zettabytes (1900 billion gigabytes, see ).1 From this growing tangle of digital information, the next generation of data resources will emerge.
As the scope of our data (i.e., the different kinds of data objects included in the resource) and our data timeline (i.e., data accrued from the future and the deep past) are broadened, we need to find ways to fully describe each piece of data so that we do not confuse one data item with another and so that we can search and retrieve data items when needed. Astute informaticians understand that if we fully describe everything in our universe, we would need to have an ancillary universe to hold all the information, and the ancillary universe would need to be much much larger than our physical universe.
In the rush to acquire and analyze data, it is easy to overlook the topic of data preparation. If data in our Big Data resources (see ) are not well organized, comprehensive, and fully described, then the resources will have no value. The primary purpose of this book is to explain the principles upon which serious Big Data resources are built. All of the data held in Big Data resources must have a form that supports search, retrieval, and analysis. The analytic methods must be available for review, and the analytic results must be available for validation.
Perhaps the greatest potential benefit of Big Data is the ability to link seemingly disparate disciplines, for the purpose of developing and testing hypotheses that cannot be approached within a single knowledge domain. Methods by which analysts can navigate through different Big Data resources to create new, merged data sets are reviewed.
What exactly is Big Data? Big Data can be characterized by the three Vs: volume (large amounts of data), variety (includes different types of data), and velocity (constantly accumulating new data).2 Those of us who have worked on Big Data projects might suggest throwing a few more Vs into the mix: vision (having a purpose and a plan), verification (ensuring that the data conforms to a set of specifications), and validation (checking that its purpose is fulfilled; see ).