P RAISE FOR
Big Data Governance: An Emerging Imperative
Big Data deals with complex data from disparate sources. Without proper data governance, the integration of this data is very difficult to do correctly. Big Data Governance: An Emerging Imperative gives you the information and insights necessary to develop the data governance program needed to support your Big Data integration projects. Well done, Sunil.
Jay Yusko, Ph.D.
VP, Technology Research SymphonyIRI Group
This is a BIG book by an information governance professional who is committed to educating his audience on complex matters, in a very practical and easily digestible manner.
As a seasoned IT professional who is part of a large corporation, I feel overwhelmed when confronted with the data dilemma. There are more questions than answers facing us data practitioners. My organization is a leading telco in South Africa, and we have data streaming into our organization in the form of call detail records, location data, and data generated by social mediaall of which needs to be managed, for intelligent use.
This well-crafted book demystifies the landscape relating to BIG DATA and empowers us with the necessary intellectual property to tackle the challenges facing us in the big data space.
There is a wealth of information crammed between the covers of this book. Now that I have had the opportunity to synthesize the ideas and learnings described in the book, I feel more confident about tackling the Big Data challenges in my organization, with gusto and determination.
We will all be successful, thanks to the guidance offered by Sunil, in this book!
Komalin Chetty
Head, Data Governance Office Telkom South Africa
Big Data Governance: An Emerging Imperative
Sunil Soares
First Edition
First PrintingOctober 2012
2012 Sunil Soares. All rights reserved.
Every attempt has been made to provide correct information. However, the publisher and the author do not guarantee the accuracy of the book and do not assume responsibility for information included in or omitted from it.
The following terms are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both: IBM, Cognos, Coremetrics, Domino, Guardium, InfoSphere, Lotus, OpenPages, Optim, SPSS, and Tivoli are trademarks of International Business Machines Corporation in many jurisdictions worldwide. Netezza is a registered trademark of IBM International Group B.V., an IBM Company. Unica is a registered trademark of Unica Corporation, an IBM Company. Vivisimo is a registered trademark of Vivisimo, an IBM Company.
Endeca, Exadata, Exalytics, Java, OpenText, and Oracle are registered trademarks of Oracle and/or its affiliates. SAP, BusinessObjects, HANA, NetWeaver, and SyBase are the trademarks or registered trademarks of SAP AG in Germany and in several other countries. Microsoft, Azure, Exchange, Excel, SharePoint, SQL Server, SSAS, SSIS, SSRS, Word, Windows, and the Windows logo are trademarks or registered trademarks of Microsoft Corporation in the United States, other countries, or both.
HP is a registered trademark of Hewlett-Packard Company. Autonomy is a registered trademark of Autonomy, an HP Company. Vertica is a registered trademark of Vertica in the United States and other countries. Informatica,
Informatica Cloud, PowerCenter, and PowerExchange are registered trademarks of Informatica Corporation. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. DataFlux and all other DataFlux Corporation product or service names are registered trademarks or trademarks of, or licensed to, DataFlux Corporation in the USA and other countries.
Teradata and Aster are registered trademarks of Teradata Corporation and/or its affiliates in the United States and other countries. EMC, Archer, Documentum, RSA, and SourceOne are registered trademarks of EMC Corporation. Amazon, Amazon Web Services, DynamoDB, Elastic Compute Cloud, Elastic MapReduce, RDS, S3, and Simple Storage Service are trademarks of Amazon.com, Inc. or its affiliates in the United States and/or other countries. Google and YouTube are trademarks or registered trademarks of Google, Inc. Pentaho is a registered trademark of Pentaho, Inc. Talend is a trademark of Talend.
Adobe is a registered trademark of Adobe Systems Incorporated in the United States, and/or other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries.
Other company, product, or service names may be trademarks or service marks of others.
Printed in Canada. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise.
MC Press offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include custom covers and content particular to your business, training goals, marketing focus, and branding interest.
Corporate Offices: MC Press Online, LLC, 3695 W Quail Heights Court, Boise, ID 83703-3861 USA
Sales and Customer Service: Toll Free: (877) 226-5394;
Permissions and Special Orders:
ISBN: 978-1-58347-377-1
Dedicated to my beautiful daughters Lizzie and Maya
Many thanks to my wife Helena for her support during the development of this book A big thanks to my parents Cecilia and Hubert for their prayers and guidance
A BOUT THE A UTHOR
Sunil Soares is the founder and managing partner of Information Asset, LLC, a consulting firm that specializes in helping organizations build out their information governance programs. Prior to this role, Sunil was the director of information governance at IBM and worked with clients across six continents and multiple industries.
Sunils first book, The IBM Data Governance Unified Process (MC Press, 2010), details the 14 steps and almost 100 sub-steps to implement an information governance program. The book has been used by several organizations as the blueprint for their information governance programs and has been translated into Chinese. Sunils second book, Selling Information Governance to the Business: Best Practices by Industry and Job Function (MC Press, 2011), reviews the best practices to approach information governance by industry and function.
Prior to joining IBM, Sunil consulted with major financial institutions at the Financial Services Strategy Consulting Practice of Booz Allen & Hamilton in New York. Sunil lives in New Jersey and holds an MBA in Finance and Marketing from the University of Chicago Booth School of Business.
C ONTENTS
C ASE S TUDIES
FOREWORD
by Inderpal Bhandari
We now live in an age of seemingly unlimited data. It has slowly, but surely, pervaded our lives. We rely on it to accomplish all manner of tasks, ranging from governing economies and advancing science, to maintaining an electronic record of our health. We have come to realize that its value must be truly understood and unlocked by deriving insights that are revealed through analysis and then translating those insights into information, knowledge, and ultimately action. Recently, with the advent of social media, sensor networks, streaming technology and the like, the sheer scale of work required to unlock this value has increased beyond the scope of traditional database and data warehousing technologies. In other words, we have entered the age of Big Data.