Praise for The Enterprise Big Data Lake
Alex is a visionary in the data industry. He has encapsulated his practical insights into a thorough treatise examining the technical considerations, firm-wide implications, and leveraged business impact of transitioning to a data-driven enterprise. This is a book for any business or technical professional who wishes to succeed with data.
Keyur Desai, Chief Data Officer, TD Ameritrade
Data lakes are essential in achieving many of the benefits of decision- and analytics-driven solutions. This book does a great job clarifying the architecture of data lakes, what value they provide, what challenges they pose, and how to address those challenges.
Jari Koister, VP of Product and Technology, FICO, and professor in the data science program at UC Berkeley, California
Big Data is one of the most confusing terms in the industry today. This book breaks down the components into easy, understandable terms and explains the best ways to approach such projects. I found the sections that articulate the interconnectedness of data streams, data ponds, and data lakes especially helpful. The book is a must-read for any executive looking to understand and educate themselves on contemporary methods of analytics.
Opinder Bawa, Vice President and Chief Information Officer, University of San Francisco
I cant wait to share this book with managers I know who have joined data lake teams and need an introduction to the tools and terms they will need to converse and understand their new teams. They will also get a great idea for the direction they should try and steer their teams. This book is a great place to start, whether you are building a data lake or have inherited one.
Nicole Schwartz, Agile and Technical Product Management consultant
The Enterprise Big Data Lake
by Alex Gorelik
Copyright 2019 Alex Gorelik. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editor: Andy Oram
- Production Editor: Kristen Brown
- Copyeditor: Rachel Head
- Proofreader: Rachel Monaghan
- Indexer: Ellen Troutman Zaig
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- March 2019: First Edition
Revision History for the First Edition
- 2019-02-19: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491931554 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. The Enterprise Big Data Lake, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the author, and do not represent the publishers views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-93155-4
[LSI]
Preface
In recent years many enterprises have begun experimenting with using big data and cloud technologies to build data lakes and support data-driven culture and decision makingbut the projects often stall or fail because the approaches that worked at internet companies have to be adapted for the enterprise, and there is no comprehensive practical guide on how to successfully do that. I wrote this book with the hope of providing such a guide.
In my roles as executive at IBM and Informatica (major data technology vendors), Entrepreneur in Residence at Menlo Ventures (a leading VC firm), and founder and CTO of Waterline (a big data startup), Ive been fortunate to have had the opportunity to speak with hundreds of experts, visionaries, industry analysts, and hands-on practitioners about the challenges of building successful data lakes and creating a data-driven culture. This book is a synthesis of the themes and best practices that Ive encountered across industries (from social media to banking and government agencies) and roles (from chief data officers and other IT executives to data architects, data scientists, and business analysts).
Big data, data science, and analytics supporting data-driven decision making promise to bring unprecedented levels of insight and efficiency to everything from how we work with data to how we work with customers to the search for a cure for cancerbut data science and analytics depend on having access to historical data. In recognition of this, companies are deploying big data lakes to bring all their data together in one place and start saving history, so data scientists and analysts have access to the information they need to enable data-driven decision making. Enterprise big data lakes bridge the gap between the freewheeling culture of modern internet companies, where data is core to all practices, everyone is an analyst, and most people can code and roll their own data sets, and enterprise data warehouses, where data is a precious commodity, carefully tended to by professional IT personnel and provisioned in the form of carefully prepared reports and analytic data sets.
To be successful, enterprise data lakes must provide three new capabilities:
Cost-effective, scalable storage and computing, so large amounts of data can be stored and analyzed without incurring prohibitive computational costs
Cost-effective data access and governance, so everyone can find and use the right data without incurring expensive human costs associated with programming and manual ad hoc data acquisition
Tiered, governed access, so different levels of data can be available to different users based on their needs and skill levels and applicable data governance policies
Hadoop, Spark, NoSQL databases, and elastic cloudbased systems are exciting new technologies that deliver on the first promise of cost-effective, scalable storage and computing. While they are still maturing and face some of the challenges inherent to any new technology, they are rapidly stabilizing and becoming mainstream. However, these powerful enabling technologies do not deliver on the other two promises of cost-effective and tiered data access. So, as enterprises create large clusters and ingest vast amounts of data, they find that instead of a data lake, they end up with a data swampa large repository of unusable data sets that are impossible to navigate or make sense of, and too dangerous to rely on for any decisions.
This book guides readers through the considerations and best practices of delivering on all the promises of the big data lake. It discusses various approaches to starting and growing a data lake, including data puddles (analytical sandboxes) and data ponds (big data warehouses), as well as building data lakes from scratch. It explores the pros and cons of different data lake architectureson premises, cloud-based, and virtualand covers setting up different zones to house everything from raw, untreated data to carefully managed and summarized data, and governing access to those zones. It explains how to enable self-service so that users can find, understand, and provision data themselves; how to provide different interfaces to users with different skill levels; and how to do all of that in compliance with enterprise data governance policies.