Mastering Azure Analytics
by Zoiner Tejada
Copyright 2017 Zoiner Tejada. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editor: Shannon Cutt
- Production Editor: Kristen Brown
- Copyeditor: Rachel Monaghan
- Proofreader: Charles Roumeliotis
- Indexer: Ellen Troutman
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- April 2017: First Edition
Revision History for the First Edition
- 2017-04-04: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491956656 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Mastering Azure Analytics, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-95665-6
[LSI]
Foreword
Every 25 milliseconds, a turbine emits 10 distinct data points began almost every customer conversation about big data and advanced analytics that Ive been a part of over the last six years. A simple story about the data needs of a wind farm highlighted the evolving size, speed, and shape of data that is representative of customers across industries. Over time, the technology names, the integration scenarios, and the guidance would evolve, but a few things remained consistent despite the ever-increasing pace of change:
- Customers are faced with a rapidly expanding amount of data, in a variety of shapes and sizes, generated and stored throughout their environment.
- Deep understanding of customers, of purchase patterns, of machine performance, of transaction streams, and more, is fast becoming table stakes as competitors are doing the same.
- The pace of innovation from vendors, and more importantly the ecosystem, is operating at what feels like a record high.
The value that customers get from advanced analytics, big data, and machine learning can transform businesses, but there are still a lot of pieces that need to come together. Ive been fortunate to have had such an immensely exciting, rewarding, and simply fun time building products customers can use to solve these challenges. These technologies have, in many cases, enabled people to build solutions that simply werent possible 5 or 10 years ago.
The addition of the Azure cloud in these scenarios has given customers an entirely new level of flexibility. Cloud services such as HDInsight make it faster, easier, and cheaper to experiment with a wide range of software and hardware combinations, make it possible to finely tune the consumption of cloud resources to the specifics of a given project, and to scale up and down as required. Additionally, the economic model of the cloud is fundamentally different than acquiring and operating these tools on premises, which enables scenarios that are simply not possible on premises. Weve seen Azure customers scale out to a large number of GPU-enabled machines to conduct training using the latest deep learning libraries, and then take that output and deploy it to their web services (as well as to devices running anywhere), paying only for the few dollars worth of compute they used when they did so. Now, with this flexibility comes the need to manage and orchestrate across these systems, which can quickly become a key challenge.
This book takes the reader through the same workflow youll see for implementing an analytics project in the real worldbuilding a data pipeline. By first walking through ingesting and storing data, youll set the stage in Azure for a rich set of insights to derive from that data. Once youve ingested the data, processing can occur in real time, in offline batch scenarios, and while using tools and languages that youre familiar with. The next stage is in acting on the insights gained, whether through dashboards or further integration into other applications and services. Oftentimes, the analysis that we want to be able to do may also involve machine learning to bring structure or predictions to the data. It is said that most machine learning projects are 80% acquiring and processing the data prior to performing any machine learning, and the tools shown throughout this book can be used for this. Finally, we must deal with a set of very real operational aspects of any production data pipeline, such as security and data governance, which need to be considered throughout any project.
Zoiners perspective on this space is one crafted through years of hard work, walking hand in hand with customers who are looking to transform their businesses with the power of data. Zoiner and I met nearly 10 years ago while we were both working in the distributed systems space, where we shared a passion for orchestration engines and messaging layers. Since then, I have always appreciated his ability to work with fantastically complicated technologies and distill down the key choices and aspects of a solution into simple guidance that anyone can understand. Im excited to see him applying that same approach to a topic thats so near to me, and Im excited to see what all the readers can do with the knowledge they will gain.
Matt Winkler
Group Program Manager,
Big Data and Machine Learning
Microsoft
Woodinville, WA
Preface
If you are building software solutions today, odds are that you have a data problem. You might even have an advanced analytics problem or one that requires machine learning. The trouble is that the world of software development and those of big data and advanced analytics seem like they are light years apartthey use different software stacks, different terminology, and often different engineering approaches, and there are lots of choices. The aim of this book is to provide you with a map of the galaxy that helps you chart your course to wrangling insights and guidance out of your datairrespective of whether that data is arriving at warp speed from IoT sensors or at the glacial pace of decades of historical data.
The structure of this book is designed along the path of a data pipeline that aims to ingest, process, store, and deliver data along both real-time (hot data) and batch (cold data) paths. The waypoints in the map to your data pipeline are groups of Azure services, and each is covered in one or more chapters. We describe each service and tool that you should consider for a particular step in your pipeline. Another way to think about it is to look at each phase of the analytics pipeline as a toolbox onto itself: which Azure service would you use for long-term storage? We show you how to use Azure Storage and Azure Data Lake Store. What about storage of streaming data? We give you the optionsincluding Azure Stream Analytics, Azure HDInsight with Storm or Spark, and the Event Processor Hostand show you how to program them.