The Azure DataLakehouse ToolkitBuilding and Scaling DataLakehouses on Azure with DeltaLake, Apache Spark, Databricks,Synapse Analytics, and SnowflakeRon LEsteveThe Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azurewith Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake Ron LEsteve Chicago, IL, USA ISBN-13 (pbk): 978-1-4842-8232-8 ISBN-13 (electronic): 978-1-4842-8233-5 https://doi.org/10.1007/978-1-4842-8233-5 Copyright 2022 by Ron LEsteve This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made.
The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director, Apress Media LLC: Welmoed Spahr Acquisitions Editor: Jonathan Gennick Development Editor: Laura Berendson Coordinating Editor: Jill Balzano Cover designed by eStudioCalamar Cover image designed by Freepik (www.freepik.com) Distributed to the book trade worldwide by Springer Science+Business Media New York, 1 New York Plaza, Suite 4600, New York, NY 10004-1562, USA. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@ springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please e-mail booktranslations@springernature.com; for reprint, paperback, or audio rights, please e-mail bookpermissions@springernature.com.
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales web page at http://www.apress.com/bulk-sales. Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the books product page, located at www.apress.com/. For more detailed information, please visit http://www.apress.com/source- code. Printed on acid-free paper For Cayden and ChristinaTable of Contents About the Author xv About the Technical Reviewer xvii Acknowledgments xix Introduction xxi Part I: Getting Started 1 Chapter 1: The Data Lakehouse Paradigm 3 Background 3 Architecture 4 Ingestion and Processing 6 Data Factory 6 Databricks 10 Functions and Logic Apps 11 Synapse Analytics Serverless Pools 12 Stream Analytics 13 Messaging Hubs 15 Storing and Serving 16 Delta Lake 17 Synapse Analytics Dedicated SQL Pools 19 Relational Database 21 Non-relational Databases 23 Snowflake 26 Consumption 28 Analysis Services 28 Power BI 30 Power Apps 33 v Table of ConTenTs Advanced Analytics 33 Cognitive Services 34 Machine Learning 35 Continuous Integration, Deployment, and Governance 36 DevOps 36 Purview 39 Summary 41 Part II: Data Platforms 43 Chapter 2: Snowflake 45 Architecture 46 Cost 48 Security 49 Azure Key Vault 49 Azure Private Link 50 Applications 50 Replication and Failover 50 Data Integration with Azure 51 Data Lake Storage Gen2 51 Real-Time Data Loading with ADLS gen2 53 Data Factory 54 Databricks 55 Data Transformation 56 Governance 63 Column-Level Security 63 Row-Level Security 65 Access History 66 Object Tagging 66 Sharing 67 Direct Share 69 Data Marketplace 70 Data Exchange 70 vi Table of ConTenTs Continuous Integration and Deployment 71 Jenkins 71 Azure DevOps 72 Reporting 73 Power BI 73 Delta Lake, Machine Learning, and Constraints 78 Delta Lake 78 Machine Learning 79 Constraints 80 Summary 80 Chapter 3: Databricks 83 Workspaces 84 Data Science and Engineering 84 Machine Learning 91 SQL 93 Compute 96 Storage 100 Mount Data Lake Storage Gen2 Account 101 Delta Lake 114 Reporting 115 Real-Time Analytics 117 Advanced Analytics 120 Security and Governance 121 Continuous Integration and Deployment 125 Integration with Synapse Analytics 126 Dynamic Data Encryption 127 Data Profile 129 Query Profile 129 Constraints 133 Identity 133 vii Table of ConTenTs Delta Live Tables Merge 135 Summary 138 Chapter 4: Synapse Analytics 141 Workspaces 142 Storage 144 Development 149 Integration 150 Monitoring 154 Management 154 Reporting 156 Continuous Integration and Deployment 158 Real-Time Analytics 160 Structured Streaming 160 Synapse Link 160 Advanced Analytics 162 Security 163 Governance 166 Additional Features 170 Delta Tables 170 Machine Learning 172 SQL Server Integration Services Integration Runtime (SSIS IR) 172 Map Data Tool 173 Data Sharing 175 SQL Incremental 175 Constraints 176 Summary 178 Part III: Apache Spark ELT 183 Chapter 5: Pipelines and Jobs 185 Databricks 185 Data Factory 191 viii Table of ConTenTs Mapping Data Flows 191 HDInsight Spark Activity 196 Scheduling and Monitoring 200 Synapse Analytics Workspace 202 Summary 206 Chapter 6: Notebook Code 209 PySpark 210 Excel 211 XML 217 JSON 221 ZIP 225 Scala 227 SQL 228 Optimizing Performance 229 Summary 232 Part IV: Delta Lake 233 Chapter 7: Schema Evolution 235 Schema Evolution Using Parquet Format 236 Schema Evolution Using Delta Format 239 Append 240 Overwrite 241 Summary 243 Chapter 8: Change Data Feed 245 Create Database and Tables 245 Insert Data into Tables 248 Change Data Capture 249 Streaming Changes 254 Summary 255 ix Table of ConTenTs Chapter 9: Clones 257 Shallow Clones 257 Deep Clones 263 Summary 267 Chapter 10: Live Tables 269 Advantages of Delta Live Tables 270 Create a Notebook 270 Create and Run a Pipeline 274 Schedule a Pipeline 278 Explore Event Logs 280 Summary 283 Chapter 11: Sharing 285 Architecture 286 Share Data 287 Access Data 288 Sharing Data with Snowflake 291 Summary 292 Part V: Optimizing Performance 295 Chapter 12: Dynamic Partition Pruning 297 Partitions 297 Prerequisites 299 DPP Commands 299 Create Cluster 300 Create Notebook and Mount Data Lake 300 Create Fact Table 301 Verify Fact Table Partitions 304 Create Dimension Table 305 x Table of ConTenTs Join Results Without DPP Filter 306 Join Results with DPP Filter 308 Summary 309 Chapter 13: Z-Ordering and Data Skipping 311 Prepare Data in Delta Lake 312 Verify Data in Delta Lake 314 Create Hive Table 317 Run Optimize and Z-Order Commands 318 Verify Data Skipping 320 Summary 325 Chapter 14: Adaptive Query Execution 327 How It Works 327 Prerequisites 328 Comparing AQE Performance on Query with Joins 329 Create Datasets 329 Disable AQE 332 Enable AQE 334 Summary 338 Chapter 15: Bloom Filter Index 339 How a Bloom Filter Index Works 339 Create a Cluster 340 Create a Notebook and Insert Data 341 Enable Bloom Filter Index 343 Create Tables 344 Create a Bloom Filter Index 346 Optimize Table with Z-Order 348 Verify Performance Improvements 349 Summary 352 xi Table of ConTenTs Chapter 16: Hypers pace 353 Prerequisites 354 Create Parquet Files 358 Run a Query Without an Index 360 Import Hyperspace 362 Read the Parquet Files to a Data Frame 362 Create a Hyperspace Index 362 Rerun the Query with Hyperspace Index 364 Other Hyperspace Management APIs 365 Summary 366 Part VI: Advanced Capabilities 369 Chapter 17: Auto Loader 371 Advanced Schema Evolution 372 Prerequisites 372 Generate Data from SQL Database 372 Load Data to Azure Data Lake Storage Gen2 376 Configure Resources in Azure Portal 377 Configure Databricks 383 Run Auto Loader in Databricks 385 Configuration Properties 385 Rescue Data 387 Schema Hints 391 Infer Column Types 392 Add New Columns 396 Managing Auto Loader Resources 402 Read a Stream 403 Write a Stream 404 Explore Results 408 Summary 416 xii Table of ConTenTs Chapter 18: Python Wheels 417 Install Application Software 417 Install Visual Studio Code and Python Extension 418 Install Python 418 Configure Python Interpreter Path for Visual Studio Code 419 Verify Python Version in Visual Studio Code Terminal 420 Set Up Wheel Directory Folders and Files 420 Create Setup File 421 Create Readme File 422 Create License File 422 Create Init File 423 Create Package Function File 424 Install Python Wheel Packages 424 Install Wheel Package 424 Install Check Wheel Package 425 Create and Verify Wheel File 425 Create Wheel File 426 Check Wheel Contents 426 Verify Wheel File 427 Configure Databricks Environment 428 Install Wheel to Databricks Library 428 Create Databricks Notebook 429 Mount Data Lake Folder 429 Create Spark Database 430 Verify Wheel Package 431 Import Wheel Package 432 Create Function Parameters 432 Run Wheel Package Function 432 Show Spark Tables 432 xiii Table of ConTenTs Files in Databricks Repos 433 Continuous Integration and Deployment 435 Summary 436 Chapter 19: Security and Controls 437 Implement Cluster, Pool, and Jobs Access Control 437 Implement Workspace Access Control 440 Implement Other Access and Visibility Controls 442 Table Access Control 443 Personal Access Tokens 443 Visibility Controls 444 Example Row-Level Security Implementation 445 Create New User Groups 445 Load Sample Data 447 Run Queries Using Row-Level Security 450 Create Row-Level Secured Views and Grant Selective User Access 454 Interaction with Azure Active Directory 457 Summary 458 Index 459 xiv
Next page