1. Next-Generation Big Data
Despite all the excitement around big data, the large majority of mission-critical data is still stored in relational database management systems . This fact is supported by recent studies online and confirmed by my own professional experience working on numerous big data and business intelligence projects. Despite widespread interest in unstructured and semi-structured data, structured data still represents a significant percentage of data under management for most organizations, from the largest corporations and government agencies to small businesses and technology start-ups. Use cases that deals with unstructured and semi-structured data, while valuable and interesting, are few and far between. Unless you work for a company that does a lot of unstructured data processing such as Google, Facebook, or Apple, you are most likely working with structured data.
Big data has matured since the introduction of Hadoop more than 10 years ago. Take away all the hype, and it is evident that structured data processing and analysis has become the next-generation killer use case for big data. Most big data, business intelligence, and advanced analytic use cases deal with structured data. In fact, some of the most popular advances in big data such as Apache Impala, Apache Phoenix, and Apache Kudu as well as Apache Sparks recent emphasis on Spark SQL and DataFrames API are all about providing capabilities for structured data processing and analysis. This is largely due to big data finally being accepted as part of the enterprise. As big data platforms improved and gained new capabilities, they have become suitable alternatives to expensive data warehouse platforms and relational database management systems for storing, processing, and analyzing mission-critical structured data.
About This Book
This book is for business intelligence and data warehouse professionals who are interested in gaining practical and real-world insight into next-generation big data processing and analytics using Apache Kudu , Apache Impala , and Apache Spark . Experienced big data professionals who would like to learn more about Kudu and other advanced enterprise topics such as real-time data ingestion and complex event processing, Internet of Things (IoT) , distributed in-memory computing, big data in the cloud, big data governance and management, real-time data visualization, data wrangling, data warehouse optimization, and big data warehousing will also benefit from this book.
I assume readers will have basic knowledge of the various components of Hadoop. Some knowledge of relational database management systems, business intelligence, and data warehousing is also helpful. Some programming experience is required if you want to run the sample code provided. I focus on three main Hadoop components: Apache Spark, Apache Impala, and Apache Kudu.
Apache Spark
Apache Spark is the next-generation data processing framework with advanced in-memory capabilities and a directed acyclic graph (DAG) engine. It can handle interactive, real-time, and batch workloads with built-in machine learning, graph processing, streaming, and SQL support. Spark was developed to address the limitation of MapReduce. Spark can be 10100x faster than MapReduce in most data processing tasks. It has APIs for Scala, Java, Python, and R. Spark is one of the most popular Apache projects and is currently used by some of the largest and innovative companies in the world. I discuss Apache Spark in Chapter .
Apache Impala
Apache Impala is a massively parallel processing (MPP) SQL engine designed to run on Hadoop platforms . The project was started by Cloudera and eventually donated to the Apache Software Foundation. Impala rivals traditional data warehouse platforms in terms of performance and scalability and was designed for business intelligence and OLAP workloads. Impala is compatible with some of the most popular BI and data visualization tools such as Tableau, Qlik, Zoomdata, Power BI, and MicroStrategy to mention a few. I cover Apache Impala in Chapter .
Apache Kudu
Apache Kudu is a new mutable columnar storage engine designed to handle fast data inserts and updates and efficient table scans, enabling real-time data processing and analytic workloads. When used together with Impala, Kudu is ideal for Big Data Warehousing, EDW modernization, Internet of Things (IoT) , real-time visualization, complex event processing, and feature store for machine learning. As a storage engine, Kudus performance and scalability rivals other columnar storage format such as Parquet and ORC. It also performs significantly faster than Apache Phoenix with HBase. I discuss Kudu in Chapter .
Navigating This Book
This book is structured in easy-to-digest chapters that focus on one or two key concepts at a time. Chapters can be read in any order depending on your interest. The chapters are filled with practical examples and step-by-step instructions. Along the way, youll find plenty of practical information on best practices and advice that will steer you to the right direction in your big data journey.
Chapter Next-Generation Big Data provides a brief introduction about the contents of this book.
Chapter Introduction to Kudu provides an introduction to Apache Kudu, starting with a discussion of Kudus architecture. I talk about various topics such as how to access Kudu from Impala, Spark, and Python, C++ and Java using the client API. I provide details on how to administer, configure, and monitor Kudu, including backup and recovery and high availability options for Kudu. I also discuss Kudus strength and limitations, including practical workarounds and advice.
Chapter Introduction to Impala provides an introduction to Apache Impala. I discuss Impalas technical architecture and capabilities with easy-to-follow examples. I cover details on how to perform system administration, monitoring, and performance tuning.
Chapter High Performance Data Analysis with Impala and Kudu covers Impala and Kudu integration with practical examples and real-world advice on how to leverage both components to deliver a high performance environment for data analysis. I discuss Impala and Kudus strength and limitations, including practical workarounds and advice.
Chapter Introduction to Spark provides an introduction to Apache Spark. I cover Sparks architecture and capabilities, with practical explanations and easy-to-follow examples to help you get started with Spark development right away.
Chapter High Performance Data Processing with Spark and Kudu covers Spark and Kudu integration with practical examples and real-world advice on how to use both components for large-scale data processing and analysis.
Chapter Batch and Real-Time Data Ingestion and Processing covers batch and real-time data ingestion and processing using native and third-party commercial tools such as Flume, Kafka, Spark Streaming, StreamSets, Talend, Pentaho, and Cask. I provide step-by-step examples on how to implement complex event processing and the Internet of Things (IoT).
Chapter Big Data Warehousing covers designing and implementing star and snowflake dimensional models with Impala and Kudu. I talk about how to utilize Impala and Kudu for data warehousing including its strengths and limitations. I also discuss EDW modernization use cases such as data consolidation, data archiving, and analytics and ETL offloading.
Chapter Big Data Visualization and Data Wrangling discusses real-time data visualization and wrangling tools designed for extremely large data sets with easy-to-follow examples and advice.