• Complain

Butch Quinto - Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Here you can read online Butch Quinto - Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2018, publisher: Apress, genre: Politics. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Butch Quinto Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark
  • Book:
    Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark
  • Author:
  • Publisher:
    Apress
  • Genre:
  • Year:
    2018
  • Rating:
    3 / 5
  • Favourites:
    Add to favourites
  • Your mark:
    • 60
    • 1
    • 2
    • 3
    • 4
    • 5

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Utilize this practical and easy-to-follow guide to modernize traditional enterprise data warehouse and business intelligence environments with next-generation big data technologies.

Next-Generation Big Data takes a holistic approach, covering the most important aspects of modern enterprise big data. The book covers not only the main technology stack but also the next-generation tools and applications used for big data warehousing, data warehouse optimization, real-time and batch data ingestion and processing, real-time data visualization, big data governance, data wrangling, big data cloud deployments, and distributed in-memory big data computing. Finally, the book has an extensive and detailed coverage of big data case studies from Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard.

What Youll Learn

  • Install Apache Kudu, Impala, and Spark to modernize enterprise data warehouse and business intelligence environments, complete with real-world, easy-to-follow examples, and practical advice

  • Integrate HBase, Solr, Oracle, SQL Server, MySQL, Flume, Kafka, HDFS, and Amazon S3 with Apache Kudu, Impala, and Spark

  • Use StreamSets, Talend, Pentaho, and CDAP for real-time and batch data ingestion and processing

  • Utilize Trifacta, Alteryx, and Datameer for data wrangling and interactive data processing

  • Turbocharge Spark with Alluxio, a distributed in-memory storage platform

  • Deploy big data in the cloud using Cloudera Director

  • Perform real-time data visualization and time series analysis using Zoomdata, Apache Kudu, Impala, and Spark

  • Understand enterprise big data topics such as big data governance, metadata management, data lineage, impact analysis, and policy enforcement, and how to use Cloudera Navigator to perform common data governance tasks

  • Implement big data use cases such as big data warehousing, data warehouse optimization, Internet of Things, real-time data ingestion and analytics, complex event processing, and scalable predictive modeling

  • Study real-world big data case studies from innovative companies, including Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard

Who This Book Is For

BI and big data warehouse professionals interested in gaining practical and real-world insight into next-generation big data processing and analytics using Apache Kudu, Impala, and Spark; and those who want to learn more about other advanced enterprise topics

Butch Quinto: author's other books


Who wrote Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark? Find out the surname, the name of the author of the book and a list of all author's works by series.

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Butch Quinto 2018
Butch Quinto Next-Generation Big Data
1. Next-Generation Big Data
Butch Quinto 1
(1)
Plumpton, Victoria, Australia
Despite all the excitement around big data, the large majority of mission-critical data is still stored in relational database management systems . This fact is supported by recent studies online and confirmed by my own professional experience working on numerous big data and business intelligence projects. Despite widespread interest in unstructured and semi-structured data, structured data still represents a significant percentage of data under management for most organizations, from the largest corporations and government agencies to small businesses and technology start-ups. Use cases that deals with unstructured and semi-structured data, while valuable and interesting, are few and far between. Unless you work for a company that does a lot of unstructured data processing such as Google, Facebook, or Apple, you are most likely working with structured data.
Big data has matured since the introduction of Hadoop more than 10 years ago. Take away all the hype, and it is evident that structured data processing and analysis has become the next-generation killer use case for big data. Most big data, business intelligence, and advanced analytic use cases deal with structured data. In fact, some of the most popular advances in big data such as Apache Impala, Apache Phoenix, and Apache Kudu as well as Apache Sparks recent emphasis on Spark SQL and DataFrames API are all about providing capabilities for structured data processing and analysis. This is largely due to big data finally being accepted as part of the enterprise. As big data platforms improved and gained new capabilities, they have become suitable alternatives to expensive data warehouse platforms and relational database management systems for storing, processing, and analyzing mission-critical structured data.
About This Book
This book is for business intelligence and data warehouse professionals who are interested in gaining practical and real-world insight into next-generation big data processing and analytics using Apache Kudu , Apache Impala , and Apache Spark . Experienced big data professionals who would like to learn more about Kudu and other advanced enterprise topics such as real-time data ingestion and complex event processing, Internet of Things (IoT) , distributed in-memory computing, big data in the cloud, big data governance and management, real-time data visualization, data wrangling, data warehouse optimization, and big data warehousing will also benefit from this book.
I assume readers will have basic knowledge of the various components of Hadoop. Some knowledge of relational database management systems, business intelligence, and data warehousing is also helpful. Some programming experience is required if you want to run the sample code provided. I focus on three main Hadoop components: Apache Spark, Apache Impala, and Apache Kudu.
Apache Spark
Apache Spark is the next-generation data processing framework with advanced in-memory capabilities and a directed acyclic graph (DAG) engine. It can handle interactive, real-time, and batch workloads with built-in machine learning, graph processing, streaming, and SQL support. Spark was developed to address the limitation of MapReduce. Spark can be 10100x faster than MapReduce in most data processing tasks. It has APIs for Scala, Java, Python, and R. Spark is one of the most popular Apache projects and is currently used by some of the largest and innovative companies in the world. I discuss Apache Spark in Chapter .
Apache Impala
Apache Impala is a massively parallel processing (MPP) SQL engine designed to run on Hadoop platforms . The project was started by Cloudera and eventually donated to the Apache Software Foundation. Impala rivals traditional data warehouse platforms in terms of performance and scalability and was designed for business intelligence and OLAP workloads. Impala is compatible with some of the most popular BI and data visualization tools such as Tableau, Qlik, Zoomdata, Power BI, and MicroStrategy to mention a few. I cover Apache Impala in Chapter .
Apache Kudu
Apache Kudu is a new mutable columnar storage engine designed to handle fast data inserts and updates and efficient table scans, enabling real-time data processing and analytic workloads. When used together with Impala, Kudu is ideal for Big Data Warehousing, EDW modernization, Internet of Things (IoT) , real-time visualization, complex event processing, and feature store for machine learning. As a storage engine, Kudus performance and scalability rivals other columnar storage format such as Parquet and ORC. It also performs significantly faster than Apache Phoenix with HBase. I discuss Kudu in Chapter .
Navigating This Book
This book is structured in easy-to-digest chapters that focus on one or two key concepts at a time. Chapters can be read in any order depending on your interest. The chapters are filled with practical examples and step-by-step instructions. Along the way, youll find plenty of practical information on best practices and advice that will steer you to the right direction in your big data journey.
Chapter Next-Generation Big Data provides a brief introduction about the contents of this book.
Chapter Introduction to Kudu provides an introduction to Apache Kudu, starting with a discussion of Kudus architecture. I talk about various topics such as how to access Kudu from Impala, Spark, and Python, C++ and Java using the client API. I provide details on how to administer, configure, and monitor Kudu, including backup and recovery and high availability options for Kudu. I also discuss Kudus strength and limitations, including practical workarounds and advice.
Chapter Introduction to Impala provides an introduction to Apache Impala. I discuss Impalas technical architecture and capabilities with easy-to-follow examples. I cover details on how to perform system administration, monitoring, and performance tuning.
Chapter High Performance Data Analysis with Impala and Kudu covers Impala and Kudu integration with practical examples and real-world advice on how to leverage both components to deliver a high performance environment for data analysis. I discuss Impala and Kudus strength and limitations, including practical workarounds and advice.
Chapter Introduction to Spark provides an introduction to Apache Spark. I cover Sparks architecture and capabilities, with practical explanations and easy-to-follow examples to help you get started with Spark development right away.
Chapter High Performance Data Processing with Spark and Kudu covers Spark and Kudu integration with practical examples and real-world advice on how to use both components for large-scale data processing and analysis.
Chapter Batch and Real-Time Data Ingestion and Processing covers batch and real-time data ingestion and processing using native and third-party commercial tools such as Flume, Kafka, Spark Streaming, StreamSets, Talend, Pentaho, and Cask. I provide step-by-step examples on how to implement complex event processing and the Internet of Things (IoT).
Chapter Big Data Warehousing covers designing and implementing star and snowflake dimensional models with Impala and Kudu. I talk about how to utilize Impala and Kudu for data warehousing including its strengths and limitations. I also discuss EDW modernization use cases such as data consolidation, data archiving, and analytics and ETL offloading.
Chapter Big Data Visualization and Data Wrangling discusses real-time data visualization and wrangling tools designed for extremely large data sets with easy-to-follow examples and advice.
Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark»

Look at similar books to Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark»

Discussion, reviews of the book Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.