• Complain

Dhabaleswar K. Panda - High-Performance Big Data Computing (Scientific and Engineering Computation)

Here you can read online Dhabaleswar K. Panda - High-Performance Big Data Computing (Scientific and Engineering Computation) full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2022, publisher: The MIT Press, genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

No cover

High-Performance Big Data Computing (Scientific and Engineering Computation): summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "High-Performance Big Data Computing (Scientific and Engineering Computation)" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

An in-depth overview of an emerging field that brings together high-performance computing, big data processing, and deep lLearning.
Over the last decade, the exponential explosion of data known as big data has changed the way we understand and harness the power of data. The emerging field of high-performance big data computing, which brings together high-performance computing (HPC), big data processing, and deep learning, aims to meet the challenges posed by large-scale data processing. This book offers an in-depth overview of high-performance big data computing and the associated technical issues, approaches, and solutions.
The book covers basic concepts and necessary background knowledge, including data processing frameworks, storage systems, and hardware capabilities; offers a detailed discussion of technical issues in accelerating big data computing in terms of computation, communication, memory and storage, codesign, workload characterization and benchmarking, and system deployment and management; and surveys benchmarks and workloads for evaluating big data middleware systems. It presents a detailed discussion of big data computing systems and applications with high-performance networking, computing, and storage technologies, including state-of-the-art designs for data processing and storage systems. Finally, the book considers some advanced research topics in high-performance big data computing, including designing high-performance deep learning over big data (DLoBD) stacks and HPC cloud technologies.

Dhabaleswar K. Panda: author's other books


Who wrote High-Performance Big Data Computing (Scientific and Engineering Computation)? Find out the surname, the name of the author of the book and a list of all author's works by series.

High-Performance Big Data Computing (Scientific and Engineering Computation) — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "High-Performance Big Data Computing (Scientific and Engineering Computation)" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Contents
List of Figures
Guide
Pagebreaks of the print version
High-Performance Big Data Computing Dhabaleswar K Panda Xiaoyi Lu and Dipti - photo 1
High-Performance Big Data Computing

Dhabaleswar K. Panda, Xiaoyi Lu, and Dipti Shankar

The MIT Press
Cambridge, Massachusetts
London, England

2022 Massachusetts Institute of Technology

All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.

The MIT Press would like to thank the anonymous peer reviewers who provided comments on drafts of this book. The generous work of academic experts is essential for establishing the authority and quality of our publications. We acknowledge with gratitude the contributions of these otherwise uncredited readers.

Library of Congress Cataloging-in-Publication Data

Names: Panda, Dhabaleswar K., author. | Lu, Xiaoyi (Professor of computer science), author. | Shankar, Dipti, author.

Title: High-performance big data computing / Dhabaleswar K. Panda, Xiaoyi Lu, and Dipti Shankar.

Description: Cambridge, Massachusetts: The MIT Press, [2022] | Series: Scientific and engineering computation | Includes bibliographical references and index.

Identifiers: LCCN 2021038754 | ISBN 9780262046855 (hardcover)

Subjects: LCSH: High performance computing. | Big data.

Classification: LCC QA76.88.P36 2022 | DDC 005.7dc23/eng/20211020

LC record available at https://lccn.loc.gov/2021038754

d_r0

Contents
List of Figures


Four V characteristics of big data.


Data Never Sleeps 8.0. Source: Courtesy of (Domo).


Convergence of HPC, big data, and deep learning.


Challenges in bringing HPC, big data processing, and deep learning into a convergent trajectory.


Can we efficiently run big data and deep learning jobs on existing HPC infrastructure?


Challenges of high-performance big data computing. HDD, hard disk drive; NICs, network interface cards; QoS, quality of service; SR-IOV, single root input/output virtualization.


Outline of the book.


Programming models for distributed data processing.


Overview of data processing with MapReduce.


Overview of data processing with Hadoop MapReduce.


Overview of Spark architecture.


Spark dependencies.


Overview of streaming processing with Apache Storm.


Overview of streaming processing with Apache Flink.


Overview of TensorFlow stack.


Overview of distributed TensorFlow environment.


Various types of parallel and distributed storage systems. PCIe, Peripheral Component Interconnect Express; RADOS, Reliable, Autonomic Distributed Object Store; SATA, Serial AT Attachment.


File system architecture. MR, MapReduce; OSS, Object Storage Server; OST, Object Storage Target.


OpenStack Swift: architecture overview.


Apache Cassandra: architecture overview.


Memcached (distributed caching over DRAM).


Redis (distributed in-memory data store). HA, High Availability.


Overview of a typical HPC system architecture.


Storage device hierarchy.


NVMe command processing.


Overview of high-performance network interconnects and protocols. OFI (OpenFabrics Interfaces).


One-way latency: MPI over RDMA networks with MVAPICH2. (a) Small message latency. (b) Large message latency.


Bandwidth: MPI over RDMA networks with MVAPICH2. (a) Unidirectional bandwidth. (b) Bidirectional bandwidth.


RDMA over NVM: contrasting NVMeoF and remote PMoF


Envisioned architecture for next-generation HEC systems. Courtesy of Panda et al. (2018).


Challenges of achieving high-performance big data computing.


Big data benchmarks.


Design overview of RDMA-Memcached (Shankar, Lu, Islam, et al., 2016; Jose et al., 2011). KNL, kernel; LRU, least recently used.


RDMA-based Hadoop architecture and its different modes. PBS, Portable Batch System.


Performance improvement of RDMA-based designs for Apache Spark and Hadoop on SDSC Comet cluster. (a) PageRank with RDMA-Spark. (b) Sort with RDMAHadoop 2.x.


Performance benefits with RDMA-Memcached based workloads. (a) Memcached Set/Get over simulated MySQL. (b) Hadoop TestDFSIO throughput with Boldio.


Architecture overview of GPU-aware hash table in Memcached.


Stand-alone throughput with CPU and GPU-centric hash table (based on Mega-KV (K. Zhang, Wang, et al., 2015)). (a) Insert. (b) Search. MOPS, millions of operations per second; thrs, threads.


Stand-alone hash table probing performance on the twenty-eightcore Intel Skylake CPU, over a three-way cuckoo hash table versus non-SIMD CPU-optimized MemC3 hash table with 32-bit key/payload (Shankar et al., 2019a).


Performance benefits of heterogeneous storage-aware designs for Hadoop on SDSC Comet. (a) NVM-assisted MapReduce design. (b) Spark TeraSort over heterogeneity-aware HDFS. MR-IPoIB, Default MapReduce running with the IPoIB protocol; RMR, RDMA-based MapReduce; RMR-NVM, RDMA-based MapReduce running with NVM in a naive manner; NVMD, Non-Volatile Memory-assisted design for MapReduce and DAG execution frameworks (Rahman et al., 2017).


Performance benefits with RDMA-Memcachedbased workloads.


Deep learning and big data analytics pipeline. Source: Courtesy of Flickr (Garrigues, 2015).


Overview of a unified DLoBD stack. IB, InfiniBand.


Convergence of deep learning, big data, and HPC.


Overview of CaffeOnSpark. DB, database.


Overview of TensorFlowOnSpark.


Overview of MMLSpark (CNTKOnSpark).


Overview of BigDL.


Comparison of DNNs. Source: Courtesy of Canziani et al. (2016). BN, Batch Nominations; ENet, efficient neural network; G-Ops, one billion (109) operations per second; ResNet, residual neural network; M, million; NIN, Network in Network; GoogleLeNet, a 22-layer Deep Convolutional Neural Network thats a variant of the Inception Neural Network developed by researchers at Google.


Performance and accuracy comparison of training AlexNet on ImageNet with CaffeOnSpark running over IPoIB and RDMA. Source: Courtesy of X. Lu et al. (2018).


Performance analysis of TensorFlowOnSpark and stand-alone TensorFlow (lower is better). The numbers were taken by training the SoftMax Regression model over the MNIST dataset on a four-node cluster, which includes one PS and three workers. Source: Courtesy of X. Lu et al. (2018).


Overview of virtualization techniques. (a) VM architecture. (b) Container architecture. libs, libraries; OS, operating system.


SR-IOV architecture. IOMMU, input-output memory management unit.


Topology-aware resource allocation in Hadoop-Virt.


NVMe hardware arbitration overview.


Performance benefits of Hadoop-Virt on HPC clouds. Execution times for (a) WordCount, (b) PageRank, (c) Sort, and (d) Self-Join (30 GB).


Evaluation with synthetic application scenarios. (a) Bandwidth over time with scenario 1. (b) Job bandwidth ratio for scenarios 25.

Acknowledgments

We are grateful to our students and collaborators, Adithya Bhat, Rajarshi Biswas, Shashank Gugnani, Yujie Hui, Nusrat Islam, Haseeb Javed, Arjun Kashyap, Kunal Kulkarni, Tianxi Li, Yuke Li, Hao Qi, Md. Wasi-ur-Rahman, Haiyang Shi, and Jie Zhang, for their joint scientific work over the past ten years. We sincerely thank Shashank Gugnani, Haseeb Javed, Arjun Kashyap, Yuke Li, Hao Qi, and Haiyang Shi for their contributions to this collection or for proofreading several versions of this manuscript. Special thanks to Marie Lee, Kate Elwell, and Elizabeth Swayze from The MIT Press for their significant help in publishing this book. In addition, we are indebted to the National Science Foundation (NSF) for multiple grants (e.g., IIS-1447804, OAC-1636846, CCF-1822987, OAC-2007991, OAC-2112606, and CCF-2132049). This book would not have been possible without this support.

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «High-Performance Big Data Computing (Scientific and Engineering Computation)»

Look at similar books to High-Performance Big Data Computing (Scientific and Engineering Computation). We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «High-Performance Big Data Computing (Scientific and Engineering Computation)»

Discussion, reviews of the book High-Performance Big Data Computing (Scientific and Engineering Computation) and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.