Contents
List of Figures
Guide
Pagebreaks of the print version
High-Performance Big Data Computing
Dhabaleswar K. Panda, Xiaoyi Lu, and Dipti Shankar
The MIT Press
Cambridge, Massachusetts
London, England
2022 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.
The MIT Press would like to thank the anonymous peer reviewers who provided comments on drafts of this book. The generous work of academic experts is essential for establishing the authority and quality of our publications. We acknowledge with gratitude the contributions of these otherwise uncredited readers.
Library of Congress Cataloging-in-Publication Data
Names: Panda, Dhabaleswar K., author. | Lu, Xiaoyi (Professor of computer science), author. | Shankar, Dipti, author.
Title: High-performance big data computing / Dhabaleswar K. Panda, Xiaoyi Lu, and Dipti Shankar.
Description: Cambridge, Massachusetts: The MIT Press, [2022] | Series: Scientific and engineering computation | Includes bibliographical references and index.
Identifiers: LCCN 2021038754 | ISBN 9780262046855 (hardcover)
Subjects: LCSH: High performance computing. | Big data.
Classification: LCC QA76.88.P36 2022 | DDC 005.7dc23/eng/20211020
LC record available at https://lccn.loc.gov/2021038754
d_r0
Contents
List of Figures
Four V characteristics of big data.
Data Never Sleeps 8.0. Source: Courtesy of (Domo).
Convergence of HPC, big data, and deep learning.
Challenges in bringing HPC, big data processing, and deep learning into a convergent trajectory.
Can we efficiently run big data and deep learning jobs on existing HPC infrastructure?
Challenges of high-performance big data computing. HDD, hard disk drive; NICs, network interface cards; QoS, quality of service; SR-IOV, single root input/output virtualization.
Outline of the book.
Programming models for distributed data processing.
Overview of data processing with MapReduce.
Overview of data processing with Hadoop MapReduce.
Overview of Spark architecture.
Spark dependencies.
Overview of streaming processing with Apache Storm.
Overview of streaming processing with Apache Flink.
Overview of TensorFlow stack.
Overview of distributed TensorFlow environment.
Various types of parallel and distributed storage systems. PCIe, Peripheral Component Interconnect Express; RADOS, Reliable, Autonomic Distributed Object Store; SATA, Serial AT Attachment.
File system architecture. MR, MapReduce; OSS, Object Storage Server; OST, Object Storage Target.
OpenStack Swift: architecture overview.
Apache Cassandra: architecture overview.
Memcached (distributed caching over DRAM).
Redis (distributed in-memory data store). HA, High Availability.
Overview of a typical HPC system architecture.
Storage device hierarchy.
NVMe command processing.
Overview of high-performance network interconnects and protocols. OFI (OpenFabrics Interfaces).
One-way latency: MPI over RDMA networks with MVAPICH2. (a) Small message latency. (b) Large message latency.
Bandwidth: MPI over RDMA networks with MVAPICH2. (a) Unidirectional bandwidth. (b) Bidirectional bandwidth.
RDMA over NVM: contrasting NVMeoF and remote PMoF
Envisioned architecture for next-generation HEC systems. Courtesy of Panda et al. (2018).
Challenges of achieving high-performance big data computing.
Big data benchmarks.
Design overview of RDMA-Memcached (Shankar, Lu, Islam, et al., 2016; Jose et al., 2011). KNL, kernel; LRU, least recently used.
RDMA-based Hadoop architecture and its different modes. PBS, Portable Batch System.
Performance improvement of RDMA-based designs for Apache Spark and Hadoop on SDSC Comet cluster. (a) PageRank with RDMA-Spark. (b) Sort with RDMAHadoop 2.x.
Performance benefits with RDMA-Memcached based workloads. (a) Memcached Set/Get over simulated MySQL. (b) Hadoop TestDFSIO throughput with Boldio.
Architecture overview of GPU-aware hash table in Memcached.
Stand-alone throughput with CPU and GPU-centric hash table (based on Mega-KV (K. Zhang, Wang, et al., 2015)). (a) Insert. (b) Search. MOPS, millions of operations per second; thrs, threads.
Stand-alone hash table probing performance on the twenty-eightcore Intel Skylake CPU, over a three-way cuckoo hash table versus non-SIMD CPU-optimized MemC3 hash table with 32-bit key/payload (Shankar et al., 2019a).
Performance benefits of heterogeneous storage-aware designs for Hadoop on SDSC Comet. (a) NVM-assisted MapReduce design. (b) Spark TeraSort over heterogeneity-aware HDFS. MR-IPoIB, Default MapReduce running with the IPoIB protocol; RMR, RDMA-based MapReduce; RMR-NVM, RDMA-based MapReduce running with NVM in a naive manner; NVMD, Non-Volatile Memory-assisted design for MapReduce and DAG execution frameworks (Rahman et al., 2017).
Performance benefits with RDMA-Memcachedbased workloads.
Deep learning and big data analytics pipeline. Source: Courtesy of Flickr (Garrigues, 2015).
Overview of a unified DLoBD stack. IB, InfiniBand.
Convergence of deep learning, big data, and HPC.
Overview of CaffeOnSpark. DB, database.
Overview of TensorFlowOnSpark.
Overview of MMLSpark (CNTKOnSpark).
Overview of BigDL.
Comparison of DNNs. Source: Courtesy of Canziani et al. (2016). BN, Batch Nominations; ENet, efficient neural network; G-Ops, one billion (109) operations per second; ResNet, residual neural network; M, million; NIN, Network in Network; GoogleLeNet, a 22-layer Deep Convolutional Neural Network thats a variant of the Inception Neural Network developed by researchers at Google.
Performance and accuracy comparison of training AlexNet on ImageNet with CaffeOnSpark running over IPoIB and RDMA. Source: Courtesy of X. Lu et al. (2018).
Performance analysis of TensorFlowOnSpark and stand-alone TensorFlow (lower is better). The numbers were taken by training the SoftMax Regression model over the MNIST dataset on a four-node cluster, which includes one PS and three workers. Source: Courtesy of X. Lu et al. (2018).
Overview of virtualization techniques. (a) VM architecture. (b) Container architecture. libs, libraries; OS, operating system.
SR-IOV architecture. IOMMU, input-output memory management unit.
Topology-aware resource allocation in Hadoop-Virt.
NVMe hardware arbitration overview.
Performance benefits of Hadoop-Virt on HPC clouds. Execution times for (a) WordCount, (b) PageRank, (c) Sort, and (d) Self-Join (30 GB).
Evaluation with synthetic application scenarios. (a) Bandwidth over time with scenario 1. (b) Job bandwidth ratio for scenarios 25.
Acknowledgments
We are grateful to our students and collaborators, Adithya Bhat, Rajarshi Biswas, Shashank Gugnani, Yujie Hui, Nusrat Islam, Haseeb Javed, Arjun Kashyap, Kunal Kulkarni, Tianxi Li, Yuke Li, Hao Qi, Md. Wasi-ur-Rahman, Haiyang Shi, and Jie Zhang, for their joint scientific work over the past ten years. We sincerely thank Shashank Gugnani, Haseeb Javed, Arjun Kashyap, Yuke Li, Hao Qi, and Haiyang Shi for their contributions to this collection or for proofreading several versions of this manuscript. Special thanks to Marie Lee, Kate Elwell, and Elizabeth Swayze from The MIT Press for their significant help in publishing this book. In addition, we are indebted to the National Science Foundation (NSF) for multiple grants (e.g., IIS-1447804, OAC-1636846, CCF-1822987, OAC-2007991, OAC-2112606, and CCF-2132049). This book would not have been possible without this support.