Heterogeneous Computing with OpenCL 2.0
Third Edition
David Kaeli
Perhaad Mistry
Dana Schaa
Dong Ping Zhang
Copyright
Acquiring Editor: Todd Green
Editorial Project Manager: Charlie Kent
Project Manager: Priya Kumaraguruparan
Cover Designer: Matthew Limbert
Morgan Kaufmann is an imprint of Elsevier
225 Wyman Street, Waltham, MA 02451, USA
Copyright 2015, 2013, 2012 Advanced Micro Devices, Inc. Published by Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publishers permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
ISBN: 978-0-12-801414-1
British Library Cataloguing in Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
For information on all MK publications visit our website at www.mkp.com
List of Figures
Fig. 1.2 Multiplying elements in arrays A and B, and storing the result in an array C.
Fig. 1.3 Task parallelism present in fast Fourier transform (FFT) application. Different input images are processed independently in the three independent tasks.
Fig. 1.4 Task-level parallelism, where multiple words can be compared concurrently. Also shown is finer-grained character-by-character parallelism present when characters within the words are compared with the search string.
Fig. 1.5 After all string comparisons in
Fig. 1.6 The relationship between parallel and concurrent programs. Parallel and concurrent programs are subsets of all programs.
Fig. 2.1 Out-of-order execution of an instruction stream of simple assembly-like instructions. Note that in this syntax, the destination register is listed first. For example, add a,b,c is a = b+c .
Fig. 2.2 VLIW execution based on the out-of-order diagram in
Fig. 2.3 SIMD execution where a single instruction is scheduled in order, but executes over multiple ALUs at the same time.
Fig. 2.4 The out-of-order schedule seen in
Fig. 2.5 Two threads scheduled in a time-slice fashion.
Fig. 2.6 Taking temporal multithreading to an extreme as is done in throughput computing: a large number of threads interleave execution to keep the device busy, whereas each individual thread takes longer to execute than the theoretical minimum.
Fig. 2.7 The AMD Puma (left) and Steamroller (right) high-level designs (not shown to any shared scale). Puma is a low-power design that follows a traditional approach to mapping functional units to cores. Steamroller combines two cores within a module, sharing its floating-point (FP) units.
Fig. 2.8 The AMD Radeon HD 6970 GPU architecture. The device is divided into two halves, where instruction control (scheduling and dispatch) is performed by the wave scheduler for each half. The 24 16-lane SIMD cores execute four-way VLIW instructions on each SIMD lane and contain private level 1 (L1) caches and local data shares (scratchpad memory).
Fig. 2.9 The Niagara 2 CPU from Sun/Oracle. The design intends to make a high level of threading efficient. Note its relative similarity to the GPU design seen in
Fig. 2.10 The AMD Radeon R9 290X architecture. The device has 44 cores in 11 clusters. Each core consists of a scalar execution unit that handles branches and basic integer operations, and four 16-lane SIMD ALUs. The clusters share instruction and scalar caches.
Fig. 2.11 The NVIDIA GeForce GTX 780 architecture. The device has 12 large cores that NVIDIA refers to as streaming multiprocessors (SMX). Each SMX has 12 SIMD units (with specialized double-precision and special function units), a single L1 cache, and a read-only data cache.
Fig. 2.12 The A10-7850K APU consists of two Steamroller-based CPU cores and eight Radeon R9 GPU cores (32 16-lane SIMD units in total). The APU includes a fast bus from the GPU to DDR3 memory, and a shared path that is optionally coherent with CPU caches.
Fig. 2.13 An Intel i7 processor with HD Graphics 4000 graphics. Although not termed APU by Intel, the concept is the same as for the devices in that category from AMD. Intel combines four Haswell x86 cores with its graphics processors, connected to a shared last-level cache (LLC) via a ring bus.
Fig. 3.1 An OpenCL platform with multiple compute devices. Each compute device contains one or more compute units. A compute unit is composed of one or more processing elements (PEs). A system could have multiple platforms present at the same time. For example, a system could have an AMD platform and an Intel platform present at the same time.
Fig. 3.2 Some of the Output from the CLInfo program showing the characteristics of an OpenCL platform and devices. We see that the AMD platform has two devices (a CPU and a GPU). The output shown here can be queried using functions from the platform API.
Fig. 3.3 Vector addition algorithm showing how each element can be added independently.
Fig. 3.4 The hierarchical model used for creating an NDRange of work-items, grouped into work-groups.
Fig. 3.5 The OpenCL runtime shown denotes an OpenCL context with two compute devices (a CPU device and a GPU device). Each compute device has its own command-queues. Host-side and device-side command-queues are shown. The device-side queues are visible only from kernels executing on the compute device. The memory objects have been defined within the memory model.
Fig. 3.6 Memory regions and their scope in the OpenCL memory model.
Fig. 3.7 Mapping the OpenCL memory model to an AMD Radeon HD 7970 GPU.
Fig. 4.1 A histogram generated from a 256-bit image. Each bin corresponds to the frequency of the corresponding pixel value.