To my wife, Peggy, who has supported not only my journey in high performance computing, but also that of our son Jon and daughter Rachel. Scientific programming is far from her medical expertise, but she has accompanied me and made it our journey. To my son, Jon, and daughter, Rachel, who have rekindled the flame and for your promising future.
To my husband Rick, who supported me the entire way, thank you for taking the early shifts and letting me work into the night. You never let me give up on myself. To my parents and in-laws, thank you for all your help and support. And to my son, Derek, for being one of my biggest inspirations; you are the reason I leap instead of jump.
front matter
foreword
From the authors
Bob Robey, Los Alamos, New Mexico
It's a dangerous business, Frodo, going out your door. You step onto the road, and if you don't keep your feet, there's no knowing where you might be swept off to.
Bilbo Baggins
I could not have foreseen where this journey into parallel computing would take us. Us because the journey has been shared by numerous colleagues over the years. My journey into parallel computing began in the early 1990s, while I was at the University of New Mexico. I had written some compressible fluid dynamics codes to model shock tube experiments and was running these on every system I could get my hands on. As a result, I along with Brian Smith, John Sobolewski, and Frank Gilfeather, was asked to submit a proposal for a high performance computing center. We won the grant and established the Maui High Performance Computing Center in 1993. My part in the project was to offer courses and lead 20 graduate students in developing parallel computing at the University of New Mexico in Albuquerque.
The 1990s were a formative time for parallel computing. I remember a talk by Al Geist, one of the original developers of Parallel Virtual Machine (PVM) and a member of the MPI standards committee. He talked about the soon-to-be released MPI standard (June, 1994). He said it would never go anywhere because it was too complex. Al was right about the complexity, but despite that, it took off, and within months it was used by nearly every parallel application. One of the reasons for the success of MPI is that there were implementations ready to go. Argonne had been developing Chameleon, a portability tool that would translate between the message-passing languages at that time, including P4, PVM, MPL, and many others. The project was quickly changed to MPICH, which became the first high-quality MPI implementation. For over a decade, MPI became synonymous with parallel computing. Nearly every parallel application was built on top of MPI libraries.
Now lets fast forward to 2010 and the emergence of GPUs. I came across a Dr. Dobbs article on using a Kahan sum to compensate for the only single-precision arithmetic available on the GPU. I thought that maybe the approach could help resolve a long-standing issue in parallel computing, where the global sum of an array changes depending on the number of processors. To test this out, I thought of a fluid dynamics code that my son Jon wrote in high school. He tested the mass and energy conservation in the problem over time and would stop running and exit the program if it changed more than a specified amount. While he was home over Spring break from his freshman year at University of Washington, we tried out the method and were pleasantly surprised by how much the mass conservation improved. For production codes, the impact of this simple technique would prove to be important. We cover the enhanced precision sum algorithm for parallel global sums in section 5.7 in this book.
In 2011, I organized a summer project with three students, Neal Davis, David Nicholaeff, and Dennis Trujillo, to see if we could get more complex codes like adaptive mesh refinement (AMR) and unstructured arbitrary Lagrangian-Eulerian (ALE) applications to run on a GPU. The result was CLAMR, an AMR mini-app that ran entirely on a GPU. Much of the application was easy to port. The most difficult part was determining the neighbor for each cell. The original CPU code used a k-d tree algorithm, but tree-based algorithms are difficult to port to GPUs. Two weeks into the summer project, the Las Conchas Fire erupted in the hills above Los Alamos and the town was evacuated. We left for Santa Fe, and the students scattered. During the evacuation, I met with David Nicholaeff in downtown Santa Fe to discuss the GPU port. He suggested that we try using a hash algorithm to replace the tree-based code for the neighbor finding. At the time, I was watching the fire burning above the town and wondering if it had reached my house. In spite of that, I agreed to try it, and the hashing algorithm resulted in getting the entire code running on the GPU. The hashing technique was generalized by David, my daughter Rachel while she was in high school, and myself. These hash algorithms form the basis for many of the algorithms presented in chapter 5.