Praise for Genomics in the Cloud
This book captures the essence of whats been learned about bringing genomics to the cloud. And it lays out an accessible path for newcomers to join this exciting and important ecosystem.
Eric S. Lander, Founding Director, The Broad Institute of MIT and Harvard
This book is a fantastic introduction to modern genome analysis using state-of-the-art tools and practices. It covers everything a reader needs to get their own analyses running in an open, repeatable way. This is the quintessential primer on the GATK and cloud-based analysis with Terra.
Jonathan Smith, Principal Software Engineer, The Broad Institute of MIT and Harvard
This is a great primer about reproducible bioinformatics in the cloud. Geraldine and Brianare at the forefront of this field, so we are learning from the best. And for those who have yet towork with Terra, look no further for an excellent introduction to it!
Jessica Maia, Data Scientist, BD
Transferring from physics to cancer research as I did, I learned genomics, sequencing, statistics piecemeal. I could have used a book like this back then, because no matter how much time youve spent in the field or if its your first contact, theres something new to learn and an appreciation for the bigger picture to be gained.
Aaron Chevalier, PhD Candidate, Boston University
Genomics in the Cloud covers everything from the science of genomic analysis to the computing technologies used to process this data at massive scale; presented in a way that lets you jump right in and run the same tools in the cloud that are used by biologists, researchers, and clinicians worldwide.
Andrew Moschetti, Senior Solutions Architect, Google Cloud Life Sciences
As the volume of genomic data increases, implementing analysis using best practice cloud patterns becomes more and more important. In this book, youll learn these patterns via practical examples that you can try out using your own data and research questions.
Lynn Langit, Cloud Architect, Google Developer Expert and AWS Community Hero
Genomics in the Cloud is an excellent introduction both to genomics and cloud-based research, perfect for those who wish to capitalize on the cloud environment to move their research forward and for those who wish to better understand this space.
David E. Mohs, Software Engineer, The Broad Institute of MIT and Harvard
Genomics in the Cloud
by Geraldine A. Van der Auwera and Brian D. OConnor
Copyright 2020 The Broad Institute, Inc. and Brian OConnor All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Acquisitions Editor: Rachel Novak
- Development Editor: Michele Cronin
- Production Editor: Katherine Tozer
- Copyeditor: Octal Publishing, LLC
- Proofreader: Sharon Wilkey
- Indexer: Ellen Troutman-Zaig
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- April 2020: First Edition
Revision History for the First Edition
- 2020-04-02: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491975190 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Genomics in the Cloud, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-97519-0
[LSI]
Foreword
I migrated from mathematics into the field of genomics in 1985roughly a year before the field officially came into existence. The word genomics was coined in 1986, which also saw the first public debate, at the Cold Spring Harbor Laboratory, about the notion of mounting a Human Genome Project.
Its hard to imagine how much has changed since then. Computers hardly figured in biomedicinethe initial design for the Whitehead Institute for Biomedical Research, founded in the early 1980s, included no provision for a computer. Large amounts of data were seen as a nuisance, not an assetin a Nature article reporting on the Human Genome Project debate, the journals biology editor wrote, If the skill and ingenuity of modern biology are already stretched to interpret sequences of known importance, such as those of the DMD and CGD genes, what possible use could be made of more sequences?
Despite such doubts, biologists eventually decided to press onlaunching the Human Genome Project, their first major data gathering effort, in 1990. One of the important motivations was the prospect of deploying systematic methodsrather than guessworkto discover the genes responsible for human diseases. In 1980, a brilliant biologist, David Botstein, had conceived how to find the location of genes for rare monogenic diseases by tracing their inheritance in families relative to a genetic map of DNA variants across the human genome. Realizing the full power of the idea, though, would require mappingand eventually sequencingthe entire human genome.
The Human Genome Project was an extraordinary collaboration that spanned six countries and twenty institutions, took thirteen years, and cost $3 billion. When the dust settled, the world had the three billion nucleotide-long DNA sequence of a single human genome.
With this project completed, many biologists thought that business would return to usual. But what happened next was even more remarkable. Over the next 15 years, biology became an information sciencein which the generation of massive amounts of data reshaped the field. For example:
Genetic mapping in families revealed the genes responsible for more than 5,000 serious rare monogenic disorders.
New kinds of genetic mapping in populations led to the discovery of ~100,000 robust associations of specific genetic regions with common diseases and traits.
Genetic analysis of thousands of tumors uncovered hundreds of new genes in which mutations propelled cancer.
Remarkably, the cost of sequencing a human genome fell by a factor of five millionfrom $3 billion to $600and the cost is likely to reach $100 in the coming years. More than one million genomes have been sequenced so far. Overall, genomic data of all kinds is doubling roughly every eight months.
None of this would have been possible without the development of powerful new computational methods and tools to work with the many new types of data that were being generated. A good example is the Genome Analysis Toolkit, developed by colleagues at the Broad Institute, which youll read a lot more about in this book.