OTHER RELATED TITLES FROM COLD SPRING HARBOR LABORATORY PRESS
A Short Guide to the Human Genome
Guide to the Human Genome
Molecular Cloning: A Laboratory Manual, Fourth Edition
HANDBOOKS
An A to Z of DNA Science: What Scientists Mean When They Talk about Genes and Genomes
At the Bench: A Laboratory Navigator, Updated Edition
At the Helm: Leading Your Laboratory, Second Edition
An Illustrated ChineseEnglish Guide for Biomedical Scientists
Binding and Kinetics for Molecular Biologists
Career Opportunities in Biotechnology and Drug Development
C. elegans Atlas
Experimental Design for Biologists
Fly Pushing: The Theory and Practice of Drosophila Genetics, Second Edition
Is It in Your Genes? The Influence of Genes on Common Disorders and Diseases That Affect You and Your Family
Lab Dynamics: Management and Leadership Skills for Scientists, Second Edition
Lab Math: A Handbook of Measurements, Calculations, and Other Quantitative Skills for Use at the Bench
Lab Ref, Volume 1: A Handbook of Recipes, Reagents, and Other Reference Tools for Use at the Bench
Lab Ref, Volume 2: A Handbook of Recipes, Reagents, and Other Reference Tools for Use at the Bench
Statistics at the Bench: A Step-by-Step Handbook for Biologists
NEXT-GENERATION
DNA SEQUENCING
INFORMATICS
EDITED BY
STUART M. BROWN
NEXT-GENERATION DNA SEQUENCING INFORMATICS
All rights reserved
2013 by Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York
Printed in the United States of America
Publisher and Acquisition Editor | John Inglis |
Director of Editorial Development | Jan Argentine |
Project Manager | Inez Sialiano |
Permissions Coordinator | Carol Brown |
Production Editor | Kathleen Bubbeo |
Production Manager | Denise Weiss |
Sales Account Manager | Elizabeth Powers |
Cover Designer | Michael Albano |
Front cover artwork: A heterozygous single-nucleotide G>A variant is verified by visualization in Genome View (genomeview.org) of short reads from a next-generation sequencing machine aligned to the reference genome. Forward reads are shown in blue, reverse reads are shown in green, and sequence variants are highlighted in yellow. Other visible sequence variants are probable sequencing errors.
Library of Congress Cataloging-in-Publication Data
Next-generation DNA sequencing informatics / edited by Stuart M. Brown.
pages cm
Includes bibliographical references and index.
ISBN 978-1-936113-87-3 (hardcover : alk. paper)
1. Nucleotide sequence. 2. Bioinformatics. I. Brown, Stuart M., 1962
QP625.N89N485 2013
572.8'633dc23
2012034431
10 9 8 7 6 5 4 3 2 1
All World Wide Web addresses are accurate to the best of our knowledge at the time of printing.
Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by Cold Spring Harbor Laboratory Press, provided that the appropriate fee is paid directly to the Copyright Clearance Center (CCC). Write or call CCC at 222 Rosewood Drive, Danvers, MA 01923 (978-750-8400) for information about fees and regulations. Prior to photocopying items for educational classroom use, contact CCC at the above address. Additional information on CCC can be obtained at CCC Online at www.copyright.com.
For a complete catalog of all Cold Spring Harbor Laboratory Press publications, visit our website at www.cshlpress.org.
Contents
Stuart M. Brown
Stuart M. Brown
Phillip Ross Smith, Kranti Konganti, and Stuart M. Brown
Efstratios Efstathiadis
D. Frank Hsu
Silvia Argimn and Stuart M. Brown
Steven Shen
Jinhua Wang, Zuojian Tang, and Stuart M. Brown
Zuojian Tang, Christina Schweikert, D. Frank Hsu, and Stuart M. Brown
Stuart M. Brown, Jeremy Goecks, and James Taylor
Alexander Alekseyenko and Stuart M. Brown
Efstratios Efstathiadis and Eric R. Peskin
Next-generation DNA sequencing (NGS) technology has been a huge stimulus for new and exciting ways to create and test new hypotheses in biology as well as to revisit old ones but with a novel and vastly enhanced perspective. It would be no exaggeration to state that many of the current dynamic advances in biomedical basic and translational science are being driven by this technology.
NGS is enabled by sophisticated and novel bioinformatics tools specifically created or adapted to make NGS possible. Not only has new software been developed for a wide range of novel applications and types of data analysis, but new algorithms have also been developed for old problems, such as sequence alignment and de novo assembly, to cope with the huge volume of data generated on new sequencing machines.
The cycle of software development has accelerated as vendors upgrade their machines and different groups compete to publish new methods and to meet investigator demands. As a result of the frenetic pace of development, new software tools for NGS data analysis are often released with bare bones command line user interfaces and minimal documentation. Making things even more complicated, many different software packages exist for each of the major NGS applications with few benchmarking studies available to guide users in the choice of the best solutions. In short, there is an urgent need for a scientifically rigorous, cutting-edge, and practical treatise to guide researchers about all major aspects of informatics needed to successfully operate and fully take advantage of NGS.
The authors of the present work have been very lucky that their home institution, NYU Langone Medical Center, has invested early and heavily in building both assay and informatics capacity and manpower in NGS. Specifically, in 2008 NYU Langone Medical Center built its Genome Technology Center to provide research and translational scientists access to the latest DNA sequencing, expanding upon previous technologies such as microarrays and real-time polymerase chain reaction (qPCR). In parallel, the Informatics Center at NYU Langone Medical Center has developed the Sequencing Informatics Group to provide research design, upstream data processing, data management, and data analysis consulting for all users of the sequencers within NYU Langone Medical Center and beyond.
As our group has grown in experience, we have evaluated many different software packages and built best practice workflows for many different types of NGS projects, including de novo sequencing (and genome annotation), amplicon sequencing for rare variant detection and for metagenomics, ChIP-seq, RNA-seq, and detection of somatic variants in cancer (including single-base substitutions, insertions, deletions, and translocations).