To my (rather large) family for their continued support: Mom, Dad, Anne, Lisa, Lauren, Violet, and Dalilah; the Buffalos, the Kihns, and the Lambs.
And my earliest mentors for inspiring me to be who I am today: Randy Siverson and Duncan Temple Lang.
Preface
This book is the answer to a question I asked myself two years ago: What bookwould I want to read first when getting started in bioinformatics? When Ibegan working in this field, I had programming experience in Python and R butlittle else. I had hunted around for a terrific introductory text onbioinformatics, and while I found some good books, most were not targeted tothe daily work I did as a bioinformatician. A few of the texts I foundapproached bioinformatics from a theoretical and algorithmic perspective,covering topics like Smith-Waterman alignment, phylogeny reconstruction, motiffinding, and the like. Although they were fascinating to read (and I do recommend that you explore this material), I had no need to implement bioinformaticsalgorithms from scratch in my daily bioinformatics worknumerous terrific,highly optimized, well-tested implementations of these algorithms alreadyexisted. Other bioinformatics texts took a more practical approach, guidingreaders unfamiliar with computing through each step of tasks like running analigner or downloading sequences from a database. While these were moreapplicable to my work, much of those books material was outdated.
As you might guess, I couldnt find that best first bioinformatics book.Bioinformatics Data Skills is my version of the book I was seeking. This bookis targeted toward readers who are unsure how to bridge the giant gap betweenknowing a scripting language and practicing bioinformatics to answer scientificquestions in a robust and reproducible way. To bridge this gap, one must learndata skillsan approach that uses a core set of tools to manipulate andexplore any data youll encounter during a bioinformatics project.
Data skills are the best way to learn bioinformatics because these skillsutilize time-tested, open source tools that continue to be the best way tomanipulate and explore changing data. This approach has stood the test of time:the advent of high-throughput sequencing rapidly changed the field ofbioinformatics, yet skilled bioinformaticians adapted to this new data usingthese same tools and skills. Next-generation data was, after all, just data(different data, and more of it), and master bioinformaticians had theessential skills to solve problems by applying their tools to this new data.Bioinformatics Data Skills is written to provide you with training in thesecore tools and help you develop these same skills.
The Approach of This Book
Many biologists starting out in bioinformatics tend to equate learningbioinformatics with learning how to run bioinformatics software. This is anunfortunate and misinformed idea of what bioinformaticians actually do. This isanalogous to thinking learning molecular biology is just learningpipetting. Other than a few simple examples used to generate data in, this book doesnt cover running bioinformatics software like aligners, assemblers, or variant callers. Running bioinformatics software isnt all that difficult, doesnt take much skill, and it doesnt embody any of the significant challenges of bioinformatics. I dont teach how to run these types of bioinformatics applications in Bioinformatics Data Skills for the following reasons:
Its easy enough to figure out on your own
The material would go rapidly out of date as new versions of software or entirely new programs are used in bioinformatics
The original manuals for this software will always be the best, most up-to-date resource on how to run a program
Instead, the approach of this book is to focus on the skills bioinformaticiansuse to explore and extract meaning from complex, large bioinformatics datasets.Exploring and extracting information from these datasets is the fun part ofbioinformatics research. The goal of Bioinformatics Data Skills is to teachyou the computational tools and data skills you need to explore these largedatasets as you please. These data skills give you freedom; youll be able tolook at any bioinformatics datain any format, and files of any sizeand begin exploring data to extract biological meaning.
Throughout Bioinformatics Data Skills, I emphasize working in a robust andreproducible manner. I believe these two qualitiesreproducibility androbustnessare too often overlooked in modern computational work. Byrobust, I mean that your work is resilient against silent errors, confounders,software bugs, and messy or noisy data. In contrast, a fragile approach is onethat does not decrease the odds of some type of error adversely affecting yourresults. By reproducible, I mean that your work can be repeated by other researchersand they can arrive at the same results. For this to be the case, your work must be well documented, and your methods, code, and data all need to be available so that other researchers have the materials to reproduce everything. Reproducibility also relieson your work being robustif a workflow run on a different machine yields adifferent outcome, it is neither robust nor fully reproducible. I introduce theseconcepts in more depth in , and these are themes that reappearthroughout the book.
Why This Book Focuses on Sequencing Data
Bioinformatics is a broad discipline, and spans subfields like proteomics,metabolomics, structure bioinformatics, comparative genomics, machine learning,and image processing. Bioinformatics Data Skills focuses primarily onhandling sequencing data for a few reasons.
First, sequencing data is abundant. Currently, no other omics data is asabundant as high-throughput sequencing data. Sequencing data has broadapplications across biology: variant detection and genotyping, transcriptomesequencing for gene expression studies, protein-DNA interaction assays likeChIP-seq, and bisulfite sequencing for methylation studies just to name a fewexamples. The ways in which sequencing data can be used to answer biologicalquestions will only continue to increase.