Practical Statistics for Data Scientists
by Peter Bruce , Andrew Bruce , and Peter Gedeck
Copyright 2020 Peter Bruce, Andrew Bruce, and Peter Gedeck. All rights reserved.
Printed in the United States of America.
Published by OReilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
OReilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com .
- Editor: Nicole Tache
- Production Editor: Kristen Brown
- Copyeditor: Piper Editorial
- Proofreader: Arthur Johnson
- Indexer: Ellen Troutman-Zaig
- Interior Designer: David Futato
- Cover Designer: Karen Montgomery
- Illustrator: Rebecca Demarest
- May 2017: First Edition
- May 2020: Second Edition
Revision History for the Second Edition
- 2020-04-10: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781492072942 for release details.
The OReilly logo is a registered trademark of OReilly Media, Inc. Practical Statistics for Data Scientists, the cover image, and related trade dress are trademarks of OReilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publishers views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
978-1-492-07294-2
[LSI]
Dedication
Peter Bruce and Andrew Bruce would like to dedicate this book to the memories of our parents, Victor G. Bruce and Nancy C. Bruce, who cultivated a passion for math and science ; and to our early mentors John W. Tukey and Julian Simon and our lifelong friend Geoff Watson, who helped inspire us to pursue a career in statistics.
Peter Gedeck would like to dedicate this book to Tim Clark and Christian Kramer, with deep thanks for their scientific collaboration and friendship.
Preface
This book is aimed at the data scientist with some familiarity with the R and/or Python programming languages, and with some prior (perhaps spotty or ephemeral) exposure to statistics. Two of the authors came to the world of data science from the world of statistics, and have some appreciation of the contribution that statistics can make to the art of data science. At the same time, we are well aware of the limitations of traditional statistics instruction: statistics as a discipline is a century and a half old, and most statistics textbooks and courses are laden with the momentum and inertia of an ocean liner. All the methods in this book have some connectionhistorical or methodologicalto the discipline of statistics. Methods that evolved mainly out of computer science, such as neural nets, are not included.
Two goals underlie this book:
To lay out, in digestible, navigable, and easily referenced form, key concepts from statistics that are relevant to data science.
To explain which concepts are important and useful from a data science perspective, which are less so, and why.
Conventions Used in This Book
The following typographical conventions are used in this book:
ItalicIndicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Key Terms
Data science is a fusion of multiple disciplines, including statistics, computer science, information technology, and domain-specific fields.As a result, several different terms could be used to reference a given concept.Key terms and their synonyms will be highlighted throughout the book in a sidebar such as this.
Tip
This element signifies a tip or suggestion.
Note
This element signifies a general note.
Warning
This element indicates a warning or caution.
Using Code Examples
In all cases, this book gives code examples first in R and then in Python. In order to avoid unnecessary repetition, we generally show only output and plots created by the R code. We also skip the code required to load the required packages and data sets. You can find the complete code as well as the data sets for download at https://github.com/gedeck/practical-statistics-for-data-scientists.
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless youre reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from OReilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your products documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: Practical Statistics for Data Scientists by Peter Bruce, Andrew Bruce, and Peter Gedeck (OReilly). Copyright 2020 Peter Bruce, Andrew Bruce, and Peter Gedeck, 978-1-492-07294-2.
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at .
OReilly Online Learning
Note
For more than 40 years, OReilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. OReillys online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from OReilly and 200+ other publishers. For more information, visit http://oreilly.com.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
- OReilly Media, Inc.
- 1005 Gravenstein Highway North
- Sebastopol, CA 95472
- 800-998-9938 (in the United States or Canada)
- 707-829-0515 (international or local)
- 707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at