The data science community is such an interesting, dynamic, and fast-paced place to work. While my journey as a data scientist so far has only been around five years long, it feels as though Ive already seen a lifetime of tools, technologies, and trends come and go. One consistent effort has been a focus on continuing to make data science easier. Lowering barriers to entry and developing better libraries have made data science more accessible than ever. That there is such a bright, diverse, and dedicated community of software architects and developers working tirelessly to improve data science for everyone has made my experience writing Data Science with Python and Dask an incredibly humblingand at times intimidatingexperience. But, nonetheless, it is a great honor to be able to contribute to this vibrant community by showcasing the truly excellent work that the entire team of Dask maintainers and contributors have produced.
I stumbled across Dask in early 2016 when I encountered my first uncomfortably large dataset at work. After fumbling around for days with Hadoop, Spark, Ambari, ZooKeeper, and the menagerie of Apache big data technologies, I, in my exasperation, simply Googled big data library python. After tabbing through pages of results, I was left with two options: continue banging my head against PySpark or figure out how to use chunking in Pandas. Just about ready to call my search efforts futile, I spotted a StackOverflow question that mentioned a library called Dask. Once I found my way over to where Dask was hosted on GitHub, I started working my way through the documentation. DataFrames for big datasets? An API that mimics Pandas? It can be installed using pip? It seemed too good to be true. But it wasnt. I was incensedwhy hadnt I heard of this library before? Why was something this powerful and easy to use flying under the radar at a time when the big data craze was reaching fever pitch?
After having great success using Dask for my work project, I was determined to become an evangelist. I was teaching a Python for Data Science class at the University of Denver at the time, and I immediately began looking for ways to incorporate Dask into the curriculum. I also presented several talks and workshops at my local PyData chapters meetups in Denver. Finally, when I was approached by the folks at Manning to write a book on Dask, I agreed without hesitation. As you read this book, I hope you also come to see how awesome and useful Dask is to have in your arsenal of data science tools!
acknowledgments
As a new author, one thing I learned very quickly is that there are many, many people involved in producing a book. I absolutely would not have survived without all the wonderful support, feedback, and encouragement Ive received over the course of writing the book.
First, Id like to thank Stephen Soehnlen at Manning for approaching me with the idea to write this book, and Marjan Bace for green-lighting it. They took a chance on me, a first-time author, and for that I am truly appreciative. Next, a huge thanks to my development editor, Dustin Archibald, for patiently guiding me through Mannings writing and revising processes while also pushing me to become a better writer and teacher. Similarly, a big thanks to Mike Shepard, my technical editor, for sanity checking all my code and offering yet another channel of feedback. Id also like to thank Tammy Coron and Toni Arritola for helping to point me in the right direction early on in the writing process.
Next, thank you to all the reviewers who provided excellent feedback throughout the course of writing this book: Al Krinker, Dan Russell, Francisco Sauceda, George Thomas, Gregory Matuszek, Guilherme Pereira de Freitas, Gustavo Patino, Jeremy Loscheider, Julien Pohie, Kanak Kshetri, Ken W. Alger, Lukasz Tracewski, Martin Czygan, Pauli Sutelainen, Philip Patterson, Raghavan Srinivasan, Rob Koch, Romain Jouin, Ruairi O'Reilly, Steve Atchue, and Suresh Rangarajulu.. Special thanks as well to Ivan Martinovic for coordinating the peer review process and organizing all the feedback, and to Karsten Strbk for giving my code another pass before handing off to production.
Id also like to thank Bert Bates, Becky Rinehart, Nichole Beard, Matko Hrvatin and the entire graphics team at Manning, Chris Kaufmann, Ana Romac, Owen Roberts and the folks at Mannings marketing department, Nicole Butterfield, Rejhana Markanovic, and Lori Kehrwald. A big thank-you also goes out to Francesco Bianchi, Mike Stephens, Deirdre Hiam, Michelle Melani, Melody Dolab, Tiffany Taylor, and the countless other individuals who worked behind the scenes to make Data Science with Python and Dask a great success!
Finally, Id like to give a special thanks to my wife, Clementine, for her patient understanding on the many nights and weekends that I holed up in my office to work on the book. I couldnt have done this without your infinite love and support. I also wouldnt have had this opportunity without the inspiration of my dad to pursue a career in technology and the not-so-gentle nudging of my mom to do my English homework. I love you both!
about this book
Who should read this book
Data Science with Python and Dask takes you on a hands-on journey through a typical data science workflowfrom data cleaning through deploymentusing Dask. The book begins by presenting some foundational knowledge of scalable computing and explains how Dask takes advantage of those concepts to operate on datasets big and small. Building on that foundation, it then turns its focus to preparing, analyzing, visualizing, and modeling various real-world datasets to give you tangible examples of how to use Dask to perform common data science tasks. Finally, the book ends with a step-by-step walkthrough of deploying your very own Dask cluster on AWS to scale out your analysis code.