Mrunal M. Chavan
About the Author
Trent Hauck is a data scientist living and working in the Seattle area. He grew up in Wichita, Kansas and received his undergraduate and graduate degrees from the University of Kansas.
He is the author of the book Instant Data Intensive Apps with pandas How-to , Packt Publishing a book that can get you up to speed quickly with pandas and other associated technologies.
First, a big thanks to the Python software community, the people behind scikit-learn in particular; the skill with which the code is developed is responsible for a lot of good work that gets done.
Personally, I'd like to thank my family, friends, and coworkers.
About the Reviewers
Anoop Thomas Mathew is a software architect with years of experience in working with Python and software development in general. With the title of Chief Technology Officer at Profoundis Inc., he leads the engineering efforts at Profoundis and is now focusing on https://vibeapp.co. He has spoken at conferences such as The Fifth Elephant 2012, PyCon 2012, FOSSMeet 2013, PyCon 2013, and FOSSMeet 2014 to name a few. He blogs at http://infiniteloop.in.
He is the author of the book, Code Explorer's Guide to the Open Source Jungle , available online at https://leanpub.com/opensourcebook.
To my beloved.
Xingzhong is a PhD candidate in Electrical Engineering at Stevens Institute of Technology, Hoboken, New Jersey, where he works as a research assistant, designing and implementing machine-learning models in computer vision and signal processing applications.
Although Python is his primary programming language, occasionally, for fun and curiosity, his works might be written on golang, Scala, JavaScript, and so on. As a self-confessed technology geek, he is passionate about exploring new software and hardware.
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at > for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
- Fully searchable across every book published by Packt
- Copy and paste, print, and bookmark content
- On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
Preface
This book is designed in the same way that many data science and analytics projects play out. First, we need to acquire data; the data is often messy, incomplete, or not correct in some way. Therefore, we spend the first chapter talking about strategies for dealing with bad data and ways to deal with other problems that arise from data. For example, what happens if we have too many features? How do we handle that? The first chapter is your guide. The meat of the book will walk you through various algorithms and how to implement them into your workflow. And finally, we'll end with the postmodel workflow. This chapter is fairly agnostic to the other chapters and can be applied to the various algorithms you'll learn up until the final chapter.
What this book covers
, Premodel Workflow , walks you through the preparatory step of preparing a dataset for modeling and shows how scikit-learn can help to ameliorate the burden of preprocessing.
, Working with Linear Models , discusses how many problems can be viewed as linear models upon the appropriate application of a transformation, and therefore walks you through what may be the most used class of models.
, Building Models with Distance Metrics , encompasses a large number of topics that largely work by measuring the similarity between the data points. Because similarity and distance are often synonymous, clustering can often be used as long as a distance function can be defined.
, Classifying Data with scikit-learn , focuses on the various methods within scikit-learn that are used to determine a data point as some member between 1 and N classes.
, Postmodel Workflow , teaches us how we can take a basic model produced from one of the recipes and tune it so that we can achieve better results than we could with the basic model.
What you need for this book
Here are the contents of the requirements.txt
file that will get the environment set up. This will allow you to follow along with the code in the book.
I've also included a conda requirements file; this method may be easier for less-experienced Python developers:
dateutil==2.1ipython==2.2.0ipython-notebook==2.1.0jinja2==2.7.3markupsafe==0.18matplotlib==1.3.1numpy==1.8.1patsy==0.3.0pandas==0.14.1pip==1.5.6pydot==1.0.28pyparsing==1.5.6pytz==2014.4pyzmq==14.3.1scikit-learn==0.15.0scipy==0.14.0setuptools==3.6six==1.7.3ssl_match_hostname==3.4.0.2tornado==3.2.2