LitArk » Books » Computer

David Mertz - Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools

Here you can read online David Mertz - Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2021, publisher: Packt Publishing - ebooks Account, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Book:
Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools
Author:
David Mertz
Publisher:
Packt Publishing - ebooks Account
Genre:
Books / Computer
Year:
2021
Rating:
4 / 5
Favourites:
Add to favourites
Your mark:
- 80
- 1
- 2
- 3
- 4
- 5

Description
Author's other books
Similar books

Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

A comprehensive guide for data scientists to master effective data cleaning tools and techniques

Key Features

Master data cleaning techniques in a language-agnostic manner
Learn from intriguing hands-on examples from numerous domains, such as biology, weather data, demographics, physics, time series, and image processing
Work with detailed, commented, well-tested code samples in Python and R

Book Description

It is something of a truism in data science, data analysis, or machine learning that most of the effort needed to achieve your actual purpose lies in cleaning your data. Written in Davids signature friendly and humorous style, this book discusses in detail the essential steps performed in every production data science or data analysis pipeline and prepares you for data visualization and modeling results.

The book dives into the practical application of tools and techniques needed for data ingestion, anomaly detection, value imputation, and feature engineering. It also offers long-form exercises at the end of each chapter to practice the skills acquired.

You will begin by looking at data ingestion of data formats such as JSON, CSV, SQL RDBMSes, HDF5, NoSQL databases, files in image formats, and binary serialized data structures. Further, the book provides numerous example data sets and data files, which are available for download and independent exploration.

Moving on from formats, you will impute missing values, detect unreliable data and statistical anomalies, and generate synthetic features that are necessary for successful data analysis and visualization goals.

By the end of this book, you will have acquired a firm understanding of the data cleaning process necessary to perform real-world data science and machine learning tasks.

What you will learn

Identify problem data pertaining to individual data points
Detect problem data in the systematic shape of the data
Remediate data integrity and hygiene problems
Prepare data for analytic and machine learning tasks
Impute values into missing or unreliable data
Generate synthetic features that are more amenable to data science, data analysis, or visualization goals.

Who This Book Is For

This book is designed to benefit software developers, data scientists, aspiring data scientists, and students who are interested in data analysis or scientific computing.

Basic familiarity with statistics, general concepts in machine learning, knowledge of a programming language (Python or R), and some exposure to data science are helpful. A glossary, references, and friendly asides should help bring all readers up to speed.

The text will also be helpful to intermediate and advanced data scientists who want to improve their rigor in data hygiene and wish for a refresher on data preparation issues.

David Mertz: author's other books

Who wrote Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools? Find out the surname, the name of the author of the book and a list of all author's works by series.

Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Cleaning Data for Effective Data Science

Doing the other 80% of the work with Python, R, and command-line tools

David Mertz

BIRMINGHAMMUMBAI

Cleaning Data for Effective Data Science

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Producer: Shailesh Jain

Acquisition Editor Peer Reviews: Saby Dsilva

Project Editor: Rianna Rodrigues

Content Development Editor: Lucy Wan

Copy Editor: Safis Editing

Technical Editor: Aditya Sawant

Proofreader: Safis Editing

Indexer: Priyanka Dhadke

Presentation Designer: Pranit Padwal

First published: March 2021

Production reference: 1260321

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80107-129-1

www.packt.com

Contributors

About the author

David Mertz, Ph.D. is the founder of KDM Training, a partnership dedicated to educating developers and data scientists in machine learning and scientific computing. He created a data science training program for Anaconda Inc. and was a senior trainer for them. With the advent of deep neural networks, he has turned to training our robot overlords as well.

He previously worked for 8 years with D. E. Shaw Research and was also a Director of the Python Software Foundation for 6 years. David remains co-chair of its Trademarks Committee and Scientific Python Working Group. His columns, Charming Python and XML Matters, were once the most widely read articles in the Python world.

I give great thanks to those people who have helped make this book better.

First and foremost, I am thankful for the careful attention and insightful suggestions of my development editor Lucy Wan, and technical reviewer Miki Tebeka. Other colleagues and friends who have read and provided helpful comments on parts of this book, while it was in progress, include Micah Dubinko, Vladimir Shulyak, Laura Richter, Alessandra Smith, Mary Ann Sushinsky, Tim Churches, and Paris Finley.

The text in front of you is better for their kindnesses and intelligence; all error and deficits remain mine entirely.

I also thank the thousands of contributors who have created the Free Software I used in the creation of this book, and in so much other work I do. No proprietary software was used by the author at any point in the production of this book. The operating system, text editors, plot creation tools, fonts, programming languages, shells, command-line tools, and all other software used belongs to our human community rather than to any exclusive private entity.

About the reviewer

Miki Tebeka is the CEO of 353solutions, and he has a passion for teaching and mentoring. He teaches many workshops on various technical subjects all over the world and also mentored many young developers on their way to success. Miki is involved in open source, has several projects of his own, and contributed to several more, including the Python project and the Go project. He has been writing software for 25 years.

Miki wrote Forging Python, Python Brain Teasers, Go Brain Teasers, Pandas Brain Teasers and is an author in LinkedIn Learning. Hes an organizer of the Go Israel Meetup, GopherCon Israel, and PyData Israel Conference.

Preface

In order for something to become clean, something else must become dirty.

Imbesis Law of the Conservation of Filth

Doing the Other 80% of the Work

It is something of a truism in data science, data analysis, or machine learning that most of the effort needed to achieve your actual purpose lies in cleaning your data. The subtitle of this work alludes to a commonly assigned percentage. A keynote speaker I listened to at a data science conference a few years ago made a jokeperhaps one already widely repeated by the time he told itabout talking with a colleague of his. The colleague complained of data cleaning taking up half of her time, in response to which the speaker expressed astonishment that it could be so little as 50%.

Without worrying too much about assigning a precise percentage, in my experience working as a technologist and data scientist, I have found that the bulk of what I do is preparing my data for the statistical analyses, machine learning models, or nuanced visualizations that I would like to utilize it for. Although hopeful executives, or technical managers a bit removed from the daily work, tend to have an eternal optimism that the next set of data the organization acquires will be clean and easy to work with, I have yet to find that to be true in my concrete experience.

Certainly, some data is better and some is worse. But all data is dirty, at least within a very small margin of error in the tally. Even datasets that have been published, carefully studied, and that are widely distributed as canonical examples for statistics textbooks or software libraries, generally have a moderate number of data integrity problems. Even after our best pre-processing, a more attainable goal should be to make our data less dirty; making it clean remains unduly utopian in aspiration.

By all means we should distinguish data quality from data utility. These descriptions are roughly orthogonal to each other. Data can be dirty (up to a point) but still be enormously useful. Data can be (relatively) clean but have little purpose, or at least not be fit for purpose. Concerns about the choice of measurements to collect, or about possible selection bias, or other methodological or scientific questions are mostly outside the scope of this book. However, a fair number of techniques I present can aid in evaluating the utility of data, but there is often no mechanical method of remedying systemic issues. For example, statistics and other analyses may revealor at least strongly suggestthe unreliability of a certain data field. But the techniques in this book cannot generally automatically fix that unreliable data or collect better data.

The code shown throughout this book is freely available. However, the purpose of this book is not learning to use the particular tools used for illustration, but to understand the underlying purpose of data quality. The concepts presented should be applicable in any programming language used for data processing and machine learning. I hope it will be easy to adapt the techniques I show to your own favorite collection of tools and programming languages.

Light

Font size:

↓

↑

Reset

Interval:

↓

↑

Bookmark:

Make

Similar books «Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools»

Look at similar books to Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.

Wade Briggs

DATA SCIENCE WITH PYTHON

Lee Baker

Practical Data Cleaning

A.J. Henley

Learn Data Analysis with Python: Lessons in Coding

McKinney

Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 2nd Edition

Jr A J Henley

Learn data analysis with Python: lessons in coding

Wes McKinney

Python for Data Analysis

Madhavan

Mastering Python for Data Science

Greg Foss

Practical Data Science with SAP: Machine Learning Techniques for Enterprise Data

Dr. Ossama Embarak

Data Analysis and Visualization Using Python: Analyze Data to Create Visualizations for BI Systems

Wes McKinney

Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython

Megan Squire

Clean Data

Q. Ethan McCallum

Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work

Reviews about «Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools»

Discussion, reviews of the book Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.