DATA SCIENCE
TOOLS
LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY
By purchasing or using this book (the Work), you agree that this license grants permission to use the contents contained herein, but does not give you the right of ownership to any of the textual content in the book or ownership to any of the information or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.
MERCURY LEARNING AND INFORMATION (MLI or the Publisher) and anyone involved in the creation, writing, or production of the companion disc, accompanying algorithms, code, or computer programs (the software), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to insure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold as is without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).
The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.
The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book, and only at the discretion of the Publisher. The use of implied warranty and certain exclusions vary from state to state, and might not apply to the purchaser of this product.
DATA SCIENCE
TOOLS
R, Excel, KNIME, & OpenOffice
CHRISTOPHER GRECO
MERCURY LEARNING AND INFORMATION
Dulles, Virginia
Boston, Massachusetts
New Delhi
Copyright 2020 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.
This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.
Publisher: David Pallai
MERCURY LEARNING AND INFORMATION
22841 Quicksilver Drive
Dulles, VA 20166
www.merclearning.com
(800) 232-0223
C. Greco. Data Science Tools: R, Excel, KNIME, & OpenOffice.
ISBN: 978-1-68392-583-5
The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.
Library of Congress Control Number: 2020937123
202122321 Printed on acid-free paper in the United States of America
Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at (800) 232-0223 (toll free). Digital versions of our titles are available at: www.academiccourseware.com and other electronic vendors.
The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the book and/or disc, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.
CONTENTS
PREFACE
Data Science is all the rage. There is a great probability that every book you read, every Web site that you visit, every advertisement that you receive, is a result of data science and, with it, data analytics. What used to be statistics is now referenced as data analytics or data science. The concepts behind data science are myriad and complex, but the underlying concept is that very basic statistical concepts are vital to understanding data. This book really has a two-fold purpose. The first is to review briefly some of the concepts that the reader may have encountered while taking a course (or courses) in statistics, while the second is to demonstrate how to use tools to visualize those statistical concepts.
There are several caveats that must accompany this book. The first one is that the tools are of a certain version, which will be described below. This means that there will undoubtedly be future versions of these tools that might perform differently on your computer. I want to be very clear that this performance does not mean that these tools will perform better. Three of these are free and open source tools, and, as such, perform as well as the group of developers dictate they will in their most current versions. In most instances, the tool will be enhanced in the newer version, but there might be a different buttonology that will be associated with newer functions. You will see the word buttonology throughout this book in the form of the mechanics of the tool itself. I am not here to teach the reader statistics or the different concepts that compose the topics of this book. I am here to show you how the free and open source tools are applied to these concepts.
Now it is time to get to the very heart of the text, the tools of data science. There will be four tools that will encompass the content of this book. Three are open source tools (FOSS or Free and Open Source), with one being COS (Common Off the Shelf) software, but all four will require some instruction in their use. These are not always intuitive or self-explanatory, so there will be many screen pages for each mechanical function. I feel that visual familiarization trumps narrative, so you will not see a lot of writing, mostly descriptions and step-by-step mechanics. A few of you may be wondering how to practice these skills, and for those readers there is a final chapter that has several scenarios that allow the reader to apply what they have learned from these tools.
The organization of this book will be on the statistical concept, not the tool, which means that each chapter will encompass an explanation of the statistical concept, and then how to apply each tool to that concept. By using this presentation method, readers can go to the prescribed concept and use the tool most comfortably applied. Each section will be labeled accordingly, so they will both be in the table of contents and the index. This makes it simpler for individuals to see their choice of tools and the concepts they have to apply to those tools.
C. Greco
April 2020
ACKNOWLEDGMENTS
Next page