Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Packt.com
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details.
At www.packt.com , you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Contributors
About the authors
Vitor Bianchi Lanzetta (@vitorlanzetta) has a master's degree in Applied Economics (University of So PauloUSP) and works as a data scientist in a tech start-up named RedFox Digital Solutions. He has also authored a book called R Data Visualization Recipes. The things he enjoys the most are statistics, economics, and sports of all kinds (electronics included). His blog, made in partnership with Ricardo Anjoleto Farias (@R_A_Farias), can be found at ArcadeData dot org, they kindly call it R-Cade Data.
I'd like to thank God and my family, especially my caring parents, Naide and Carmo, and my wonderful sister, Gabriela. I love you all beyond measure.
Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma , Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
I'd like to thank my wife, Sara, for her caring support and understanding as I worked on the book at weekends and evenings, and to my parents, parents-in-law, sister, and grandmother for all their support, guidance, tutelage, and encouragement over the years. I'd also like to thank Packt, especially the editors, Tushar Gupta, and Karan Thakkar, and everyone else in the team, whose persistence and attention to detail has been exemplary.
Ricardo Anjoleto Farias is an economist who graduated from the Universidade Estadual de Maring in 2014. In addition to being a sports enthusiast (electronic or otherwise) and enjoying a good barbecue, he also likes math, statistics, and correlated studies. His first contact with R was when he embarked on his master's degree, and since then, he has tried to improve his skills with this powerful tool.
I am grateful to my family, mainly my parents, for their support during the difficult moments. I would also like to thank my friend and the book's co-author, Vitor Bianchi Lanzetta , who has taught me a lot, both academically and personally.
About the reviewer
Doug Ortiz is the founder of Illustris, LLC and is an experienced enterprise cloud, big data, data analytics, and solutions architect who has architected, designed, developed, reengineered, and integrated enterprise solutions. His other areas of expertise includeAmazon Web Services, Azure, Google Cloud, Business Intelligence, Hadoop, Spark, NoSQL databases, and SharePoint, to name but a few.
Huge thanks to my wonderful wife, Milla, to Maria, and Nikolay, and to my children, for all their support.
What this book covers
, Getting Started with Data Science and R, provides an introduction to the field of data science, its applicability in different industry domains, an overview of the machine learning process, and how to install R Studio in order to get started in R development. It also introduces the reader to programming in R, starting off at an intermediate level to facilitate an analysis of the HDI, published by the UN development program. The HDI signifies the level of economic development, including general public health, education, and various other societal factors, of a state.
, Descriptive and Inferential Statistics, introduces fundamental statistical analysis using R, including techniques to perform random sampling, hypothesis testing, and non-parametric tests. This chapter contains extensive examples of commands in R for performing common analysis, such as t-tests and z-tests, and includes utilization of some well-known statistical packages, such as HMISC in R.
, Data Wrangling with R, provides an introduction to packages available in R to slice and manipulate data. Packages that are available as part of the tidyverse set of packages, such as dplyr, and, more generally, the apply family of functions in R, have been introduced. The chapter is example-heavy, in that several examples have been provided to guide the reader on how to apply the functions in the respective packages
, KDD, Data Mining, and Text Mining, includes extensive discussions on the art of extracting information from unstructured data sources, such as websites and Twitter. KDD is a popular term in the data science community and this chapter does full justice to the topic by providing step-by-step examples so as to provide a holistic overview of the subject matter. Sections on web scraping, data transformation, and data visualization have been included. Examples on how to leverage packages such as rvest and httr in order to perform such operations are also discussed at length.