~
FOREWORD
By Julian McAuley
Many recent breakthroughs in Machine Learning, including Natural Language Processing, Computer Vision, etc. owe as much to having better data as they owe to having better models.
Naturally, modern ML datasets should be large , in order for models to capture their complex underlying semantics. However having enough data is only a small part of the problem: data must also be processed, appropriately represented, properly sampled, freed from issues of balance and bias etc., not to mention the challenge of extracting meaningful predictive information.
A common experience among ML practitioners is that this type of data munging occupies more time and effort than modeling; it is also incredibly rewarding, as the collection and curation of new datasets often facilitates the most novel and exciting research, and can represent a significant contribution to the research community.
It is wonderful to see a book that covers the underexplored but important skill of collecting and curating data. I expect this will be useful to practitioners who are beginning to collect their own datasets, or wondering how popular datasets are typically collected. Such topics are typically missing from academic treatment of machine learning, where the massive task of data collection and preparation is so often glossed over.
I was thrilled to hear Jigyasa and Rishabh were working on this book: both have experience collecting, curating, and modeling large datasets, both in academic and industrial settings. I expect readers will find the sections on data extraction and data preparation especially useful, as these are the skills I have found most useful in my own career.
Julian McAuley
Associate Professor, University of California San Diego
~
FOREWORD
By Laurence Moroney
This is an important book!
Data is the lifeblood of any Machine Learning or AI solution, and there is only so far you can go with publicly available datasets. What excites me about this book is that Jigyasa and Rishabh go beyond these, and teach you how to create, curate, and manage data effectively.
They will take you through a number of scenarios where they got real-world data from varied sources like online retail and news aggregator websites, but, instead of a rough copy-and-paste, they will instead demonstrate the pipeline involved in making the dataset eminently usable.
Chapter 3 of this book is especially powerful, where youll see how, from first principles, to go through the processes of data trimming, anonymization, standardization, transformation, and balancing. Chapter 4 will take you through the important task of feature engineering, where, instead of just throwing raw data at the problem, you can refine and improve it with clipping, scaling, bucketization, and a lot more.
All of this will prepare you for Machine Learning with your own custom data that you have sourced, cleaned, and managed for optimal model creation.
I am really excited by this field, and delighted that a book like this one exists. Pick it up, read, learn, enjoy!
Laurence Moroney
Lead Artificial Intelligence Advocate, Google
~
FOREWORD
By Mengting Wan
Throughout the rapid growth of Machine Learning and Data Science these years, data is always the key foundation for almost any downstream research, analysis, or intelligent product feature development. One may easily notice that numerous books and courses exist nowadays about helping people manage the skills of consuming the data; however, there are very few resources talking about how to carefully collect, process, and curate high quality datasets. I used to work with Rishabh Misra on several research projects at UC San Diego and have learned many practical data collection and processing skills from him. Therefore I am so excited to hear that Jigyasa and Rishabh are willing to share their knowledge in this domain, and really appreciate their efforts on this book.
The book introduces critical data collection, extraction, preparation, and processing skills. It also provides several Machine Learning application examples and approaches the data problems from the application-oriented perspective. I personally find this book can be very helpful for researchers and practitioners, in order to remove their data availability obstacles, help them proactively but responsibly gather the data they need, and understand the strengths as well as limitations of their datasets. In this regard, I think the book will be ideal as a starting point for data enthusiasts who are willing to learn the dataset collection process from scratch.
Mengting Wan
Senior Applied Scientist, Microsoft
~
PREFACE
In the contemporary world of Artificial Intelligence and Machine Learning, data is the new oil . Rightly so, giant leaps in this domain can be attributed to access to large-scale data. Despite this fact, most of the focus often lies in the methodological aspect of Machine Learning, which is excellent for a start but can limit our advancement. Upon reaching a certain comfort level with modish methodologies, only tackling problems for which a well-prepared dataset is already available curbs our potential. Hence, for Machine Learning algorithms to work their magic, it is imperative to lay a firm foundation by acquiring knowledge of curating good quality datasets.
With the modern bloom of social networks, online retailers, streaming platforms, and knowledge and experience sharing platforms, there is no shortage of any form of data, be it textual, audio, or visual. Therefore, an extensive amount of crude data is available at our fingertips. All we need are the skills to identify valuable information and extract meaningful datasets to fashion more precise models.
Sculpting Data for ML functions as the first act of the play of Machine Learning. It aims at enlightening Machine Learning and Artificial Intelligence enthusiasts, practitioners, and data scientists about one of the fundamental aspects of this realm, Dataset Curation . This stage often does not get its due limelight yet has high relevance in both Academia and Industry. This books distinctive feature is that it puts forward a step-by-step guide on constructing a good quality dataset from scratch. The hands-on tutorial ingrained in the book uses Python with tools like BeautifulSoup and Selenium to coach how to ethically gather data from various web sources. The whole flow is pinned on the fact that predictive models necessitate access to relevant, structured, and distinctive data to maneuver effectively.
Overall, the book covers different techniques for dataset building, preprocessing, and engineering impactful features, thus highlighting the significance of data representation for Machine Learning models. Apart from molding data in its worthy format, this book also discusses ways to deal with noisy and unreliable data. Towards the end, it lays out various Machine Learning paradigms, and their data needs to showcase how to identify suitable learning algorithms to solve challenging problems effectively.