Chapter 1. Training Data Introduction
A Note for Early Release Readers
With Early Release ebooks, you get books in their earliest formthe authors raw and unedited content as they writeso you can take advantage of these technologies long before the official release of these titles.
This will be the 1st chapter of the final book. Please note that the GitHub repo will be made active later on.
If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out to the editor at jleonard@oreilly.com.
Data is all around us. Videos, images, text, 3D, geospatial, documents, and more. Yet, in its raw form this data is of little use to machine learning (ML). How do we make use of this data? How do we record our intelligence so it can be reproduced through ML? The answer is the Art of Training Data - the discipline of making raw data useful.
In this book you will learn:
All-new Training Data specific concepts
The Day-to-Day Practice of Training Data
How to improve Training Data efficiency
Real world case studies
How to transform your team to be more AI/ML centric
Before we can cover some of these concepts, we first have to understand the foundations, which this chapter will unpack.
Training Data is about molding, reforming, shaping, and digesting raw data into new forms. Creating new meaning out of raw data to solve problems. These acts of creation and destruction sit at the intersection of subject matter expertise, business needs, and technical requirements. Its a diverse set of activities that crosscut multiple domains.
At the heart of these activities is annotation. Annotation produces structured data that is ready to be consumed by a machine learning model. Without annotation, raw data is considered to be unstructured and not usable. Thats why training data is required for modern machine learning use cases including computer vision, natural language processing and speech recognition.
To cement this idea in an example lets consider annotation in detail. When we annotate data, we are capturing human knowledge. Typically, this process looks as follows: a piece of media such as an image, text, video, 3D, or audio, is presented along with a set of predefined options (labels). A human reviews the media and determines the most appropriate answers. For example, declaring a region of an image to be good or bad. This label provides the context needed to apply machine learning concepts (Figure 1-1).
But how did we get there? How did we get to the point that the right media element, with the right predefined set of options, is shown to the right person at the right time? There are many concepts that lead up to and follow the moment where that annotation, or knowledge capture, actually happens. Collectively all of these concepts are the art of training data.
Figure 1-1. The Training Data Process
In this chapter, well introduce what training data is, why it matters, and dive into many key concepts that will form the base for the rest of the book.
Training Data Intents
What can you do with Training Data. What is it most concerned with? What are people aiming to achieve with Training Data? The purpose of Training Data varies across different use cases, problems, and scenarios. Lets explore some of the most common questions.
What Can You Do With Training Data?
Training Data is the foundation of AI/ML systems - the underpinning that makes these systems work. With Training Data, you can build and maintain modern ML systems, such as ones that create next generation automations, improve existing products, and even create all new products.
In order to be useful, the data needs to be presented in a structured way to ML programs. Thats where Training Data comes in - adding and maintaining structure to make the raw data useful. If you have great Training Data, you are on the path towards a great overall solution.
In practice, common use cases center around:
Improving an existing product (e.g., performance), even if ML is not currently a part of it
Production of a new product, including systems that run in a limited or one off fashion
Research and Development
Training Data transcends all parts of ML programs. Training data comes up before you can run an ML Program, it comes up during running in terms of output and results, and even later in analysis and maintenance. Further, Training Data concerns tend to be long lived. For example, after getting a model up and running, maintaining the Training Data is an important part of maintaining a model.