Finding the Code for This Book
This book is about algorithms. But its not a book about the theory of algorithms. Its a short tutorial introduction to algorithms thats targetted at people who like to learn about new ideas by experimenting with them in practice.
Because we want you to experiment, this book is meant to be read while youre near an interpreter for your favorite programming language. In the text, we illustrate every algorithm we describe using Python. As part of the accompanying online materials, there is similar code available that implements all of the same algorithms in Julia, a new programming language that is ideally suited for implementing bandit algorithms. Alongside the Python and Julia code, there are also links to similar implementations in other languages like JavaScript.
Weve chosen to use Python for this book because it seems like a reasonable lingua franca for programmers. If Python isnt your style, you should be able to translate our Python code into your favorite programming language fairly easily.
Assuming you are happy using Python or Julia, you can find the code for the book on GitHub at https://github.com/johnmyleswhite/BanditsBook. If you find mistakes or would like to submit an implementation in another language, please make a pull request.
Dealing with Jargon: A Glossary
While this book isnt meant to introduce you to the theoretical study of the Multiarmed Bandit Problem or to prepare you to develop novel algorithms for solving the problem, we want you to leave this book with enough understanding of existing work to be able to follow the literature on the Multiarmed Bandit Problem. In order to do that, we have to introduce quite a large number of jargon words. These jargon words can be a little odd at a first, but theyre universally used in the academic literature on Multiarmed Bandit Problems. As you read this book, you will want to return to the list of jargon words below to remind yourself what they mean. For now, you can glance through them, but we dont expect you to understand these words yet. Just know that this material is here for you to refer back to if youre ever confused about a term we use.
Reward A quantitative measure of success. In business, the ultimate rewards are profits, but we can often treat simpler metrics like click-through rates for ads or sign-up rates for new users as rewards. What matters is that (A) there is a clear quantitative scale and (B) its better to have more reward than less reward. Arm What options are available to us? What actions can we take? In this book, well refer to the options available to us as arms. The reasons for this naming convention will be easier to understand after weve discuss a little bit of the history of the Multiarmed Bandit Problem. Bandit A bandit is a collection of arms. When you have many options available to you, we call that collection of options a Multiarmed Bandit. A Multiarmed Bandit is a mathematical model you can use to reason about how to make decisions when you have many actions you can take and imperfect information about the rewards you would receive after taking those actions. The algorithms presented in this book are ways of trying to solve the problem of deciding which arms to pull when. We refer to the problem of choosing arms to pull as the Multiarmed Bandit Problem. Play/Trial When you deal with a bandit, its assumed that you get to pull on each arm multiple times. Each chance you have to pull on an arm will be called a play or, more often, a trial. The term "play" helps to invoke the notion of gambling that inspires the term "arm", but the term trial is quite commonly used. Horizon How many trials will you have before the game is finished? The number of trials left is called the horizon. If the horizon is short, you will often use a different strategy than you would use if the horizon were long, because having many chances to play each arm means that you can take greater risks while still having time to recover if anything goes wrong. Exploitation An algorithm for solving the Multiarmed Bandit Problem exploits if it plays the arm with the highest estimated value based on previous plays. Exploration An algorithm for solving the Multiarmed Bandit Problem explores if it plays any arm that does not have the highest estimated value based on previous plays. In other words, exploration occurs whenever exploitation does not. Explore/Exploit Dilemma The observation that any learning system must strike a compromise between its impulse to explore and its impulse to exploit. The dilemma has no exact solution, but the algorithms described in this book provide useful strategies for resolving the conflicting goals of exploration and exploitation. Annealing An algorithm for solving the Multiarmed Bandit Problem anneals if it explores less over time. Temperature A parameter that can be adjusted to increase the amount of exploration in the Softmax algorithm for solving the Multiarmed Bandit Problem. If you decrease the temperature parameter over time, this causes the algorithm to anneal. Streaming Algorithms An algorithm is a streaming algorithm if it can process data one piece at a time. This is the opposite of batch processing algorithms that need access to all of the data in order to do anything with it. Online Learning An algorithm is an online learning algorithm if it can not only process data one piece at a time, but can also provide provisional results of its analysis after each piece of data is seen. Active Learning An algorithm is an active learning algorithm if it can decide which pieces of data it wants to see next in order to learn most effectively. Most traditional machine learning algorithm are not active: they passively accept the data we feed them and do not tell us what data we should collect next. Bernoulli A Bernoulli system outputs a 1
with probability p
and a 0
with probability 1 p
.
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold
Shows commands or other text that should be typed literally by the user. Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
Tip
This icon signifies a tip, suggestion, or general note.
Warning
This icon indicates a warning or caution.
Using Code Examples
This book is here to help you get your job done. In general, if this book includes code examples, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless youre reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O'Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product's documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: "