MEAP VERSION 4
Welcome
Thanks for purchasing the MEAP edition of "Causal Inference for Data Science". This book is for data scientists, but also for machine learning practitioners/engineers/researchers that may feel the need to include causality in their models. It is also for statisticians and econometricians that want to develop their knowledge on causal inference through machine learning and modeling causality using graphs. Readers may need a basic knowledge of probability (basic distributions, conditional probabilities, ...), statistics (confidence intervals, linear models), machine learning (cross validation and some nonlinear models) and some experience programming.
I remember discovering causal inference in 2016 through the works of Judea Pearl, and the feelings I had at that moment: a combination of high curiosity and not understanding anything at all,at the same time. As I kept reading, I realized that it solves very fundamental questions around decision making and predictive modeling. After some time, I started to think differently about many problems I had been working on. I ended up enjoying a lot everything related to causal inference and deciding to try to make a living out of it. Moreover,I felt very comfortable with its intrinsic objective: finding the why.
Learning causal inference has given me the confidence to face many problems for which I wasnt previously prepared. Now I can interpret data and take conclusions out of it with a principled approach, being aware of the weaknesses and strengths of the analysis. I have a language and a way of thinking that lets me enter in new domains quickly. Being an experienced machine learning practitioner, causal inference helps me to know when to use machine learning, what to expect of it and when it will struggle to perform well. There are many books about casual inference, but mainly from a statistics and econometrics perspective. As a data scientist, I wanted to write a book that used the language and tools that I use in my everyday work. I think that the adoption of causal inference in data science can have a huge impact changing the way decisions are made in businesses and institutions. Moreover, I think that the approach, developed by Pearl and many others, based on describing reality through graphs and exploiting their structure, is very flexible and fits very well with typical problems in data science. In this book you will get an introduction to causal inference. You will learn when you need it and when you dont. You will also learn the main techniques to estimate causal effects. There are two aspects that I have paid special attention to. The first one is finding intuitive ways to explain the key concepts and formulas. And the second one is showing examples and applications where causal inference can be used. I hope this book helps you enter the causal inference world and helps you to use it and to enjoy it at least as much as I do!
If you have any questions, comments, or suggestions, please share them in Mannings for my book.
Aleix Ruiz de Villa Robert
In this book
1 Introduction to causality
This chapter covers
- Why and when we need causal inference
- How causal inference works
- Understanding the difference between observational data and experimental data
- Reviewing relevant statistical concepts
In most of the machine learning applications you find in commercial enterprises (and outside research), your objective is to make predictions. So, you create a predictive model that, with some accuracy, will make a guess about the future. For instance, a hospital may be interested in predicting which patients are going to be severely ill, so that they can prioritize their treatment. In most predictive models, the mere prediction will do; you dont need to know why it is the way it is.
Causal inference works the other way around. You want to understand why, and moreover you wonder what could we do to have a different outcome. A hospital, for instance, may be interested in the factors that affect some illness. Knowing these factors will help them to create public healthcare policies or drugs to prevent people from getting ill. The hospital wants to change how things currently are, in order to reduce the number of people ending up in the hospital.
Why should anyone that analyses data be interested in causality? Most of the analysis we, as data scientists or data analysts, are interested in relates in some way or another to questions of causal nature. Intuitively we say that X causes Y when, if you change X, Y changes. So, for instance, if you want to understand your customer retention, you may be interested in knowing what you could do so that your customers use your services longer. What could be done differently, in order to improve your customers experience? This is in essence a causal question: you want to understand what is causing your current customer retention stats, so that you can then find ways to improve them. In the same way, we can think of causal questions in creating marketing campaigns, setting prices, developing novel app features, making organizational changes, implementing new policies, developing new drugs, and on and on. Causality is about knowing what is the impact of your decisions, and what factors affect your outcome of interest.
Ask Yourself
Which types of questions are you interested in when you analyze data? Which of those are related in some way to causality? Hint: remember that many causal questions can be framed as measuring the impact of some decision or finding which factors (especially actionable ones) affect your variables of interest.
The problem is that knowing the cause of something is not as easy as it may seem. Let me explain.
Imagine you want to understand the causes of some illness, and when you analyze the data, you realize that people in the country tend to be sicker than people living in cities. Does this mean that living in the country is a cause of sickness? If that were the case, it would mean that if you move from the country to a city, you would have less of a chance of falling ill. Is that really true? Living in the city, per se, may not be healthier than living in the country, since you are exposed to higher levels of pollution, food is not as fresh or healthy, and life is more stressful. But its possible that generally people in cities have higher socio-economic status and they can pay for better healthcare, or they can afford to buy gym memberships and do more exercise to prevent sickness. So, the fact that cities appear to be healthier could be due to socio-economic reasons and not because of the location itself. If this second hypothesis were the case, then moving from the country to a city would not improve your health, on average, but increase your chances of being ill: you still wouldnt be able to afford good healthcare, and youd be facing new health threats from the urban environment.
The city-country example shows us a problem we will face often in causal inference. Living in the city and having less chance to fall ill, frequently happens at the same time. However, we have also seen that where you live may not be the only cause of your health. Thats why the phrase correlation is not causation is so popular. Because the fact that two things happen at the same time does not mean that one causes the other. There may be other factors, as the socio-economic status in our example, that are more relevant for explaining why.