This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and many iterations to get reader feedback, pivot until you have the right book and build traction once you do.
Introduction To Kubernetes Chaos Engineering
There are very few things as satisfying as destruction, especially when were frustrated.
How often did it happen that you have an issue that you cannot solve and that you just want to scream or destroy things? Did you ever have a problem in production that is negatively affecting a lot of users? Were you under a lot of pressure to solve it, but you could not crack it as fast as you should. It must have happened, at least once, that you wanted to take a hammer and destroy servers in your datacenter. If something like that never happened to you, then you were probably never in a position under a lot of pressure. In my case, there were countless times when I wanted to destroy things. But I didnt, for quite a few reasons. Destruction rarely solves problems, and it usually leads to negative consequences. I cannot just go and destroy a server and expect that I will not be punished. I cannot hope to be rewarded for such behavior.
What would you say if I tell you that we can be rewarded for destruction and that we can do a lot of good things by destroying stuff? If you dont believe me, you will soon. Thats what chaos engineering is about. It is about destroying, obstructing, and delaying things in our servers and in our clusters. And were doing all that, and many other things, for a very positive outcome.
Chaos engineering tries to find the limits of our system. It helps us deduce what are the consequences when bad things happen. We are trying to simulate the adverse effects in a controlled way. We are trying to do that as a way to improve our systems to make them more resilient and capable of recuperating and resisting harmful and unpredictable events.
Thats our mission. We will try to find ways how we can improve our systems based on the knowledge that we will obtain through the chaos.
Who Are We?
Before we dive further, let me introduce you to the team comprised of me, Viktor, and another guy, which I will introduce later.
Who Is Viktor?
Lets start with me. My name is Viktor Farcic. I currently work in CloudBees. However, things are changing and, by the time you are reading this, I might be working somewhere else. Change is constant, and one can never know what the future brings. At the time of this writing, I am a principal software delivery strategist and developer advocate. Its a very long title and, to be honest, I dont like it. I need to read it every time because I cannot memorize it myself. But thats what I am officially.
What else can I say about myself? I am a member of the Google Developer Experts (GDE) group, and Im one of the Docker Captains. You can probably guess from those that I am focused on containers, Cloud, Kubernetes, and quite a few other things.
Im a published author. I wrote quite a few books under the umbrella of The DevOps Toolkit Series. I also wrote DevOps Paradox and Test-Driven Java Development. Besides those, there are a few Udemy courses.
I am very passionate about DevOps, Kubernetes, microservices, continuous integration, and continuous delivery, and quite a few other topics. I like coding in Go.
I speak in a lot of conferences and community gatherings, and I do a lot of workshops.
I have a blog TechnologyConversations.com where I keep my random thoughts, and I co-host a podcast DevOps Paradox.
What really matters is that Im curious about technology, and I often change what I do. A significant portion of my time is spent helping others (individuals, communities, or companies).
Now, let me introduce the second person that was involved in the creation of this book. His name is Darin, and I will let him introduce himself.
Who Is Darin?
My name is Darin Pope. Im currently working at CloudBees as a professional services consultant. Along with Viktor, Im the co-host of DevOps Paradox.
Whether its figuring out the latest changes with Kubernetes or creating more content to share with our listeners and viewers, Im always learning. Always have, always will.
Principles Of Chaos Engineering
What is chaos engineering? What are the principles behind it?
We can describe chaos engineering as a discipline of experimenting on a system to build confidence in the systems capability to withstand turbulent conditions in production. Now, if that was too confusing and you prefer a more straightforward definition of what chaos engineering is, I can describe it by saying that you should be prepared because bad things will happen. It is inevitable. Bad things will happen, often when we dont expect them. Instead of being reactive and waiting for unexpected outages and delays, we are going to employ chaos engineering and try to simulate adverse effects that might occur in our system. Through those simulations, were going to learn what the outcomes of those experiments are and how we can improve our system by building confidence by making it resilient and by trying to design it in a way that it is capable of withstanding unfavorable conditions happening in production.
Are You Ready For Chaos?
Before we proceed, I must give you a warning. You might not be ready for chaos engineering. You might not benefit from this book. You might not want to do it. Hopefully, youre reading this chapter from the sample (free) version of the book, and you did not buy it yet since I am about to try to discourage you from reading.
When can you consider yourself as being ready for chaos engineering? Chaos engineering requires teams to be very mature and advanced. Also, if youre going to practice chaos engineering, be prepared to do it in production. We dont want to see how, for example, a staging cluster behaves when unexpected things happen. Even if we do, that would be only a practice. Real chaos experiments are executed in production because we want to see how the real system used by real users is reacting when bad things happen.
Further on, you, as a company, must be prepared to have sufficient budget to invest in real reliability work. It will not come for free. There will be a cost for doing chaos engineering. You need to invest time in learning tools. You need to learn processes and practices. And you will need to pay for the damage that you will do to your system.
Now, you might say that you can get the budget and that you can do it in production, but theres more. There is an even bigger obstacle that you might face.
You must have enough observability in your system. You need to have a relatively advanced monitoring and alerting processes and tools so that you can detect harmful effects of chaos experiments. If your monitoring setup is non-existent or not reliable, then you will be doing damage to production without being able to identify the consequences. Without knowing what went wrong, you wont be able to (easily) restore the system to the desired state.