Cloud-Native Observability with OpenTelemetry
Learn to gain visibility into systems by combining tracing, metrics, and logging with OpenTelemetry
Alex Boten
BIRMINGHAMMUMBAI
Cloud-Native Observability with OpenTelemetry
Copyright 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Rahul Nair
Publishing Product Manager: Shrilekha Malpani
Senior Editor: Arun Nadar
Content Development Editor: Sujata Tripathi
Technical Editor: Rajat Sharma
Copy Editor: Safis Editing
Project Coordinator: Shagun Saini
Proofreader: Safis Editing
Indexer: Pratik Shirodkar
Production Designer: Ponraj Dhandapani
Marketing Coordinator: Nimisha Dua
First published: April 2022
Production reference: 1140422
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80107-770-5
www.packt.com
To my mother, sister, and father. Thank you for teaching me to persevere in the face of adversity, always be curious, and work hard.
Foreword
It has never been a better time to be a software engineer.
As engineers, we are motivated by impact and efficiencyand who can argue that both are not skyrocketing, particularly in comparison with time spent and energy invested?
These days, you can build out a scalable, elastic, distributed system to serve your code to millions of users per day with a few clickswithout ever having to personally understand much about operations or architecture. You can write lambda functions or serverless code, hit save, and begin serving them to users immediately.
It feels like having superpowers, especially for those of us who remember the laborious times before. Every year brings more powerful APIs and higher-level abstractions many, many infinitely complex systems that "just work" at the click of a button or the press of a key.
But when it doesn't "just work," it has gotten harder than ever to untangle the reasons and understand why.
Superpowers don't come for free it turns out. The winds of change may be sweeping us all briskly out toward a sea of ever-expanding options, infinite flexibility, automated resiliency, and even cost-effectiveness, but these glories have come at the price of complexityskyrocketing, relentlessly compounding complexity and the cognitive overload that comes with it.
Systems no longer fail in predictable ways. Static dashboards are no longer a viable tool for understanding your systems. And though better tools will help, digging ourselves out of this hole is not merely an issue of switching from one tool to another. We need to rethink the way software gets built, shipped, and maintained, to be production-focused from day 1.
For far too long now, we have been building and shipping software in the dark. Software engineers act like all they need to do is write tests and make sure their code passes. While tests are important, all they can really do is validate the logic of your code and increase your confidence that you have not introduced any serious regressions. Operations engineers, meanwhile, rely on monitoring checks, but those are a blunt tool at best. Most bugs will never rise to the criticality of a paging alert, which means that as a system gets more mature and sophisticated, most issues will have to be found and reported by your users.
And this isn't just a problem of bugs, firefighting, or outages. This is about understanding your software in the wildas your users run your code on your infrastructure, at a given time. Production remains far too much of a black box for too many people, who are then forced to try and reason about it by reading lines of code and using elaborate mental models.
Because we've all been shipping code blindly, all this time, we ship changes we don't fully understand to a production system that is a hairball of changes we've never truly understood. We've been shipping blindly for years and years now, leaving SRE teams and ops teams to poke at the black boxes and try to clean up the messall the while still blindfolded. The fact that anything has ever worked is a testament to the creativity and dedication of these teams.
A funny thing starts happening when people begin instrumenting their code for observability and inspecting it in productionregularly, after every deployment, as a habit. You find bugs everywhere, bugs you never knew existed. It's like picking up a rock and watching all the little nasties lurking underneath scuttle away from the light.
With monitoring tools and aggregates, we were always able to see that errors existed, but we had no way of correlating them to an event or figuring out what was different about the erroring requests. Now, all of a sudden, we are able to look at an error spike and say, "Ah! All of these errors are for requests coming from clients running app version 1.63, calling the /export endpoint, querying the primaries for mysql-shard3, shard5, and shard7, with a payload of over 10 KB, and timing out after 15 seconds." Or we can pull up a trace and see that one of the erroring requests was issuing thousands of serial database queries in a row. So many gnarly bugs and opaque behaviors become shallow once you can visualize them. It's the most satisfying experience in the world.
But yes, you do have to instrument your code. (Auto-instrumentation is about as effective as automated code commenting.) So let's talk about that.
I can hear you now"Ugh, instrumentation!" Most people would rather get bitten by a rattlesnake than refactor their logging and instrumentation code. I know this, and so does every vendor under the sun. This is why even legacy logging companies are practically printing money. Once they get your data flowing in, it takes an act of God to move it or turn it off.
This is a big part of the reason we, as an industry, are so behind when it comes to public, reusable standards and tooling for instrumentation and observability, which is why I am so delighted to participate in the push for OpenTelemetry. Yes, it's in the clumsy toddler years of technological advancement. But it will get better. It has gotten better. I was cynical about OTel in the early days, but the community excitement and uptake have exceeded my expectations at every step. As well it should. Because the promise of OpenTelemetry is that you may need to instrument your code once, but only once. And then you can move from vendor to vendor