Drop #129 (2022-10-31): o11y, o11y oxen free¹
o11y Primer/OpenTelemetry; o11y MELTdown; USE o11y in 60-Seconds
On page 121 of "The lore and language of schoolchildren" (free to borrow), the chapter on the "Code of Oral Legislation" begins with this declarative:
The schoolchild, in his primitive community, conducts his business with his fellows by ritual declaration. His affidavits, promissory notes, claims, deeds of conveyance, receipts, and notices of resignation, are verbal, and are sealed by the utterance of ancient words which are recognized and considered binding by the whole community.
After being thrust, recently, into the fray of k8s ("kubernetes"), and — before that — being around for the inaugural pre-pending events of "Dev", "Sec", and "Data" to DevSecDataOps, it's odd how we who toil or dabble in the "engineering" space (apologies to my Canadian readers who take that title more seriously than we in The Colonies do) have similar ritual declarations. Whether said declarations be fad-induced buzzwords tossed about at conferences, or directives codified in (ugh) YAML files, we exchange them in (sometimes/often futile) attempts to bring order to chaos.
Part of this order-bringing involves observability (
o11y for short, b/c ofc it is), which is the focus of today's edition. In general, observability is the extent to which you can understand the internal state or condition of a complex system based only on knowledge of its external outputs. The more observable a system, the more quickly and accurately you can navigate from an identified performance problem to its root cause, without additional testing or coding.
The goal of this section is to shunt you to a very accessible and quintessential resource on o11y. To that end, below is the introductory excerpt from OpenTelemetry's pitch. If you're new to o11y, then make sure to hit up the primer, first.
In order to make a system observable, it must be instrumented. That is, the code must emit traces, metrics, and logs. The instrumented data must then be sent to an Observability back-end. There are a number of Observability back-ends out there, ranging from self-hosted open-source tools (e.g. Jaeger and Zipkin), to commercial SAAS offerings.
In the past, the way in which code was instrumented would vary, as each Observability back-end would have its own instrumentation libraries and agents for emitting data to the tools.
This meant that there was no standardized data format for sending data to an Observability back-end. Furthermore, if a company chose to switch Observability back-ends, it meant that they would have to re-instrument their code and configure new agents just to be able to emit telemetry data to the new tool of choice.
With a lack of standardization, the net result is the lack of data portability and the burden on the user to maintain instrumentation libraries.
Recognizing the need for standardization, the cloud community came together, and two open-source projects were born: OpenTracing (a Cloud Native Computing Foundation (CNCF) project) and OpenCensus (a Google Open Source community project).
OpenTracing provided a vendor-neutral API for sending telemetry data over to an Observability back-end; however, it relied on developers to implement their own libraries to meet the specification.
OpenCensus provided a set of language-specific libraries that developers could use to instrument their code and send to any one of their supported back-ends.
Once more, if you're new to o11y, def read ^^ before hitting up the next two sections.
As with most "thought leader" contrivances, a common oversimplification of o11y is to reduce the concept to “make sure you generate Traces, Logs, and Metrics” from the components of your application. These have been known as the three "pillars" of o11y, and many drop-in-and-grifters (read: consultants) bilk enterprises out of much coin to get them bootstrapped into a false sense of o11y by just focusing on those three areas.
Siddharth Sharma does an excellent job of crushing those pillars, and provides context on some additional o11y areas engineers should include in their endeavors to gain more operational visibility on their sprawling infrastructure. It's a bit more than just a think-piece, which you can tell since it drops some (ugh) YAML along the way.
USE o11yin 60-Seconds
Back in my day, there were no numeric-laden terms such as "k8s" or "o11y". We had ALL CAPS acronyms, and WE LIKED THEM (WLT). You may not be in a kubernetes shop, or have to gain visibility into a plethora of systems. Even in tiny contexts, OpenTelemetry (see the first section) can work well, but it may be overkill for many readers. What can you do if you want/need to triage an oddly behaving system or process?
For starters, you can use the "Utilization Saturation and Errors" (USE) Method, which is a methodology for analyzing the performance of any system and is a foundational concept to the o11y concepts. It directs the construction of a checklist, which for server analysis can be used for quickly identifying resource bottlenecks or errors. It begins by posing questions, and then seeks answers, instead of beginning with given metrics (partial answers) and trying to work backwards.
USE can be summed up in one phrase (no numbers required!):
"For every resource, check utilization, saturation, and errors."
Since terms are important, here are the assumptions for those four:
resource: all physical server functional components (CPUs, disks, busses, ...) — note these may be “virtual” vs physical in a modern context;
utilization: the summary statistics that communicate the time that the resource was busy servicing work;
saturation: the degree to which the resource has extra work which it can't service, often queued;
errors: the counts and types of error events;
The USE page has many checklists that you can, er, use, and much 'splainin on the topic.
If you'd prefer more of a walkthrough example, this Netflix blog post from 2015 does a fine job showing how to gain some USE/o11y on a linux command line in less than sixty seconds.