SRE and Observability: inside a Formula One lap for improvement
Let us start recalling the conclusion of the last conversation [1]. IT phenomena as cloud, “…-as-a-service”, and microservices, which carry on distributed computing, require reliability engineering since the complexity of systems continues to grow. By the same token, “failures are a given, and everything will eventually fail over time” a quote by Werner Vogels (VP & CTO Amazon).
Therefore, Site-Reliability Engineering (SRE), an acronym defined by Google, emphasizes the necessity of reliability concerns and requirements, in particular, fault tolerance, which is the capacity of a given system to tolerate failures and continue to function. Nonetheless, a sine qua non condition for fault tolerance is fault detection. Regarding fault detection, it turned out that the term observability was brought from another engineering, namely, control engineering [2], in such a way that it conceptualizes the “inference” of the internal state, perhaps with failures, of a given IT system from the knowledge of its external outputs. As stated, observability is a cornerstone medium for SRE allowing, at least, practical implementation of SLIs (Service Level Indicator — indeed, the way that “field data” is collected).
Equipped with the two main concepts in analysis, namely SRE and observability, it is time to look for improvement inside a Formula One (F1) lap.
Inside an F1 lap
Back in 1978, the F1’s Tyrell team led the way for data acquisition since it introduced a computer-driven analysis of its cars’ performance data. An American physicist and mathematician rigged up a computer to the team’s cars and extracted sensor-generated data on their speed, suspension movements, and directional forces, plus percentages of throttle opening and braking [3].
In those embryonic days of data acquisition, the results were downloaded onto a tape cassette. This provided Tyrell’s engineers and pilots with data that gave them a competitive advantage over their rivals as they tuned their cars and pilotage, respectively [3].
F1’s McLaren took the concept a stage further in 1991 when it was the first F1 team to harness telemetry [3] — the collection of measurements or other data and their automatic transmission to the receiving equipment. The word telemetry is derived from the Greek roots tele, “remote”, and metron, “measure”.
Moreover, the counterpart of telemetry, telecommand, a command sent to control a remote system, was successfully used by McLaren in 2002 to stop a car engine from smoking due to a sticking valve in the oil system. A telecommand beamed out from the pit garage allowed that car to keep running to victory [3], a major competitive advantage.
A modern F1 car, namely F1’s Sauber Team cars in 2014, are fitted with 140 sensors and constantly send 10Mb of data per lap to the pit garage, see Fig. 1 for a gist of telemetry in F1’s Sauber Team cars [4].
The introduction of telemetry transformed the way that engineers and drivers learn about their own and others car’s performance, e.g., drivers and engineers could see where a driver could go faster, how consistent he was being and how much wheelspin he was experiencing [3].
Engineers and drivers from the early years of F1 can only guess at how the usage of telemetry might have improved the lap times. Back in those days, the drivers had only the stopwatch and pit board to inform them of their progress, and the feedback that they received through their hands and the seat to gauge their car’s behavior [3].
Back to IT, telemetry data
Firstly, the term telemetry is not often used in the IT domain. However, a recent project in the Cloud Native Computing Foundation (CNCF) focused on an observability framework for cloud-native software was named OpenTelemetry [5]. Such a project is envisioned to allow instrumentation, generation, collection, and export telemetry data (metrics, traces, and logs) in order to understand a given system behavior [5].
Based on such a framework for telemetry data, the next fundamental question is what are the counterpart of “throttle paddle, brake pressure, steering wheel position, …” — F1’s car metrics — in the IT domain. It turned out that such metrics, in the IT domain, are called “golden signals” by Google, namely latency (or response time), errors, traffic, and saturation.
Indeed, in a recent report about SRE practices [6], the metrics most tracked by SRE teams are error rate and end-user response time (see Fig. 2).
Furthermore, the most critical thing to figure out in a given system behavior is “what is broken” (symptoms), in other words, to infer its internal state, perhaps with failures, from the knowledge of its external outputs. In that sense, “Symptoms are a better way to capture more problems more comprehensively and robustly with less effort” a quote by Rob Ewaschuk (SR Engineer Google). Using the identified symptom, the “why” indicates a (possibly intermediate) cause.
Clearly, regarding the golden signals, latency (or response time) and errors are symptoms, and the last two are more related to causes, namely traffic and saturation. While the former group supports fault detection, the last group supports fault resolution and continuous improvement — all activities of OPS and SRE teams.
IT usage of telemetry data
In fact, as previously introduced telemetry data, in IT, are composed of metrics, traces, and logs. Which one of these data has a different purpose, volume, and usage, roughly: (1) metrics — fault detection, low volume, “do I have a problem”, (2) traces — fault identification, middle volume, “where is the problem”, and (3) logs — fault qualification, high volume, “what is causing the problem” (see Fig. 3 to grasp graphically such different purpose, volume, and usage).
Finally, in control engineering, a discipline in which the concept of observability has a precise mathematical meaning [2], there is a related concept called detectability. Detectability is a property of a given system in which all the unobservable states are stable, which means that it is not required to “observe” all the states in that system. This is particularly important in IT since it is nearly impossible to “observe” all the states of a given system.
In the F1’s pit garage — Takeaways
In organizations that are planning to apply SRE discipline, and, consequently, some sort of observability (or telemetry data as advocated by [5]) as a medium, approximately, the OPS team are like the drivers from the early years of F1 since they had only the stopwatch and pit board (composed of dashboards in IT) to inform them of their progress, and the feedback that they received through their hands and the seat (composed of alerts in IT) to gauge their car’s behavior (system’s behavior in IT).
Roughly, engineers (SRE team in IT) and drivers (OPS team in IT) from the early years of F1 can only guess at how the usage of telemetry (SRE discipline, and, consequently, some sort of observability as a medium) might have improved the lap times, what has been shown a major competitive advantage (there is no reason to be different in IT).
References
[1] Site Reliability Engineering (SRE) in the old-school kindergarten — https://romgerale.medium.com/site-reliability-engineering-sre-in-the-old-school-kindergarten-a693bed2f9fb
[2] Kalman R. E., “On the General Theory of Control Systems”, Proc. 1st Int. Cong. of IFAC, Moscow 1960 1481, Butterworth, London 1961.
[3] Data acquisition and telemetry in Formula 1 — Jason Sultana — https://formulaoneinsights.com/data-acquisition-and-telemetry-in-formula-1/
[4] F1 Telemetry for Rookies — Sauber F1 Team — https://www.youtube.com/watch?v=0sR5oCIfXDI
[5] OpenTelemetry — https://opentelemetry.io/
[6] 2020 SRE Report THE DISTRIBUTED SRE — Catchpoint
[7] Logs and Metrics and Traces, Oh My! — Splunk — https://www.youtube.com/watch?v=O0XNSU-I-sg