Site Reliability Engineering (SRE) in the old-school kindergarten
Of course, you have heard about Site Reliability Engineering (SRE), an acronym introduced by Google. The aim of this text is to dissect the acronym using a seesaw (an old-school kindergarten toy) as a conceptual model. Nonetheless, instead of going from kindergarten to middle school, as usual, this text applies the other way around.
Let us start with a brief history of reliability engineering. One can argue that statistics and mass production were the enablers in the rise of reliability engineering , concomitantly, many reliability issues occurred in World War II due to the unreliability of electronic equipment and the fatigue issues. Consequently, the modern use of the word reliability was defined by DoD (U.S. Department of Defense) in the 1940s, characterizing a product that would operate as required under stated conditions for a specified time. Later, in the 1950s, a group called the “Advisory Group on the Reliability of Electronic Equipment” (AGREE) of DoD investigated reliability methods for military equipment . This group recommended three main ways of working: (1) improve component reliability, (2) establish quality and reliability requirements for suppliers and (3) collect field data and find root causes of failures.
Therefore, looking in perspective it is clear that reliability engineering (RE) is a discipline established since the 1940s. Furthermore, roughly, there are two approaches for the reliability challenges : (1) Failure Avoidance — the design of a system for long life with rigorous quality control accompanied by maintenance policies; (2) Failure Tolerance — the design of a system which can tolerate component failures and continue to function.
Finally, as a system grows in complexity (e.g., numbers of highly-interconnected components) more demand for reliability engineering. At this point, it should be clear that IT phenomena as cloud, “…-as-a-service”, and microservices, which carry on distributed computing, require reliability engineering since the complexity of systems continues to grow. Such observation, in the IT domain, leads to the quote of Werner Vogels (VP & CTO Amazon): “failures are a given, and everything will eventually fail over time” which was experimented in the field during World War II in the 1940s.
Still, regarding the IT domain, the apparent dilemma between DEV and OPS teams is well-known. Roughly, such a dilemma can be analyzed using the elementary physics of a seesaw (see Fig. 1), in which an organization, a process, and a product are given. On one hand, the DEV team strives to optimize lead time for changes and deployment frequency (two of the five metrics of the software delivery and operational performance — SDO ), on the other hand, the OPS team struggles to optimize change failure rate and time to restore (two more metrics of SDO ). Meanwhile, somewhere DEVOPS team battles to find an equilibrium knowing that, in general, the change failure rate is between 0% and 60% percent .
One condition for equilibrium in such seesaw is that sum of all forces (one-dimensional problem disregarding torques, arrows down meaning negative while arrows up positive forces) are equals to zero, which leads to:
With some elementary algebra:
Hence, the condition for equilibrium is that the DEVOPS team should have the force (meaning the capacity to influence the direction of the given organization, process, and product) that are equals to the sum of forces of DEV and OPS teams. We leave it as a mental exercise if such a condition has been satisfied in your previous and current experiences.
Besides, recall IT phenomena as cloud, “…-as-a-service”, and microservices, which carry on distributed computing, require reliability engineering since the complexity of systems continues to grow. Such phenomena are directly related to the last metric of SDO, availability , and are not historically handled by one of the three teams discussed above, namely DEV, OPS, and DEVOPS.
Here is the major contribution of Google, the emergence of an independent IT team to handle reliability engineering (see Fig. 2), which can be translated into the IT domain: (1) improve reliability, (2) establish reliability requirements and (3) collect field data (a.k.a, observability) and find root causes of failures — compare with three main ways of working defined in the 1950s by AGREE .
It turned out that the condition for equilibrium in updated model (with SRE team) of the seesaw is (after some elementary algebra):
Hence, the condition for equilibrium is that the force of DEVOPS and SRE teams summed up are equals to the sum of forces of DEV and OPS teams. We leave it as a challenge if such a condition will be satisfied in your next experiences, in the sense that user satisfaction will be achieved using reliability engineering.
Furthermore, in reliability engineering, the most basic definition is the reliability function (R), which is the probability that a system will perform a required function under stated conditions for a specified time. Complementarily, the unreliability function (Q) is the probability that a system will fail (Q = 1 - R). Both definitions are stated in statistical terms — as a probability — which reflects the fact that failures occur at unpredictable times, beyond that, this establishes at the outset the fact that much of the analysis in reliability engineering have to be statistical .
Here is another contribution of Google, instead of approaching the reliability engineering using statistics, it rebranded R, the reliability function, as SLO (Service Level Objective — a percentage, meaning also a weaker definition in the sense that it can be violated and it is not defined by “field data”) as well as it rebranded Q, the unreliability function, as error budget meaning that this is the tolerated number of errors in a given system during a specified time without violating the SLO.
The last contribution of Google, is the usage of SLIs (Service Level Indicator — indeed, the way that “field data” is collected), to enact the capacity to influence the direction of the given organization, process, and product by the four teams, namely DEV, OPS, DEVOPS, and SRE. Roughly, the general policy states that while SLI is constrained by error budget new features goes to production, otherwise, the reliability of the system is the feature to be enhanced in the production.
Kindergarten — Takeaways
IT phenomena as cloud, “…-as-a-service”, and microservices, which carry on distributed computing, require reliability engineering since the complexity of systems continues to grow. By the same token, “failures are a given, and everything will eventually fail over time”.
In our understanding, Google proposes a three-fold path regarding reliability engineering for cloud-native software in the striving for the equilibrium of the analyzed seesaw:
- Allocate reliability requirements and concerns in a dedicated team, the SRE team;
- Use simple apparatus to handle reliability, namely SLO, error budget, and SLI;
- Use error budget to enact the capacity to influence the direction of the given organization, process, and product by the four teams, namely DEV, OPS, DEVOPS, and SRE. Recall the direction of the forces exerted by each team.
 Saleh, J.H. and Marais, Ken “Highlights from the Early (and pre-) History of Reliability Engineering”, Reliability Engineering and System Safety — 2006
 Souza, M. and de Carvalho, T., “The Fault Avoidance and The Fault Tolerance Approaches for Increasing the Reliability of Aerospace and Automotive Systems,” SAE Technical Paper 2005–01–4157, 2005.
 Accelerate State of DevOps 2019 https://services.google.com/fh/files/misc/state-of-devops-2019.pdf