Value Proposition of Site-Reliability Engineering in a Tale of Two Cities

A Tale of Two Cities is a novel by Charles Dickens published in the mid-19th century. Its central theme is duality, which is depicted in the first paragraph of the novel:

"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair…" - Chapter 1 - The Period

source: https://www.amazon.com.br/Tale-Two-Cities-Special-Illustrated-ebook/dp/B07HKWVG3G

Henceforth we explore the duality of the value proposition of Site-Reliability Engineering (SRE) through real-world stories and data. Consider a hypothetical company with two "cities", which are:

  • CEO's city - a city in which the mayor is the Chief-Executive Officer (CEO), who is concerned, at least, about sales and profit;
  • CTO's city - a friendship city is governed by the Chief-Technology Officer (CTO), who is concerned about CAPEX (Capital Expenditure) and OPEX (Operational Expenditure) among others.

Let us move to the Tale of Two Cities regarding the value proposition of SRE.

CEO's city

Regarding the CEO’s concerns, an emblematic real-world story is the partial outage of Amazon’s website during their annual prime day sale in 2018. According to analysts [1][2], this hour-long outage could have cost Amazon up to $100 million in sales. Additionally, reputation costs related to the users' frustration were not estimated.

Availability matters for sales, and, consequently, impacts revenue and profit. Some experts [3] argue that availability is assumed as foundational and hence is not perceived as a major differentiator. Nonetheless, the most important feature of every system is availability (an SRE tenet [5]). Usability, performance, scalability, maintainability, etc. all require the systems to actually be available.

The dual concept of availability is unavailability, which is rooted in failures. Werner Vogels, again from Amazon, states “Failures are a given, and everything will eventually fail over time”. A quote that challenges the previous assumption, availability is foundational, using common sense as well as the body of knowledge of reliability engineering [4].

Let us explore the friendship city.

CTO’s city

It is well-known the duality between computation and communication, which in terms of the CTO's city can be interpreted as computing and networking.

Facebook disclosed a real-world story about its networking team in mid-2010 [6]. In the story, Najam Ahmad, Facebook’s director of network engineering, stated “No software engineer ever came to network” before the emblematic project called FBAR (Facebook Auto-Remediation). Moreover, Najam added that the networking team shifted the system administration job to “… an actual engineering job where you’re creating things. The network is a large-scale distributed system, and it’s a really fun problem to solve … pretty much everyone on the team is writing some level of code”. The outcome was outstanding: FBAR sifts through 3.37 billion notifications from network devices each month, filtering out noise down to roughly 750,000 alarms that need action, of those, FBAR resolves 99.6 percent of the alarms without human intervention[6].

Facebook's real-world story [6] converges into the definition of SRE by Google “SRE is what happens when you ask a software engineer to design an operations team”[5]. Furthermore, it highlights that human labor in the operations can be optimized, which results in increased morale of teams — in the novel, the hope of renewal is what prevents people from losing what make them human and resort to reprovable behavior, indeed, another tenet of SRE, a blamelessness culture [5] — , lower OPEX costs, and higher-degrees of availability.

Supporting the emergence of higher-degrees of availability instead of the previously mentioned foundational assumption, Gartner published data about the cost of network (communication) downtime, which is on average $300,000 per hour [3]. Complementarily, Statista published in 2020, 25 percent of respondents worldwide reported the cost of their servers downtime (computation) as being between $301,000 and $400,000 [7].

In summary, the lack of the required availability regarding computation and communication in the CTO's city compromises OPEX since it decreases the user productivity due to slow response times and unavailability. Moreover, fire-fighting issues is a costly activity including from the human viewpoint. Finally, CAPEX is also compromised because the focus on issues diverts resources from growth and innovation opportunities.

Wrapping up

In the Tale of Two Cities, the value proposition of SRE is dual:

  • for the CEO's city, the delivery of the required availability means more sales, consequently, more revenue, and potentially more profit, if the OPEX is under control. More profit perhaps results in a higher CAPEX for the CTO's city;
  • in the CTO's city, the guarantee of the required availability reduces the OPEX — and so potentially increasing the profit of the CEO's city. Complementarily, it enables the focus on investments (CAPEX) in growth and innovation opportunities — once again, positively reflecting in the profit of the CEO's city.

References

[1] https://www.theverge.com/2018/7/16/17577654/amazon-prime-day-website-down-deals-service-disruption

[2] https://www.businessinsider.com/amazon-prime-day-website-issues-cost-it-millions-in-lost-sales-2018-7

[3] https://blogs.gartner.com/andrew-lerner/2014/07/11/network-downtime/

[4] https://romgerale.medium.com/site-reliability-engineering-sre-in-the-old-school-kindergarten-a693bed2f9fb

[5] https://sre.google/sre-book/introduction/

[6] https://www.sdxcentral.com/articles/news/how-facebook-transformed-its-networking-team/2015/05/

[7] https://www.statista.com/statistics/753938/worldwide-enterprise-server-hourly-downtime-cost/