A Sword Battle against Unavailability

Alessandro Gerlinger Romero
4 min readJun 26, 2023

Recall the commencement of the quest against unavailability in an insurance company, pertaining to an application that supports insurance quotes, applications, and policies, can be traced back to the moment when the CTO’s awareness transcended the mere boundaries of the application code [1].

In the ongoing journey against unavailability, the insurance company made remarkable progress by significantly augmenting its computational resources. As a result, the system became capable of accommodating higher loads and delivering enhanced performance [2]. Nevertheless, during peak periods of high throughput demand, there was still a noticeable increase in the 90th percentile of response time experienced by end-users.

Indeed, these peak periods of high demand exceeded the previous ones. For instance, after four weeks, the number of quotes per day had increased eightfold, following the shared rollout business plan.

Notably, CPU utilization was no longer an issue, as its usage within the nodes did not exceed 70% even during these peak periods. The bottleneck had shifted to another part of the system, presenting a new constraint that needed to be defeated in a further battle.

Unavailability as a Function of Scalability

The previous progress was achieved by ‘cloning’ the initial computational resources. Consequently, the system improved its scalability, which is a non-functional property of a system that models the ability to appropriately handle increasing workloads. As a consequence of increased scalability, the unavailability was tinier while the availability was bigger.

Availability (A) refers to the ability of a system to remain operational and accessible to users over time. It is typically measured in terms of uptime, which is the percentage of time a system is available for use. Higher availability means that the system experiences minimal downtime or disruptions, ensuring that users can access it whenever they need to. Complementarily, unavailability (U) is typically measured in terms of downtime, (U = 100% — A), which is the percentage of time a system is unavailable.

Using these definitions, the CTO envisioned that the relation between unavailability (expressed as downtime percentage, 0 < U < 1) and scalability (represented as a score between 0 and 1) could be graphed for further analysis. Furthermore, leveraging the scale cube [3], which defines a three-dimensional space for scalability, unavailability could be conceptualized as a function of scalability in a four-dimensional space (see the graph below, where the scale of color represents the fourth dimension, unavailability).

The 'cloning' of computational resources. Source: author; computed using MATLAB online (https://matlab.mathworks.com/)

The graph revealed to the team that the progress in reducing unavailability (towards 0% or complete absence of downtime) had been achieved by moving the scalability score along the X-axis through 'cloning' processors.

This visualization technique posed a challenge to the CTO and team, should we explore the other dimensions of scalability?

The Sword Battle against Unavailability

Meanwhile, the team found out that unavailability during the peak periods was rooted in the concurrency on data blocks inside the database. As the number of concurrent insurance quotes, applications, and policies continued to grow, the actual configuration to handle concurrency on data blocks began to reach its limits imposing a constraint.

Immediately, the CTO and the team comprehended that the Z-axis of scalability should be utilized, involving the 'splitting of similar things'.

CTO drove the team to identify the top five tables in which concurrency occurred in their data blocks.

With a fast movement of the sword, the CTO quickly chopped the first table into 20 hash partitions. Subsequently, the evaluation of the results pointed out chopping the same table into date interval partitions and 20 hash subpartitions. Applying rapid movements, the CTO chopped the next three tables.

The final table among the top five required a more elaborate sword technique since it was frequently accessed based on different criteria. Nevertheless, the chopping was successfully executed.

This sword battle took approximately one month and concluded with the top five tables chopped. The resulting unavailability as a function of scalability is depicted below. Departing from the initial condition, the journey went through the ‘cloning’ of the initial computational resources, moving scalability along X-axis. And, lastly, the sword battle moved scalability along Z-axis through the partitioning of tables or ‘splitting of similar things’.

The sword battle. Source: author; computed using MATLAB online (https://matlab.mathworks.com/)

Wrapping up

As the insurance company continued to evolve, the CTO and the team remained vigilant, constantly monitoring the system’s unavailability, and seeking opportunities for further optimization. They recognized that the journey against unavailability was an ongoing process, requiring continuous improvement and adaptation.

References

[1] A Journey against Unavailability began — https://romgerale.medium.com/a-journey-against-unavailability-began-c25e69142583

[2] Overcoming a Resource Constraint in the Journey against Unavailability — https://medium.com/@romgerale/overcoming-a-resource-constraint-in-the-journey-against-unavailability-2e41ba1a8e4f

[3] Art of Scalability, The: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise — Martin Abbot and Michael Fisher — Addison-Wesley Professional - 2nd edition

--

--