Overcoming a Resource Constraint in the Journey against Unavailability


Recall the journey against unavailability of an application supporting insurance quotes, applications, and policies in an insurance company began when the CTO developed a heightened awareness extending beyond the application code [4].

Source: https://hotpot.ai/art-generator?s=dalle-mini

But that was just the beginning. Along the journey, the CTO encountered a resource constraint, which posed a significant obstacle. However, through strategic utilization of the current infrastructure’s elasticity, the constraint was successfully overcome. This achievement was particularly noteworthy as it was accomplished without the requirement for additional investments.

Constraints, in general

In the theory of constraints, the fundamental premise is that the rate of goal achievement by a goal-oriented system is limited by at least one constraint [1]. In line with this, the CTO envisioned the rate of quotes per day as the desired goal, aligning with the business perspective.

In that regard, the CTO applied reduction ad absurdum as follows: If there was nothing preventing the system from achieving higher rates of quotes per day, the rate would be infinite — which is clearly impossible in a real-world system. In practice, insurance brokers have experienced periods of unavailability and poor response times, further highlighting the existence of constraints.

Resource Constraint, in practice

A cornerstone database was running on a virtualization infrastructure. The database cluster was supported by two nodes (in order to provide high availability) and each one had 15 virtual CPUs as the maximum number of computational resources. Moreover, such nodes were configured to use SMT4 — SMT, also known as simultaneous multithreading, permits multiple independent threads of execution to better use the computational resources available. Consequently, the total number of logical CPUs was calculated as follows: 15 virtual CPUs * 4 SMT * 2 nodes = 120.

Recalling the observations made by the infrastructure teams during business hours, they noticed a correlation between the increasing number of users and the rising demand for SQL execution in the database. Upon investigating the database through instrumentation, the teams identified three key findings: (1) the CPU utilization within the nodes exceeded 95% constantly; (2) the virtualization infrastructure was already delivering the maximum number of available virtual CPUs; and (3) the number of active sessions in the database far surpassed the number of logical CPUs.

Thanks to the heightened awareness extending beyond the application code of the CTO, he guided the teams towards tuning the virtualization infrastructure. This was necessary as it became evident that the current computational resource allocation posed a practical constraint.

The tuning of the virtualization infrastructure was performed meticulously, and, internal benchmarks were derived. One noteworthy benchmark indicated that for every additional virtual CPU allocated to the database during a given load test, the 90th percentile of response time experienced by the end-user decreased by approximately 200 milliseconds.

Such a benchmark highlights that the journey against unavailability played a crucial role in directly contributing to the insurance company’s income. This aligns with the findings in the literature, such as a notable example from 2006 when professionals at Amazon publicly stated that an additional 100 milliseconds of load time resulted in a 1% decrease in sales [2, 3].

The resource constraint focused on computational resources in the database was successfully overcome within approximately six months. This effort resulted in a significant increase in the total number of logical CPUs available, which now stands at 896 (56 virtual CPUs * 8 SMT * 2 nodes). This is more than seven times greater than the initial computational resources that were available. What is equally important is that this expansion was accomplished through strategic utilization of the current infrastructure’s elasticity, without requiring additional investments.

Awaiting in the ongoing journey against Unavailability

The remarkable expansion of computational resources demonstrates the effectiveness of the CTO’s involvement in overcoming the constraint. It emphasizes the critical importance of fine-tuning and optimizing the infrastructure in the journey against unavailability, ultimately leading to positive impacts on the company’s bottom line.

However, the story doesn’t end here. There is more to the tale, as further challenges and achievements await in the ongoing journey.