A Journey against Unavailability began

Alessandro Gerlinger Romero
3 min readApr 25, 2023

Once upon a time in an insurance company, a CTO (Chief Technology Officer) faced recurrent unavailability of the application supporting insurance quotes, applications, and policies. Such unavailability was perceived differently by stakeholders:

- Insurance brokers were frequently unable to process their insurance quotes and applications;

- Business teams experienced impacts on their income;

- Infrastructure teams were often in war rooms;

- Technology teams oscillated between delivering new features and searching for the root cause of the unavailability.

The Wound

In an attempt to find the root cause, infrastructure teams noticed that during business hours, as the number of users increased, the demand for SQL execution in the database also rose. Moreover, the increased SQL execution reached levels in which the database was not able to deliver the required response time, resulting in unavailability.

Therefore, during war rooms conducted by infrastructure teams, the main recurring theme was which SQLs were most frequently executed and whether they were tuned properly or not. Once the SQLs were verified and their execution plan was good enough for, at most, a dozen milliseconds execution, the next question posed was: why are they executed frequently?

At this point, the CTO drove the teams to focus on the application since the claim was that the database had elasticity for the required capacity, and it was properly tuned. Accordingly, the CTO actively participated in frequent war rooms, investing his time in that claim.

Afterward some time, the technology teams tuned properly the SQLs found, however, the last question remained: why are they so frequently executed? So, the next step for the technology teams was to focus on how to apply caching to reduce the executions of querying SQLs.

Despite the changes promoted by the technology teams, the diagnosis persisted: during business hours, as the number of users increased the demand for SQL execution in the database also raised and such raising reached levels in which the database was not able to deliver the required response time, resulting in unavailability. In addition, the remaining frequently executed SQLs were modifying data, mainly UPSERTs, and INSERTs.

Facing such a changed scenario, the CTO claimed that a refactoring of the application was required.

Although the applicability of refactoring was a consensus, the time to market of such a refactoring was not feasible in terms of business (also a consensus). Meanwhile, the business teams shared that the rollout plan would significantly increase the total number of users during business hours in the coming weeks.

The Journey

Analyzing this slightly different scenario, the CTO stated that a broader range of options was available ranging from capacity review passing by tuning the operating system and the database and, of course, manageable application changes.

Armed with this increased awareness, the CTO led the teams to a whole new bunch of alternatives. In fact, the journey against unavailability (in other words, to increase the reliability) of the application supporting insurance quotes, applications, and policies in that insurance company began.

Image Source and Credits: https://guardian.ng/opinion/the-journey-of-man/

--

--