Microservice: Everyone Makes Mistakes — Part II

8 min readFeb 11, 2020

“Everyone makes mistakes. The wise are not people who never make mistakes, but those who forgive themselves and learn from their mistakes.” — Ajahn Brahm (theoretical physicist, author, and Buddhist monk)

In this second part, we return to the endeavor of providing, perhaps, for wise people, a shortcut to “learn from their mistakes” when applying the microservice pattern (if shortcuts exist at all). The first part was focused on the applicational design mistakes [1], whereas this second and final part is focused on architectural mistakes.

Indeed, a given system that applies the microservice pattern is a distributed system by definition, consequently, the architectural mistakes are deeply related to the fallacies of distributed computing [2] and their impact in the “-ilities”.

First Mistake — The avoidance of “Right Speech” or the legitimation of idle chatter

A practitioner committed to the “Right Speech” abstains from the idle chatter. In the realm of microservices, a common mistake is to allow the intrinsically distributed system to be designed using a monolithic “Procedure-Call-PC-mindset”, in rough words, “whenever it is needed, do a remote PC using REST over HTTP/S”, which certainly leads to idle chatter.

This mistake bears a close relation to the mistakes explored in the first part, in particular, to the ‘Second Mistake — The avoidance of the “Middle Way” or about the wrong scope size of the microservices’ and the ‘Third Mistake — The avoidance of “Right Effort” or the lack of proper design’ [1]. In fact, it works as a “technical endorsement” to allow excessive communications between services and to avoid questions that naturally arise regarding complexities of communication between services, e.g., “why do we need to use a REST call over HTTPS to retrieve the genders? what is the frequency of change of genders? should I cache locally this data? what is the SLA of the genders’ service?”.

The consequences of the idle chatter including those led by the “PC-mindset” are DISASTROUS in a distributed system since the number of services that collaborate for a given business function may be high and network calls can fail (Fallacy 1. The network is reliable), which in turn is what compromises the overall reliability of the system (a set of coupled components in series has its reliability defined by multiplying the reliabilities of its components). Moreover, as the number of services is high, the latency of the network (Fallacy 2. Latency is zero) as well as the serialization and deserialization of exchanged messages (Fallacy 7. Transport cost is zero) severely compromises the responsiveness, in particular, at peak utilization. Furthermore, at peak utilization, the bandwidth can impose harsh issues in a non-local network (Fallacy 3. Bandwidth is infinite).

Once established the DISASTROUS SCENARIO by the avoidance of “Right Speech”, it is common to see further outspread of the mistake in the analysis (always an alert sign for the wise):

(1) to handle the Fallacy 1. The network is reliable, one emphasizes the usage of timeouts, automatic retries and, perhaps, circuit breakers.

(2) to deal with the Fallacy 2. Latency is zero, one preconizes local caching. Furthermore, when the volume of data to be cached is “too big”, the preconization leads to the concept of “warmed cache”.

(3) to manage the Fallacy 7. Transport cost is zero, one stresses more about the transport protocols than the design/analysis of inter-service communication.

(4) to take care of Fallacy 3. Bandwidth is infinite, in addition to the concept of local caching one suggests throttling policies.

Although these are important concerns of a well-architected distributed system, they do not increase the reliability and responsiveness of the system exclusively. On the other hand, the commitment to the “Right Speech” can mean: to avoid too small services ( ‘Second Mistake — The avoidance of the “Middle Way” or about the wrong scope size of the microservices’ [1]), to perform a proper design (‘Third Mistake — The avoidance of “Right Effort” or the lack of proper design’), to strive for reasonable payloads, to use bulk operations, to consider the network topology (Fallacies 5. Topology doesn’t change and 8. The network is homogeneous.), to select the right approach for the inter-service communication of each case, etc…

There are claims that events and the related tools, e.g., a high-available broker, can handle all the needs for inter-service communications. Recall events wear two hats: (1) a notification hat that triggers services into action, and (2) a replication hat that copies data from one service to another [7]. Although inter-service communication can be well handled by the latter hat (the replication one), idle chatter still represents a major challenge for the former hat since it is deeply related to the concept of business transaction (see next mistake).

Second Mistake — The avoidance of “Right Resolve” or the addiction to ACID

A practitioner committed to the “Right Resolve” resolves to leave home. Here, in the sense of, he leaves his comfort zone. In the realm of microservices, a common mistake is the addiction to ACID (atomicity, consistency, isolation, durability), which is comfortable for business and technical professionals.

ACID is a set of properties of “state management” transactions intended to guarantee validity even in the event of errors, power failures, etc… Within the boundary of a given microservice, ACID is easily achieved relying on the “state management tools” as classical SQL databases or the NewSQL tools.

In a distributed system, distributed transactions would be required between microservices what reassembles the addiction to ACID. In fact, “Much time has been wasted by organizations of all shapes and sizes attempting to preserve the notion of database-like transactions across independent services. It never works in the end, wasting time and money” [3]. Consequently, eventual consistency must be applied.

Eventually-consistent microservices are known as providing BASE (Basically Available, Soft state, Eventual consistency), in contrast to the addiction to ACID.

Furthermore, “in a system which cannot count on distributed transactions, the management of uncertainty must be implemented in the business logic” [4]. Therefore, the business logic must be designed/analyzed considering the complexities of state, which includes, at least, understanding the shape of data and how it evolves during the business lifecycle as well as during the cooperation between services. Such management of uncertainty must be properly reflected in the business functionalities, in other words, business professionals are responsible for the consistency of the state.

Third Mistake — The avoidance of “Right Action” or the simplification of security

A practitioner committed to the “Right Action” abstains from stealing. In the realm of microservices, a closely related mistake is to over-simplify the security concern (Fallacy 4. The network is secure).

In practice, there are “extreme viewpoints”. On one hand, one claims that the security is provided by the infrastructure and, consequently, no concern must be taken into account in the distributed system itself. On the other hand, professionals are arguing that each communication requires an independent authorization, which is a consequence of the first mistake (the legitimation of idle chatter). Once again, the “Middle Way” should be the guidance based on one or more OAuth2 grant types.

Fourth Mistake — The avoidance of “Right Livelihood” or the lack of focus on business value

A practitioner committed to the “Right Livelihood” avoids causing suffering to sentient beings in any way. In the realm of microservices, a common mistake is the lack of focus on business value, which clearly can lead to suffering for the business users (usually, the most sentient being in the value chain).

In the first mistake of this second part, the compromise of reliability was explored regarding the Fallacy 1. The network is reliable and others. Furthermore, the second mistake can compromise also reliability since there can be design or implementation flaws in the “management of uncertainty of state”, inherently present in a distributed system. Moreover, in order to handle the Fallacy 5. Topology doesn’t change, one must apply some sort of high-available service discovery (e.g., Kubernetes Service Discovery, Eureka, etc…), which in turn is the subject of the same concern, namely “management of uncertainty of state”.

In addition to reliability, a major cause of suffering is the lack of emphasis on scalability concerns. As microservices are stateless so they are well-suited for scaling, whereas “state management tools” represent challenges [5], once again, due to “management of uncertainty of state”. Nonetheless, even scalability of microservices can be easily compromised by a naive analysis. For example, in a real project, as the default sizing, in the documentation of the supplier [6], is to limit the number of pods per each core (PodsPerCore) to 10 (ten), then one architect concluded that each pod must be limited to use 100 millicores, which compromises the scalability as well as the responsiveness.

Finally, the responsiveness of the resulting distributed system is the combination of all the concerns previously explored and others as resource availability.

Fifth Mistake — The avoidance of “Right mindfulness” or the lack of observability

A practitioner committed to the “Right mindfulness” is conscious of what he is doing. A distributed system “will always be more complex to implement and run than a simple, single-process application designed to perform the same logic” [7], then, in the realm of microservices, a common mistake is the lack of observability or, in other words, the lack of conscious of what the distributed system is doing.

It is common to hear from architects, in the real-world, that observability is desired but not a requirement, in the sense that, the operations area is working for years and perhaps decentralized logs provide all information that one needs. However, as stated earlier, microservices require conscious of what the distributed system is doing, which is only achieved through a set of tools that provides observability.

In particular, observability is delivered by “service mesh” tools (e.g., Istio, Linkerd, etc…) working together with “distributed tracing tools” (e.g., Jaeger, Zipkin, etc…), which became prevalent in mature microservices landscapes. Such tools are the place for (1) system health monitoring, (2) troubleshooting issues, (3) enforcing inter-service communication policies as timeout, automatic retries, circuit breakers, caching, throttling, etc…, (4) reliability of production environments and (5) the evaluation of alternate versions in production environment (e.g, blue/green deployments, A/B testing, etc…). They are the central places for monitoring, tracing and controlling the inter-service communication — how they are connected, perform and secured.

Conclusion

In the wrap-up, consider a scenario in which a combination of part of these common mistakes is in place. In such a possible scenario, indeed, a real-world scenario, a REST call to the genders’ service is used by multiple business transactions, which in turn have their responsiveness defined by the genders’ service as well as their reliability. Moreover, failures in the calling of genders’ service propagate directly to the business transactions, increasing temporary state inconsistency (if the designed distributed system does not exhibit flaws in the eventual consistency, which is business defined) and causing a global failure. In the presence of global failures, the business value of the distributed system was compromised, and, consequently, some sort of suffering is imposed on the most sentient being in the value chain, business user. Finally, as there is no concern regarding observability, it is time-consuming to track such global failures to the problem in the calling of genders’ service, which worsens the suffering of business users. Such a disastrous scenario can be even worsened by these architects that actively work on “the legitimation of idle chatter”, for example, suggesting a “warmed cache” for the genders’ service calls.

This second and final part of the series “Microservice: Everyone makes mistakes” presented the five most common and general mistakes made in the microservice pattern regarding the architectural concerns. Perhaps, for wise people, they are a shortcut to “learn from their mistakes” using five of the eightfold paths.

References

[1] Microservice: Everyone Makes Mistakes — Part I

[2] Fallacies of Distributed Computing Explained

[3] Microservices Done Right, Part 1: Avoid the Antipatterns!

[4] Life Beyond Distributed Transactions, Pat Helland, 2007

[5] High-availability resides in the Backing Services

[6] Planning OpenShift — Sizing

[7] Designing Event-Driven Systems — Concepts and Patterns for Streaming Services with Apache Kafka, Ben Stopford, 2018

Microservice: Everyone Makes Mistakes — Part II

Written by Alessandro Gerlinger Romero