Is Your Organization Ready for a Service Mesh? (Service Mesh Part 3)
By now, we hope you’ve had the chance to read our Service Mesh blog series. In our previous posts, we shared how Service Meshes has the potential to cure many of your microservice pains by enabling canary deployments, ubiquitous monitoring, circuit breaking, and a bunch of other useful features. In fact, it can be tempting to put a few lower priority efforts on hold just to roll out a proof of concept in a non-prod cluster. Perhaps you have a new greenfield project that is begging for a service mesh to solve all the communication problems you had in the legacy system, and the timely rollout of a service mesh can kickstart the next wave of new features that your organization desperately needs.
Despite being one of the most powerful concepts to come out of the Cloud Native space, service meshes should be adopted with care. They are an operational subsystem that can take down your entire data plane (read: cause a catastrophic outage of your entire system) if not managed correctly. Furthermore, the location of Service Meshes on the data plane makes them tempting attack vectors and therefore high-maintenance in terms of keeping up-to-date with the latest releases and security patches. The purpose of this post is to provide simple criteria to help decide if your organization is prepared for the operational overhead of moving to a service mesh.
Across the OpsGuru team, we have assisted several organizations in adopting service mesh technology. Over the course of these projects, we have identified 3 DevOps capabilities that are essential to the success of service mesh adoption:
Continuous Deployment of Infrastructure
Independently Testable Microservices
Well Defined SLOs, SLAs, and SLIs
In the sections below, we'll elaborate on these capabilities and why they are critical to adopting a service mesh.
Continuous Deployment of Infrastructure
CI/CD is a well-known cornerstone of DevOps practices when it comes to application code. However, most organizations still struggle with introducing DevOps practices to their infrastructure. Infrequent changes, inherent statefulness, and the high cost of disruption are all factors that have stunted the maturity of infrastructure change automation in most organizations. Even when the components change infrequently or have a limited impact on other components, slowly but surely, deployed environments succumb to configuration drift, and automated processes are shelved and tagged for "refactoring" at some later date. Even when the drift is sub-optimal and error-prone, the drift is generally still manageable - that is till Service Mesh is introduced.
The nature of a service mesh means that it can be:
an attack vector for each microservice
an attack vector for your control plane
a point of failure for each microservice
a single point of failure for your entire system (within a failure domain)
Furthermore, given the newness of the space and competitive market, all service mesh products are undergoing aggressive development. In practice, this means you can expect a fast release cadence and high frequency of critical patches (due to CVE disclosures as well as regressions) from whichever service mesh you adopt.
Therefore, before adopting a service mesh. Your teams must have the ability to:
Rapidly deploy a new version of the mesh to a non-production environment
Adequately test the mesh for regressions across all microservices
Upgrade production environments to a new version of the mesh without unplanned downtime
Note, there is no one-size-fits-all approach for managing these deployments. It can be achieved via in-place upgrades, blue-green deployments, or phased rollouts. What matters is that the processes are fully automated, reproducible, reliable, and testable.
Full Stack Visibility
While a deficiency in infrastructure automation, like upstream vulnerabilities, can bring productivity to a halt, inadequate visibility can magnify simple application bugs into release-blocking time sinks. In this sense, visibility goes beyond simply setting up your favourite logging and monitoring systems. It also encompasses application and infrastructure design practices that enable a targeted analysis of each microservice's behaviour independent of the service mesh and underlying cloud infrastructure.
Obviously, proper visibility is a benefit even without a service mesh. Modern applications are typically built with some combination of managed cloud services, third-party services, and proprietary microservices. Troubleshooting a misbehaving transaction requires identifying which subset of components was involved, reasoning about the role each one played, and then hypothesizing and testing until a root cause is identified. Without adequate visibility, this process is challenging at the best of times. However, adding a service mesh into the mix compounds required effort due to its relatively unique position in the stack. Few, if any, third party components are more closely integrated with application processes than service mesh proxies. While there are real benefits to encapsulating communication logic in a local proxy, the practice makes it hard to distinguish between application behaviour and proxy behaviour, especially when it comes to troubleshooting performance degradation and communication errors.
One of the confounding factors with visibility is that most service meshes promise rich monitoring and tracing functionality out of the box. This often lulls teams into believing they can ignore all visibility concerns and let the service mesh take care of it. The unavoidable truth, however, is that the service mesh provides only one perspective on system health - as seen from the local proxies. For example, if the mesh control plane or proxies are disrupted, latency metrics may become unavailable for some workloads, or worse, they may become skewed due to time spent within proxy buffers, making them irrelevant to the application state. While this still provides sufficient signal to determine overall system health, it increases noise in root cause analysis. In other words, a service mesh does significantly increase the observability of your system, but it also adds a new failure mode. (It often requires external tooling to efficiently diagnose and debug. This particular concern varies quite a bit by implementation. Linkerd, for example; provides excellent tooling for visibility out of the box - but the general case still stands.)
In order to prevent your teams from losing all momentum in troubleshooting integration issues, it is critical that they have logs and metrics, including the Four Golden Signals (Latency, Traffic, Errors and Saturation) readily available for each:
microservice workload instance (excluding the local proxy)
service mesh proxy instance
service mesh control plane component
The availability of similar signals for cloud and third-party components goes without saying, but is typically provided out of the box and therefore less likely to be overlooked. These signals should be exposed via a system that is external and independent of all service mesh components.
Well Defined SLOs (Service Level Objectives)
Full-stack visibility provides the required insight into how each component is behaving but does not provide insight into how a microservice, or system as a whole, should behave. Defining desired behaviour is key to controlling the amount of effort as well as the implementation costs of any system. As soon as a service mesh is introduced, these controls quickly become essential as the complexity of a system increases.
SLOs are an efficient and measurable mechanism for describing desired behaviour. Typically they focus on the availability of a component, which includes some upper bound on latency. For example: "Microservice A will return a non-5XX status code in under 200ms, 99.95% of the time". The process of defining SLOs is beyond the scope of this post but once again, Google's SRE book is a great place to start.
Adopting a service mesh increases the necessity of SLOs for several reasons:
Guides the judicious use of service mesh features
Prevents unnecessary engineering costs due to over-tuning
Provides testable contracts between microservices
Encourages cost-benefit discussions around service mesh features
First and foremost, adopting a service mesh means adopting an abundance of mechanisms for controlling the performance and availability of your microservices and system as a whole. Without a fixed objective, performance-minded teams can easily fall into the trap of over-utilizing features like circuit breakers and rate limiters. This increased demand adds new failure modes and complexity to the system and drives up the overall effort to stabilize and maintain it. On the other hand, SLOs also help identify which service mesh features should be seen as a necessary complexity for achieving desired behaviour.
Well-defined SLOs also simplify the development of new services that depend on existing microservices. The contracts represented by those SLOs can be used to reason about the maximum achievable performance of the new service and, in turn, the SLOs it can support. Furthermore, the act of defining SLOs themselves drives conversations around the costs associated with achieving them. Costs will always increase super-linearly with improvements to availability and latency. These costs apply to underlying infrastructure as well as to service mesh components and features. Oftentimes, stateful features (like rate limiting) are much more costly in terms of compute resources than simple traffic management features. At any significant scale, understanding these costs in relation to the behaviour they support is critical.
Service meshes are extremely helpful solutions to very real problems; but, like most technologies, they have their trade-offs. For organizations that haven't reached a certain level of operational maturity, those trade-offs could be catastrophic. In this post, we've laid out key criteria you can use to evaluate your team's operational maturity. If your teams already have the ability to safely deploy infrastructure to live systems, quickly collect key metrics when troubleshooting, and understand the behaviour of each microservice, introducing a service mesh will not be a win. On the other hand, if they fall short in one or more of those areas, short term efforts are probably better spent improving operational maturity. Wherever you land, hopefully, this post helps save some time and stress during the evaluation process.