• Opsguru Twitter
  • OpsGuru LinkedIn
  • OpsGuru Github
  • OpsGuru Facebook

COPYRIGHT © OPSGURU 2019

To be successful on the cloud for the long term must include sound operational capabilities to manage the constantly changing environments.  That’s why Day2Ops is the core principle that drives OpsGuru’s projects. OpsGuru consider CI/CD, security, observability key tenets of our deliverable instead of secondary add-ons.  Our cloud solutions are only complete when the team is operationally ready.

Day2Ops is closely related to the concept of site reliability engineering. Google has published two books on site reliability engineering philosophy (that OpsGuru highly recommends).  Day2Ops reframes operations from reactive issue fixing to proactive continuous system enhancements shaped by metrics such as service level objectives (SLO).  It gives operators the agency to influence the efficiency of the systems, to engineer tools that continuously enhance delivery and operational processes and ultimately improve the job satisfaction of operators. It is the virtuous feedback loop because the output of the system and the end-user experience act as inputs to engineers’ effort to enhance service reliability. This approach results in better tooling and processes that lead to better operational metrics, which propel further business success.

Operating on the cloud has been a significant challenge to many companies, especially to teams that are new to cloud-native computing. Part of the challenge is unfamiliarity - it takes time to adopt new resource types and tooling running on a completely new system. Part of the challenge is how workloads are defined based on cloud-native design: at transitioning between legacy workloads to cloud-native computing, the system becomes a heterogeneous set of services; each of the services has a particular purpose and its own set of unique properties. Managing such distributed stages on drastically different services is very different from managing a number of bare-metal servers or virtual machines; hence it takes new ways of thinking to successfully manage the operations. Day2Ops focuses on equipping the teams for cloud-native computing, not just ready for the resources, but also be ready for the end-to-end distributed workloads.

Cloud-native Adoption can be Challenging

Observability is the ability to infer the internal state of a system from external outputs.  It is the read-time insight into various data to assess the state of a system. The insights are captured from three aspects.

 

  • Monitoring

  • Logging

  • Tracing 

  • Analytics 

At OpsGuru, we often deploy Prometheus, Fluentd, OpenTelemetry to support the various observability needs.  Meanwhile, we endeavour to support our clients in the tools that they have chosen (because we understand that tools decisions are not quick to change or reverse -- and operational capabilities cannot always wait for lengthy considerations on switching/adopting new tools).

Observability

Logging in observability is mostly referring to application logging. The main challenge is the aggregation of applications logs, the configuration of stream-processing mechanism such that the streaming logs -- even in the extreme case of error and unexpectedly high demand -- can be processed.  Storing the streaming messages for replay and diagnostics by only the people who should be granted access is part of the challenge too.

Logging

Analytics, while seems to be out of place in observability, plays a critical role in the system.  Monitoring, logging and tracing produce data in various aspects, allowing time-based data to be collected to construct a view of the comprehensive state of a system, while analytics executes the follow-up action to derive meaning with the data. Analytics is responsible for uncovering the real value of the data, making the data not only measurable but also actionable.  It is not an accident that AIOps has been a popular moniker in recent years; many application performance monitoring solutions attempt to extend the value and usage of their products by including “AI” capabilities in it, to reduce the mean-time to resolution (MTTR). These products use machine learning to discern outlier data, previous actions to discern applicable future actions to automatically resolve anomalies. This is only made possible by first building the capability to analyse the signals collected in the system.

Analytics

DevOps -- as defined within OpsGuru -- is the philosophy of collaboration between development and operations since product or services inception, to avoid silos and the tedious and unnecessary handoff between Dev and Ops that slow down the pace of delivery and time to market and ultimately hurts the bottom line of the business.  Day2Ops is the preparedness that readies the team to operate on the services once delivered to production. There is a clear feedback loop where the internal state of the system are captured in proper metrics, logs and traces, which are then used as requirements to tools and process improvement, resulting in long-term continuous improvement. With Day2Ops, operational objectives are clear and near real-time automatic alerts are issued when thresholds are reached (while noisy signals are rejected). The team is equipped to interpret the different signals to troubleshoot incidents and they are ready to resolve the application, infrastructure and ecosystem issues.

How does Day2Ops differ from DevOps?

Day2Ops is getting increasingly popular, so much so that in early 2019 Mesosphere renamed their company to D2iQ in pursuit of the Day2Ops ideal.  While there is a number of products state Day2Ops as features, OpsGuru offers Day2 as the core deliverable of engagement. Ultimately, our goal is to ensure that our clients are able to operate on Day 2 when the rubber meets the road and when the product is running in production. 


There are three fundamental features that we always consider in providing a cloud-native solution. 

 

  • Observability 

  • CI/CD 

  • Security Guardrails

How do we implement Day2Ops at OpsGuru to solve the Cloud-native Challenge?

Monitoring refers to the numerical/boolean metrics collected as telemetry information to the health of a system.  One challenge often faced by the cloud infrastructure is the availability of too many metrics but too little understanding.  Many services provide default metric (e.g. the CPU reading of a virtual machine) and there are many examples of application metrics that should be collected (e.g. requests per second, size of payload), but to make sense all these metrics, and then determine if there are extraneous metrics that should be collected to gain a holistic understanding of the system remains a problem that is difficult to resolve.

Monitoring

Tracing is an important tool, especially in distributed computing.  Since the workload involves a multitude of services, to be able to diagnose where bottlenecks and to understand the interactions between the various components of the workload is critical to resolving performance and overall system behavioural issues. We often see push back on cloud-native computing because when things are broken, technicians have trouble understanding how to operate and diagnose problems.  Arguably tracing is going to take an even more important role as systems get more complex, and the number of microservices keeps growing.

Tracing

Generally known as continuous integration / continuous delivery,  it is the process and communication framework.

 

In a stripped-down sense, Continuous Integration is the automation that transforms source code into installable packages, while Continuous Delivery is the mechanism that deploys the installable packages into the various similarly-configured environments (e.g. development, staging and production) and runs tests on the corresponding environment to determine the quality of the packages. 

 

While automation is a key goal of the CI/CD -- such that fast and ad hoc delivery can be done in a methodical fashion -- CI/CD must encode within it the values appreciated by the company. For a healthcare provider where security verification and component certification are vital to the business, an CI/CD automation that does not support security testing is near useless because it does not align with key check-gates in the company.
 

A successful CI/CD process must be customized for the team and the company at large to reinforce the qualities that the company cares about, while best practices on designing CI/CD pipelines should be followed to ensure shared understanding and communications across different teams.  A good CI/CD forms the common language and put forward the core values of the systems. 

 

OpsGuru is well-versed in a number of tools that supports CI/CD,  including Jenkins, CircleCI and Spinnaker. We are contributors to Jenkins plugins.  There are a number of CI/CD experts within OpsGuru who have worked across different industries to help you design a CI/CD automation system and streamline a process that works for your team.

CI/CD

Security is a major concern of any deployment, particularly for deployments on the cloud -- not because the cloud is by any means insecure, but as a result of the shared responsibility model, users are not always clear where the boundary in the responsibility sharing.  Cloud can be secure and is capable to be even more secure than on-premise, but it does require an understanding of the security perimeters and best practice designs to minimize attack surfaces. 

 

Minimizing attack surfaces aside, it takes vigilant monitoring and actionable alerts to ensure prompt responses to threats.  The nature of security is very similar to system operational tasks, but since attacks evolve very quickly, and the stakes for not maintaining best security measures are very high, incorporating security guardrails, monitoring and alerts are essential to any cloud operations. 

 

At OpsGuru we design with best-practice security in mind, while we also operate with continuous security monitoring. We include in our solutions constant tracking of access requests and issuers to easily detect anomalies. Keeping patches up to date and alertness against vulnerability is also part of our normal flow. OpsGuru advocates continuous validation and detection, such that the solution has security embedded in the design, implementation and daily operations.

Security Guardrails