• Opsguru Twitter
  • OpsGuru LinkedIn
  • OpsGuru Github
  • OpsGuru Facebook

COPYRIGHT © OPSGURU 2019

CLOUD ENGINEERING

OpsGuru is founded on the cloud. Not just running on the cloud, but designing and operating with a set of cloud-native best practices that address availability, reliability, scalability, and efficiency; regardless of which vendor you are deploying the workload on. We can help you to embed these principles in your workloads, such that you can truly capitalize on the cloud.

shutterstock_1062915290.jpg

There are many advantages to running workloads on the cloud, such as (near-)limitless computing and storage with almost no upfront costs, reduced operational overhead as the underlying physical infrastructure is taken care of by the cloud infrastructure provider. However, to run workloads on the cloud also comes with its challenges, such as the increased possibility of unplanned downtime (since the physical infrastructure and hypervisor is managed by the cloud provider) and more complex security requirements. The frequent news on ransomware attacks, data theft and security breaches, unavailability of critical systems that resulted in millions of dollars of loss serve as sober reminders on the importance of a robust design, implementations and operations, along with the necessity fault and anomaly detection and risk mitigation plans. Multiple lists of best practices have been provided by cloud providers and industry experts, but it takes a lot of work to understand the drivers behind such best practices, to prioritize the items on the evolving lists and to translate the recommendations into technical implementations and rollouts across systems. As OpsGuru has years of experience working with clients of various industries understanding and implementing such best practices, we are a reliable partner that can work with your organization to implement such best practices in a safe, secure, effective and cost-efficient manner.

Cloud-Native Best Practices

Screen Shot 2019-09-11 at 1.30.29 pm.png

In the last decade, the cloud has progressed from the hosting platform for start-up businesses to the de-facto standard. It is estimated that the pace of migration to the cloud is increasing, so much so that 83% of the enterprise workloads will run on the cloud by 2020. In the earlier days, businesses put all their workloads onto one single cloud for ease and simplicity, but as the threat of vendor lock-in is increasingly tangible, the risk of a one-vendor solution continues to grow. Meanwhile, duplicating workloads across multiple clouds -- on top of the existing workloads in data centres -- is a hefty cost and massive endeavour, the complexity of which is so ghastly that success is difficult to achieve. 

 

In OpsGuru we support and implement multi-/hybrid-cloud solutions. We help clients identify the best place to run specific parts of workloads and set up interconnectivity between the different components to create an overarching system. For heavy on-site computing, it is perfectly reasonable to process the data locally in the data centers before uploading the cleansed data onto the cloud, where batch analytics or model training takes place taking advantage of the near limitless computing power and storage. If the majority of your personnel are already running on an incumbent cloud, forcing them to move all workloads to adopt another cloud because of short-term discounts can be counter-productive and operationally costly. In OpsGuru, the multi-/hybrid-cloud paradigm is to “run the right workload at the right place”. 

 

With advanced software-defined networking available today, connectivity from one cloud to another and back to the data center is no longer an insurmountable challenge. The rapid increase of service mesh technology such as Istio in the past few years, is yet another signal that there are strong interests and demands for secure and predictable networking that transcends underlying systems and vendors without rewriting applications and orchestration end-to-end. It has been claimed that traditional enterprise found cloud adoption is much more palatable with a robust service routing layer than rewriting and containerizing all applications.

Multi/Hybrid Cloud Adoption

Code on Laptop Computer

The cloud is known for agility, where the user can scale in and out workload quickly. However, agility is only possible with automation: If the infrastructure is provisioned manually by a technician following a step-by-step playbook, it will be slow, tedious and not to mention error-prone. 

 

Automated/non-manual infrastructure orchestration is only possible once the infrastructure is codified. When the infrastructure is codified, it makes provisioning faster. With the significant increase in speed, the infrastructure can grow and shrink according to real-time needs, realizing the proposition of the cloud as on-demand infrastructure.

Other than speed, risk reduction is another driver for infrastructure as code. Back in the days of shrink-wrapped products, the application was what engineers focused on. Continuous integration (subsequently continuous integration/continuous delivery) process and automated testing ensure the products are produced, shipped and validated using a streamlined process. As a shrink-wrapped product is now replaced by software as a service, the definition of a product has also changed. The underlying infrastructure on which the services are run is now also a key part of the product, as the unavailability and malfunctioning of the infrastructure will directly impact the customer experience and customer satisfaction. The same rigor and discipline to manage software delivery has now also been extended to the infrastructure because both infrastructure and software play an equally important role in the customer story. Software application delivery automation has long since been a standard practice to ensure repeatability, testing and traceability, infrastructure as code, and automated infrastructure testing are the means to apply the same principles to infrastructure. 

 

Other than repeatable and speedy execution, infrastructure as code is also an effective tool to spread the knowledge of the infrastructure across the organization. Traditionally infrastructure and application teams have completely segregated centres of concerns. However, as time-to-market keeps being reduced, the hand-off of knowledge from team to team is no longer sufficient to ensure smooth operations. While documentation on the different components of product stacks helps, being able to read the codified instructions that capture the features of the infrastructure that is based on well-known technology are often more efficient to gain understanding than reading exhaustive documentation or leveraging a support team.

Infrastructure as Code

Continuous Integration/Continuous Delivery (CI/CD) is the process that enables speedy, reliable, repeatable product delivery. Embedded within the process is frequent source code commits, automated testing to enable early discovery of defects and efficient resolutions of such defects. While CI/CD is often used in software delivery, it is analogous to streamline manufacturing processes popularized by “the Toyota way.”

To completely support CI/CD, production-similar environments should be set up to support realistic testing of the software and infrastructure in order to maximize confidence in production rollout. With multiple disciplines involved, for example: application development, feature testing, performance testing, infrastructure, product marketing, customer support, etc. they become the stakeholders to the changes introduced by the CI/CD pipelines. CI/CD often forms the backbone of operations within the company. Even when there are multiple guiding principles that are always recommended for every organization, such as small incremental change and frequent automated testing, an effective CI/CD process needs to align with the business drivers of the organization and therefore each of them is unique for the organization.

Continuous Delivery (CI/CD)

Men with Calculator

Running workflows on the cloud offers a myriad of possibilities, ranging from only acquiring compute power in form of instances from the cloud provider and managing the operations within your own organization, to leveraging the “serverless” model whereby the cloud provider takes care of the operational tasks such as maintaining backup, scaling in and out to align with demand. The build vs buy options amongst IaaS, PaaS and SaaS (Infrastructure/Platform/Software as a service) often lead to drastically varying profiles on costs. Costs consist of not only the billing total issued by the cloud provider but also operational overhead and cost to acquire knowledge. 

 

Even when the IaaS/PaaS/SaaS decision is made for specific components of a workflow, there are other important deciding factors. Depending on the service, the costs can be based on minutes or hours used, on the total volume of data processed or stored, and more often than not, the combination thereof. On top of that, the forecasted usage plays an important role - some services allow long-term usage discount, while other services may offer a bidding model that can potentially lead to steep discounts to take advantage of the excess compute capacity. 

 

To be able to approximate costs correctly and make decisions on what and how to consume, and to make improvements to minimize the total cost of ownership is a complex exercise. Calculators provided by the cloud providers are helpful, but they only form a small subset of the tools enabling cost efficiency on the cloud. More importantly, price calculators do not take the holistic design and purpose of the system into consideration, whereas at performing any kind of cost analysis for our clients, the team at OpsGuru prides themselves on the capability to drill down into the core purposes of individual components. As cloud services keep evolving rapidly, it is from OpsGuru’s experience that there is often room for optimization by using different combinations of services, or leveraging new costing features provided by the vendors.

Cost Optimisation

Shiny External Hard Drive

Whether you are running on-premise or in the cloud, planning for backup (or data retention) and disaster recovery is a fundamental requirement. Other than it being one of the security best practices -- a lot of the ransomware attacks can in fact be mitigated by a robust backup plan -- data retention is one of the fundamental requirements for numerous compliance standards such as SOC and HIPAA. 

 

Even though backup and disaster recovery plans seem to be apples and oranges at first glance, backup is in fact one of the disaster recovery strategies. Backup tends to be the first step for any kind of disaster recovery plan, while multi-site/provider workflows tend to be the highest level of disaster recovery plan, where the operations no longer have to be tied to the well-being of the services of a specific provider or location. Given the limitless scaling and storage in the cloud, and that most cloud providers have got multiple geographic locations to offset risks on local incidents, the cloud is an ideal location for backup and disaster recovery implementations. 

 

An effective disaster recovery plan is dependent on the compliance requirements and service level objectives, which is often derived from the service level agreements with the end-users of the workflows and the existing performance of the services. To come up with an effective disaster recovery plan, it is necessary to understand where the weakness of the existing workflow is, where the bottlenecks are, the service level requirements and usage profile of each component. The operational steps and automation to enable execution of the disaster recovery aside, it is also essential to run drills and game-days periodically to ensure the disaster recovery plan is effective and the team is equipped to execute on it. The best plan is only useful only when the team knows how to run it. As most of the members at OpsGuru come from a site reliability engineering background having significant experience with contingency plans and executions, we have capitalized on our collective experience to help clients formulate and execute on the disaster recovery implementations.

Backup/Disaster Recovery

The “General Data Protection Regulation” (GDPR) came into effect on 25 May 2018, but even before and after its full effectiveness, there are many questions on the legislation, such as to whom it is applicable, what measures an organization should implement minimally in order to achieve compliance with the legislation and whether there are exemptions capable. More importantly, the GDPR may set a precedent to other countries and territories who have variations of similar legislations. For any companies that operate globally, understanding the nuances of each set of legislation and implementing data security policies accordingly is a must. 

 

Political-unit based legislations aside, each industry tends to have its own compliance standards. In the United States, for example, payment industries follow Payment Card Industry Data Security Standard (PCI-DSS), service and product providers to the healthcare industry in the US have to comply with the Health Insurance Portability and Accountability Act (HIPAA) and may seek HITRUST certification. Any company looking to do business with the federal government needs to at least be aware of the Federal Information Processing Standard (FIPS). On top of pan-industry standards such as ISO and NIST, navigating the compliance landscape whilst delivering features and developing markets is a difficult task. It is useful to work with partners who have experience working with solutions following the standards and have experience interpreting and implementing such regulations. OpsGuru has experience working with a variety of heavily-regulated industries such as healthcare and finance; and can help to navigate the compliance implementation and audits.

GDPR and Compliance

DevOps has been a buzzword in the community for the past decade. The communications, agility, and fail-fast mechanism as offered by the framework work well with the cloud because the implementation of DevOps accentuates the value of the cloud. In recent years security has been added to the mix because it has become the foremost concern of successful operations on the cloud. Adding security testing and access policy validation in the workflow for automated validations is the natural progression of a fully automated pipeline that aims to streamline processes and reduce off-the-cycle interrupt-driven work items. On top of the CI/CD (continuous integration and continuous delivery), continuous validation is increasingly becoming a standard.

 

In OpsGuru we believe that security should be implemented from day one. Role-based access control, logs to trace individual actions applied to resources to personnel and IP address are implemented as a baseline. On top of that, deployments should include not only functionality checks, but also security verifications, such that the security posture is always upheld.

DevSecOps Rollout