Azure Site Reliability Engineering 1

Working Together IT/OT

When I first started working with Azure and GCP, it became evident, that both platforms were providing services to solve the same set of problems. Each platform chose a number of paths, paths that sometimes converged and sometimes diverged. It was with Google and GCP that I first learned about Site Reliability Engineering (SRE).

From Wikipedia, a short definition of SRE:

Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to IT infrastructure and operations.[1] The main objectives are to create highly reliable and scalable software systems. Site reliability engineering has been described as a specific implementation of DevOps.

Google has published books and papers, free, that start at the basics, show the importance of breaking down the silos between information technology (IT) and operation technology (OT), the reasons why, and the rewards.

To Microsoft’s credit, there are papers and a book that describe how to apply SRE in an Azure environment. It is not too often that I have seen Microsoft reference/credit Google, but in the case of SRE, Microsoft does just that. Kudos.

Because a large number of folks do something a certain way, does not mean that there is not a win-win solution – approach, tool, application. etc that can not produce a more efficient outcome. Google has a history of looking at a workflow with literally a hand full of folks that think about “what if we did this?”.

I am not a cheer leader for Google. I can talk for hours on what I have seen Google do – leverage “free”, and become the power house that it is today. But I have also seen Google look at a problem, come up with a better solution, write about the solution, the positives and the negatives, and give the solution to the open source community. Case in point, back in the stone age, I learned BEAM, originally a Google platform. My problem needed more than what BEAM could provide. That took me to FLINK and the solving of a real world problem. My cost was the time to learn from others.

Breaking Down The Silos

On to some basics. For decades, on-premise data centers often followed the model of developers developing and operators keeping everything operating. The model worked but not without obvious glitches. The model had two silos, one developers and one operators. The developers wanted to release often and the operators wanted to minimize change. At the same time both groups wanted success in production. The yin and the yang.

The solution, build a team with folks that are developers and those that are operators. Have the developers write code 50% of the time and write scripts to support operations 50% of the time and get the operators to write operational scripts 50% of the time and write code 50% of the time. Feel the pain and learn. Sounds easy. Bringing together developers and operators is a challenge – a people integration problem, but if done successfully, the outcome is a much more effective production implementation.

SRE, like any other set of tools has a set of words and phrases to define what SRE is and is not.

Perhaps what is a little unique is that the words and phrases that define SRE do not stand alone. There is a strong dependency between the words and phrases. There are few, if any, absolutes.

Service Level Objective

There is the Service Level Objectives (SLO).

SRE begins with the idea that a prerequisite to success is availability. A system that is unavailable cannot perform its function and will fail by default. Availability, in SRE terms, defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future.

When we set out to define the terms of SRE, we wanted to set a precise numerical target for system availability. We term this target the availability Service-Level Objective (SLO) of our system. Any discussion we have in the future about whether the system is running sufficiently reliably and what design or architectural changes we should make to it must be framed in terms of our system continuing to meet this SLO.

Keep in mind that the more reliable the service, the more it costs to operate. Define the lowest level of reliability that you can get away with for each service, and state that as your SLO. Every service should have an availability SLO—without it, your team and your stakeholders cannot make principled judgments about whether your service needs to be made more reliable (increasing cost and slowing development) or less reliable (allowing greater velocity of development). Excessive availability can become a problem because now it’s the expectation. Don’t make your system overly reliable if you don’t intend to commit to it to being that reliable. 

Within Google, we implement periodic downtime in some services to prevent a service from being overly available. You might also try experimenting with planned-downtime exercises with front-end servers occasionally, as we did with one of our internal systems. We found that these exercises can uncover services that are using those servers inappropriately. With that information, you can then move workloads to somewhere more suitable and keep servers at the right availability level.”

Service Level Agreement

At Google, we distinguish between an SLO and a Service-Level Agreement (SLA). An SLA normally involves a promise to someone using your service that its availability SLO should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. The concept is that going out of SLO is going to hurt the service team, so they will push hard to stay within SLO. If you’re charging your customers money, you will probably need an SLA.

Because of this, and because of the principle that availability shouldn’t be much better than the SLO, the availability SLO in the SLA is normally a looser objective than the internal availability SLO. This might be expressed in availability numbers: for instance, an availability SLO of 99.9% over one month, with an internal availability SLO of 99.95%. Alternatively, the SLA might only specify a subset of the metrics that make up the internal SLO.

If you have an SLO in your SLA that is different from your internal SLO, as it almost always is, it’s important for your monitoring to measure SLO compliance explicitly. You want to be able to view your system’s availability over the SLA calendar period, and easily see if it appears to be in danger of going out of SLO. You will also need a precise measurement of compliance, usually from logs analysis. Since we have an extra set of obligations (described in the SLA) to paying customers, we need to measure queries received from them separately from other queries. That’s another benefit of establishing an SLA—it’s an unambiguous way to prioritize traffic.”

Service Level Indicator

We also have a direct measurement of a service’s behavior: the frequency of successful probes of our system. This is a Service-Level Indicator (SLI). When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as running a second instance of the service in a different city and load-balancing between the two. If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries as your SLIs.

Since the original post was published, we’ve made some updates to Stackdriver that let you incorporate SLIs even more easily into your Google Cloud Platform (GCP) workflows. You can now combine your in-house SLIs with the SLIs of the GCP services that you use, all in the same Stackdriver monitoring dashboard. At Next ‘18, the Spotlight session with Ben Treynor and Snapchat will illustrate how Snap uses its dashboard to get insight into what matters to its customers and map it directly to what information it gets from GCP, for an in-depth view of customer experience.”

Next up, less about Google and more about tools for the Site Reliability Engineer on the Azure platform.


Discover more from Threat Detection

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from Threat Detection

Subscribe now to keep reading and get access to the full archive.

Continue reading