Monitoring: SLA, SLO & SLI
Monitoring is part of hierarchy process of infrastructure to maintain availability of your system/app (a fundamental for SRE).
Managing a service (system/app) correctly without understanding which behaviors really matter for that service, measuring & evaluating those behaviors without any specific terms is nearly impossible to do. Level of service should be exist and given to your users or clients to guarantee the availability of your service.
There are three different service levels that you should use for it. There are service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react or even mitigate if we can’t provide the expected service.
Specific measurable targets/goals requires a right balance between product development and operation work. Choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives a cloud or infrastructure engineer confidence that a service is healthy. Level of reliability can be stated because forecast shown after indicators measurements is given. Then you can convince your client by confidently says “My app should have 97% uptime in a rolling 30-day window”.
Service Level Indicator (SLI)
SLI is a service level indicator which carefully defined quantitative measure of some aspect of the level of service that is provided. Common SLIs that most considered that measured are:
- Request Latency
- Error Rate
- Saturation
- Throughput
- Availability
Not all metrics make for good SLIs. You want to find metrics that accurately measure a user’s experience. Things like high-cpu/high-memory make for a poor SLI as a user might not see any impact on their end during these events. Ideally, the SLI directly measures a service level of interest, but sometimes only a proxy is available because the desired measure may be hard to obtain or interpret.
For example, client-side latency is often the more user-relevant metric, but it might only be possible to measure latency at the server.
Service Level Objectives (SLO)
SLO is a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. It might be tempting to set them to aggressive values like 100% uptime, however this will come at a higher cost. Meanwhile, you don’t always get to choose its value! For incoming HTTP requests from the outside world to your service, the queries per second (QPS) metric is essentially determined by the desires of your users, and you can’t really set an SLO for that.
The goal is not to achieve perfection but instead to make customers happy with the right level of reliability If a customer is happy with 99% reliability increasing it any further doesn’t add any other value. That’s why choosing an appropriate SLO is complex.
For example, we might decide that we will return Shakespeare search results “quickly,” adopting an SLO that our average search request latency should be less than 100 milliseconds.
Choosing and publishing SLOs to users sets expectations about how a service will perform. SLO should be explicitly stated and transparent! Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service.
Service Level Agreement (SLA)
SLA is a contract between a vendor and a user that guarantees a certain SLO The consequences for not meeting any. The consequences are most easily recognized when they are financial — a rebate or a penalty — but they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask, “what happens if the SLOs aren’t met?”: if there is no explicit consequence, then you are almost certainly looking at an SLO.
Google Search is an example of an important service that doesn’t have an SLA for the public: we want everyone to use Search as fluidly and efficiently as possible, but we haven’t signed a contract with the whole world. Even so, there are still consequences if Search isn’t available — unavailability results in a hit to our reputation, as well as a drop in advertising revenue.