An Engineer’s Guide To SLA, SLO, and SLI.
Engineers want software systems to be massive, yet be agile, to perform at the highest class, and to not compromise on security. They want software with the ability to scale, be simple in design, easy to develop and maintain. What they don’t want are more acronyms. 😀
SLA stands for Service Level Agreement. SLAs typically span the business domain. They are made by cross-functional teams that span Legal, Tech, Sales, and Support functions. Beneath all the language, and verbose conditions lies this basic premise – If the software does not function as expected, one of our engineers will fix it within a stipulated time. The rest of the document is several layers over this engineering responsibility.
An SLA spans the definitions of accepted resolution time, performance expectation, uptime of the service/servers, and a myriad other parameters. The inclusion and exclusion of parameters depends on what the vendor/company agrees is a good measure of usability. This suffers from a deplorable dichotomy. All these definitions are made based on heuristics, and assumptions. Rarely on the basis of observing actual user behaviour or interaction with the application.
Having SLAs is not a guarantee of understanding user needs better. There are advantages of putting them in place though. SLAs provide a common vocabulary during conflicts, and aids resolution when disputes arise. They help companies providing the service define their manpower requirements, and delineate responsibilities for engineers.
It is important that engineers have visibility into SLAs. They need understanding about the definitions and the parameters in them. It helps in making individual engineers aware of their responsibilities, and creates a culture of awareness. It helps engineers devise SLOs that take into account implications of the business at large, while designing software systems.
To engineers, SLOs are much less abstract. SLO stands for Service Level Objective. Each software component built by engineers effectively provides a service. This service has to meet certain requirements. These requirements could be generic, or very specific. An example of a generic requirement could be ‘This gateway has to connect to the API of the payment provider”. Specific ones could be ‘99% of all API calls should be completed in under 100ms”.
Creating SLOs can quickly become a complex process involving probability theory, calculus, and other statistical methods to accurately predict incidents. A simple, and straightforward method to create them would be to monitor user interactions, set appropriate thresholds, and plot them into quantiles that are relevant. An example would be “Less than 1% of users should be experience an idle time greater than 5 s while using app.plumbr.io”. SLOs often stem from specific behavior expected from the system. It is important to remember that these might be beyond the scope of analysts and product managers.
SLOs are an engineers’ best friend. It helps them define important boundaries to the systems they are building. SLOs, when used by engineers effectively, help them build accurate systems. It allows engineers to maneuver around architectural considerations. When working with interdependent systems, it helps analyze feasibility. SLOs also help engineering teams balance technical activities, decide tradeoffs, and incorporate business considerations by deciding which SLIs to use when outlining SLOs.
SLIs are metrics. Numbers that tell the tale. SLI stands for Service Level Indicator. Some SLIs, such as throughput, latency, availability, and capacity are very common. These indicators involve how servers are holding up under a load. You could monitor SLIs differently for each system or sub-system.
Bad SLI choices might leave engineers with a very wrong picture of the user’s actual experience. One way to separate user behaviour is by dividing them as requests originating as authenticated and unauthenticated ones, and monitoring different parameters for each one. Read-write requests could be dealt with separate from read-only requests. The list goes on.
Each of these users have different expectations, and thus need to be measured against different parameters. The business implications of these users are also different. The impact of failures and bottlenecks on the interfaces that affect different classes of users are an important aspect in deciding where to invest engineering effort. If SLIs can include gauging the user behaviour accurately, it serves a much more active purpose in the larger scheme of things.
1. SLIs are ways for engineers to communicate quantitative data about systems.
2. SLOs are designed to provide a certain level of service, defined using SLIs.
3. SLAs are exchanged on the basis of understanding the SLOs which teams adopt.
4. If user behaviour is not included in these definitions, they remain deficient.
Thank you Priit, for reading drafts, and helping edit this post.