Probably the most common metric used with Service Level Agreement (SLA) for enterprise applications (and SOA infrastructure) is availability. While my primary interest is in figuring out how to define effective SLAs for distributed, transactional business applications (things like order management, reservations, etc.), I thought it would be interesting to explore the use of availability for more basic applications - for example email servers.
Most IT organizations will provide an SLA to their business for email. A typical example would be "we guarantee at least 99.7% availability for our email systems." That's basically less than 26 hours of downtime in a year.
So, this begs the question: Is this a measurement that business users care about? Because, after all, the purpose of an SLA is to ensure you meet the expectations of your users.
Now, 26 hours a year could be 4 minutes a day at 4 a.m., or it could be one full day of downtime on the last day of a quarter. Do you think a business user considers these equivalent? Of course not. Having email go down for a full day on the last day of a quarter could be extremely damaging to a business, whereas 4 minutes of downtime in the middle of the night each day would be irrelevant.
Let's look at a different variation: let's say we have 15 minutes of email downtime, right in the middle of the day, every 3rd day. Would that be an issue? Well, with Microsoft Exchange and Outlook, users wouldn't even notice this because Outlook caches data locally even when the server is online. So, even if the server is down the user can be reading emails plus drafting and sending new ones. In fact, with Outlook users have to look hard to even know if they are connected to the server or not - it's not obvious. The end user perceives email as subjectively up or down based on whether they can interact with Outlook or not - not whether the server is actually up or down. They also perceive that email is "slow" when Outlook is slow (not when the server is slow).
Now let's say the email server is up and running (so it's available), but the virus checker hooked into the email server is spinning in a loop, causing a 2 day delay for every sent or received email (something that happened to us at one point). Even though email is available, clearly end users would be very unhappy.
So, it's pretty clear that % availability of an email server is a useless metric for email SLAs. It doesn't correspond to what users care about. Users care that the emails they send arrive in their recipients mail boxes in a reasonable amount of time (let's say less than 15 minutes). Similarly, they care that email arrives in a timely manner as well. So, the most meaningful email SLA would be something like "we guarantee that inbound or outbound email will arrive within 15 minutes." They also care that they can constantly interact with the interface to the application (for email this is Outlook).
To be clear, I'm not saying that measuring availability is wrong. Knowing when the email system is unavailable is a useful diagnostic. It lets IT know that they might end up with a SLA problem. But don't think that the business cares one bit about it as part of your SLA, it's just a diagnostic tool - an early warning sign of a subset of problems that might cause an SLA violation.
OK, now let's translate this to a few lessons about defining effective SLAs:
- The best SLAs are ones that measure metrics (and guarantee results) that directly correspond to the expectations of users or to measurable business metrics (e.g. revenue).
- The synchronous parts of an application (specifically, the part the user directly interacts with) should be measured with different metrics than the asynchronous/background parts of an application.
- For the synchronous parts of an application outages and performance issues should be weighted by when they occur (mid-day or at night, beginning of quarter or end-of-quarter, etc.) and how long they take - a raw total or % is not that meaningful.
- For the asynchronous parts of an application, SLAs are most effective when they relate to missing important deadlines (for example, not being able to recognize revenue by the end of the quarter, not being able to ship a next-day-delivery-guaranteed product by the end of the day, the expectation of an email arriving within 15 minutes, etc.)
So, yes, defining a good SLA is hard. You need to understand your users and your business. You need to measure things which aren't necessarily that easy to measure. Many IT organizations take the easy road out and measure metrics like % availability and call it an SLA (no wonder business people view IT as out-of-alignment with them). Providing SLAs which follow the lessons above will ensure that you meet the expectations of your users, customers, and partners. It may be hard, but the result is priceless.