Monday, April 28, 2008

SLAs and reporting - whose truth to believe?

One of the core concepts in SOA is the idea that a service should have a Service Level that it agrees to meet (the SLA), this might be technical in terms of up times, response times, amount of information to be handled. In more sophisticated services it could be more business oriented in terms of cost to serve, order to ship time, conversion rate, stock levels etc. These agreements can apply both ways, as in the service commits to respond within 20ms but the consumer can't send more than 20 messages a second.

SLAs are a guarantee that something will be done and there should be penalties in place when a violation occurs. In theory this should be a simple case of measurement, but all too often this is something that is overlooked. People define the SLAs but forget the old adage on KPIs that if you allow someone to measure their own KPIs they will be always be successful.

I've been looking for a simple demonstration of this problem for a while, most of them are specific to a given business so don't work generically. But thanks to the folks in Redmond I've now got a great example, it may or may not be there fault but its a good example of measurement problems.

I run Windows XP under Parallels on a MacBook Pro. Now I just do basic office work so I've given it a C: drive with 15GB. This should be a decent amount for the basic office files (Outlook files are stored on a dedicated share). The trouble is I've run out of space....
So what files are taking up the space? Well a quick "WinDirStat" on the C drive (after doing the same exercise with Windows properties) came up with the stat that the files on the drive take up around 5.4GB. This leaves me with around 10GB unaccounted for.
Here is an example of a producer/consumer SLA. The producer (Windows XP) has committed to provide me with around 15GB of storage for my office files. It then reports that I have violated the client SLA for the service and it will now perform like a dog. Equally my information is that the performance of the producer has severely degraded and I am unable to add more files despite being significantly below the agreed limit.

In this case it is the producer who measures the KPIs and even though the independent measure suggests it to be incorrect it is not possible to challenge the producers statement.

What this says is that when looking at KPIs and SLAs for services you need to think about independent measurement being part of the basic requirement. This implies that measurement is done at the service boundary by a 3rd party which must track these SLAs over time. Otherwise you'll just end up with Windows saying "disk space full please free up some space" and then telling you that you don't have enough files available to make a dent in the space.

The final piece therefore is around arbitration. It is quite possible in the setup that I'm using that something is going screwy around Windows that isn't the fault of the producer or the consumer its due to a 3rd party (e.g. the VM) once you get into this stage you need to think about arbitration. The challenge here is that the consumer is still due compensation from the producer, but the producer may have a counter claim against the 3rd party. This is the final piece about SLAs in a professional SOA environment. Its important to think "back to back" in your SLAs otherwise they won't actually mean anything. If a service commits to responding in 20ms but none of the things it relies upon will make any such guarantee then its just corporate optimism (at best) or fraud (at worst).

SLA management is a core part of the shift of SOA away from technology and into the business domain. With all the WS-* arguments going around I'm still stunned that there is no WS-Contract or WS-SLA because its that which would really separate WS-* from other technical choices.

SLAs are about
  1. Defining the terms
  2. Defining the penalities
  3. Measuring the operation
  4. Arbitrating the violation
  5. Spreading the risk down the chain
Its the independent measurement that helps make all the rest of it honest.

(Oh and if anyone has any idea about the 10GB I'd love to know where its gone!)

Technorati Tags: ,

1 comment:

David Bressler said...

Steve,

Great post! I especially like your list of what SLA's are about.

I believe, however, that there are real challenges with standards that address the issue.

The problem is, that any standard way of defining the metric is something that has to flow back into the business. Sometimes, it may not matter, but other times, it then renders the standard metric useless.

Using your example, what you really want is Microsoft and Steve to be able to agree on what is measurable about disk usage. For instance, microsoft may be including tmp files and maybe virtual memory, because that space is untouchable. Perhaps WinDirStat is not counting that, because they're not "user files" that are using disk space.

Let me give a more relevant SOA example. You want the average response time for a service. How do you measure the average? Do you give an average over 1 minute, or over 5 minutes? Is it the average response time across all uses of the service, or just across a single business application/process use? These are just some simple questions that come to mind, that for me, make "policy or contract standards" very difficult to make in a way that is pragmatically useful.

Disclosure: I have been accused of being "anti-standards," something I don't believe is true. I'm just a realist.