Tuesday, February 24, 2009

Setting and measuring KPIs - or how Whistler rigs its SLA

Last week I went to Whistler where they've just had a large dump of snow. Last week however conditions could be described as "spring like" or to put it another way "rocks, ice and slush". Now during this time the resort was reporting a "base depth" of about 150cm and that 200 out of 200 runs were open.

What was interesting about this was where on earth do they measure the depth from? Clearly they are measuring it from somewhere but also clearly the impression of 150cm is that everywhere has a massive base and that this leads to every run is open. The reality is that Whistler and all resorts get to pick their measurement spot in a way that conveys the image they want to project. The number of open runs appeared to mean "we might have put up a closed sign, but you could go down if you like rock climbing".

What is the point of this? Well it might sound obvious but if you are consuming a service then you need to know what the SLA means, either by specifying it yourself or having real clarity on exactly what the provider is guaranteeing and how they are measuring it.

SLAs are only worth something if you can trust what they say otherwise they are worth as much as the 150cm base at Whistler last week.

Technorati Tags: ,

Monday, February 02, 2009

Think Holistically, speak clearly

Recently I've had a couple of occasions where I've needed to work on some direct communications to some execs. In all of these (both internal and external) there have been a whole list of issues but a couple of actually core ones. People often like to hide in the detail of these conversations, particularly in change programmes, and you lose the big picture as they sink into arguing line 83 of the spreadsheet.

Most of the time the issues come down to something very specific from which other issues stem and I've found that by ignoring the detail when engaging with people it helps hugely in getting the result you want. So if the lack of a specific person means that you aren't engaging with the business, aren't getting the documentation and aren't getting the sort of clarity you need then completely ignore the later points just be specific that "lack of Bill = FAIL". If they want to get specific then just say "engagement, documentation, clarity, I can go into the specifics but it all comes back to having Bill". Let the senior people ask for the detail, let yourself just provide the clarity.

If you don't then its certain that the person you are talking with won't, after all you are the expert in this area so how are they meant to understand it better than you?

With some SOA efforts I've seen this where people start saying things like "We need to reorganise the teams, set up a new procurement process, buy an ESB, get a rules engine, get finance engaged, agree on the KPIs" and the list goes on. The reality is that the first two bits are the most important and the rest will either drop out or be materially effected by the first two.

The problem is that engineers, and especially architects, like to "think holistically" which means "telling everyone all the problems" in other words there is a lack of filtering between the brain and the mouth.

So take that list of 20 "big issues" on your project and look at it. If you could fix just 2 (or at most 3) which would they be? Do they now seem quite a lot bigger than the rest of the list? So go and be clear "these mean FAIL".

Technorati Tags: ,

When to switch to static

One of the bits I always find funny is the "X scales" pitch, whether it be stateless EJBs, REST or anything else its always one of the magic phrases. Mainframes scale, really quite effectively, they handle some very impressive numbers. Those Blue-Gene systems from IBM seem to scale pretty well too.

The key to the claims at scaling in most of these things is that you can throw more tin at the problem. Often this ignores the fact that there is a chumping database behind the scenes where scaling is a bit tricker, and more expensive, or they do smarts like Amazon's S3. The point is though that sometimes the unexpected happens and you have two options.

1) Scale to the possible peak that occurs in an exceptional circumstance
2) Prepare a static page for the exceptional circumstance

Sometimes, for instance if your website is the way you handle customers in the exceptional case, you have to go for the peak. Lots of times however its about getting information out.

As an example, the South East of the UK today was brought to a halt by the sort of snow levels that people in Boston would consider a "flurry" and the folks in Scandinavia would just shrug and walk on. This brought lots of the various sites down, for instance SouthEastern (my local rail company) had their site offline for most of the day.

What did they need to tell me? ALL TRAINS ARE CANCELLED INTO LONDON. But their dynamic site couldn't handle it. Later in the day they switched over to a PHP solution with a minimal (single) page on it but it took a good half of the day.

This is why people should always think about the ultimate fail-over for their sites. Sure you've scaled to some peak, but what if the worst happens and you get treble that peak? The answer is to switch to a file based approach, load that file into memory and just serve it as fast as you can, its amazing how many connections you can support when you are just returning a single static memory loaded page.

Some people will say "scale to that extraordinary peak" but you know what? 99.99% of the people hitting the site were looking for the same single piece of information and saying "normal service will be resumed once the snow has melted" would have been fine for the one random person looking to visit their aunt next June.

Failure conditions don't always mean that you site hasn't failed, it means that you've coped with that failure in a smart way.

Technorati Tags: ,