Tuesday, April 26, 2011

The forecast is briefly clear followed by heavy cloud


The weather in England has been unusually sunny for April, with not a cloud in the sky but it was not the only place lacking clouds last week as for 36 hours the cloud services giant Amazon disappeared from the web ( see here ).

I know what will now happen as I've seen it before with managers that found access to Google mail was offline for a few hours a couple of years ago.  They will state in their old school IT manager way that this just proves the cloud to be unreliable "we are safer to have the equipment in our own server room because then we have control".

I say they are wrong, I've worked in IT for a long time and in that time seen a lot of major outages and they almost always take more that a couple of hours to fix often multiples of days.  This asscertion is particularly true when it is loss of a service such as GMail, I can guarantee that corporate email services are unavailable more often that in most companies.   Corporates often have to do work to particular sites or servers that require this equipment to be taken offline, but even if the servers did run continuously there will be people that loose connectivity to the servers and this will be perceived as an email outage.  Not only that can people really not live without email for a couple of hours?  Personally I did not see the GMail outage at all probably because I was in a meeting or something for its whole duration.

The Amazon outage is more serious, many businesses have used these facilities to produce online ordering sites vital to their business and there is no doubt shutting up shop for 36 hours will have hurt them.  At this point though we must consider the costs of running these sites the old fashion way.  We not only need to put up enough servers to run the maximum load, we must duplicate this to at least one other site possibly even across regions.  This will require investment in infrastructure and communications not to mention a tribble load of techies in each location to keep it all running sweetly.  None of this will guarantee correct set-up and the avoidance of an outage at some future point, all it will do is ensure you are in control of that outage and entirely culpable.

Compare this to what has actually happened with the Amazon outage.  Users are pretty tolerant to web site failures and most will simply return the following day to place their order.  This is especially true when it wasn't the fault of the vendor to whom they are loyal but of some corporation on which they rely.  Indeed its now less than a week since the issue and yet there are no stories of this in the IT press, it has been forgotten and consigned to the past.  The biggest looser in all of this will have been Amazon who will have not only lost sales but also will probably have to compensate against SLA's for some of those sites, a good incentive to track down the issue and make sure it never happens again.

I don't believe any of those companies will be planning to implement their own internal architecture to ensure that they can avoid this rare occurrence, though some of the more affluent ones may consider implementing their sites using multiple cloud services.

Remember errors are not completely undesirable they are necessary to progress and as long as they are not too major or too frequent they make us stronger.  If they become either of these for Amazon people will move to another service and Amazon know it. Not only that all of their competitors know it and will be making sure they to learn from it and through the one incident the whole cloud market becomes stronger.

No comments:

Post a Comment