Buyer's Guides

Fault tolerance, “always on” computing, zero downtime – what does it all mean?

Sunday, 1 May 2011

More and more companies are striving for 24/7 uptime - 'always on' computing. But how do they achieve that? Here’s how it's done according to 24/7 Uptime Ltd...

"We keep any business/mission critical applications that run on a Windows environment up and running in the event of a component failure, a complete server failure, or a complete site outage. “So what?” you might say – there are many ways to do this. But you’d be wrong. There are many solutions that react to failures, but what makes our solutions different is that rather than reacting to outages, we actually prevent them. We achieve this by handling each type of failure (component, complete host, or full site) in the most appropriate manner. The best way to describe how we do this is by example…..

"Our simplest configuration is two servers sitting next to each other in the same server room or data centre, and working together to share components and resources. In other words, they are both running as production servers, but protecting each other from any outages. This is much different than older reactive technologies that have a single main production server and a standby server ready for a cumbersome failover procedure, typically incurring some downtime when a problem strikes. In our solution, if a server component fails (example complete disk array goes down, or network card fails) – then the load is instantly assumed by the disk array or network card of a connected server. This is called component fault tolerance – the result is zero downtime.

"Second example – if a complete server fails, the connected server instantly takes over – this is called system fault tolerance – the result, again, is zero downtime.

"Finally, the third example – (this is relevant when we introduce our off site disaster recovery module to the 2 server configuration) the complete server room has a long term power outage, or a fire hits the main site – the whole environment fails over to a standby server at a second site, connected by fibre, WAN extension, etc. This is extended disaster recovery and is the only condition where we would invoke a more traditional reactive / failover technology – the result is maybe 2 or 3 minutes of downtime. However, this is only ever invoked when a complete disaster strikes and hence is typically acceptable to even the most demanding of environments.

"All of the above is achieved by a simple software solution on standard HP / Dell etc hardware – very easy to manage and without the need for any shared storage.

"Since each failure is treated in the most appropriate manner – failover to the D/R site would only be invoked when a true disaster scenario strikes. This would hopefully be a once in a lifetime event – if at all. Local failures such as component or server problems are treated as local issues. Why would you want to invoke a full D/R failover to a second standby site if a disk fails or a network card goes down? Other solutions do just that – we keep it real and treat minor outages as minor problems all with zero downtime.

"Many organisations use reactive failover technology that requires shared storage. It’s costly, it’s complicated to manage, it regularly fails and it does not create a zero downtime computing environment. Our solution simplifies the network, treats outages in the most appropriate way and keeps the environment up and running through any type of failure in the most appropriate manner.

"This solution is now being used by many organisations such as local government, integrated system providers, manufacturing companies, security system providers, access system companies, finance and banking organisations, on line gaming companies etc. In other words, companies who like the idea of an ‘always on’ environment."

print share

> Tell us your news

If you have industry related IT news that you would like to submit for concideration for our newsletters, please feel free to submit them via this link Tell us…