Blame it on the data centre
03-07-2012 - John Hatcher
The data centre represents a multi million pound investment that typically translates to around £15 million per MW of power and requires significant coordination across multiple disciplines – real estate, mechanical and electrical engineers and technology divisions. And it is at this level where breakdowns often appear to take place, as illustrated by the recent service interruptions at BT, RIM (Blackberry), Amazon and Lloyds Banking Group where outages were caused by site power, network and infrastructure failures.
Data centre downtime comes with a big price tag – $5,600 (£3,539)/minute according to Emerson – which explains why managing the complex array of dependences that span the physical build environment, the supporting mechanical and electrical infrastructure and the technology platforms that uphold services or ‘applications’ should be an every-day priority.
Yet for many data centre operators, trends like the density of compute (more watts per sq ft, more KW per rack), network consolidation and convergence, storage, and the ongoing nirvana of achieving more efficient power and cooling options to satisfy the ‘green’ agenda are becoming distractions from the job in hand.
Designed to fail
Even more worrying, many organisations and their application teams implicitly rely on the design and resilience of the data centre to assure availability. Take the case of the RIM outage, where a power failure caused by a single switch caused widespread disruption that affected the entire application suite.
Any failure to consider the operational characteristics of any application or service represents wasted capital at the data centre design and build phase, and service interruptions are bound to occur. A better approach is to ensure that any outage - planned and unplanned - invokes ‘prescribed’ behaviours in the application or service to ensure no service interruption is experienced by the user community. Sadly, however, this approach is rarely considered at the application or operational design stages.
On a positive note, in today’s application architecture the concepts of Active/Active (traffic intended for the failed node is passed to an existing node or balanced across remaining nodes) and Active/Passive (a fully redundant instance where each node is passed to another) as a means to drive high availability and compensate for any data centre failure are becoming more commonplace.
But a data centre is for life, so it’s disappointing that all too often after the initial design and build phase of a facility is over, many organisations are often left with a set of OEM manuals at handover and little insight or guidance on the complexities of ensuring operational synchronicity between the entire data centre stack – applications, services, platform, physical infrastructure and the data centre premises/environment itself.
Clearly, there are highly significant planning and operational considerations involved in ensuring uptime, and finding a data centre provider that understands and applies these considerations at the ‘setting out’ phase is of equal, if not higher importance than, say, selecting a partner purely based on their PUE offering.
A different approach
Because all applications/services rely on platforms (servers or computers) that are dependent on physical infrastructure (network switches and the SAN), there is an absolute dependency on availability and performance that spans the entire data centre stack.
Which is why it is essential to ensure operational ‘mismatches’ don’t occur. And that depends on eliminating any point of failure at the data centre plant level and undertaking the same level of scrutiny on the programme and the operational use of the facility by the application owner.
In infrastructure terms, this means that applications must adhere to carefully defined hosting rules. But understanding application affinity – which applications have affinity with each other and should be hosted within close proximity of one another – and application diversity – those which should not share any dependencies – requires a high level of domain skill.
Understanding capacity – the mix of servers, storage, network and SAN, and the capacities required to support each application, dependent on its operational state - is also key.
Finally, the operational context of placement – managing workload placement based on application affinity, diversity and capacity – and placing workload by requirement is essential.
Getting physical and operational design right
Data centres are often conceived of as being a single facility, measured in terms of space or power/KW. But, in order to ensure no single point of failure can impact application availability, the data centre should always be viewed in the context of space dependencies. Which means detailed spatial analysis and planning should be undertaken in parallel with the development of the application architecture strategy.
That means creating multiple data centre ‘halls’ within the data centre, surrounded by a fire-rated wall with dedicated MEP and detection/suppression systems. Each hall will contain domains, PODs, racks and clusters – all of which will be designed to eliminate any possible point of failure.
The way forward
Today’s data centre industry is becoming distracted by today’s build and design concepts and trends – chasing the lowest PUE, developing ‘modular’ building blocks and improving time to market – and abandoning its focus on the ‘design, build, operate’ ecosystem that’s so essential for delivering the five 9’s availability target.
And that’s disappointing, because any failure to recover quickly from a service interruption in the data centre will get an organisation on the front pages of the Wall Street Journal or the Financial Times. Which is why organisations should ask themselves ‘If a disaster occurred, how confident are we in our ability to recover, i.e. that critical data is protected, recoverable, and in a timely manner?’
Because achieving five 9s availability depends on a dedicated focus on operational capability and extensive domain expertise to ensure that design, programme use, and operation all work in perfect harmony.