Humans in data centres: risk and energy reduction opportunities
31-07-2012 - John Hatcher
The cause of the majority of failures in data centres is reported to be human error and this finding is repeated in studies of other industries. Human interaction is the common element in all cases and forms part of the ‘homo-technical system’. A very complex or highly automated design may make operation more difficult and increase the likelihood of misoperation. There may have been a large investment in redundant infrastructure but the facility is still vulnerable to how the operators act in a failure scenario. The testing and commissioning programme prior to handover is important to identify design and installation errors. This process, culminating with the integrated systems tests, allows a unique opportunity to prove that the systems operate in the manner specified, including in a variety of rare operating modes and uncovers incipient faults and latent defects. Part load testing can prove whether facility efficiency meets the expected performance.
Operational performance is a function of the organisational and individual operator experience. A well-managed facility should have clear, accessible, up-to-date, well practised procedures for both standard and emergency modes which are undertaken by staff with a thorough understanding of the systems they are responsible for and how these support the operation. The data centre industry faces a shortage of skilled operators, in part caused by its rapid growth.
When budgets are restricted there may be pressure to reduce maintenance spend which can increase operational risk and may prove to be a false economy if the result is operational outages and reputational damage. Reliability forms part of the data centre total cost of ownership, however the cost of failure is difficult to estimate due to its intangible and unpredictable nature. Failures normally occur not just due to one abnormal event but a confluence of circumstances where a series of contributory factors trigger the outcome. With hindsight, the causes may have been evident but left undetected or uncorrected without previous ill effect. By having the right checks and balances in place, problems can be caught early and dealt with, minimising impact and preventing escalation. Risk tends to reduce with experience, however risk due to complacency can be important in older facilities with a more experienced site team.
The organisational learning environment is important to allow the development of knowledge, understanding and experience. It should be recognised that mistakes are inevitable and that failure can never be completely eliminated, however by learning from our mistakes, both on an individual and organisational level, reliability can be improved. A spirit of continuous improvement fosters a positive attitude towards learning, with an openness to embrace new concepts and different ways of doing things. A blame culture is detrimental as it encourages problems to be hidden and has a negative effect on staff morale. Often one of the results following a major failure is the dismissal of staff (scapegoating), however this may not achieve the intended effect – whoever replaces them will not have the benefit of learning from the failure event; introducing less experienced people at this stage will increase risk. When failure root causes are investigated, in many cases managerial decisions are identified as significant contributing factors, rather than specific operator actions.
Training has an important role in improving operator skills and motivation; it helps give people ownership of the systems they look after and provides the business with confidence that the team will perform effectively when under pressure. A better understanding of the critical infrastructure also has additional benefits, such as identifying system and plant optimisation opportunities and energy savings. This knowledge is especially valuable when cost pressures mean that legacy facility lifetimes are extended. There is often a perception that energy efficiency improvements cannot be implemented as this will compromise reliability, however with a thorough technical awareness of system design and operation, appropriate changes can be managed which reduce energy consumption and risk.
Developing a highly skilled workforce and sharing knowledge and learning between operators is an important challenge the data centre industry needs to address to improve operational performance and reduce total cost of ownership. There is a strong business case for investing in this area.