Downtime in your data center can be costly. But failing to adequately maintain your facility means you’re in for unexpected downtime—which can be much worse than planned downtime. If your data center receives much lighter traffic at certain times of day or certain times of the year, scheduling a service break during those off hours is one possibility. For instance, if product sales are virtually nonexistent during late-night and early-morning hours, those times are a good opportunity to put operations on hold while you perform needed maintenance or other critical work.
But what if your facility hosts business transactions or provides services steadily, 24 hours a day and year round? In these cases, even short discontinuities in resource availability can annoy customers and drive business to competitors. When “always on” service is a business requirement, maintenance and other critical work on the data center cannot disrupt normal operations. Performing this work on a live data center, depending on the scope of the tasks at hand, can be a tremendous challenge. What if, for example, you need to upgrade or repair your uninterruptible power supply (UPS) deployment?
Working on a live data center takes careful planning and more than a little courage. The following are some tips to help reduce the risks associated with this kind of critical work and to keep IT resources available to both internal and external customers throughout the process.
Plan Ahead or Fail
More than anything else, the key to managing critical work in a live data center is planning. No one in his or her right mind starts replacing UPSs, for instance, without first either shutting everything down or carefully reviewing the potential contingencies if the data center is to continue running. Planning, however, is more than just a matter of scheduling: it requires a comprehensive strategy for dealing with even the unexpected.
The planning phase should accept input from all parties that could be affected, particularly should the work run into complications. Generally, the more isolated or peripheral the system, the lower the risk that a failure or other complication will affect a wide swath of the company, customers and contractors. If you’re planning maintenance or an upgrade for a critical central system, like UPS, the consequences of an error or unexpected event are much broader.
Scheduling for critical work on a live data center should take into account the availability of all relevant parties. If a particular contingency necessitates bringing in an electrical contractor, for instance, ensure that the contractor is either present or available at the time of the work. Data center managers must “work with the facilities departments to coordinate the maintenance schedules for the supporting infrastructure with their asset-deployment activities,” said Kevin Lemke, Product Line Manager of Row and Small Systems Cooling for Schneider Electric’s IT Business. In addition, ensure that your schedule leaves enough padding to allow for unexpected delays. Cramming successive stages too closely together can jeopardize the entire project; for example, a delay at one stage could push a subsequent stage out to the point that a contractor involved in the effort becomes unavailable. Data center managers should give their employees credit for their competence—as well as some leeway for the unexpected.
Extensive planning for critical work is a requisite for consistent success. Ideally, however, preparing for critical workon a live data center should go beyond one-time planning: it should begin at the design phase.
Designing for Live Data Center Upgrades and Maintenance
If the system you’re upgrading or repairing is a single point of failure in your data center, a live fix is all but impossible. Thus, this kind of critical work is most feasible in cases where the design phase of the facility looks ahead to maintaining uptime even when this work is performed. Victor Garcia, Director of Facilities at Brocade, notes that to maintain or upgrade a live data center, it “has to be designed for and planned in advance. Depending on the level of uptime required, either N+1, N+2 or 2N designs need to be incorporated into the plans and operations so that uptime can be achieved while performing maintenance.” This redundancy is critical: not only does it avoid single points of failure, which are a bane of data center uptime, but it enables live maintenance. Replacing a redundant UPS while keeping the facility running, for instance, is far easier than doing so when you have just a single UPS!
In addition to the initial design, tracking changes made to systems is critical. Despite extensive planning, critical work can land in serious jeopardy if the configuration assumed in the plans turns out to be different from what’s discovered because no one kept adequate records over time. “From an operational perspective, having a change-control process where any changes—whether IT or facilities related—are reviewed cross-functionally to ensure that none of them put the data center at risk,” said Garcia. In some cases, however, certain changes are potentially problematic. Garcia recommends mitigation or contingency plans in such cases to prevent downtime once the work starts.
Unfortunately, even if you’ve implemented a change-management policy, prudence demands verification of relevant system configurations before you begin critical work. Although doing so may cost some extra time and effort beforehand, you must weigh that perhaps unnecessary effort against the costs that your business might incur should you run into unexpected conditions once work begins. And because critical work requires careful scheduling, such an occurrence can easily throw off large chunks of the schedule, potentially ruining the entire plan.
Consider Peripheral Effects on Operations
Focusing on electricity, airflow and network connectivity are critical when performing maintenance on a live data center, but data center managers should be careful to remember other, more indirect aspects of how their work might affect operations. If the work involves some kind of physical construction, for example, Swedish construction company Skanska recommends sealing the work area with plastic to prevent dust from reaching IT and other sensitive equipment. Furthermore, depending on the location, workers may need to wear booties or other coverings to prevent dust from hitching a ride into the data center proper.
In addition, temporary rearrangements of equipment or the presence of certain gear—if large enough—can cause changes in the normal airflow of the data center. The result can be dangerous hot spots that could lead to system failures. Depending on the scope and budget of the project, one option is to employ computational fluid dynamics (CFD) to model the airflow. Although doing so may or may not be practical for intermediate stages of the work, it can deliver solid returns if applied during the planning phase for the project—mainly if the work involves a new cooling system or otherwise altered airflow dynamics, such as through rearrangement of server rows.
Practice Where Possible
As far as possible—and practical, given the associated costs in employee time and so forth—conduct a practice run of your plan. The more critical the systems you’re working on, the more beneficial practice can be in avoiding downtime. In addition to identifying potential trouble spots that you might not have considered or might otherwise been unaware of, a dry run can help employees and other involved parties gain confidence in what they’ll be doing. It will also help data center managers govern the process more smoothly.
Safety
Uptime should always be second to employee and contractor safety. A dangerous shortcut during the process might have the potential to save the entire project, but it can also put lives at risk. Practically speaking, a serious injury or death in the data center is likely to cause more trouble than some downtime because of a problem that arises during the project. From a more compassionate perspective, it’s better to lose some business than to risk the lives of those working in the data center.
Safety considerations may not improve uptime, but they can improve morale and encourage responsibility among employees and data center managers alike. They can also help avoid regulatory hassles. Of course, there’s always a balance to be struck: it’s easy to go overboard with safety to the point of foolishness. Usually, however, a data center manager with some common sense will be able to identify areas of critical work on the live facility that require more care than others.
Learn From Past Experiences
Maybe your last effort at maintenance led to downtime. The only unforgivable failure is the failure to learn from the experience. If you’ve needed to perform live maintenance or upgrades in the past, you’ll need to do them again in the future. Even if your last effort wasn’t a resounding success, you can glean information from it that will help you avoid similar difficulties in future projects.
In addition, experiences gained during maintenance projects—whether successful or not—enable opportunities to prepare beforehand for future projects. For instance, a data center manager might consider implementing a change-management policy to keep better track of the equipment configurations in the facility.
Conclusions
The central facet of any project involving critical work on a live data center should be planning. The more detailed the plan, taking into consideration likely contingencies that could arise during the project, the more likely staff and contractors will execute it successfully.
Apart from simply planning ahead of particular projects, however, companies should plan from the very start: the design phase of the data center. Appropriate redundancy in critical systems not only avoids single points of failure, it enables maintenance and upgrades while the data center is still running. Brocade’s Victor Garcia suggests, “From a design standpoint, future-proof your design to the next level by thinking through each discipline: what if you had to provide one more level of redundancy or what if your densities or number of racks had to increase, which increases your total system load. Make sure you can add an extra set of equipment from a space perspective, and being able to tie it into the distribution system without interruptions, for example installing maintenance bypass switches, bypass valves or isolation valves, putting in a tie breaker or simply reserving space in the mechanical or electrical room for expansion capabilities.
This kind of planning and ongoing awareness of data center design and infrastructure not only enables scalability, it enables critical work that doesn’t interfere with uptime. If your customers, whether internal or external, demand always-on access to IT resources, you can expect to face live maintenance and upgrade projects. By taking some steps beforehand—including but not limited to detailed planning—you can avoid the high costs of downtime while improving your data center.
To read more articles from this August’s DCJ Magazine please click here