ITD improves and automates supporting systems to strengthen Data Center

In the winter of 2015, a plan began to form that provided a more robust support infrastructure to ITD's Data Center. Although the Data Center had its share of protections and provisions, the systems that fed into it — namely power and water — were aging, insufficient and poorly supervised. The Data Center was frequently exposed to downtime, and eliminating even one unscheduled Data Center crash was calculated at a savings of nearly $3.8 million by industry standards.

These calculations, from ETS Infrastructure Manager Pete Palacios, account for damage to mission critical data, impact of downtime on organizational productivity, damage to equipment, legal, regulatory repercussions, and lost confidence and trust amongst key stakeholders.

What began as a simple plan to "make it better" turned into a series of improvements that drastically improved the dependability of the infrastructure, and in turn, the Data Center. In a collaborative effort with the Building Services Team and Enterprise Technology Services, work was scheduled and implemented to provide the improvements needed with minimal impact to the user community.

According to Palacios, the series of improvements includes:

Alarm-point improvements – In an environment where minutes count, we wanted to place our alarm connections as near to the initial point of failure as possible. Setting an alarm on the cooling water supply rather than the room temperature provided notification 30-40 minutes before the system point of failure.

Real-time notification – Communication improvement was critical. In the areas that we had pull communication (requires the end user to go get the information), we made efforts to upgrade or modify to push communications (system pushed notification to the end user via email, text or visual). In the critical areas that we needed to guarantee a response, we added an interactive communication system (system that requires a response from the end user). These improvements were accomplished mostly by simple system changes to the current Building Automation System (BAS) and the addition of an offsite monitoring service for critical alarm points.

Prevention and Recovery – We examined the support infrastructure through two lenses — prevention and recovery. For prevention, we asked "is it possible to eliminate or reduce the susceptibility to failure?" For recovery, in the areas that we have redundancy, "can we initiate the recovery process utilizing the BAS?"

Trend data – Finally, we don't know what we don't measure. Utilizing the systems currently in place, we have compiled trend data to measure the overall health of the system as a whole. City and generator power quality, water temperatures, humidity, periodic maintenance records, and recorded alarms measuring will provide critical information for the short-term and long-term health of the Data Center

Palacios said the process was initiated over 18 months ago.

"During the last 12 months, we experienced 10 independent events that historically would have placed us at a significantly elevated risk for a critical event," Palacios explained.

"In an industry driven by data, strengthening our access and bolstering our safeguards is just good business sense," he added.


Published 09-02-16