Plumbing Disaster Takes Out Dallas County, Tx Systems

Posted by JD 06/08/2010 at 09:00

Your computer guys are always talking about some coming disaster and trying to get budget for a DR, Disaster Recovery failover location. There never seems to be enough funds. Heck, you may be lucky to get backups rotated to an offsite location.

A story in Dallas Morning News, Water-main break cripples Dallas County computers, operations explains what happened to take down the county computer systems.

They Were Lucky

The county was lucky. They only lost power, not all their equipment, with this disaster. They dodged a bullet THIS TIME. Imagine if a tornado hit their building and it was completely destroyed. Computer parts would be spread all over a 10 block area. The applications would be down for weeks and some applications would be down for months.

Future Plans

2 years earlier, eGov performed an assessment on the county IT infrastructure. That assessment pointed out many short comings including the lack of a disaster recovery location. I hope a multi-purpose DR site is created, equipment placed there is used not just as a DR location, but as a routine fail over location for each application. Think of how Google services are placed in data centers around the world and when 1 has a failure, the others pick up the requests. I’m not suggesting anything so elaborate for a county government. What I’m suggesting is they have 2 data centers at least 50 miles apart and that they swap production systems between them as needed by the support teams. Weekly or monthly would be ideal. Then, the county commissioners and IT team would be able to sleep very well at night, knowing their DR plan works.
I’m not suggesting anything that I haven’t designed and implemented for fairly large scale systems.
I’m not suggesting that 30 minutes of maintenance for each application swap wouldn’t be needed and that the networking isn’t a little more complex. Still, it needs to be done.

How does your company look for DR Planning? Ready, I hope.

Had Not Rebooted Some Systems in 4-5 Years!

Seriously? Are they crazy? What if something was changed 3.5 yrs ago didn’t get changed correctly? By the time they are forced to reboot and test that change, the guy who made the change, the team that approved the change and all the consultants are gone. They certainly do not remember what the change was even if they are still working there. I’m all for long up times, but anything over a year is really pushing it. PC and midrange systems probably should never go longer than 3 months between reboots. Any longer period of time needs to be approved and have a very good reason. Require the Mayor or a corporate officer’s written approval on a document that outlines the risks. A key and legally accountable decision maker needs to be formally told of the issue.

  • For MS-Windows Servers, a monthly reboot cycle is the longest I’d allow. Weekly may be desired.
  • For UNIX/Linux Servers, a quarter is pushing it. Weekly may be desired.
  • For Mainframes, annual complete power cycles would be my initial suggestion, but with mainframes comes the guys in white lab coats who take great notes to track all the systems changes. This practice reduces the risks along with the constant hardware health checks that is built into mainframe systems.

Rebooting isn’t just to validate the shutdown and startup of the software. There are hardware tests that are only performed during a full power cycle. If you don’t reboot it, a small, quickly fixed warning can turn into a not-gonna-boot issue later. A single issue can easily become 3-10 issues that are difficult to troubleshoot. I’ve seen this with (XYZ) midrange systems and heard about it with many other vendors. The (XYZ) system, SysA, I saw had been running non-stop for about 2 years. I was not the admin for that box, but I had an identical server, SysB, for which I was the admin. SysB was patched quarterly and rebooted along with the other HP, Sun, Dell, SGI, and IBM systems (this was in the 1990s so security patches weren’t as critical). For some reason, I became the admin on the SysA and after my first patch cycle, I dutifully rebooted it. SysA never came up again. We did not have a support contract at that company, so my budget to get it working was limited. We contacted a 3-letter support team (you know the name), who sent a hardware guy. A few days later, the system still did not boot. My boss decided we’d wasted enough time and money and told me to forget the system. I think that multiple hardware failures were to blame, but since we never fixed it, there is know way to know. I heard a few years later from people working at the company that the system never came back.

Reboot quarterly or more often for most systems. You won’t thank me until a small hardware fault happens and you can get it fixed relatively quickly. I know you won’t thank me when it happens at 2am during a maintenance period either, since most of you will be working another 7 hours trying to fix it yourself before calling the hardware support team. Still, you will be better off in the long term. Think of rebooting like going to the dentist every 6 months. Preventative care is best for small issues that can be fixed before they become big issues.

Want to Know More About Disaster Recovery Planning?

We’ve written a few articles here on Disaster Recovery Planning. Some plans are for a home, some for small businesses and others for the largest companies and governments. Disaster Recovery Articles here.

The best way to ensure computer systems have a DR Plan is to build it when the initial system is built, not try to bolt a plan on later. If you are purchasing and installing new hardware for each new software system installed, that will get very expensive. Today, virtualization should be the first technology that is considered when building flexible and disaster recovery ready solutions.

Trackbacks

Use the following link to trackback from your own site:
https://blog.jdpfu.com/trackbacks?article_id=685