Top 9 _Ooops_ Moments

Posted by JD 07/27/2010 at 10:30

Below are a few incidents that I’m personally aware of which impacted a few different projects. Some are from my personal desktop to production dispatching systems with 20K+ users to some that impacted a space shuttle launch data.

People like Top 10 Lists, but I could think of only 9 near disasters. Perhaps something interesting will happen this week? ;)

Ooops – beep, beep, beep ….

  • What happened: I heard our network admin yell from his cave (non-raised floor area of the lab), “Oooops! Sorry Guys!”
  • Why it happened: At around 10am, the admin accidentally ran a reboot all systems command. If you have just a few systems, that isn’t so bad. This command rebooted every system inside the entire building, about 100 of them with people actively working. Ever heard the term save early and often? I live this advice since that time. The beeps from many, many systems (maybe 100 workstations and servers) going down and the flicker of system after system going down is to be remember. Hi Ron in the STL!
  • The Fix: None or beat the net-admin about the face. We did nothing but go get some coffee.

Who Needs Root Anyway?

  • What happened: Root password was changed on a 4 hour old, new SPARC system to an unknown value. The person who changed the password couldn’t reproduce it.
  • Why it happened: At the company in question, only a few people had UNIX experience and most of them had not been on a UNIX system in years. I was hired, in part, to help them build cross platform software. I was new to the company and didn’t want to be the administrator for any machines.
  • The Fix: I hacked into the SPARC system (physical access IS everything) and took control over the root password. I refused to give it to anyone else, since it was clear that mistakes could happen. I became the UNIX server administrator for that company in addition to my other roles. After a few months, I shared the root password with another admin person AND we placed the critical passwords for all equipment into a sealed envelop and handed that to the VP of Development for safe keeping.

I’d say the name of the person, but that is too specific. I’ll just say – he was both a guru and a swami.

My Files Are Gone!

  • What happened: Disk space was getting tight on a SPARC system. I wrote a small script to find all web browser cache files for every user on the system and remove them. My small script found every file on the machine, including / and proceeded to wipe the disk of everything. SWEET!
  • Why it happened: A find clean-up script running as root found more than I really wanted. It found ./ , which happened to be the / directory and started running rm -rf / wiping out the entire operating system and all other directories on the main development Solaris machine for my company.
  • The Fix: The day before, I had created a dump 0 tape. Talk about blind luck. I’d never restored a system and without a running system and man pages, restoration was a mystery. I drove to the local big-box-computer-centric store and purchased a general UNIX administration book for $20. 2 hrs later, the system was back running and I learned to check find results with the -print option. That was close. Backups became a priority from that point on.

This was me – all me.

Upgrade – Sure.

  • What happened: OS removed from Ubuntu Machine via rsync script meant to create a mirror / backup of the system to another disk.
  • Why it happened: These days I like to mirror or backup a system prior to performing any upgrades. I wrote an rsync script that handled this efficiently over 10 yrs ago and it normally works exactly as desired. This time I confused the source and the target directories AND included the —delete-before option. Effectively, that caused the script to run rm -rf /.
  • The Fix: I was performing the backup just prior to upgrading a system from an LTS Ubuntu to another LTS Ubuntu OS. There was an offline mirror backup available of the old machine from about a month ago, but nothing current. A fresh install was performed, then specific files were copied from the older backup to the new machine and programs were manually loaded via Synaptic over the next few days.

This was me – all me.

Missing CPU

  • What happened: An IBM AIX system was transported between 2 offices about 4 hours apart. At the new office, some software refused to start.
  • Why it happened: After hours and hours of troubleshooting, professional support was called. They determined that one CPU was missing from the server. The machine was brought down and all the CPUs were reseated. At reboot all the software started up as expected. One of the CPUs had vibrated out of the socket during transport. That CPU was the primary for licensing.
  • The Fix: Avoid software licenses that are tied to specific CPUs. Use network license whenever possible or software that is not licensed that way.

I was The Transporter here. That server barely fit into my hatchback.

Just Four Easy Commands

  • What happened: After a company-wide IP address change (before NAT existed), 1 UNIX server never booted up again.
  • Why it happened: I spent weeks planning for an IP re-addressing project at our small company. DHCP and NAT didn’t exist back them so every network device needed to have the IP address changed. That was about 100 PCs, a few Windows Servers and 5-10 UNIX servers. I spent a few hours carefully creating sed and cp scripts to modify all the necessary files related to IP address changes for each impacted system. My scripts would require 4 commands to perform this change across all the UNIX machines inside the company. It would require less than 10 seconds to run then automatically reboot each system to bring them up clean with a new IP. The Windows admin was jealous that it would be so simple under UNIX. This was forever known as the 4 Easy Commands Failure. When the time came, late on a Friday afternoon, we announced to the office that the network change was happening and that most people should just leave for the day. The Windows guy started his changes, manually pointing and clicking across the servers, then he was onto the 100+ desktops. I logged into the main control server and typed 1, 2, 3, 4 commands and watched as the xterms previously connected to each UNIX machine disappeared after the final reboot in the script happened. Life seemed good. Then each machine started coming back up on the network, as expected. Sun-enchilada, HP-UX-fajita, AIX-burrito, AIX-guacamole, Irix, Linux, (can’t recall all the hostnames anymore, but they were Mexican foods) …. where’s the other AIX box? It never found a boot sector. We called support, but the contract had expired. We called EDS (cheaper support) and they sent a HW engineer who spent 3 days working the issue. The drive was fine. The RAM, CPU, MB, controller, network, BIOS were all fine. Still it wouldn’t boot. That machine was maintained by a different team – ok, it wasn’t maintained at all and hadn’t been rebooted in over a year. That machine never came up and my boss told me to stop screwing around and get back to work on my development job after a week.
  • The Fix: None. Unrelated to this issue, I had decided it was time to change jobs prior to another company buying us out. About 2 months later, I was working someplace else, but still maintained contact with my old friends. 6 months later, I heard the machine still had never booted.

This was me – all me.

To this day, friends ask how many easy commands was that? Once, a new person that I hadn’t met before heard my name and asked, “What are those special commands that I need to avoid?” This was years later and at different companies.

Failover during Ascent

  • What happened: Every flight controller workstation in the JSC Mission Control Center in Houston became unresponsive during a space shuttle launch for about 45 seconds. The flight control team had ZERO telemetry data during a live launch. This was bad.
  • Why it happened: At the time, a new mission control center system architecture was just being completed and was in production. This was an early launch after other launches had already proven the new systems worked as designed. Every MCC around the world had been replaced with 1990s 64-bit systems, removing the 1970s systems. Only the UNIX boot was local to each server-class workstation (about 600 of them in the MCC/FCR at JSC) with the rest of the software, data, and configuration files provided from a HUGE NFS server over the network. Accessing files on the NFS server was faster than accessing local files on the workstations, so this was a very good design AND implementation. The NFS server was so important to the function of the MCC that a failover NFS server had been deployed too. Both of these servers each took up a long wall in a fairly large bedroom – about 20 feet in length. That failover on a quiet network took about 30 seconds, including the 10 second timeout to determine a failover was needed, but on a busy network with very busy disks, it took about 45 seconds. A failover was triggered by accident during a space shuttle launch. I was doing my job during ascent and baby sitting my project’s server in the data center room of Building 30. A few other administrators were in there watching their server performance closely too on those consoles. It was not the best place to be using a launch, since it is like most other data centers around the world – cold, noisy with no outside data access or graphical displays. The technical architect contractor who designed and implemented the NFS server systems was there too. He was mature, about 55 yrs old, and his eye sight wasn’t working so well. The location of the console to his view were less than ideal, so he physically grabbed the primary NFS server screen and pulled it about 4 inches closer, so he could read it easier. That was a mistake. The cable lengths in that server room were perfectly sized for each console and specific location with very little slack, definitely not 4 inches worth. It was a very good looking data center. No excess cables were laying around. The console cable became unplugged from the SunOS NFS server (50 ft away). Yes, the cabling didn’t have 4 inches of slack over that entire run due to expert cable-tie deployment. The disconnected console made the primary-NFS server failover to the secondary. BTW, my server was still performing fine and doing what it needed to accomplish for the 600 clients, but the NFS server was in the process of migrating the network and disk connections from 1 machine to another. For the first 10 seconds (or so), nothing happened, telemetry stopped as all the programs couldn’t access any of their ISP data (Information Sharing Protocol). If the console had been reconnected during the first 10 seconds, nothing more would have happened. The failover worked perfectly and as expected. Sadly, because this was during a space shuttle launch before SRB Separation, the ground controllers were basically completely in the dark for the most critical points of ascent.
  • The Fix: Consultant was fired. I specifically ask for some extra slack in all my cables.

I don’t know the cable guy or the tech architect’s names. Sorry. Fortunately, the main part of his work was done and he needed to move on anyway. Sadly, he didn’t get the come-check-everything-pre-launch contract.

What Library?

  • What happened: A single space character was inserted at the wrong location in a rm -rf command which caused the entire document library to be deleted 2 days prior to a space shuttle launch. This library of documents took 3 months for the librarian to build.
  • Why it happened: Removal of old documents was partially manual in the electronic documentation library (EDP) used for space shuttle manuals. A key part was to recursively remove old files for completed launches to make room for new files needed for upcoming launches. The librarian, an older gentleman with 35 yrs of service as a NASA worker with less than 6 months until retirement, accidentally inserted a space at the wrong place in his
    rm -rf directory*

    command. What he entered was
    rm -rf directory *
    .
    See the slight difference? Ouch.
  • The Fix: The entire development team of 6 people became document librarians and helped manual re-build the library, including complex regexs needed to build TOC to page, table, figure and index hyper-links. We started with the pre-launch documents, continued to the ascent documents, and finished with the on-orbit, reentry and landing documents. Prior to launch, the library was nearly complete and the disk drive array was very warm. UNIX scripting rocks. Need I say more?

Hi Ray! I hope retirement has been good to you and your family!

Dispatch Who to Where?

  • What happened: Massive database corruption during production hours when the software vendor told a highly experienced application support guy to run a specific Oracle command. 17k+ users impacted at the height of production for an entire day. 17,000 technicians cannot perform their tasks due to the dispatching system being down and 4k call center people can’t help customers with dispatch times. Break-fix became a manual process for about 8 hours across the SE USA telephone and data network.
  • Why it happened: A telecom software vendor told a software support person to type an Oracle command into the running production database. The vendor was deemed to be expert at this, but clearly was not. No Oracle DBA was involved or managing the DB. The command was not safe and caused major DB corruption.
  • The Fix: Recover from backups created earlier the morning of the incident. Upper management wanted the local contractor to be fired. The project management took all necessary steps to protect that person (rightly) and work was begun to add EMC BCV mirroring (which was part of the system design, just delayed by the internal customer for other priorities over and over for 2 yrs) to the system. The BCV mirrors were setup to run every 4 hours to limit data loss. Backups were still only performed daily, which included remote replication to a DR location hundreds of miles away. Professional DBAs were engaged to perform all Oracle work AND held accountable going forward. A highly experienced Oracle DBA joined the team and was highly valued until I left the team.

Hi Mark! Vendor “X” sucks!

Lessons Learned?

  1. Be very careful with rm -rf commands.
  2. Be very careful with rsync commands.
  3. Be very careful with find commands, especially when they include rm commands.
  4. Be very careful with any DBMS command.
  5. There is no substitute for a good, verified backup. None.
  6. Some small cable slack is good. Having a few excess loops on each end of a cable run is a really, really, good idea.

What are your top ooops moments? Any that are better than mine? Don’t post any complete names or companies, please.

Trackbacks

Use the following link to trackback from your own site:
https://blog.jdpfu.com/trackbacks?article_id=740