Outage-Hardware Failure 2
Over the last few days we had an outage here. A UPS used by the network infrastructure failed and nobody could physically get to it until sometime yesterday. Actually, nobody could trouble shoot the problem to know what actually caused the lack of connectivity.
Around here, we don’t have many failures and certainly not many failures that impact public-facing services for more than a few minutes. Sometimes the blog software is a little flaky and crashes, but since we are running a few instances behind a load balancer, the service shouldn’t be completely unavailable and should auto-restart if all instances fail. Overnight when we take backups, it is just easier to take down all instances of the blog for a few minutes, do the backup, then bring them back up. It takes just a few minutes – not a big deal in the middle of the night. Yes, there are other ways to do this in a non-impacting way.
Remember, this is a non-income-generating blog – like a front door to a small business. It is expected, but doesn’t make any money.
Perhaps a few readers would like to know more details?
The Situation
Only a few people have physical access to the servers, so when all of those people are unavailable and bad things happen, that means that a longer outage will happen. That is what happened here.
Sometime over the weekend, there was a hardware failure. From where I was, it seemed to happen before 3:41am on Saturday morning. Obviously, I didn’t notice it until much later, but automatic email messages weren’t being sent or received after that time. They were being queued internally and external emails where being redirected to our secondary mail servers in other locations. At least the business emails were. Emails to my personal domains didn’t appear to be handled properly. Hummm. I need to look into that.
Since I wasn’t here, I don’t know what started the failure, but according to weather reports, a thunderstorm is my best guess. It could have simply been an old device that failed.
Troubleshooting
When I finally get to the physical location, I hear the server fans and see that the EGA router is up and working fine. The public wifi router is working too. My gaze sees the internal router lights are not blinking. Das ist kaput. I immediately think that the old router which has served me so well all these years has finally died. I can steal the public wifi router for a quick fix, and start unplugging power and ethernet cables. I’ve moved the other router into this location and connect the WAN cable then the power ….. no blinking lights. Duh – it is a power thing, not a failed router. At least at this point, the power is more likely. That’s good news – I have an old record of all the ports and where they are forwarded and I can certainly get SMTP and web ports to the correct places, but the other 15-25 ports will take a little more effort.
I follow the power cable to a power strip. Sometimes I use cheap ones, but not this time. It is a Tripp Lite. Ok, time to pull the rack out for easier access. The Tripp Lite breaker is not out, but it doesn’t have any LEDs with diagnostics. Where does it connect?
To an old UPS – perhaps 10 yrs old. That explains it. I’ve been watching for a UPS sale/deal for about a year looking for a replacement. I expected this device to fail at some point, but some of the ports seem to still be working … interesting. The router, switch, telephone and a TV device are connected to the battery backup side of the UPS. None of those are working, but on the other side, with just surge protection, everything seems to be working. Sadly, this is a cheaper UPS, so there’s no diagnostic software or serviceable parts. The 3 LEDs are not lit. Other UPSes around here don’t have any open ports.
First things first, get the email and public servers accessible again. Move the battery-side plug into a non-battery plug. Lights on the router, and other devices come to life. A few seconds later and I’m pinging google.com from a desktop. Good enough.
Next Steps
Now we have time to think and plan the next steps.
- look for a good-enough deal on a UPS
- try to migrate the switch, router power to a UPS with battery backup
- create a careful list of the router port forwarding settings
And the public wifi router can go back to doing what it does here.
I hope this wasn’t too boring, but at least it was just a cheap UPS, not something expensive or difficult to swap out.
If you use “kaput” in a German sentence, please use the correct German spelling: “kaputt”. Take this from a German :-)
It is very difficult to edit any articles here, so I won’t fix it.
Just know that I’ll remember the German spelling next time.