Categories
Tech

What I learned and how I recovered from Partition Table corruption due to Power failure and fsck

Had an interesting experience. Our power blinked (rapid off and on) few times. Not only did SuKam UPS pass that through the power fluctuations but continued after power has stabilized. By the time I switched off the computer, the damage was done. It would only boot to Grub prompt.
Used a SSD drive from another computer to boot the computer and ran fsck. Instead of fixing the situation, it f*ck’ed up the three partitions of /dev/sda (/boot, swap and /) and created a single partition!
I hadn’t saved the partition information and didn’t remember how much I allocated to swap so I couldn’t safely guess their size and re-partition the disk myself.
I tried rescue option of parted and part, none of them helped.
Then I used testdisk from CGSecurity. It accurately detected the partitions and saved it. Then I fired up gparted. It couldn’t read /dev/sda3 but suggested I should reboot first, which I did.
Funnily the system was rebooting with /boot -> /dev/sda1 instead of /dev/sdb1 (SSD added for recovery) but then mounting /dev/sdb3 as / instead of /dev/sda3. This happened because the menu.lst specified the volume by label and both sda3 and sdb3 had identical label:

kernel /vmlinuz-2.6.18-408.el5 ro root=LABEL=/ rhgb quiet

During booting I changed it to use ID: root=ID=yuwew…
The ID used was of /dev/sda3
It came up but gave an error in nvidia.
I decided to try again on the actual system. I removed the extra SSD to allow the actual SSD to boot properly.
It came up fine but nvidia driver was not loading. After working for so loong with multi-monitors, it was strangely restrictive trying to get work done with a single monitor, duplicated. After some frantic search I found that the best option was to download it from nvidia and run it using init 3.

chmod 755 NVIDIA-Linux-x86_64-340.96.run
./NVIDIA-Linux-x86_64-340.96.run

Then switched back to init 5, rebooted and it multi-monitor started working again.
Take away from this experience:

  1. Always backup partition table.
  2. Do not run fsck without backing up partition table.
  3. Don’t expect fsck to always do the right thing. After all the name is intentional and the tool is to be used only as a last ditch effort to save your disk and data.
  4. UPS should not be your only protection. I am thinking of adding a spike buster in between.
  5. Regular backups are a must. I had some backups but when it happened I realized they were much too old to be of much use. Cloud backup services like Dropbox are your friend.
  6. Backup is of no value unless you remember how to restore them in an emergency.
  7. Backup disk on the same machine was unharmed. So backup even on the same machine is of some value.
  8. Next time around I will choose Graphics Card with seamless Linux support
  9. Always have a second computer around in running condition, even a Raspberry Pi is good. You may need lots of help from the Internet.
  10. Do not setup your Internet connection, router, firewall, DNS, DHCP etc. on your machine. Use your router and share with all your machines from there. Your router is less likely to fail than your machine. You may keep the settings on an unused NIC on a computer to use if your router fails.

PS. The SuKam Tubular battery has failed which SuKam support insists as the cause of this catastrophe. It was aged over 5 years.

2 replies on “What I learned and how I recovered from Partition Table corruption due to Power failure and fsck”

Hi, I work for Su-Kam. I just noticed that you faced problems with our product. I would like to know more about this. May I please have your email?

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.