Two mistakes, a full day

My webservers have been quite stable, probably 3-4 months now without rebooting. I have 2 copies of a disk on the main server, but since I will be going to China soon, I thought I would create a third copy and put it in a third server as a further backup. if both servers go down, the third one can rebooted with the correct IP quickly.

I use gmirror to duplicate the master disk as a backup, but somehow if I stop the gmirror before rebooting the machine, then the slave would not be bootable.  so I always reboot and take one drive out and swap another one in. The risk here is that if I take the wrong drive out (master), the new drive might automatically start gmirror (since it was not stopped the previous round), it can destroy the good copy in a few seconds.  Gmirror does a bit by bit copying so only a few bits on the drive will render it not bootable.

This is what happened today.  The main server has 8 sata ports,  one drive (I thought was the slave) was on sata6. But unix says it  was ad0, not ad6! the machine booted from this one, instead of sata4.  After a while I realized the machine booted from the wrong drive and rebooted again (I should have issued “gmirror forget gm0” to disable gmirror).  I made sure the master was sata4 and slave sata6, this was still before I realized this motherboard was confused with sata numbers.  I rebooted with the other good copy! and it was destroyed in seconds…Now I lost two main drives! I should have booted with one single drive (the good one) first…or tried a different machine.

Luckily most data (wordpress posts, both html files and mysql data) are backed up daily and automatically transferred (through rsync) to the backup server.  I had to use a backup server disk on the main server.  Unluckily, I did not backup some files, eg. http.conf!  I had create a new one…restore web files…trying to remember what else was missing…this took most of the day today.

setting up rsync again took me hours…it simply refused to work!  finally I created a new user and it worked in a few minutes…something wrong my regular user name.

What a horrible mistake! actually I made the same mistake twice!

Trying to get a third copy and ending up destroying two good ones!

本来是主/付服务器都一直很稳定, 不要备份算了。 但是昨天把副的搞了第3份, 很顺利, 5分钟搞定。
怕回国期间万一出问题, 想把主机也备份吧。 现在只有2份, 在同一个机器里。
看到一个HD上有Tape, 以为是Slave, 换了一个HD进去, 结果这个换了的启动, 把那个好的备份破坏了(自动启动Gmirror, Bit by bit 考到好盘了 — 只要几秒, 好盘就没有了, 不能启动了)。

我 看了看机器, 又把另一个好的放进去,放在4号Sata, 将要备份的放在6号,但是,又被写了! 最后才发现, 一般的电脑都是SATA号小的启动成为Master, 但是这个怪! 明明是SATA6, 在 Unix被认识成Ad0! 不是AD6, 酿成大错! 我没有别的备份, 只有网页+Mysql的数据库每晚自动备份到副机。 我为啥以前没有发现? 而且为啥第一个被破坏了, 还要在试一次? 换个机器就没事了啊。

但是Httpd.conf都没有最新的!花了几个小时重新写。 还好, 网页都回来了。 但是最后发现2011年上的照片没有了(只上了一会议的, 还好)。 Http 的Log也都没有了 (这样没有今年的Access Stats了)。 其他少了啥? 还么有发现。

以后好多Config File得定期备份, 如/etc/rc.conf, httpd.conf, named 的DB file等。 要不搞死人了。

Author: Zachary Huang

Leave a Reply

Your email address will not be published. Required fields are marked *