Posts Tagged ‘server farm’

Correct rack wiring tips & tricks with pictures

Sunday, April 15th, 2007

This post will offer several tips and tricks for wiring cables into new rack closets. It uses pictures as a demonstration of what to do and what not to do. It is based on a job I have taken part in with several other companies, moving the server farm to a new location. This job included rewiring new rack closets, and these pictures are based on work I and another team have done.

Since this post includes many pictures, I have decided to split it into this description, and the real contents.

A bug in restore in Centos4.1 and probably RHEL 4 update 1

Sunday, February 26th, 2006

I’ve been to Hostopia today. The land of hosting servers. I’ve had an emergency job on one Linux server, due to a mistake I’ve made. It appears that the performance hindrance of using raid0 instead of raid1 (Centos/RH default raid setup is raid0 and not raid1, which led me to this mistake) for the root partition is terrible.

I tend to setup servers in the following way:

Small (100MB) raid1 partition (/dev/sda1 and /dev/sdb1, usually) for /boot.

Two separated partitions for swap (/dev/sda2 and /dev/sdb2), each just half the required total swap.

One large raid1 (/dev/sda3 and /dev/sdb3) containing LVM, which, in turn, holds the “/” and the rest of the data partitions, if required.

In this specific case, I’ve made a mistake and was not aware of it on time. I’ve setup the large LVM over a stripe (raid0) by mistake. I’ve had degraded performance on the server, and all disk access were slow. Very slow. Since it is impossible to break such a raid array without loosing data, I’ve had to backup the data currently there, and make sure I would be able to restore it. It’s an old habit of mine to use dump and restore. Both ends of the procedure worked so far perfectly, on all *nix operating systems I’ve had experience with. I’ve dumped the data, using one of the swap partitions as a container (formatted as ext3, of course), and was ready to continue.

I’ve reached the server farm, where all hosting servers stood in long rows (I’m so sorry I did not take a picture. Some of those so called “servers” had color leds in their fans!), and got busy on this specific server. Had to backup all from the start, as it failed to complete before (and this time, I’ve done so to my laptop through the 2nd NIC), and then I’ve booted into rescue mode, destroyed the LVM, destroyed the raid (md device), and recreated them. It went fine, except that restore failed to work. The claim was “. is not the root” or something similar. Checking restore via my laptop worked fine, but the server itself failed to work. Eventually, after long waste of time, I’ve installed minimal Centos4.1 setup on the server, and tried to restore through overwrite from within a chroot environment. It failed as well. Same error message. I’ve suddenly decided to check if I’ve had an update to the dump package, and there was. Installing it solved the issue. I was able to restore the volume (using the “u” flag, to overwrite files), and all was fine.

I’ve wasted over an hour over this stupid bug. Pity.

Keeping the static copy of the up-to-date restore binary. Now I will not have these problems again. I hope 🙂

Proccess monitoring, Keepalive, etc

Sunday, October 23rd, 2005

My new Linux server-to-be will require some remote monitoring and process keepalive going there. It’s that I’ve noticed nscd (which is required, when dealing with hundreds of LDAP based accounts) tends to
die once a while. I’ve also made a mistake once, and managed to kill all SSH daemons, including the running ones. I am happy to say it was solved by going down one floor, and connecting a screen to the machine, and restarting the service, however, it would have been nasty has it happened in relocation room, inside our ISP’s server farm…

So I’m trying to solve problems *before* they appear, I’ve decided to search for process KeepAlive daemon, or something which will ease my life, and make sure I don’t get any phone calls.

At first searching for "process keepalive" led me to some pages about HA-servers, aka, High Availability clusters. I don’t need multi-node keepalive, so I didn’t bother with it. Installing Centos’ or Dag’s keepalived proved to be exactly the thing I did not look for. So I’ve removed it, and kept on searching.

In the process, I found this link, which should have been put into cron. Nice going for one or two processes, but maintaining a full load of about 10 processes, which I must keep alive at all times, is a bit too big for this one. Without being able to code perl, I needed something else, better scalable.

I’ve seen lots of things, and some of them looked like they could interest me, but I wanted it as part of my package tree. I wanted it to be an RPM, and me to be able to upgrade it, if there are updates. All this, without actually tracking each package in person (which is a good enough reason to having package management system in the first place).

I was able to find in Dag Wieers RPM repository just the thing for me. It’s called "monit", and it was just the thing. Took me about 10 minutes to set the thing up, and make it work, tested, for most of my more important daemons.

Example of a configuration file is here monit.conf

It works, and it made my life a lot easier. I can easily recover both human mistakes and machine errors now. I might add some mail notification, but for now I will settle for logs only.