Archive for March, 2007

High load average due to hardware issues

Friday, March 30th, 2007

Performance tuning is a sort of art. You know what you expect to reach, and you somehow strive towards that through selective tuning. Either your OS memory utilization, your network settings, NFS mount parameters, etc.

I’ve been to a customer who’s server acted funny. First, it had high load average – for an idle server with 2 CPUs, a load average which never gets below 1.0 can be considered high.

Viewing the logs I’ve seen lots of PS/2 error messages. It seems that the hotplug daemon had been very busy at respawning several times a second due to incorrect hardware detection – due to these PS/2 errors, and caused high load average (many processes in the CPU queue). Disconnecting the PS/2 port between the server and the KVM solved the issue, and within around 2 minutes the load average has decreased to around 0.02.

Hardware related problems are, usually, the most intensive and easy to solve performance hogging.

The cutest useless thing I could have wanted

Wednesday, March 28th, 2007

I’ve been browsing a blog of a friend of mine, xslf, when I read this post dealing with this cute wifi rabbit. I have been browsing its website.

I want one. This rabbit just cought me so badly that I even set the icon for the "Not Really Technical" section to be its figure. I hope I don’t break any trademark rule… Still – I do have a link directly to their website…

HP MSA1000 controller failover

Tuesday, March 27th, 2007

HP MSA1000 is an entry-level disk storage capable of communicating via different types of interfaces, such as SCSI and FC, and can allow FC failover. This FC failover, however, is controller failover and not path failover. It means that if the primary controller fails entirely, the backup controller will “kick in”. However, if a multi-path capable client will fail its primary interface, there is no guarantee that communication with the disks through the backup controller.

The symptom I have encountered was that the secondary path, while exposing the disks (while the primary path was down for one of the servers) to the server, did not allow any SCSI I/O operations. This prevented the Linux server’s SCSI layer from accessing the disks. So they did appear when doing “cat /proc/scsi/scsi“, however, they were not detected using, for example, “fdisk -l“, and the system logs got filled with “SCSI Error” messages.

About a month ago, after almost two years, a new firmware update has been released (can be found here). Two versions exist – Active/Passive and Active/Active.

I have upgraded the MSA1000 storage device.

After installing the Active/Active firmware upgrade (Notice Linux users – You must have X to run the “msa1500flash” utility), and after power cycling the MSA1000 device, things start to look good.

I have tested performance with a person on-site disconnecting fiber connections on-demand, and it worked great. About 2-5 seconds failover time.

Since this system run Oracle RAC, and it uses OCFS2, I had to update the failed-node timeout to be 31 seconds (per this Oracle’s OCFS site, which includes some really good tips).

So real High Availability can be archived after upgrading MSA1000 firmware.

New design to my blog!

Sunday, March 25th, 2007

Thanks to my charming wife, my blog has been redesigned to be somewhat more appealing. I have noticed that many of the techno-babble blogs or personal websites look bad. Usually – black on white at most. Sometimes, some awful design.

I am proudly not part of *this* group anymore 🙂

Compaq Proliant 360/370/380 G1 cpqarray problems with Ubuntu

Saturday, March 24th, 2007

Or, for that matter, any other Linux distribution that:

a. uses kernel 2.6.x up to 2.6.18

b. Does not dynamically create the initrd as part of the installation

Ubuntu, for that matter, is an example of not doing both. While it does create the initrd, it doesn’t create it dynamically per the output of ‘lspci‘, which results in inclusion of every SCSI module which exists.

The symptoms – you can install the system, however, you are unable to boot it afterwards. You might get into your Busybox initrd. The cpqarray module doesn’t detect any arrays. Error is "cpqarray: error sending ID controller" . You will notice that the module sym53c8xx is loaded.

I’ve searched for a solution and found an initial hint in this blog, however, the entry was not completely accurate. Following the tips given in this page, I was able to understand that there was a bug in the kernel which caused sym53c8xx modules to take-over the cpqarray. I was required to remove the modules from the initrd. I booted into rescue mode from the Ubuntu Server CD, and from there did the following:

1. mount /boot

2. add the following modules list to your /etc/initramfs-tools/modules – modules-proliantG1.txt

3. Edit /etc/initramfs-tools/initramfs.conf to change "MODULES=most" to "MODULES=list"

4. Run "update-initramfs -k 2.6.17-11-server -c" (this is relevant in my case – up-to-date Ubuntu server 6.10. For other versions, check what is the latest version of installed kernel. This can be found by a mere ls on /lib/modules/)

After reboot I was pleased to discover that my system was able to boot correctly, and I know it will do so for updated versions of the kernel