Archive for May, 2006

IBM X306 (ServerRaid7e) and Linux Centos 4.3

Sunday, May 14th, 2006

It was no fun, and I hope I will never experiance again such bad setup.

Summery: IBM X306. The X306, X206 and some of the others are equipped with ServerRaid7e Sata controller. These controllers lack Linux drivers, and thus, make me a sad person. Drivers are available from IBM web site to RHEL 3 Update 4 (Similar to Centos 3.4), and on a very rare occasion, for RHEL 4, and RHEL 4 Update 1.

Centos 4.3 is equivalent to RHEL 4 Update 3, so it wasn’t quite that.

Stage zero: Come ready.

I’ve came as ready as one can be. Equipped with a laptop capable of serving bootp requests, NFS images of both the 32bit and the 64bit version of Centos 4.3, I was as ready as I can. Discovered that tfptd alone (Debian version) was not enough (or could not boot IBM’s PXE, in this case), and had to replace it to tftpd-hpa, which worked correctly.

Stage one: Asses the problem. The Problem was assessed, and I’ve understood that no disks were available for installation. The kernel was unable to access local disks. Bad.

Stage two: Find a solution.

With Internet access, I was able to identify drivers in IBM site, but they were for RHEL3 only. Some forum (can’t remember link, sorry) led me to a hidden part of IBM’s web site, which included a driver for RHEL 4 Update 1. I’ve downloaded it.

Stage Three: Work harder.

To load a vendor module available for a specific kernel version, you need to have that specific version. I’ve had Centos 4.3, and required the kernel for Centos 4.1. I’ve visited Centos old download source, and was able to download the kernel version for Centos 4.1 (kernel 2.6.9-11.EL.x86_64.rpm), and the required kernel and initrd image for installation.

I’ve started installation with the following command:

linux dd=nfs:192.168.0.1:/mnt/Source/40k8690.img

After a short boot sequence, an error message appread, claiming I did not install the correct version. Of course I did not, I’ve used an older kernel and initrd!

I’ve downloaded the old stage2.img file from cdroot/Centos/base, and replaced the current one.

There was a problem with IBM’s supplied drivers, or the one obtained from the net, so I’ve looked into IBM’s CDROM, the one supplied with the server, and found the driver there. I will add it here, just in case – both RPM and DriverDisk (dd) image.

Finally I’ve managed to install the server, after rather long time in the server room.

One note – do yourselved a favour and replace the kernel package to the one of Centos 4.1 before you reboot the server at the end of the installation. It can save you both rescue install, and both troubles with RPM.

For some reason, I could not install the alternate (older) kernel. It failed to install because it was older, and failed with error when used with "–force". I’ve had to insert the files manuall. It required some tweeking:

I’ve used "rpm2cpio kernel….rpm | cpio -id" to extract the files, and then moved them to the respective directories. I’ve had to do a similar trick for IBM’s drivers, becuase they failed, for some reason, the post-install script. They have created an entry for the raid controller in /etc/modprobe.conf, and I’ve only had to recreate the initrd file. A command similar to this:

mv /boot/initrd-2.6.9-11.EL.img /boot/initrd-2.6.9-11.EL.img.old

initrd -o /boot/initrd-2.6.9-11.EL.img 2.6.9-11.EL

did the trick. Adding the correct entry to /etc/grub.conf fixed it all up. I was able to boot the newly installed system.

I don’t know what are the implications, but I did not dare letting Kudzu change these settings, when it claimed to have found a raid controller. I just ordered it to ignore it and never ask again.

reason: 550 Requested action not taken: Nonstandard SMTP line terminator.

Tuesday, May 9th, 2006

I have encountered this problem on my own personal mail server only once a while. It’s ratehr rare, and happends only when sending mail to some specific domains.

My first notion was based on the claim that if it works for anywhere except for these one or two domains, the problem is with these one or two.

My second notion was to investigate it somewhat further, to be able to assist the owner of the domains in question in solving their problem, or supply them with links showing possible solutions.

I was very surprised to discover that the problem was here. It was a bug with spamass-milter, used by my server to connect ol’ Sendmail and Spamassassin together, and it appeared only in version 0.3.0.

I’ve just upgraded to version 0.3.1, and it works correctly for these domains.

Case closed.

Ontap Simulator, and some insights about NetApp

Tuesday, May 9th, 2006

First and foremost – the Ontap simulator, a great tool which surely can assist in learning NetApp interface and utilization, lacks in performance. It has some built-in limitations – No FCP, no disks (virtual disks) larger than 1GB (per my trial-and-error. I might find out I was wrong somehow, and put in on this website), and low performance. I’ve got about 300KB/s transfer rate both on iSCSI and on NFS. To make sure it was not due to some network hog hiding somewhere on my net(s), I’ve even tried it from the host of the simulator itself, but to no avail. Low performance. Don’t try to use it as your own home iSCSI Target. Better just use Linux for this purpose, with the drivers obtained from here (It’s one of my next steps into “shared storage(s) for all”).

Another issue – After much reading through NetApp documentation, I’ve reached the following concepts of the product. Please correct me if you see fit:

The older method was to create a volume (vol create) directly from disks. Either using raid_dp or raid4.

The current method is to create aggregations (aggr create) from disks. Each aggregate consists of raid groups. A raid group (rg) can be made up of up to eight physical disks. Each group of disks (an rg) has one or two parity disks, depending on the type of raid (raid 4 uses one parity, and raid_dp uses “double parity”, as its name can suggest).

Actually, I can assume that each aggregation is formatted using the WAFL filesystem, which leads to the conclusion that modern (flex) volumes are logical “chunks” of this whole WAFL layout. In the past, each volume was a separated WAFL formatted unit, and each size change required adding disks.

This separation of the flex volume from the aggregation suggests to me the possibility of multiple-root capable WAFL. It can explain the lack of requirement for a continuous space on the aggregation. This eases the space management, and allows for fast and easy “cloning” of volumes.

I believe that the new “clone” method is based on the WAFL built-in snapshot capabilities. Although WAFL Snapshots are supposed to be space conservatives, they require a guaranteed space on the aggregation prior to committing the clone itself. If the aggregation is too crowded, they will fail with the error message “not enough space”. If there is enough for snapshots, but not enough to guarantee a full clone, you’ll get a message saying “space not guaranteed”.

I see the flex volumes as some combination between filesystem (WAFL) and LVM, living together on the same level.

LUNs on NetApp: iSCSI and/or Fibre LUNs are actually managed as a single (per-LUN) large file contained within a volume. This file has special permissions (I was not able to copy it or modify it while it was online and I had root permissions. However, I am rather new to NetApp technology), and it is being exported as a disk outside. Much like an ISO image (which is a large file containing a whole filesystem layout) these files contain a whole disk layout, including partition tables, LVM headers, etc – just like a real disk.

Thinking about it, it’s neither impossible nor very surprising. A disk is no more than a container of data, of blocks, and if you can utilize the required communication protocol used for accessing it and managing its blocks (aka, the transport layer on which filesystem can access the block data), you can, with just a little translation interface, set up a virtual disk which will behave just like any regular disk.

This brings us to the advantages of NetApp’s WAFL – the ability to minimize I/O while maintaining a set of snapshots for the system – a list of per-block modification history. It means you can “snapshot” your LUN, being physically no more than a file on a WAFL-based volume, and you can go back with your data to a previous date – an hour, a day, a week. Time travel for your data.

There are, unfortunately, some major side effects. If you’ve read the WAFL description from NetApp, my summary will be inaccurate at best. If you haven’t, it will be enough, but still you are most encouraged to read it. The idea is that this filesystem is made out of multi-layers of pointers, and of blocks. A pointer can point to more than one block. When you commit a snapshot, you do not change the pointers, you do not move data, you just modify the set of pointers. When there is any change in the data (meaning a block is changed), the pointer points to the alternate block instead of the previous (historical) block, but keeps reference of the older block’s location. This way, only modified blocks are actually recreated, while any unmodified data remains on the same spot on the physical disk. An additional claim of NetApp is that their WAFL is optimized for the raid4 and raid_dp they use, and utilizes it in a smart manner.

The problem with WAFL, as can be easily seen, is fragmentation. For CIFS and NFS, it does not cause much of a problem, as the system is very capable of read-ahead just to solve this issue. However, A LUN (which is supposed to act as a continuous layout, just like any hard-drive or raid-array in the world and on which various file-system related operations occur) gets fragmented.

Unlike CIFS or NFS, LUN read-ahead is harder to predict, as the client tries to do just the same. Unlike real disks, NetApp LUNs do not behave, performance-wise, like the hard-drive layout any DB or FS has learned to expect and was best optimized for. It means, for my example, that on a DB with lots of small changes, that the DB itself would have tried to commit changes in large write operations, committed every so and so interval, and would thrive to commit them as close to each other, as continuous as possible. On NetApp LUN this will cause fragmentation, and will result in lower write (and later read) performance.

That’s all for today.

NetApp Ontap 7.1

Thursday, May 4th, 2006

I’ve had the pleasure of playing with Ontap Simulator. This is a marvelous tool, designed to simulate a real NetApp appliance, in an easy and affordable manner.

I’ve noticed a link to this simulator, in Oracle’s web site. I’m posting it here so you’ll get to know what I’m talking about. I’m playing with it on a Linux host. I’ve created virtual disks (small ones only. The simulator does not allow larger than 1G disks anyhow), and I’m playing with NFS, CIFS, Snapshots, etc.

If I have some surprising views on the matter, I will share them here.