Archive for the ‘Clusters’ Category

VMware Fencing in RedHat Cluster 5 (RHCS5)

Thursday, June 14th, 2007

Cluster fencing – Unlike many common thoughts, high-availability is not the highest priority of an high-availability cluster, but only the 2nd one. The highest priority of an high-availability cluster is maintenance of data integrity by prevention of multiple concurrent access of nodes to the shared disk.

On different cluster, depending on the vendor, this can be achieved by different methods, either by prevention of access based on the status of the cluster (for example – Microsoft Cluster, which will not allow access to the disks without cluster management and coordination), by panicking the node in question (Oracle RAC, for example, or IBM HACMP), or by preventing failover unless the status of the other node, as well as all heartbeat links were ok up to the exact moment of failure (VCS, for example).

Another method is based on a fence, or “Shoot the Other Node in the Head”. This “fence” is usually based on an hardware device which has no dependencies for the node’s OS, and is capable of shutting it down, many times brutally, upon request. A good fencing device can be a UPS, which supports the other node. The whole idea is that in a case of uncertainty, either one of the nodes can attempt to ‘kill’ the other node, independently of any connectivity issue one of them might experience. This race result is quite obvious: one node remains alive, capable of taking over the resource groups, the other node is off, unable to access the disk in an uncontrolled manner.

Linux-based clusters will not force you to use fencing of any sort, however, for a production environments, setups without any fencing device will be unsupported, as the cluster cannot handle cases of split-brain or uncertainty. These hardware devices, which can be, as said before, a manageable UPS, a remote-control power-switch, the server’s own IPMI (or any other independent system such as HP ILO, IBM HMC, etc), and even the fiber switch – as long as it can prevent the node in question from accessing the disks, are quite expensive, but comparing to hours of restore-from-backup, they sure justify their price.

On many sites there is a demand for a “test” setup which will be as similar to the production setup as possible. This test setup can be used to test upgrades, configuration changes, etc. Using fencing in this environment is important, for two reasons:

1. Simulation of the production system behavior is achieved with as similar setup as possible, and fencing takes an important part in the cluster and its logic.

2. A replicated production environment contain data which might have some importance, and if not that, at least re-replicating it from the production environment after a case of uncontrolled access to the disk by a faulty node (and this test cluster is in a higher risk, as defined by its role), or restoring from tapes is unpleasant and time consuming.

So we agree that the test cluster should have some sort of fencing device, even if not similar to production’s one, for the sake of the cluster logic.

On some sites, there is a demand for more than one test environment. Both setups – a single test environment and multiple test environments can be defined to work as guests on a virtual server. Virtualization assists in saving hardware (and power, and cooling) costs, and allows for easy duplication and replication, so this is a case where it is ideal for the task. This said, it brings up a problem – fencing a virtual server has implications – we can kill all guest systems in one go. We wouldn’t want that to happen. Lucky for us, RedHat Cluster has a fencing device for VMware, which, although not recommended in a production environment, will suffice for a test environment. These are the steps required to setup one such VMware fencing device in RHCS5:

1. Download the latest CVS fence_vmware from here. You can use this direct link (use with “save target as”). Save it in your /sbin directory under the name fence_vmware, and give it execution permissions.

2. Edit fence_vmware. In line 249 change the string “port” to “vmname”.

3. Install VMware Perl API on both cluster nodes. You will need to have gcc and openssl-devel installed on your system to be able to do so.

4. Change your fencing based on this example:

<?xml version="1.0"?>
<cluster alias="Gfs-test" config_version="39" name="Gfs-test">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="cent2" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="man2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="cent1" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="man1"/>
                                </method>
                                <method name="2">
                                        <device domain="22 " name="11 "/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_vmware" name="man2"
                          ipaddr="192.168.88.1" login="user" passwd="password"
                          vmname="c:\vmware\virt2\rhel5.vmx"/>
                <fencedevice agent="fence_vmware" name="man1"
                          ipaddr="192.168.88.1" login="user" passwd="password"
                          vmname="c:\vmware\virt1\rhel5.vmx"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources>
                        <fs device="/dev/sda" force_fsck="0" force_unmount="0"
				fsid="5" fstype="ext3" mountpoint="/data"
                                name="sda" options="" self_fence="0"/>
                </resources>
                <service autostart="1" name="smartd">
                        <ip address="192.168.88.201" monitor_link="1"/>
                </service>
                <service autostart="1" name="disk1">
                        <fs ref="sda"/>
                </service>
        </rm>
</cluster>

Change to your relevant VMware username and password.

If you have a Centos system, you will be required to perform these three steps:

1. ‘ln -s /usr/sbin/cman_tool /sbin/cman_tool

2. ‘cp /etc/redhat-release /etc/redhat-release.orig

3. ‘echo “Red Hat Enterprise Linux Server release 5 (Tikanga)” > /etc/redhat-release

This should do the trick. Good luck, and thanks again to Yoni who brought and fought the configuration steps.

***UPDATE***

Per comments (and a bit-late – common logic) I have broken lines in the XML quote for cluster.conf. In cases these line breaks might break something in RedHat Cluster, I have added the original xml file here: cluster.conf

HP MSA1000 controller failover

Tuesday, March 27th, 2007

HP MSA1000 is an entry-level disk storage capable of communicating via different types of interfaces, such as SCSI and FC, and can allow FC failover. This FC failover, however, is controller failover and not path failover. It means that if the primary controller fails entirely, the backup controller will “kick in”. However, if a multi-path capable client will fail its primary interface, there is no guarantee that communication with the disks through the backup controller.

The symptom I have encountered was that the secondary path, while exposing the disks (while the primary path was down for one of the servers) to the server, did not allow any SCSI I/O operations. This prevented the Linux server’s SCSI layer from accessing the disks. So they did appear when doing “cat /proc/scsi/scsi“, however, they were not detected using, for example, “fdisk -l“, and the system logs got filled with “SCSI Error” messages.

About a month ago, after almost two years, a new firmware update has been released (can be found here). Two versions exist – Active/Passive and Active/Active.

I have upgraded the MSA1000 storage device.

After installing the Active/Active firmware upgrade (Notice Linux users – You must have X to run the “msa1500flash” utility), and after power cycling the MSA1000 device, things start to look good.

I have tested performance with a person on-site disconnecting fiber connections on-demand, and it worked great. About 2-5 seconds failover time.

Since this system run Oracle RAC, and it uses OCFS2, I had to update the failed-node timeout to be 31 seconds (per this Oracle’s OCFS site, which includes some really good tips).

So real High Availability can be archived after upgrading MSA1000 firmware.

Single-Node Linux Heartbeat Cluster with DRBD on Centos

Monday, October 23rd, 2006

The trick is simple, and many of those who deal with HA cluster get at least once to such a setup – have HA cluster without HA.

Yep. Single node, just to make sure you know how to get this system to play.

I have just completed it with Linux Heartbeat, and wish to share the example of a setup single-node cluster, with DRBD.

First – get the packages.

It took me some time, but following Linux-HA suggested download link (funny enough, it was the last place I’ve searched for it) gave me exactly what I needed. I have downloaded the following RPMS:

heartbeat-2.0.7-1.c4.i386.rpm

heartbeat-ldirectord-2.0.7-1.c4.i386.rpm

heartbeat-pils-2.0.7-1.c4.i386.rpm

heartbeat-stonith-2.0.7-1.c4.i386.rpm

perl-Mail-POP3Client-2.17-1.c4.noarch.rpm

perl-MailTools-1.74-1.c4.noarch.rpm

perl-Net-IMAP-Simple-1.16-1.c4.noarch.rpm

perl-Net-IMAP-Simple-SSL-1.3-1.c4.noarch.rpm

I was required to add up the following RPMS:

perl-IO-Socket-SSL-1.01-1.c4.noarch.rpm

perl-Net-SSLeay-1.25-3.rf.i386.rpm

perl-TimeDate-1.16-1.c4.noarch.rpm

I have added DRBD RPMS, obtained from YUM:

drbd-0.7.21-1.c4.i386.rpm

kernel-module-drbd-2.6.9-42.EL-0.7.21-1.c4.i686.rpm (Note: Make sure the module version fits your kernel!)

As soon as I finished searching for dependent RPMS, I was able to install them all in one go, and so I did.

Configuring DRBD:

DRBD was a tricky setup. It would not accept missing destination node, and would require me to actually lie. My /etc/drbd.conf looks as follows (thanks to the great assistance of linux-ha.org):

resource web {
protocol C;
incon-degr-cmd “echo ‘!DRBD! pri on incon-degr’ | wall ; sleep 60 ; halt -f”; #Replace later with halt -f
startup { wfc-timeout 0; degr-wfc-timeout 120; }
disk { on-io-error detach; } # or panic, …
syncer {
group 0;
rate 80M; #1Gb/s network!
}
on p800old {
device /dev/drbd0;
disk /dev/VolGroup00/drbd-src;
address 1.2.3.4:7788; #eth0 network address!
meta-disk /dev/VolGroup00/drbd-meta[0];
}
on node2 {
device /dev/drbd0;
disk /dev/sda1;
address 192.168.99.2:7788; #eth0 network address!
meta-disk /dev/sdb1[0];
}
}

I have had two major problems with this setup:

1. I had no second node, so I left this “default” as the 2nd node. I never did expect to use it.

2. I had no free space (non-partitioned space) on my disk. Lucky enough, I tend to install Centos/RH using the installation defaults unless some special need arises, so using the power of the LVM, I have disabled swap (swapoff -a), decreased its size (lvresize -L -500M /dev/VolGroup00/LogVol01), created two logical volumes for DRBD meta and source (lvcreate -n drbd-meta -L +128M VolGroup00 && lvcreate -n drbd-src -L +300M VolGroup00), reformatted the swap (mkswap /dev/VolGroup00/LogVol01), activated the swap (swapon -a) and formatted /dev/VolGroup00/drbd-src (mke2fs -j /dev/VolGroup00/drbd-src). Thus I have now additional two volumes (the required minimum) and can operate this setup.

Solving the space issue, I had to start DRBD for the first time. Per Linux-HA DRBD Manual, it had to be done by running the following commands:

modprobe drbd

drbdadm up all

drbdadm — –do-what-I-say primary all

This has brought the DRBD up for the first time. Now I had to turn it off, and concentrate on Heartbeat:

drbdadm secondary all

Heartbeat settings were as follow:

/etc/ha.d/ha.cf:

use_logd on #?Or should it be used?
udpport 694
keepalive 1 # 1 second
deadtime 10
initdead 120
bcast eth0
node p800old #`uname -n` name
crm yes
auto_failback off #?Or no
compression bz2
compression_threshold 2

I have also created a relevant /etc/ha.d/haresources, although I’ve never used it (this file has no importance when using “crm=yes” in ha.cf). I did, however, use it as a source for /usr/lib/heartbeat/haresources2cib.py:

p800old IPaddr::1.2.3.10/8/1.255.255.255 drbddisk::web Filesystem::/dev/drbd0::/mnt::ext3 httpd

It is clear that the virtual IP will be 1.2.3.10 in my class A network, and DRBD would have to go up before mounting the storage. After all this, the application would kick in, and would bring up my web page. The application, Apache, was modified beforehand to use the IP 1.2.3.10:80, and to search for DocumentRoot in /mnt

Running /usr/lib/heartbeat/haresources2cib.py on the file (no need to redirect output, as it is already directed to /var/lib/heartbeat/crm/cib.xml), and I was ready to go.

/etc/init.d/heartbeat start (while another terminal is open with tail -f /var/log/messages), and Heartbeat is up. It took it few minutes to kick the resources up, however, I was more than happy to see it all work. Cool.

The logic is quite simple, the idea is very basic, and as long as the system is being managed correctly, there is no reason for it to get to a dangerous state. Moreover, since we’re using DRBD, Split Brain cannot actually endanger the data, so we get compensated for the price we might pay, performance-wise, on a real two-node HA environment following these same guidelines.

I cannot express my gratitude to http://www.linux-ha.org, which is the source of all this (adding up with some common sense). Their documents are more than required to setup a full working HA environment.

Setting up an AIX HA-CMP High Availability test Cluster

Tuesday, July 4th, 2006

This post will be divided into this common view part, and (the first in this blog) "click here for more" part.

The main reason I’ve created this blog was to document, both for myself and other technical persons, the acts required to perform set tasks.

My first idea was to document how to install HACMP on AIX. For those of you who do not know what I’m talking about, HACMP is a general-purpose high availability cluster made by IBM, which can work on AIX, and if I’m not mistaken, on other platforms as well. It is based, actually, on a large set of "event" scripts, which run in a predefined order.

Unlike other HA clusters, this is a fixed-order cluster. You can bring your application up after the disk has been up, and after IP has been up. You cannot change this predefined order, and you have no freedom to set this order. Unless.

Unless you create custom scripts, and use them as pre-event and post-event, naming them correctly and putting them in the right directories.

This is not an easy cluster to manage. It has no flashy features, it is not versatile like other HA Clusters are (VCS being the best one, to my opinion, and MSCS, despite its tendency to reach race conditions, is quite versatile itself).

It is a hard HA Cluster, for a hard working people. It is meant for a single method of operation, and for a single track of mind. It is rather stable, if you don’t go around adding volumes to VGs (know what you want before you do it.

Below is a step by step list of actions to do, based on my work experience. I’ve brought up two clusters (four nodes) while actually doing copy-paste into a text document of every action done by me, any package installed, etc.

It is meant for test purposes. It is not as HA as it could be (using the same network infrastructure), it doesn’t employ all heart-bit connections it might have had – it’s meant for lab purposes, and has done quite well there.

It was installed on P5 machines, P510, if I’m not mistaken, using FastT200 (single port) for shared storage (single logical drive, small size – about 10GB), with Storage Manager 8.2 and adequate firmware.

RedHat Cluster, and some more

Sunday, February 12th, 2006

It’s been a long while since I’ve written. I get to have, once a while, a period of time dedicated for laziness. I’ve had just one of these for the last few weeks, in which I’ve been almost completely idle. Usually, waking up from such idle time is a time dedicated to self studies and hard work, so I don’t fight my idle periods too hard. This time, I’ve had the pleasure of testing and playing, for personal reasons, both with VMWare GSX, in a “Cluster-In-a-Box” setup, based on recommendation regarding MSCS, altered for Linux (and later, Veritas Cluster Service) and both with RedHat Cluster Server, with the notion of playing with RedHat’s GFS, but, regrettably, without the last.

First, VMware. In their latest rivalty with Microsoft over the issue of Virtualization of servers and desktops, MS has gained an advantage lately. Due to the lower prices of “Virtual Server 2005″, comparing with “VMware GSX Server”, and due to their excellent marketing system (from which we should all learn, if I may say!), Not a few servers and virtual server farms, especially the ones running Windows/Windows setups, had moved to this MS solution, which is as capable as VMware GSX Server. Judging by the history of such rivalries, MS would have won. They always have. However, VMware, in an excellent move, has announced that the next generation of their GSX, simply called “Server”, would be for free. Free for everyone. In this they probably mean to invest more in their more robust ESX server, and give the GSX as a taste of their abilities. While MS do not have any more advanced product than their Virtual Server, it could mean a death blow to their effort in this direction. It could even mean they will just give away their product! While this will happen, we, the customers, will earn a selection of free, advanced and reliable products designed for virtualization. Could it be any better than that?

One more advantage of this “Virtualization for the People” is that community based virtual images, of even the most complicated to install setups can and would be widely available. Meaning to shorten installation time, and allow for a quick working system for everyone. It will require, however, better knowledge and understanding of the products themselves, as merely installing them will not be enough. To survive the future market, you won’t be able to just sell an installation of a product, but should be able to support an out-of-the-box setup of it. That’s for the freelances, and the partially freelances of us…

So, I’ve reinstalled my GSX, and started playing with it. The original goal was to actually run a working setup of RHEL, VCS and Oracle 10g. Unfortunately, VCS supports only RH3 (update 2?), and not RH4, which was a shame. At that point, I’ve considered using RH Cluster Server for the task at hand. It grew to the task of learning this cluster server, and nothing more, which I did, and I can and would share my concepts about it here.

First – Names – I’ve had the pleasure of working with numerous cluster solutions. I’m thrilled each time I get to play with another cluster solution the naming conventions, and name changes vendors do, just to keep themselves unique. I hate it. So here’s a little explanation:
All clusters contain a group of resources (Resource Group, as most vendors call them). This group contains a set of resources, and in some cases, relations (order of startup, dependencies, etc). Each resource could be any single element required for an application. Example – Resource could be an IP address, which without you won’t be able to contact the application. Resource could be a disk device, containing the application’s data. It could be an application start/stop script, and it could be a sub-application – an application required for the whole group to be up, such as a DB for DB driven web server. The order you would ask them to start would be IP, disk, DB, web server (in our case). You’d ask the IP to be brought up first because some of the cluster servers can trick an IP based clients into some delay, so the client hardly feels the short downtime of application failover. But this is for later. So, in a resource group, we have resources. If we can separate resources into different groups, if they have no required dependency between them, it is always better to do so. In our previous example, lets say our web server uses the DB, but it contacts it using IP address, or using hostname. In this case, we won’t need the DB to run on the same physical machine the web server is using, and in such a case, assuming the physical disk holding the DB and the one holding the rest of the web application are not the same disk, we could separate them.

The idea, if I can try to sum it up, is to split your application to the smallest self-maintained structures. Each structure will be called resource group, and each component in this structure is a resource. On some cluster servers, one could group and set dependencies between resource groups, which allows for even more scalability, but that is not our case.

So we had resource groups containing resources. Each computer, a member in the cluster, is called a node. Now, let’s assume our cluster containing three nodes, but we want our application (our resource group) to be able to run on only two specific? In this case, we need to define, for our resource group, which nodes are to be associated with it. In RH Cluster Server, a thing called “Domain” is designed for it. This Domain containes a list of nodes. This Domain can be associated with Resource Group, and thus set priority of failover, and set the group of nodes allowed to deal with the resource group.

All clusters have a single point of error (unlike failure). The whole purpose of the cluster is to allow for non-cluster-aware application the high-availability you could expect for a (relatively) low price. We’re great – we know how to bring an application up, we know how to bring it down. We can assume when the other node(s) is/are down. We cannot be sure of it. We try. We demand few means of communication, so that one link failure won’t cause us to corrupt our shared volumes (by trying multiple access into them). We set a whole system of logic, a heartbit, just name it, to avoid, at almost all cost, a status of split-head – two cluster nodes believing they are the only ones up. You can guess what it means, right?

In RH, there is a heartbit, sure. However, it is based on bonding, in the event of more than one NIC, and not on separated infrastructures. It is a simple network-based HB, with nothing special about it. In case of loss of connection, it would have reset the inactive node, if it saw fit, using a mechanism they call “Fence”. A “Fence” is a system by which the cluster can *know* for sure (or almost for sure) a node has been down, or the cluster can physically take a node down (poweroff if needs), such as control of the UPS the node is connected to, or its power switch, or alternate monitoring infrastructure, such as the Fibre Channel Switch, etc. In such an event, the cluster can know for sure, or can assume, at least, that the hung node has been reset, or it can force it to reset, to release some hung application.

Naming – Resource group is called Service. Resource remains resource, but an application resource *must* be defined by an rc-like script, which accepts start/stop (/restart?). Nothing complicated to it, really. The service contains all required resources.

I was not happy with the cluster, if I can sum up my issues with it. Monitoring machines (nodes) it did correctly, but in the simple enough example I’ve chosen to setup, using apache as a resource (only afterwards I’ve noticed it to be the example RedHat used in their documentation) it failed miserably to take the correct action when an application failed (unlike a failure of a node). I’ve defined my “Service” to contain the following three items:

1) IP Address – Unique for my testing purposes.

2) Shared partition (in my case, and thanks to VMware, /dev/sdb1, mounted at /var/www/html)

3) The Apache application – “/etc/init.d/httpd”

All in all, it was brought up correctly, and switch-over went just fine, including in a case of correct and incorrect reset of the active/passive node, however, when I’ve killed my apache (killall httpd), the cluster detected failure in the application, but was helpless with it. I was unable to bring down the “Service”, as it failed to turn off Apache (duh!), so it did not release neither the IP address, nor the shared volume. In so doing, I’ve had to restart the service rgmanager on both nodes, after manual removal of the remains of the “Service”. I didn’t like it. I expect the cluster to notice failure in the application, which it did, but I expect it to either try to restart the application (/etc/init.d/httpd stop && /etc/init.d/httpd start) before it fails completely, or to set a flag saying it is down, remove the remains of the “Service” from the node in question (release the IP address and the shared storage), and try to bring it up on the other node(s). It did nothing of the likes. It just failed, completely, and required manual intervention.

I expect HA-Cluster to be able to react to an application or resource failure, and not just to a node failure. Since HA-Clusters are meant for the non-ideal world, a place where computers crash, where hardware failures occure, and where applications just die, while servers remain working, I expect the Cluster Server to be able to handle the full variety of problems, but maybe i was expecting too much. I believe it to be better in their future versions, and I believe it could have been done quite easily right now, as long as detection of the failed application occurred, which it has, but it’s not for me to define the cluster’s abilities. This cluster is not mature enough for real-life production sites, if and only because of its failure to react correctly to a resource failure, without demanding manual intervention. A year from now, I’ll probably recommend it as a cheap and reliable solution for most common HA-related tasks, but not today.

That leaves me with VCS and Oracle, which I’ll deal with in the future, wether I like it or not :-)