Posts Tagged ‘cluster’

Tips and tricks for Redhat Cluster

Saturday, January 31st, 2009

Redhat Cluster is a nice HA product. I have been implementing it for a while now, lecturing about it, and yes – I like it. But like any other software product, it has few flaws and issues which you should take under consideration – especially when you create custom “agents” – plugins to control (start/stop/status) your 3rd party application.

I want to list several tips and good practices which will help you create your own agent or custom script, and will help you sleep better at night.

Nighty Night: Sleeping is easier when your cluster is quiet. It usually means that  you don’t want the cluster to suddenly failover during night time, or – for that matter, during any hour, unexpectedly.
Below are some tips to help you sleep better, or to perform an easier postmortem of any cluster failure.

Chop the Logs: Since RedHat Cluster logging might be hidden and filled with lots of irrelevant information, you want your agents to be nice about it. Let them log out somewhere the result of running “status” or “stop” or even “start”. Of course – either recycle the output logs, or rotate them away. You could use

exec &>/tmp/my_script_name.out

much like HACMP does (or at least – behaves as if it does). You can also use specific logging facility for different subsystems of the cluster (cman, rg, qdiskd)

Mind the Gap: Don’t trust unknown scripts or applications’ return codes. Your cluster will fail miserably if a script or a file you expect to run will not be there. Do not automatically assume that the vmware script, for example, will return normal values. Check the return codes and decide how to respond accordingly.

Speak the Language: A service in RedHat Cluster is a combination of one or more resources. This can be somewhat confusing as we tend to refer to resources as a (system) service. Use the correct lingo.  I will try to do just that in this document, so heed the difference between the terms “service” and “system service”, which can be a cluster resource.

Divide and Conquer: Split your services to the minimal set of resources possible. If your service consists of hundreds of resources failure to one of them could cause the entire service to restart, taking down all other working resources. If you keep it to the minimum, you actually protect yourself.

Trust No One: To stress out the “Mind the Gap” point above – don’t trust 3rd party scripts or applications to return a correct error code. Don’t trust their configuration files, and don’t trust the users to “do it right”. They will not. Try to create your own services as fault-protected as possible. Don’t crash because some stupid user (or a stupid administrator – for a cluster implementer both are the same, right?) used incorrect input parameters, or because he has kept  an important configuration file in a different name than was required.

I have some special things I want to do with regard to RedHat Cluster Suite. Stay tuned 🙂

Protect Vmware guest under RedHat Cluster

Monday, November 17th, 2008

Most documentation on the net is about how to run a cluster-in-a-box under Vmware. Very few seem to care about protecting Vmware guests under real RedHat cluster with a shared storage.

This article is just about it. While I would not recommend using Vmware in such a setup, it has been the case, and that Vmware guest actually resides on the shared storage. To relocate it is out of the question, so migrating it together with other resources is the only valid option.

To do so, I have created a simple script which will accept start/stop/status arguments. The Vmware guest VMX is hard-coded into the script, but in an easy-to-change format. This script will attempt to freeze the Vmware guest, and only if it fails, to shut it down. Mind you that the blog’s HTML formatting might alter quotation marks into UTF-8 marks which will not be understood well by shell.

# This script will start/stop/status VMware machine
# Written by Ez-Aton

# Hardcoded. Change to match your own settings!
VMWARE="/export/vmware/hosts/Windows_XP_Professional/Windows XP Professional.vmx"

function status () {
  # This function will return success if the VM is up
  $VMRUN list | grep "$VMWARE" &>/dev/null
  if [[ "$?" -eq "0" ]]
    echo "VM is up"
    return 0
    echo "VM is down"
    return 1

function start () {
  # This function will start the VM
  $VMRUN start "$VMWARE"
  if [[ "$?" -eq "0" ]]
    echo "VM is starting"
    return 0
    echo "VM failed"
    return 1

function stop () {
  # This function will stop the VM
  $VMRUN suspend "$VMWARE"
  for i in `seq 1 $TIMEOUT`
    if status
      echo "VM Stopped"
      return 0
    sleep 1
  $VMRUN stop "$VMWARE" soft

case "$1" in
start)     start
stop)      stop
status)   status

exit $RET

Since the formatting is killed by the blog, you can find the script here: vmware1

I intend on building a “real” RedHat Cluster agent script, but this should do for the time being.


Raw devices for Oracle on RedHat (RHEL) 5

Tuesday, October 21st, 2008

There is a major confusion among DBAs regarding how to setup raw devices for Oracle RAC or Oracle Clusterware. This confusion is caused by the turn RedHat took in how to define raw devices.

Raw devices are actually a manifestation of character devices pointing to block devices. Character devices are non-buffered, so they act as FIFO, and have no OS cache, which is why Oracle likes them so much for Clusterware CRS and voting.

On other Unix types, commonly there are two invocations for each disk device – a block device (i.e /dev/dsk/c0d0t0s1) and a character device (i.e. /dev/rdsk/c0d0t0s1). This is not the case for Linux, and thus, a special “raw”, aka character, device is to be defined for each partition we want to participate in the cluster, either as CRS or voting disk.

On RHEL4, raw devices were setup easily using the simple and coherent file /etc/sysconfig/rawdevices, which included an internal example. On RHEL5 this is not the case, and customizing in a rather less documented method the udev subsystem is required.

Check out the source of this information, at this entry about raw devices. I will add it here, anyhow, with a slight explanation:

1. Add to /etc/udev/rules.d/60-raw.rules:

ACTION==”add”, KERNEL==”sdb1″, RUN+=”/bin/raw /dev/raw/raw1 %N”

2. To set permission (optional, but required for Oracle RAC!), create a new /etc/udev/rules.d/99-raw-perms.rules containing lines such as:

KERNEL==”raw[1-2]“, MODE=”0640″, GROUP=”oinstall”, OWNER=”oracle”

Notice this:

  1. The raw-perms.rules file name has to begin with the number 99, which defines its order during rules apply, so that it will be used after all other rules take place. Using lower numbers might cause permissions to be incorrect.
  2. The following permissions have to apply:
  • OCR Device(s): root:oinstall , mode 0640
  • Voting device(s): oracle:oinstall, mode 0666
  • You don’t have to use raw devices for ASM volumes on Linux, as the ASMLib library is very effective and easier to manage.

    RedHat 4 working cluster (on VMware) config

    Sunday, November 11th, 2007

    I have been struggling with RH Cluster 4 with VMware fencing device. This was also a good experiance with qdiskd, the Disk Quorum directive and utilization. I have several conclusions out of this experience. First, the configuration, as is:

    <?xml version=”1.0″?>
    <cluster alias=”alpha_cluster” config_version=”17″ name=”alpha_cluster”>
    <quorumd interval=”1″ label=”Qdisk1″ min_score=”3″ tko=”10″ votes=”3″>
    <heuristic interval=”2″ program=”ping vm-server -c1 -t1″ score=”10″/>
    <fence_daemon post_fail_delay=”0″ post_join_delay=”3″/>
    <clusternode name=”clusnode1″ nodeid=”1″ votes=”1″>
    <multicast addr=”″ interface=”eth0″/>
    <method name=”1″>
    <device name=”vmware”
    <clusternode name=”clusnode2″ nodeid=”2″ votes=”1″>
    <multicast addr=”″ interface=”eth0″/>
    <method name=”1″>
    <device name=”vmware”
    <multicast addr=”″/>
    <fencedevice agent=”fence_vmware” ipaddr=”vm-server” login=”cluster”
    name=”vmware” passwd=”clusterpwd”/>
    <failoverdomain name=”cluster_domain” ordered=”1″ restricted=”1″>
    <failoverdomainnode name=”clusnode1″ priority=”1″/>
    <failoverdomainnode name=”clusnode2″ priority=”1″/>
    <fs device=”/dev/sdb2″ force_fsck=”1″ force_unmount=”1″ fsid=”62307″
    fstype=”ext3″ mountpoint=”/mnt/sdb1″ name=”data”
    options=”” self_fence=”1″/>
    <ip address=”″ monitor_link=”1″/>
    <script file=”/usr/local/” name=”My_Script”/>
    <service autostart=”1″ domain=”cluster_domain” name=”Test_srv”>
    <fs ref=”data”>
    <ip ref=”″>
    <script ref=”My_Script”/>

    Several notes:

    1. You should run mkqdisk -c /dev/sdb1 -l Qdisk1 (or whatever device is for your quorum disk)
    2. qdiskd should be added to the chkconfig db (chkconfig –add qdiskd)
    3. qdiskd order should be changed from 22 to 20, so it precedes cman
    4. Changes to fence_vmware according to the past directives, including Yoni’s comment for RH4
    5. Changes in structure. Instead of using two fence devices, I use only one fence device but with different “ports”. A port is translated to “-n” in fence_vmware, just as it is being translated to “-n” in fence_brocade – fenced translates it
    6. lock_gulmd should be turned off using chkconfig

    A little about command-line version change:

    When you update the cluster.conf file, it is not enough to update the ccsd using “ccs_tool update /etc/cluster/cluster.conf“, but you also need to understand that cman is still on the older version. Using “cman_tool version -r <new version>“, you can force it to allow other nodes to join after a reboot, when they’re using the latest config version. If you fail to do it, other nodes might be rejected.

    I will add additional information as I move along.

    Single-Node Linux Heartbeat Cluster with DRBD on Centos

    Monday, October 23rd, 2006

    The trick is simple, and many of those who deal with HA cluster get at least once to such a setup – have HA cluster without HA.

    Yep. Single node, just to make sure you know how to get this system to play.

    I have just completed it with Linux Heartbeat, and wish to share the example of a setup single-node cluster, with DRBD.

    First – get the packages.

    It took me some time, but following Linux-HA suggested download link (funny enough, it was the last place I’ve searched for it) gave me exactly what I needed. I have downloaded the following RPMS:









    I was required to add up the following RPMS:




    I have added DRBD RPMS, obtained from YUM:


    kernel-module-drbd-2.6.9-42.EL-0.7.21-1.c4.i686.rpm (Note: Make sure the module version fits your kernel!)

    As soon as I finished searching for dependent RPMS, I was able to install them all in one go, and so I did.

    Configuring DRBD:

    DRBD was a tricky setup. It would not accept missing destination node, and would require me to actually lie. My /etc/drbd.conf looks as follows (thanks to the great assistance of

    resource web {
    protocol C;
    incon-degr-cmd “echo ‘!DRBD! pri on incon-degr’ | wall ; sleep 60 ; halt -f”; #Replace later with halt -f
    startup { wfc-timeout 0; degr-wfc-timeout 120; }
    disk { on-io-error detach; } # or panic, …
    syncer {
    group 0;
    rate 80M; #1Gb/s network!
    on p800old {
    device /dev/drbd0;
    disk /dev/VolGroup00/drbd-src;
    address; #eth0 network address!
    meta-disk /dev/VolGroup00/drbd-meta[0];
    on node2 {
    device /dev/drbd0;
    disk /dev/sda1;
    address; #eth0 network address!
    meta-disk /dev/sdb1[0];

    I have had two major problems with this setup:

    1. I had no second node, so I left this “default” as the 2nd node. I never did expect to use it.

    2. I had no free space (non-partitioned space) on my disk. Lucky enough, I tend to install Centos/RH using the installation defaults unless some special need arises, so using the power of the LVM, I have disabled swap (swapoff -a), decreased its size (lvresize -L -500M /dev/VolGroup00/LogVol01), created two logical volumes for DRBD meta and source (lvcreate -n drbd-meta -L +128M VolGroup00 && lvcreate -n drbd-src -L +300M VolGroup00), reformatted the swap (mkswap /dev/VolGroup00/LogVol01), activated the swap (swapon -a) and formatted /dev/VolGroup00/drbd-src (mke2fs -j /dev/VolGroup00/drbd-src). Thus I have now additional two volumes (the required minimum) and can operate this setup.

    Solving the space issue, I had to start DRBD for the first time. Per Linux-HA DRBD Manual, it had to be done by running the following commands:

    modprobe drbd

    drbdadm up all

    drbdadm — –do-what-I-say primary all

    This has brought the DRBD up for the first time. Now I had to turn it off, and concentrate on Heartbeat:

    drbdadm secondary all

    Heartbeat settings were as follow:


    use_logd on #?Or should it be used?
    udpport 694
    keepalive 1 # 1 second
    deadtime 10
    initdead 120
    bcast eth0
    node p800old #`uname -n` name
    crm yes
    auto_failback off #?Or no
    compression bz2
    compression_threshold 2

    I have also created a relevant /etc/ha.d/haresources, although I’ve never used it (this file has no importance when using “crm=yes” in I did, however, use it as a source for /usr/lib/heartbeat/

    p800old IPaddr:: drbddisk::web Filesystem::/dev/drbd0::/mnt::ext3 httpd

    It is clear that the virtual IP will be in my class A network, and DRBD would have to go up before mounting the storage. After all this, the application would kick in, and would bring up my web page. The application, Apache, was modified beforehand to use the IP, and to search for DocumentRoot in /mnt

    Running /usr/lib/heartbeat/ on the file (no need to redirect output, as it is already directed to /var/lib/heartbeat/crm/cib.xml), and I was ready to go.

    /etc/init.d/heartbeat start (while another terminal is open with tail -f /var/log/messages), and Heartbeat is up. It took it few minutes to kick the resources up, however, I was more than happy to see it all work. Cool.

    The logic is quite simple, the idea is very basic, and as long as the system is being managed correctly, there is no reason for it to get to a dangerous state. Moreover, since we’re using DRBD, Split Brain cannot actually endanger the data, so we get compensated for the price we might pay, performance-wise, on a real two-node HA environment following these same guidelines.

    I cannot express my gratitude to, which is the source of all this (adding up with some common sense). Their documents are more than required to setup a full working HA environment.