Posts Tagged ‘cluster’

Redhat Cluster and Citrix XenServer

Thursday, April 9th, 2015

I wanted to write down a guide for RHCS on RHEL/Centos6 and XenServer.

If you want to do that, you need to go through two major challenges which you will encounter. I want to save on the search and sum it all up together here.

The first difficulty is the shared disk. In order to set up most common cluster scenarios, you will need a shared storage. You could either map the VMs to an iSCSI LUNs external to the environment, however, if you do not have such infrastructure (either because everything is based on SAS/FC, or you do not have the ability to set up iSCSI storage with reasonable level of availability), you will want XenServer to allow you to share the VDI between two VMs.

In order to do so, you will need to add a flag to all your pool’s XenServers, and to create the VDI in a specific method. First – the flag – you need to create a file in /etc/xensource called “allow_multiple_vdi_attach”. Do not forget to add it to all your XenServers:

touch /etc/xensource/allow_multiple_vdi_attach

Next, you will need to create your VDI as “raw” type. This is an example. You need to change the SR UUID to the one you use:

xe vdi-create sm-config:type=raw sr-uuid=687a023b-0b20-5e5f-d1ef-3db777ce7ae4 name-label=”My Raw LVM VDI” virtual-size=8GiB type=user

You can find Citrix article about it here.

Following that, you can complete your cluster setup and configuration. I will not add details about it here, as this is not the focus of this article. However, when it comes to fencing, you will need a solution. The solution I used was a fencing agent which was written specifically for XenServer using XenAPI, by using the agent called fence-xenserver. I did not use the fencing agents repository (which this page also points to), because I was unable to compile the required components to run on Centos6. They just don’t compile well. This is, however, a simple Python script which actually works.

In order to make it work, I did the following:

  • Extracted the archive (version 0.8)
  • Placed fence_cxs* in /usr/sbin, and removed their ‘.py’ suffix
  • Placed XenAPI.py as-is in /usr/sbin
  • Verified /usr/sbin/fence_cxs* had execution permissions.

Now, I needed to add it to the cluster configuration. Since the agent cannot handle accessing a non-pool master, it had to be defined for each pool member (I cannot tell in advance which of them is going to have the pool master role when a failover should happen). So, this is my cluster.conf relevant parts:

<fencedevices>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver01″ passwd=”password” session_url=”https://xenserver01″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver02″ passwd=”password” session_url=”https://xenserver02″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver03″ passwd=”password” session_url=”https://xenserver03″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver04″ passwd=”password” session_url=”https://xenserver04″/>
</fencedevices>
<clusternodes>
<clusternode name=”clusternode1″ nodeid=”1″>
<fence>
<method name=”xenserver01″>
<device name=”xenserver01″ vm_name=”clusternode1″/>
</method>
<method name=”xenserver02″>
<device name=”xenserver02″ vm_name=”clusternode1″/>
</method>
<method name=”xenserver03″>
<device name=”xenserver03″ vm_name=”clusternode1″/>
</method>
<method name=”xenserver04″>
<device name=”xenserver04″ vm_name=”clusternode1″/>
</method>
</fence>
</clusternode>
<clusternode name=”clusternode2″ nodeid=”2″>
<fence>
<method name=”xenserver01″>
<device name=”xenserver01″ vm_name=”clusternode2″/>
</method>
<method name=”xenserver02″>
<device name=”xenserver02″ vm_name=”clusternode2″/>
</method>
<method name=”xenserver03″>
<device name=”xenserver03″ vm_name=”clusternode2″/>
</method>
<method name=”xenserver04″>
<device name=”xenserver04″ vm_name=”clusternode2″/>
</method>
</fence>
</clusternode>
</clusternodes>

Attached xenserver-fencing-cluster.xml for clarity (WordPress makes a mess out of that)

Note that I used four (4) entries, since my pool has four hosts. Also note the VM name (it is case sensitive), and your methods – one for each host, since you don’t want them running in parallel, but one at a time. Failover time is between 5-15 seconds on my tests, depending on who is the actually pool master (xenserver04 takes the longest, obviously). I did not test it with pool master down (before or without HA kicking in), nor with the hosts down and thus TCP timeout is longer (than when attempting to connect a host which responds immediately that it is not the pool master). However, if ILO fencing takes about 30-60 seconds, I am not complaining about the current timeouts.

Timeout when using Ricci as the backend for Corosync update in Redhat Cluster

Saturday, March 14th, 2015

When using Ricci as the engine for ‘cman_tool version -r’ command, you will experience timeouts (and practically – you will be unable to use ricci to update the cluster configuration across the nodes) when the ricci user password contains XML-sensitive characters, like <>&, etc.
As they say – FYI 🙂

ZFS with Redhat Cluster Suite

Friday, July 25th, 2014

This is a very nice project I have been working on. The hardware at hand - two servers, with a shared SAS bus containing several SAS disks. Since it's a shared bus, no RAID solution would cut it, and as I don't want to waste disks with ASM ("normal" redundancy meaning half the size...), I went to ZFS storage.

ZFS is a wonderful technology, with many advantages, but with some dangerous pitfalls. As I prefer Linux, I did not bother with any Sloaris solutions, and went directly to Centos 6. I will describe my cluster setup below.

I will disclose the entire setup, including hardware layout, Linux platform, ZFS module parameters, the Redhat Cluster Suite ZFS agent I wrote and the cluster.conf configuration file. I will also share my considerations regarding some of the choices I made. In addition, this system was designed to act as NFS storage for Citrix XenServer pool, so I will have to describe the changed I had to perform on the XenServer itself (which might make it unsupported, but I will have to live with it), to allow it to handle the timeouts resulting by server failover.

So first - the servers - each having a single CPU (quad core), 24GB RAM, and dual 1Gb/s NICs. Also - a tiny internal SATA disk is used for the OS. The shared disks - at the moment, 10 SAS disks, dual port (notice - older HP disks might mark in a very small letters that they are only a single-port SAS disks...), 72GB, 10K RPM. Zpool called 'share' with two 5 disks RaidZ1 vdevs. As I mentioned before - ZFS seemed like the best possible option allowing me to achieve my goals at minimal cost.

When I came to this project, I wanted to be able to use a native ZFS cluster agent, and not a 'script' agent, which takes a very long time to respond (30 seconds). Also - I wanted to be able to handle multiple storage pools concurrently - each floating on its own. While I have only one at the moment, I wanted the ability to have a fine-grained control over multiple pools. In addition - I am unable (or unwilling?) to handle the multiple filesystems introduced with each pool. I wanted to be able to import or export the pool silently, and with a clear head, thus I had to verify that the multiple filesystems are not in use as part of the export process.

As an agent, I wanted to comply with Redhat Cluster Suite (RHCS from now on) OCF syntax. I used the supplied fs.sh script as an inspiration for my agent script, so some of it might look familiar. All credit goes to the original authors, of course.

The operating system I selected was Centos 6. Centos is based on Redhat Linux, and I find it mature and stable, which is exactly what I want when I plan a production-ready, enterprise-class storage solution. The version had to be x86_64, due to ZFS requirements, and due to the amount of RAM in the server.

To handle ZFS options, I added a file called /etc/modprobe.d/zfs.conf, with the following content

install zfs /bin/rm -f /etc/zfs/zpool.cache && /sbin/modprobe --ignore-install zfs
options zfs zfs_arc_max=12593790976
options zfs zfs_arc_min=12593790975

I had to verify there is no zpool.cache file. Since my pool was rather small (planned for 24 disks max), I was not concerned by the longer import process caused by not having the zpool.cache file. I was more concerned with automatic import process which might happen, and had to prevent it at almost any cost. In addition, I learned from other systems that the arc memory should never exceed half the RAM, and it should be given just a little under that.

Of course, when changing such module settings, you need to recreate initrd (dracut -f) to be on the safe side later on.

The zfs.sh agent script was placed in /usr/share/cluster directory. You must have rgmanager installed for this directory to exist, and anyhow, without rgmanager, you will have no cluster whatsoever.

This is the contents of the zfs.sh file. Notice that it is not compatible with Luci, so if you're using it - them kids won't play well together.

#!/bin/bash

LC_ALL=C
LANG=C
PATH=/bin:/sbin:/usr/bin:/usr/sbin
export LC_ALL LANG PATH
# Private return codes
FAIL=2
NO=1
YES=0
YES_STR="yes"

. $(dirname $0)/ocf-shellfuncs

meta_data()
{
    cat < EOT

    1.0

	This script will import and export ZFS storage pools
	It will make sure to mount and umount all child filesystems

        This is a ZFS pool

                Symbolic name for this zfs pool

                File System Name

		ZFS Pool name or ID

                ZFS pool name

		ZFS Pool alternate mount

                ZFS pool alternate mount

                If set, the cluster will kill all processes using 
                this file system when the resource group is 
                stopped.  Otherwise, the unmount will fail, and
                the resource group will be restarted.

                Force Unmount

                If set and unmounting the file system fails, the node will
                immediately reboot.  Generally, this is used in conjunction
                with force-unmount support, but it is not required.

                Seppuku Unmount

	
        

        

EOT
}

ocf_log()
{
        echo $*
}

verify_driver() {
	ocf_log info "Verifying ZFS driver"
	lsmod | grep -w zfs > /dev/null >&1 && return 0
	ocf_log err "ZFS driver is not loaded"
	return $OCF_ERR_ARGS
}

verify_poolname() {
	ocf_log info "Verify pool name "
	if [ -z "$OCF_RESKEY_pool" ]
	then
		ocf_log err "Missing pool name"
		return $OCF_ERR_ARGS
	fi
	zpool import | grep pool: | grep -w $OCF_RESKEY_pool > /dev/null 2>&1 && return 0
	ocf_log err "Cannot identify pool name"
	return $OCF_ERR_ARGS
}

verify_mounted_poolname() {
	ocf_log info "Verify pool name "
	if [ -z "$OCF_RESKEY_pool" ]
	then
		ocf_log err "Missing pool name"
		return $OCF_ERR_ARGS
	fi
	zpool list $OCF_RESKEY_pool > /dev/null >&1 && return 0
	ocf_log err "Cannot identify pool name"
	return $OCF_ERR_ARGS
}

verify_mountpath() {
	ocf_log info "Verifying alternate root mount path"
	[ -z "$OCF_RESKEY_mount" ] && return 0
	declare mp="${OCF_RESKEY_mount}"
	case "$mp" in
		/*)    	# found it
                	;;
        	*)      # invalid format
			ocf_log err 
"verify_mountpath: Invalid mount point format (must begin with a '/'): '$mp'"
                return $OCF_ERR_ARGS
                ;;
        esac
}

pool_import() {
	ocf_log info "Importing pool"
	OPTS=""
	[ -n "$OCF_RESKEY_mount" ] && OPTS="-R $OCF_RESKEY_mount"
	zpool import $OCF_RESKEY_pool $OPTS
	RET="$?"
	if [ "$RET" -ne "0" ]
	then
		ocf_log info "Cannot import without applying force"
		zpool import -f $OCF_RESKEY_pool $OPTS
		RET="$?"
	fi
	if [ "$RET" -ne "0" ]
	then
		ocf_log err "Pool import failed for $OCF_RESKEY_pool. error=$RET"
		return 1
	fi
	ocf_log info "Imported ZFS pool"
	return $RET
}

check_and_release_fs() {
	ocf_log info "Checking and releasing FS"
	FS=""
	case ${OCF_RESKEY_force_unmount} in
        $YES_STR|on|true|1)	force_umount=$YES ;;
        *)		        force_umount="" ;;
        esac

	RET=0
	for i in `zfs list -t filesystem | grep ^${OCF_RESKEY_pool} | awk '{print $NF}'`
	do
		# To be on the safe side. Why not?
		sleep 1
		# Is it mounted?
		if ! df -l | grep -w "$i" > /dev/null 2>&1
		then
			ocf_log info "Filesystem $i is not mounted"
			continue
		fi 	
		if [ `lsof $i | wc -l` -gt "0" ]
		then
			ocf_log info "Filesystem $i is in use"
			if [ "$force_umount" ]
			then
				ocf_log info "Attempting to kill processes on $i filesystem"
				fuser -k $i
				sleep 2
				if [ `lsof $i | wc -l` -gt "0" ]
				then
					ocf_log err "Cannot umount filesystem $i - filesystem in use"
					return 1
				fi
			else
				ocf_log err "Cannot umount filesystem $i
 - filesystem in use"
                                return 1
			fi
		fi
	done
	return $RET	
}

self_fence() {
	ocf_log info "Should we validate and call self-fence?"
	case ${OCF_RESKEY_self_fence} in
		$YES_STR|on|true|1)       self_fence=$YES ;;
       		*)              self_fence="" ;;
        esac	

	if [ "$self_fence" ]; then
		ocf_log alert "umount failed - REBOOTING"
               	sync
                reboot -fn
	fi
	return $OCF_ERR_GENERIC
}

pool_export() {
	ocf_log info "Exporting zfs pool"
	check_and_release_fs || self_fence
	zpool export $OCF_RESKEY_pool
	RET="$?"
	if [ "$RET" -ne "0" ]
	then
		ocf_log err "Pool export failed for $OCF_RESKEY_pool. error=$RET"
		return 1
	fi
	return $RET
}

start() {
	ocf_log info "Starting ZFS"
	verify_driver || return $OCF_ERR_ARGS 
	verify_poolname || return $OCF_ERR_ARGS
	verify_mountpath || return $OCF_ERR_ARGS
	pool_import
	# Handle filesystem?
}

stop() {
	ocf_log info "Starting ZFS"
	verify_driver || return $OCF_ERR_ARGS 
	verify_mounted_poolname || return $OCF_ERR_ARGS
	verify_mountpath || return $OCF_ERR_ARGS
	# Handle filesystem?
	pool_export
}

is_imported() {
	ocf_log debug "Checking if $OCF_RESKEY_pool is imported"
	zpool list ${OCF_RESKEY_pool} > /dev/null >&1
	return $?
}

is_alive() {
	ocf_log debug "Checking ZFS pool read/write"
	declare file=".writable_test.$(hostname)"
	declare TIMEOUT="10s"
	[ -z "$OCF_CHECK_LEVEL" ] && export OCF_CHECK_LEVEL=0
	mount_point=`zfs list ${OCF_RESKEY_pool} | grep ${OCF_RESKEY_pool} | awk '{print $NF}'`
	test -d "$mount_point"
        if [ $? -ne 0 ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: $mount_point is not a directory"
                return $FAIL
        fi
	[ $OCF_CHECK_LEVEL -lt 10 ] && return $YES

        # depth 10 test (read test)
        timeout -s 9 $TIMEOUT ls "$mount_point" > /dev/null 2> /dev/null
        errcode=$?
        if [ $errcode -ne 0 ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: failed read test on [$mount_point]. Return code: $errcode"
                return $NO
        fi

	[ $OCF_CHECK_LEVEL -lt 20 ] && return $YES

        # depth 20 check (write test)
        rw=$YES
        for o in `echo $OCF_RESKEY_options | sed -e s/,/ /g`; do
                if [ "$o" = "ro" ]; then
                        rw=$NO
                fi
        done
	if [ $rw -eq $YES ]; then
                file="$mount_point"/$file
                while true; do
                        if [ -e "$file" ]; then
                                file=${file}_tmp
                                continue
                        else
                                break
                        fi
                done
                timeout -s 9 $TIMEOUT touch $file > /dev/null 2> /dev/null
                errcode=$?
                if [ $errcode -ne 0 ]; then
                        ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: failed write test on [$mount_point]. Return code: $errcode"
                        return $NO
                fi
                rm -f $file > /dev/null 2> /dev/null
        fi

	return $YES
}

monitor() {
	ocf_log debug "Checking ZFS pool $OCF_RESKEY_pool, Level $OCF_CHECK_LEVEL"
	verify_driver || return $OCF_ERR_ARGS 
	is_imported
	RET=$?
	if [ "$RET" -ne $YES ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: ${OCF_RESKEY_device} is not mounted on ${OCF_RESKEY_mountpoint}"
                return $OCF_NOT_RUNNING
        fi
	is_alive
	return $RET
}

if [ -z "$OCF_CHECK_LEVEL" ]; then
	OCF_CHECK_LEVEL=0
fi

case $1 in
start)
	ocf_log info "zfs start $OCF_RESKEY_pooln"
	OCF_CHECK_LEVEL=0
	monitor
	[ "$?" -ne "0" ] && start || ocf_log info "$OCF_RESKEY_pool is already mounted"
	exit $?
	;;
stop)
	ocf_log info "zfs stop $OCF_RESKEY_pooln"
	OCF_CHECK_LEVEL=0
	monitor
	[ "$?" -eq "0" ] && stop || ocf_log info "$OCF_RESKEY_pool is not mounted"
	exit $?
	;;
status|monitor)
	ocf_log debug "ZFS monitor $OCF_RESKEY_pool"
	monitor
	exit $?
	;;
meta-data)
	echo -e "zfs metadat $OCF_RESKEY_addressn" >>/tmp/out
	meta_data
	exit 0
	;;
validate-all)
	exit 0
	;;
*)
	echo "usage: $0 {start|stop|status|monitor|restart|meta-data|validate-all}"
	exit $OCF_ERR_UNIMPLEMENTED
	;;
esac

All I had to do now was to build the cluster.conf file.

The reason I placed the IP address as the last to start and the first to stop was that the other way around, the NFS client would receive an ordered disconnection command, and would not bother to establish a connection with the remaining server. Abruptly taking away the clustered IP address causes the NFS clients to initiate a reconnection process, of which the systems are supposed to recover

I have left this article incomplete for a while now. It has some stuff I do like to share, so I am sharing it as-is. I will (some day) complete it.

Oracle Clusterware as a 3rd party HA framework

Friday, June 12th, 2009

Oracle begin to push their Clusterware as a 3rd party HA framework. In this article we will review a quick example of how to do it. I will refer to this post as a quick-guide, as this is by no means any full-scale guide.

This article assumes you have installed Oracle Clusterware following one of the few links and guides available on the net. This quick-guide applies to both Clusterware 10 and Clusterware 11.

We will discuss the method of adding an additional NFS service on Linux.

In order to do so, you will need a shared storage – assuming the goal of the exercise is to supply the clients with a consistent storage services based on NFS. I, for myself, prefer to use OCFS2 as the choice file system for shared disks. This goes well with Oracle Clusterware, as this cluster framework does not handle disk mounts very well, and unless you are to write/search an agent which will make sure that every mount and umount behave correctly (you wouldn’t want to get a file system corruption, would you?), you will probably prefer to do the same. The lack of need to manage the disk mount actions will both save time on planned failover, and will guarantee storage safety. If you have not placed your CRS and Vote on OCFS2, you will need to install OCFS2 from here and here, and then to configure it. We will not discuss OCFS2 configuration in this post.

We will need to assume the following prerequisites:

  • Service-related IP address: 1.2.3.4. Netmask 255.255.255.248. You need this IP to be member of the same class as your public network card is.
  • Shared Storage: Formatted to OCFS2, and mounted on both nodes on /shared
  • Oracle Clusterware installed and working
  • Cluster nodes names are “node1” and “node2”
  • Have $CRS_HOME point to your CRS installation
  • Have $CRS_HOME/bin in your $PATH

We need to create the service-related IP resource first. I would recommend to have an entry in /etc/hosts for this IP address on both nodes. Assuming the public NIC is eth0, The command would be

crs_profile -create nfs_ip -t application -a $CRS_HOME/bin/usrvip -o oi=eth0,ov=1.2.3.4,on=255.255.255.248

Now you will need to set running permissions for the oracle user. In my case, the user name is actually “oracle”:

crs_setperm nfs_ip -o root
crs_serperm nfs_ip -u user:oracle:r-x

Test that you can start the service as the oracle user:

crs_start nfs_ip

Now we need to setup NFS. For this to work, we need to setup the NFS daemon first. Edit /etc/exports and add a line such as this:

/shared *(rw,no_root_sqush,sync)

Make sure that nfs service is disabled during startup:

chkconfig nfs off
chkconfig nfslock off

Now is the time to setup Oracle Clusterware for the task:

crs_profile -create share_nfs -t application -B /etc/init.d/nfs -d “Shared NFS” -r nfs_ip -a sharenfs.scr -p favored -h “node1 node2” -o ci=30,ft=3,fi=12,ra=5
crs_register share_nfs

Deal with permissions:

crs_setperms share_nfs -o root
crs_setperms share_nfs -u user:oracle:r-x

Fix the “sharenfs.scr” script. First, find it. It should reside in $CRS_HOME/crs/scripts if everything is OK. If not, you will be able to find it in $CRS_HOME using find.

Edit the “sharenfs.scr” script and modify the following variables which are defined relatively in the beginning of the script:

PROBE_PROCS=”nfsd”
START_APPCMD=”/etc/init.d/nfs start
START_APPCMD2=”/etc/init.d/nfslock start”
STOP_APPCMD=”/etc/init.d/nfs stop”
STOP_APPCMD2=”/etc/init.d/nfslock stop”

Copy the modified script file to the other node. Verify this script has execution permissions on both nodes.

Start the service as the oracle user:

crs_start sharenfs

Test the service. The following command should return the export path:

showmount -e 1.2.3.4

Relocate the service and test again:

crs_relocate -f sharenfs
showmount -e 1.2.3.4

Done. You now have HA NFS service above Oracle Clusterware framework.

I used this web page as a reference. I thank him for his great work!

Tips and tricks for Redhat Cluster

Saturday, January 31st, 2009

Redhat Cluster is a nice HA product. I have been implementing it for a while now, lecturing about it, and yes – I like it. But like any other software product, it has few flaws and issues which you should take under consideration – especially when you create custom “agents” – plugins to control (start/stop/status) your 3rd party application.

I want to list several tips and good practices which will help you create your own agent or custom script, and will help you sleep better at night.

Nighty Night: Sleeping is easier when your cluster is quiet. It usually means that  you don’t want the cluster to suddenly failover during night time, or – for that matter, during any hour, unexpectedly.
Below are some tips to help you sleep better, or to perform an easier postmortem of any cluster failure.

Chop the Logs: Since RedHat Cluster logging might be hidden and filled with lots of irrelevant information, you want your agents to be nice about it. Let them log out somewhere the result of running “status” or “stop” or even “start”. Of course – either recycle the output logs, or rotate them away. You could use

exec &>/tmp/my_script_name.out

much like HACMP does (or at least – behaves as if it does). You can also use specific logging facility for different subsystems of the cluster (cman, rg, qdiskd)

Mind the Gap: Don’t trust unknown scripts or applications’ return codes. Your cluster will fail miserably if a script or a file you expect to run will not be there. Do not automatically assume that the vmware script, for example, will return normal values. Check the return codes and decide how to respond accordingly.

Speak the Language: A service in RedHat Cluster is a combination of one or more resources. This can be somewhat confusing as we tend to refer to resources as a (system) service. Use the correct lingo.  I will try to do just that in this document, so heed the difference between the terms “service” and “system service”, which can be a cluster resource.

Divide and Conquer: Split your services to the minimal set of resources possible. If your service consists of hundreds of resources failure to one of them could cause the entire service to restart, taking down all other working resources. If you keep it to the minimum, you actually protect yourself.

Trust No One: To stress out the “Mind the Gap” point above – don’t trust 3rd party scripts or applications to return a correct error code. Don’t trust their configuration files, and don’t trust the users to “do it right”. They will not. Try to create your own services as fault-protected as possible. Don’t crash because some stupid user (or a stupid administrator – for a cluster implementer both are the same, right?) used incorrect input parameters, or because he has kept  an important configuration file in a different name than was required.

I have some special things I want to do with regard to RedHat Cluster Suite. Stay tuned 🙂