Posts Tagged ‘Linux’

ZFS with Redhat Cluster Suite

Friday, July 25th, 2014

This is a very nice project I have been working on. The hardware at hand – two servers, with a shared SAS bus containing several SAS disks. Since it’s a shared bus, no RAID solution would cut it, and as I don’t want to waste disks with ASM (“normal” redundancy meaning half the size…), I went to ZFS storage.

ZFS is a wonderful technology, with many advantages, but with some dangerous pitfalls. As I prefer Linux, I did not bother with any Sloaris solutions, and went directly to Centos 6. I will describe my cluster setup below.

I will disclose the entire setup, including hardware layout, Linux platform, ZFS module parameters, the Redhat Cluster Suite ZFS agent I wrote and the cluster.conf configuration file. I will also share my considerations regarding some of the choices I made. In addition, this system was designed to act as NFS storage for Citrix XenServer pool, so I will have to describe the changed I had to perform on the XenServer itself (which might make it unsupported, but I will have to live with it), to allow it to handle the timeouts resulting by server failover.

So first – the servers – each having a single CPU (quad core), 24GB RAM, and dual 1Gb/s NICs. Also – a tiny internal SATA disk is used for the OS. The shared disks – at the moment, 10 SAS disks, dual port (notice – older HP disks might mark in a very small letters that they are only a single-port SAS disks…), 72GB, 10K RPM. Zpool called ‘share’ with two 5 disks RaidZ1 vdevs. As I mentioned before – ZFS seemed like the best possible option allowing me to achieve my goals at minimal cost.

When I came to this project, I wanted to be able to use a native ZFS cluster agent, and not a ‘script’ agent, which takes a very long time to respond (30 seconds). Also – I wanted to be able to handle multiple storage pools concurrently – each floating on its own. While I have only one at the moment, I wanted the ability to have a fine-grained control over multiple pools. In addition – I am unable (or unwilling?) to handle the multiple filesystems introduced with each pool. I wanted to be able to import or export the pool silently, and with a clear head, thus I had to verify that the multiple filesystems are not in use as part of the export process.

As an agent, I wanted to comply with Redhat Cluster Suite (RHCS from now on) OCF syntax. I used the supplied fs.sh script as an inspiration for my agent script, so some of it might look familiar. All credit goes to the original authors, of course.

The operating system I selected was Centos 6. Centos is based on Redhat Linux, and I find it mature and stable, which is exactly what I want when I plan a production-ready, enterprise-class storage solution. The version had to be x86_64, due to ZFS requirements, and due to the amount of RAM in the server.

To handle ZFS options, I added a file called /etc/modprobe.d/zfs.conf, with the following content

install zfs /bin/rm -f /etc/zfs/zpool.cache && /sbin/modprobe –ignore-install zfs
options zfs zfs_arc_max=12593790976
options zfs zfs_arc_min=12593790975

I had to verify there is no zpool.cache file. Since my pool was rather small (planned for 24 disks max), I was not concerned by the longer import process caused by not having the zpool.cache file. I was more concerned with automatic import process which might happen, and had to prevent it at almost any cost. In addition, I learned from other systems that the arc memory should never exceed half the RAM, and it should be given just a little under that.

Of course, when changing such module settings, you need to recreate initrd (dracut -f) to be on the safe side later on.

The zfs.sh agent script was placed in /usr/share/cluster directory. You must have rgmanager installed for this directory to exist, and anyhow, without rgmanager, you will have no cluster whatsoever.

This is the contents of the zfs.sh file. Notice that it is not compatible with Luci, so if you’re using it – them kids won’t play well together.

#!/bin/bash
 
LC_ALL=C
LANG=C
PATH=/bin:/sbin:/usr/bin:/usr/sbin
export LC_ALL LANG PATH
# Private return codes
FAIL=2
NO=1
YES=0
YES_STR="yes"
 
. $(dirname $0)/ocf-shellfuncs
 
meta_data()
{
    cat <
 
    1.0
 
	This script will import and export ZFS storage pools
	It will make sure to mount and umount all child filesystems
 
        This is a ZFS pool
 
                Symbolic name for this zfs pool
 
                File System Name
 
		ZFS Pool name or ID
 
                ZFS pool name
 
		ZFS Pool alternate mount
 
                ZFS pool alternate mount
 
                If set, the cluster will kill all processes using 
                this file system when the resource group is 
                stopped.  Otherwise, the unmount will fail, and
                the resource group will be restarted.
 
                Force Unmount
 
                If set and unmounting the file system fails, the node will
                immediately reboot.  Generally, this is used in conjunction
                with force-unmount support, but it is not required.
 
                Seppuku Unmount
 
	<!-- Note: active monitoring is constant and supplants all              check depths -->
        <!-- Checks to see if we can read from the mountpoint -->
 
        <!-- Checks to see if we can write to the mountpoint (if !ROFS) -->
 
EOT
}
 
ocf_log()
{
        echo $*
}
 
verify_driver() {
	ocf_log info "Verifying ZFS driver"
	lsmod | grep -w zfs &gt; /dev/null 2&gt;&amp;1 &amp;&amp; return 0
	ocf_log err "ZFS driver is not loaded"
	return $OCF_ERR_ARGS
}
 
verify_poolname() {
	ocf_log info "Verify pool name "
	if [ -z "$OCF_RESKEY_pool" ]
	then
		ocf_log err "Missing pool name"
		return $OCF_ERR_ARGS
	fi
	zpool import | grep pool: | grep -w $OCF_RESKEY_pool &gt; /dev/null 2&gt;&amp;1 &amp;&amp; return 0
	ocf_log err "Cannot identify pool name"
	return $OCF_ERR_ARGS
}
 
verify_mounted_poolname() {
	ocf_log info "Verify pool name "
	if [ -z "$OCF_RESKEY_pool" ]
	then
		ocf_log err "Missing pool name"
		return $OCF_ERR_ARGS
	fi
	zpool list $OCF_RESKEY_pool &gt; /dev/null 2&gt;&amp;1 &amp;&amp; return 0
	ocf_log err "Cannot identify pool name"
	return $OCF_ERR_ARGS
}
 
verify_mountpath() {
	ocf_log info "Verifying alternate root mount path"
	[ -z "$OCF_RESKEY_mount" ] &amp;&amp; return 0
	declare mp="${OCF_RESKEY_mount}"
	case "$mp" in
		/*)    	# found it
                	;;
        	*)      # invalid format
			ocf_log err \
"verify_mountpath: Invalid mount point format (must begin with a '/'): \'$mp\'"
                return $OCF_ERR_ARGS
                ;;
        esac
}
 
pool_import() {
	ocf_log info "Importing pool"
	OPTS=""
	[ -n "$OCF_RESKEY_mount" ] &amp;&amp; OPTS="-R $OCF_RESKEY_mount"
	zpool import $OCF_RESKEY_pool $OPTS
	RET="$?"
	if [ "$RET" -ne "0" ]
	then
		ocf_log info "Cannot import without applying force"
		zpool import -f $OCF_RESKEY_pool $OPTS
		RET="$?"
	fi
	if [ "$RET" -ne "0" ]
	then
		ocf_log err "Pool import failed for $OCF_RESKEY_pool. error=$RET"
		return 1
	fi
	ocf_log info "Imported ZFS pool"
	return $RET
}
 
check_and_release_fs() {
	ocf_log info "Checking and releasing FS"
	FS=""
	case ${OCF_RESKEY_force_unmount} in
        $YES_STR|on|true|1)	force_umount=$YES ;;
        *)		        force_umount="" ;;
        esac
 
	RET=0
	for i in `zfs list -t filesystem | grep ^${OCF_RESKEY_pool} | awk '{print $NF}'`
	do
		# To be on the safe side. Why not?
		sleep 1
		# Is it mounted?
		if ! df -l | grep -w "$i" &gt; /dev/null 2&gt;&amp;1
		then
			ocf_log info "Filesystem $i is not mounted"
			continue
		fi 	
		if [ `lsof $i | wc -l` -gt "0" ]
		then
			ocf_log info "Filesystem $i is in use"
			if [ "$force_umount" ]
			then
				ocf_log info "Attempting to kill processes on $i filesystem"
				fuser -k $i
				sleep 2
				if [ `lsof $i | wc -l` -gt "0" ]
				then
					ocf_log err "Cannot umount filesystem $i - filesystem in use"
					return 1
				fi
			else
				ocf_log err "Cannot umount filesystem $i
 - filesystem in use"
                                return 1
			fi
		fi
	done
	return $RET	
}
 
self_fence() {
	ocf_log info "Should we validate and call self-fence?"
	case ${OCF_RESKEY_self_fence} in
		$YES_STR|on|true|1)       self_fence=$YES ;;
       		*)              self_fence="" ;;
        esac	
 
	if [ "$self_fence" ]; then
		ocf_log alert "umount failed - REBOOTING"
               	sync
                reboot -fn
	fi
	return $OCF_ERR_GENERIC
}
 
pool_export() {
	ocf_log info "Exporting zfs pool"
	check_and_release_fs || self_fence
	zpool export $OCF_RESKEY_pool
	RET="$?"
	if [ "$RET" -ne "0" ]
	then
		ocf_log err "Pool export failed for $OCF_RESKEY_pool. error=$RET"
		return 1
	fi
	return $RET
}
 
start() {
	ocf_log info "Starting ZFS"
	verify_driver || return $OCF_ERR_ARGS 
	verify_poolname || return $OCF_ERR_ARGS
	verify_mountpath || return $OCF_ERR_ARGS
	pool_import
	# Handle filesystem?
}
 
stop() {
	ocf_log info "Starting ZFS"
	verify_driver || return $OCF_ERR_ARGS 
	verify_mounted_poolname || return $OCF_ERR_ARGS
	verify_mountpath || return $OCF_ERR_ARGS
	# Handle filesystem?
	pool_export
}
 
is_imported() {
	ocf_log debug "Checking if $OCF_RESKEY_pool is imported"
	zpool list ${OCF_RESKEY_pool} &gt; /dev/null 2&gt;&amp;1
	return $?
}
 
is_alive() {
	ocf_log debug "Checking ZFS pool read/write"
	declare file=".writable_test.$(hostname)"
	declare TIMEOUT="10s"
	[ -z "$OCF_CHECK_LEVEL" ] &amp;&amp; export OCF_CHECK_LEVEL=0
	mount_point=`zfs list ${OCF_RESKEY_pool} | grep ${OCF_RESKEY_pool} | awk '{print $NF}'`
	test -d "$mount_point"
        if [ $? -ne 0 ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: $mount_point is not a directory"
                return $FAIL
        fi
	[ $OCF_CHECK_LEVEL -lt 10 ] &amp;&amp; return $YES
 
        # depth 10 test (read test)
        timeout -s 9 $TIMEOUT ls "$mount_point" &gt; /dev/null 2&gt; /dev/null
        errcode=$?
        if [ $errcode -ne 0 ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: failed read test on [$mount_point]. Return code: $errcode"
                return $NO
        fi
 
	[ $OCF_CHECK_LEVEL -lt 20 ] &amp;&amp; return $YES
 
        # depth 20 check (write test)
        rw=$YES
        for o in `echo $OCF_RESKEY_options | sed -e s/,/\ /g`; do
                if [ "$o" = "ro" ]; then
                        rw=$NO
                fi
        done
	if [ $rw -eq $YES ]; then
                file="$mount_point"/$file
                while true; do
                        if [ -e "$file" ]; then
                                file=${file}_tmp
                                continue
                        else
                                break
                        fi
                done
                timeout -s 9 $TIMEOUT touch $file &gt; /dev/null 2&gt; /dev/null
                errcode=$?
                if [ $errcode -ne 0 ]; then
                        ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: failed write test on [$mount_point]. Return code: $errcode"
                        return $NO
                fi
                rm -f $file &gt; /dev/null 2&gt; /dev/null
        fi
 
	return $YES
}
 
monitor() {
	ocf_log debug "Checking ZFS pool $OCF_RESKEY_pool, Level $OCF_CHECK_LEVEL"
	verify_driver || return $OCF_ERR_ARGS 
	is_imported
	RET=$?
	if [ "$RET" -ne $YES ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: ${OCF_RESKEY_device} is not mounted on ${OCF_RESKEY_mountpoint}"
                return $OCF_NOT_RUNNING
        fi
	is_alive
	return $RET
}
 
if [ -z "$OCF_CHECK_LEVEL" ]; then
	OCF_CHECK_LEVEL=0
fi
 
case $1 in
start)
	ocf_log info "zfs start $OCF_RESKEY_pool\n"
	OCF_CHECK_LEVEL=0
	monitor
	[ "$?" -ne "0" ] &amp;&amp; start || ocf_log info "$OCF_RESKEY_pool is already mounted"
	exit $?
	;;
stop)
	ocf_log info "zfs stop $OCF_RESKEY_pool\n"
	OCF_CHECK_LEVEL=0
	monitor
	[ "$?" -eq "0" ] &amp;&amp; stop || ocf_log info "$OCF_RESKEY_pool is not mounted"
	exit $?
	;;
status|monitor)
	ocf_log debug "ZFS monitor $OCF_RESKEY_pool"
	monitor
	exit $?
	;;
meta-data)
	echo -e "zfs metadat $OCF_RESKEY_address\n" &gt;&gt;/tmp/out
	meta_data
	exit 0
	;;
validate-all)
	exit 0
	;;
*)
	echo "usage: $0 {start|stop|status|monitor|restart|meta-data|validate-all}"
	exit $OCF_ERR_UNIMPLEMENTED
	;;
esac

All I had to do now was to build the cluster.conf file.

The reason I placed the IP address as the last to start and the first to stop was that the other way around, the NFS client would receive an ordered disconnection command, and would not bother to establish a connection with the remaining server. Abruptly taking away the clustered IP address causes the NFS clients to initiate a reconnection process, of which the systems are supposed to recover

I have left this article incomplete for a while now. It has some stuff I do like to share, so I am sharing it as-is. I will (some day) complete it.

Extracting/Recreating RHEL/Centos6 initrd.img and install.img

Tuesday, October 1st, 2013

A quick note about extracting and recreating RHEL6 or Centos6 (and their derivations) installation media components:

Initrd:

Extract:

mv initrd.img /tmp/initrd.img.xz
cd /tmp
xz –format=lzma initrd.img.xz –decompress
mkdir initrd
cd initrd
cpio -ivdum < ../initrd.img

Archive (after you applied your changes):

cd /tmp/initrd
find . | cpio -o -H newc | xz -9 –format=lzma > ../new-initrd.img

/images/install.img:

Extract:

mount -o loop install.img /mnt
mkdir /tmp/install.img.dir
cd /mnt ; tar cf – –one-file-system . | ( cd /tmp/install.img.dir ; tar xf – )
umount /mnt

Archive (after you applied your changes):

cd /tmp
mksquashfs install.img.dir/ install-new.img

Additional note for Anaconda installation parameters:

I did not test it, however there is a boot flag called stage2= which should lead to a new install.img file, other than the hardcoded one. I don’t if it will accept /images/install-new.img as its flag, but it can be a good start there.

One more thing:

Make sure that the vmlinuz and initrd used for any custom properties, in $CDROOT/isolinux do not exceed 8.3 format. Longer names didn’t work for me. I assume (without any further checks) that this is isolinux limitation.

XenServer – increase LVM over iSCSI LUN size – online

Wednesday, September 4th, 2013

The following procedure was tested by me, and was found to be working. The version of the XenServer I am using in this particular case is 6.1, however, I belive that this method is generic enough so that it could work for every version of XS, assuming you're using iSCSI and LVM (aka - not NetApp, CSLG, NFS and the likes). It might act as a general guideline for fiber channel communication, but this was not tested by me, and thus - I have no idea how it will work. It should work with some modifications when using Multipath, however, regarding multipath, you can find in this particular blog some notes on increasing multipath disks. Check the comments too - they might offer some better and simplified way of doing it.

So - let's begin.

First - increase the size of the LUN through the storage. For NetApp, it involves something like:

lun resize /vol/XenServer/luns/SR1.lun +1t

You should always make sure your storage volume, aggregate, raid group, pool or whatever is capable of holding the data, or - if using thin provisioning - that a well tested monitoring system is available to alert you when running low on storage disk space.

Now, we should identify the LUN. From now on - every action should be performed on all XS pool nodes, one after the other.

cat /proc/partitions

We should keep the output of this command somewhere. We will use it later on to identify the expanded LUN.

Now - let's scan for storage changes:

iscsiadm -m node -R

Now, running the previous command again will have a slightly different output. We can not identify the modified LUN

cat /proc/partitions

We should increase it in size. XenServer uses LVM, so we should harness it to our needs. Let's assume that the modified disk is /dev/sdd.

pvresize /dev/sdd

After completing this task on all pool hosts, we should run sr-scan command. Either by CLI, or through the GUI. When the scan operation completes, the new size would show.

Hope it helps!

Juniper NetworkConnect (NC) and 64bit Linux

Tuesday, June 25th, 2013

Due to a major disk crash, I have had to use my ‘other’ computer for VPN connections. It meant that I have had to prepare it for the operation. I attempted to login to aJuniper-based SSL-VPN connection, however, I did get a message saying that my 64bit Java was inadequate. I had a link, as part of the error message to Juniper KB, to which I must link (remembering how I have had to search for possible solutions in the past).

The nice thing about this solution is that it does not replace your default Java version on the system, which was always a problem, as I was using Java for various purposes, but it recognizes that it’s part of the (update-)alternatives list, and makes use of the correct Java version.

Juniper did it right this time!

Oh – and the link to their KB

And to Oracle Java versions, to make life slightly easier for you. You will need Oracle login, however (you can register for free).

Target-based persistent device naming

Saturday, June 22nd, 2013

When Connecting Linux to a large array of SAS disks (JBOD), udev creates default persistent names in /dev/disk/by-* . These names are based on LUN ID (all disks take lun0 by default), and by path, which includes, for a pure SAS bus – the PWWN of the disks. It means that an example to such naming would be like this (slightly trimmed for ease of view):

/dev/disk/by-id:
scsi-35000c50055924207 -> ../../sde
scsi-35000c50055c5138b -> ../../sdd
scsi-35000c50055c562eb -> ../../sda
scsi-35000c500562ffd73 -> ../../sdc
scsi-35001173100134654 -> ../../sdn
scsi-3500117310013465c -> ../../sdk
scsi-35001173100134688 -> ../../sdj
scsi-35001173100134718 -> ../../sdo
scsi-3500117310013490c -> ../../sdg
scsi-35001173100134914 -> ../../sdh
scsi-35001173100134a58 -> ../../sdp
scsi-3500117310013671c -> ../../sdm
scsi-35001173100136740 -> ../../sdl
scsi-350011731001367ac -> ../../sdi
scsi-350011731001cdd58 -> ../../sdf
wwn-0x5000c50055924207 -> ../../sde
wwn-0x5000c50055c5138b -> ../../sdd
wwn-0x5000c50055c562eb -> ../../sda
wwn-0x5000c500562ffd73 -> ../../sdc
wwn-0×5001173100134654 -> ../../sdn
wwn-0x500117310013465c -> ../../sdk
wwn-0×5001173100134688 -> ../../sdj
wwn-0×5001173100134718 -> ../../sdo
wwn-0x500117310013490c -> ../../sdg
wwn-0×5001173100134914 -> ../../sdh
wwn-0x5001173100134a58 -> ../../sdp
wwn-0x500117310013671c -> ../../sdm
wwn-0×5001173100136740 -> ../../sdl
wwn-0x50011731001367ac -> ../../sdi
wwn-0x50011731001cdd58 -> ../../sdf

/dev/disk/by-path:
pci-0000:03:00.0-sas-0x5000c50055924206-lun-0 -> ../../sde
pci-0000:03:00.0-sas-0x5000c50055c5138a-lun-0 -> ../../sdd
pci-0000:03:00.0-sas-0x5000c50055c562ea-lun-0 -> ../../sda
pci-0000:03:00.0-sas-0x5000c500562ffd72-lun-0 -> ../../sdc
pci-0000:03:00.0-sas-0×5001173100134656-lun-0 -> ../../sdn
pci-0000:03:00.0-sas-0x500117310013465e-lun-0 -> ../../sdk
pci-0000:03:00.0-sas-0x500117310013468a-lun-0 -> ../../sdj
pci-0000:03:00.0-sas-0x500117310013471a-lun-0 -> ../../sdo
pci-0000:03:00.0-sas-0x500117310013490e-lun-0 -> ../../sdg
pci-0000:03:00.0-sas-0×5001173100134916-lun-0 -> ../../sdh
pci-0000:03:00.0-sas-0x5001173100134a5a-lun-0 -> ../../sdp
pci-0000:03:00.0-sas-0x500117310013671e-lun-0 -> ../../sdm
pci-0000:03:00.0-sas-0×5001173100136742-lun-0 -> ../../sdl
pci-0000:03:00.0-sas-0x50011731001367ae-lun-0 -> ../../sdi
pci-0000:03:00.0-sas-0x50011731001cdd5a-lun-0 -> ../../sdf

Real port (connection) persistence is not possible in that manner. A map of PWWN-to-Slot is required, and handling the system in case of a disk failure by non-expert is nearly impossible. A solution for that is to create matching udev rules which will allow handling disks per-port.

While there are (absolutely) better ways of doing it, time constrains require that I get it to work quick&dirty. The solution is based on lsscsi command, as the backend engine of the system, so make sure it exists on the system. I tend to believe that the system will not be able to scale out to hundreds of disks in its current design, but for my 16 disks (and probably for several tenths as well) – it works fine.

Add 60-persistent-disk-ports.rules to /etc/udev/rules.d/ (and omit the .txt suffix)

 

# By Ez-Aton, based partially on the built-in udev block device rule
# forward scsi device event to corresponding block device
ACTION=="change", SUBSYSTEM=="scsi", ENV{DEVTYPE}=="scsi_device", TEST=="block", ATTR{block/*/uevent}="change"

ACTION!="add|change", GOTO="persistent_storage_end"
SUBSYSTEM!="block", GOTO="persistent_storage_end"

# skip rules for inappropriate block devices
KERNEL=="fd*|mtd*|nbd*|gnbd*|btibm*|dm-*|md*", GOTO="persistent_storage_end"

# never access non-cdrom removable ide devices, the drivers are causing event loops on open()
KERNEL=="hd*[!0-9]", ATTR{removable}=="1", SUBSYSTEMS=="ide", ATTRS{media}=="disk|floppy", GOTO="persistent_storage_end"
KERNEL=="hd*[0-9]", ATTRS{removable}=="1", GOTO="persistent_storage_end"

# ignore partitions that span the entire disk
TEST=="whole_disk", GOTO="persistent_storage_end"

# for partitions import parent information
ENV{DEVTYPE}=="partition", IMPORT{parent}="ID_*"

# Deal only with SAS disks
KERNEL=="sd*[!0-9]|sr*", ENV{ID_SERIAL}!="?*", IMPORT{program}="/usr/local/sbin/detect_disk.sh $tempnode", ENV{ID_BUS}="scsi"
KERNEL=="sd*|sr*|cciss*", ENV{DEVTYPE}=="disk", ENV{TGT_PATH}=="?*", SYMLINK+="disk/by-target/disk-$env{TGT_PATH}"
#KERNEL=="sd*|cciss*", ENV{DEVTYPE}=="partition", ENV{ID_SERIAL}!="?*", IMPORT{program}="/usr/local/sbin/detect_disk.sh $tempnode"
KERNEL=="sd*|cciss*", ENV{DEVTYPE}=="partition", ENV{ID_SERIAL}=="?*", IMPORT{program}="/usr/local/sbin/detect_disk.sh $tempnode", SYMLINK+="disk/by-target/disk-$env{TGT_PATH}p%n"

ENV{DEVTYPE}=="disk", KERNEL!="xvd*|sd*|sr*", ATTR{removable}=="1", GOTO="persistent_storage_end"
LABEL="persistent_storage_end"

 
You will need to add (and make executable) the script detect_disk.sh in /usr/local/sbin. Again – remove the .txt suffix
 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/bin/bash
# Written by Ez-Aton to assist with disk-to-port mapping
# $1 - disk device name
name=$1
name=${name##*/}
# Full disk
TGT_PATH=`/usr/bin/lsscsi | grep -w /dev/$name | awk '{print $1}' | tr -d ] | tr -d [`
if [ -z "$TGT_PATH" ]
then
	# This is a partition, so our grep fails
	name=`echo $name | tr -d [0-9]`
	TGT_PATH=`/usr/bin/lsscsi | grep -w /dev/$name | awk '{print $1}' | tr -d ] | tr -d [`
fi
echo "TGT_PATH=$TGT_PATH"

 
The result of this addition to udev would be a directory called /dev/disk/by-target containing links as follow:

/dev/disk/by-target:
disk-0:0:0:0 -> ../../sda
disk-0:0:1:0 -> ../../sdb
disk-0:0:10:0 -> ../../sdk
disk-0:0:11:0 -> ../../sdl
disk-0:0:12:0 -> ../../sdm
disk-0:0:13:0 -> ../../sdn
disk-0:0:14:0 -> ../../sdo
disk-0:0:15:0 -> ../../sdp
disk-0:0:2:0 -> ../../sdc
disk-0:0:3:0 -> ../../sdd
disk-0:0:4:0 -> ../../sde
disk-0:0:5:0 -> ../../sdf
disk-0:0:6:0 -> ../../sdg
disk-0:0:7:0 -> ../../sdh
disk-0:0:8:0 -> ../../sdi
disk-0:0:9:0 -> ../../sdj

The result is a persistent naming, based on real device ports.
 
I hope it helps. If you get to read it and have some suggestions (or a better use of udev, which I know is far from perfect in this case), I would love to hear about it.

RedHat cluster on RHEL6 and KVM-based VMs

Wednesday, August 1st, 2012

The concept of running a virtual machine, KVM-based, in this case, under RHCS is acceptable and reasonable. The interesting part is that the <vm/> directive replaces the <service/> directive and acts as a high-level directive for VMs. This allows for things which cannot be performed with regular 'service', such as live migration. There are probably more, but this is not the current issue.

An example of how it can be done can be shown in this excellent explanation. You can grab whatever parts of it relevant to you, as there is an excellent combination of DRBD, CLVM, GFS and of course, KVM-based VMs.

This whole guide assumes that the VMs reside on a shared storage, which is concurrently accessible by both (all?) hosts. When this is not the case, like when the shared filesystem is ext3/4 and not GFS, and the virtual disk image file is located on it. In this particular case, you would want to connect the VM to the mount. This cannot be performed, however, when using the <vm/> as a top directive (like <service/>), as it does not allow for child-resources.

As the <vm/> directive allows to be defined (with some limitations) as a child resource in a <service/> group, it inherits some properties from its parent (the <service/> directive), while some other properties are not mandatory and will be ignored. A sample configuration would be this:

<resources>
     <fs device="/dev/mapper/mpathap1" force_fsck="1" force_unmount="1" fstype="ext4" mountpoint="/images" name="vmfs" self_fence="0"/>
</resources>
<service autostart="1" domain="vm1_domain" max_restarts="2" name="vm1" recovery="restart">
     <fs ref="vmfs"/>
     <vm migrate="pause" name="vm1" restart_expire_time="600" use_virsh="1" xmlfile="/images/vm1.xml"/>
</service>

This would do the trick. However, the VM will not be able to live migrate, but will have to shutdown/startup for each cluster takeover.

Attach USB disks to XenServer VM Guest

Saturday, May 5th, 2012

There is a very nice script for Windows dealing with attaching XenServer USB disk to a guest. It can be found here.

This script has several problems, as I see it. The first – this is a Windows batch script, which is a very limited language, and it can handle only a single VDI disk in the SR group called “Removable Storage”.

As I am a *nix guy, and can hardly handle Windows batch scripts, I have rewritten this script to run from Linux CLI (focused on running from the XenServer Domain0), and allowed it to handle multiple USB disks. My assumption is that running this script will map/unmap *all* local USB disks to the VM.

Following downloading this script, you should make sure it is executable, and run it with the arguments “attach” or “detach”, per your needs.

And here it is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#!/bin/bash
# This script will map USB devices to a specific VM
# Written by Ez-Aton, http://run.tournament.org.il , with the concepts
# taken from http://jamesscanlonitkb.wordpress.com/2012/03/11/xenserver-mount-usb-from-host/
# and http://support.citrix.com/article/CTX118198
 
# Variables
# Need to change them to match your own!
REMOVABLE_SR_UUID=d03f247d-6fc6-a396-e62b-a4e702aabcf0
VM_UUID=b69e9788-8cd2-0074-5bc1-63cf7870fa0d
DEVICE_NAMES="hdc hde" # Local disk mapping for the VM
XE=/opt/xensource/bin/xe
 
function attach() {
        # Here we attach the disks
        # Check if storage is attached to VBD
        VBDS=`$XE vdi-list sr-uuid=${REMOVABLE_SR_UUID} params=vbd-uuids --minimal | tr , ' '`
        if [ `echo $VBDS | wc -w` -ne 0 ]
        then
                echo "Disks are allready attached. Check VBD $VBDS for details"
                exit 1
        fi
        # Get devices!
        VDIS=`$XE vdi-list sr-uuid=${REMOVABLE_SR_UUID} --minimal | tr , ' '`
        INDEX=0
        DEVICE_NAMES=( $DEVICE_NAMES )
        for i in $VDIS
        do
                VBD=`$XE vbd-create vm-uuid=${VM_UUID} device=${DEVICE_NAMES[$INDEX]} vdi-uuid=${i}`
                if [ $? -ne 0 ]
                then
                        echo "Failed to connect $i to ${DEVICE_NAMES[$INDEX]}"
                        exit 2
                fi
                $XE vbd-plug uuid=$VBD
                if [ $? -ne 0 ]
                then
                        echo "Failed to plug $VBD"
                        exit 3
                fi
                let INDEX++
        done
}
 
function detach() {
        # Here we detach the disks
        VBDS=`$XE vdi-list sr-uuid=${REMOVABLE_SR_UUID} params=vbd-uuids --minimal | tr , ' '`
        for i in $VBDS
        do
                $XE vbd-unplug uuid=${i}
                $XE vbd-destroy uuid=${i}
        done
        echo "Storage Detached from VM"
}
case "$1" in
        attach) attach
                ;;
        detach) detach
                ;;
        *)      echo "Usage: $0 [attach|detach]"
                exit 1
esac

 

Cheers!

Bonding + VLAN tagging + Bridge – updated

Wednesday, April 25th, 2012

In the past I hacked around a problem with the order of starting (and with several bugs) a network stack combined of network bonding (teaming) + VLAN tagging, and then with network bridging (aka – Xen bridges). This kind of setup is very useful for introducing VLAN networks to guest VMs. This works well on Xen (community, Server), however, on RHEL/Centos 5 versions, the startup scripts (ifup and ifup-eth) are buggy, and do not handle this operation correctly. It means that, depending on the update release you use, results might vary from “everything works” to “I get bridges without VLANs” to “I get VLANs without bridges”.

I have hacked a solution in the past, modifying /etc/sysconfig/network-scripts/ifup-eth and fixing some bugs in it, however, both maintaining the fix on every release of ‘initscripts’ package has proven, well, not to happen…

So, instead, I present you with a smarter solution, better adept to updates supplied from time to time by RedHat or Centos, using predefined ‘hooks’ in the ifup scripts.

Create the file /sbin/ifup-pre-local with the following contents:

 

#!/bin/bash
# $1 is the config file
# $2 is not interesting
# We will start the vlan bonding before any bridge
 
DIR=/etc/sysconfig/network-scripts
 
[ -z "$1" ] &amp;&amp; exit 0
. $1
 
if [ "${DEVICE%%[0-9]*}" == "xenbr" ]
then
    for device in $(LANG=C egrep -l "^[[:space:]]*BRIDGE=\"?${DEVICE}\"?" /etc/sysconfig/network-scripts/ifcfg-*) ; do
        /sbin/ifup $device
    done
fi

You can download this scrpit. Don’t forget to change it to be executable. It will call ifup for any parent device of xenbr* device called at. If the parent device is already up, no harm is done. If the parent device is not up, it will be brought up, and then the xenbr device can start normally.

Things to remember…

Monday, October 24th, 2011

As my work takes me to various places (where technology is concerned), I collect lots of browser tab of things I want to keep for later reference.
I have to admit, sadly, that I lack the time to sort them out, to make a real good and nice post about them. I do not want to lose them, however, so I am posting now those which I find or found in the past as more useful to me. I might expand either of them one day into a full post, or elaborate further on them. Either or none. For now – let’s clean up some tab space:
Reading IPMI sensors. Into Cacti, and into Nagios, with some minor modifications by myself (to be disclosed later, I believe):
Cacti
Nagios
This is somewhat info of the plugin check_ipmi_sensor
And its wiki (in German. Use Google for translation)
XenServer checks:
check_xen_pool
Checking XenServer using NRPE
But I did not care about Dom0 performance parameters, as they meant very little regarding the hypervisor’s behavior. So I have combined into it the following XenServer License Check. Unfortunately, I could run it only on the XenServer domain0, due to python version limitations on my Cacti /Nagios server.
You can obtain XenServer SDK
This plugin looks interesting for various XenServer checks, but I have never tried it myself.
Backing up (exporting) XenServer VMs as a scheduled task. I have had it modified extensively to match my requirements, but I am allowed to, it has some of its sources based on my blog :-)
Installing Dell OpenManage on XenServer 5.6.1, and the nice thing is that it works fine on XenServer 6 as well.
Oracle ASM recovery tips . One day I will take it further, and investigate possible human errors and methods of fixing them. Experience, they say, has a value :-)
A guide dealing with changing from raw to block devices in Oracle ASM . This is only a small part of it, but it’s the thing that interests me.
Understanding Steal Time in Linux Xen-based VMs.
Because I always forget, and I’m too lazy to search again and again (and reach the same page again and again): Upgrading PHP to 5.2 on Centos 5
And last – a very nice remote-control software fomr my Android phone. Don’t leave home without it. Seriously.

Reduced to only 23 tabs is excellent. This was a very nice job, and these links will be useful. To me, for sure. I hope that to you as well.

Hot resize Multipath Disk – Linux

Friday, August 19th, 2011

This post is for the users of the great dm-multipath system in Linux, who encounter a major availability problem when attempting a resize of mpath devices (and their partitions), and find themselves scheduling a reboot.

This documented is based on a document created by IBM called "Hot Resize Multipath Storage Volume on Linux with SVC", and its contents are good for any other storage. However - it does not cover the procedure required in case of a partition on the mpath device (for example - mpath1p1 device).

I will demonstrate with only two paths, but, with understanding this process, it can be well used for any amount of paths for a device.

I do not explain how to reduce a LUN size, but the apt viewer will be able to generate a method out of this document. I, for myself, try to avoid as much as I can from shrinking LUNs. I prefer backup, LUN recreation, and then restore. In many case - it's just faster.

So - back to our topic - first - increase the size of your LUN on the storage.

Now, you need to collect the paths used for your mpath device. Check this example:

mpath1 (360a980005033644b424a6276516c4251) dm-2 NETAPP,LUN
[size=200G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=4][active]
\_ 2:0:0:0 sdc 8:32  [active][ready]
\_ round-robin 0 [prio=1][enabled]
\_ 1:0:0:0 sdb 8:16  [active][ready]

The devices marked in bold are the ones we will need to change. Lets get their current size:

blockdev --getsz /dev/sdb
419430400

Keep this number somewhere safe. We can (and should!) assume that sdc has the same values, otherwise, this is not the same exact path.

Collect this info for the partition as well. It will be smaller by a tiny bit:

blockdev --getsz /dev/sdb1
419424892

Keep this number as well.

Now we need to reread the current (storage-based) size parameters of the devices. We will run

blockdev --rereadpt /dev/sdb
blockdev --rereadpt /dev/sdc

Now, our size will be slightly different:

blockdev --getsz /dev/sdb
734003200

Of course, the partition size will not change. We will deal with it later. Keep the updated values as well. Of course, the multipath still holds the disks with their original size values, so running 'multipath -ll' will not reveal any size change. Not yet.

We now need to create editable dmsetup map. Use the current command to create two files: cur and org containing this map:

dmsetup table mpath1 | tee org cur
0 419424892 multipath 1 queue_if_no_path 0 2 1 round-robin 0 1 1 8:32 128 round-robin 0 1 1 8:16 128

Important part - explaining some of these values. The map shows the device's size in blocks - 419424892. It shows some parameters, it shows path groups info (0 2 1), and both sub devices - sdc being 8:32 and sdb being 8:16. Try it with 'ls -la /dev/sdb' to see the minor and major. At this point, if you are not familiar with majors and minors, I would recommend you do some reading about it. Not mandatory, but will make your life here safer.

We need to delete one of the paths, so we can refresh it. I have decided to remove sdb first:

multipathd -k"del path sdb"

Now, running the multipath command, we will get:

mpath1 (360a980005033644b424a6276516c4251) dm-2 NETAPP,LUN
[size=200G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=4][active]
\_ 2:0:0:0 sdc 8:32  [active][ready]

Only one path. Good. We will need to edit the 'cur' file created earlier to reflect the new settings we are to introduce:

0 419424892 multipath 1 queue_if_no_path 0 1 1 round-robin 0 1 1 8:32 128

The only group left was the one containing 'sdc' (8:32), and since one group down, the bold number was changed from 2 to 1 (as there is only a single path group now!)

We need to reload multipath with these settings:

dmsetup suspend mpath1; dmsetup reload mpath1 cur; dmsetup resume mpath1

The correct response for this line is 'ok'. We pause mpath1, reload and then resume it. It is best to be in a single line, as this process freezes IO for a short period of time on the device, and we prefer it to be as short as possible.

Now, as /dev/sdb is not a part of the multipath managed devices, we can modify it. I usually use 'fdisk' - deleting the old partition, and recreating it in the new size, but you must make sure, if your device requires LUN alignment, that you recreated the partition from the same start point. I will dedicate a post some time to LUN alignment, but not at this particular time. Just a hint - if you're not sure, run fdisk in expert mode and get a printout of your partition table (fdisk /dev/sdb and then x and then p). If your partition starts at 128 or 64, it is aligned. If not (usually for large LUNs - at 63), you are not, and you should either be worried about it, but not now, or should not care at all.

Back to our task.

We need to grab the size of the newly created partition, for later use. Write it down somewhere.

blockdev --getsz /dev/sdb1
733993657

Following the partition recreation, we need to introduce the device to the multipath daemon. We do this by:

multipathd -k"add path sdb"

followed by immediately removing the remaining device:

multipathd -k"del path sdc"

We need to have our 'cur' file updated, so we can release the device to our uses. This time, we update both the size section with the new size, and the new, remaining path. Now, the file looks like this:

0 734003200 multipath 1 queue_if_no_path 0 1 1 round-robin 0 1 1 8:16 128

As mentioned before - the large number in bold is the new size of the block device. The amount of failure groups is one (1), also in bold, and the device name is 'sdb' which is 8:16. Save this modified file, and run:

dmsetup suspend mpath1; dmsetup reload mpath1 cur; dmsetup resume mpath1

Running the command 'multipath -ll' you will get the real size of the device.

mpath1 (360a980005033644b424a6276516c4251) dm-2 NETAPP,LUN
[size=350G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=1][active]
\_ 1:0:0:0 sdb 8:16  [active][ready]

We will need to reread the partition layout of /dev/sdc. The quickest way is by running:

partprobe

This should do it. We can now add it back in:

multipathd -k"add path sdc"

and then run

multipath

(which should result in all the available paths, and the correct size).

Our last task is to update the partition size. The partition, normally, is called mpath1p1, so we need to read its parameters. Lets keep it in a file:

dmsetup table mpath1p1 | tee partorg partcur

We should now edit the newly created file 'partcur' with the new size. You should not change anything else. Originally, it looked like this:

0 419424892 linear 253:2 128

and it was modified to look like this:

0 733993657 linear 253:2 128

Notice that the size (in bold) is the one obtained from /dev/sdb1 (!!!) and not /dev/sdb.

We need to reload the device. Again - attempt to do it in a single line, or else you will freeze IO for a long time (which might cause things to crush):

dmsetup suspend mpath1p1; dmsetup reload mpath1p1 partcur; dmsetup resume mpath1p1

Do not mistaked mpath1 with mpath1p1.

Our device is updated, our paths are online. We are happy. All left to do is to online resize the file system. With ext3, this is done like this:

resize2fs /dev/mapper/mpath1p1

The mount will increase in size online, and all left for us is to wait for it to complete, and then go home.

I hope this helps. It helped me.