Connecting EMC/NetApp shelves as JBOD to a Linux machine

Let’s say you have old shelves of either EMC or NetApp with SAS or SATA disks in them. And let’s say you want to connect them via FC to a Linux machine and have some nice ZFS machine/cluster, or whatever else. There are few things to know, and to attend in order for it to work.

The first one is the sector size. For NetApp – this applies only to non SATA disks (I don’t know about SSDs, though), and for EMC this could apply, as far as I noticed, to all disks – sector size is not 512 bytes, but 520 – the additional 8 bytes are used for block checksum. Linux does not handle well 520 blocks – the following error message will appear in the logs:

Unsupported sector size 520.

To solve it, we will need to identify the disks – using sg3_utils (in Centos-like - yum install sg3_utils) and then modify them to block size of 512 bytes. To identify the disks, run:

sg_scan -i
/dev/sg0: scsi0 channel=3 id=0 lun=0
HP P410i 3.66 [rmb=0 cmdq=1 pqual=0 pdev=0xc]
/dev/sg1: scsi0 channel=0 id=0 lun=0
HP LOGICAL VOLUME 3.66 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg2: scsi3 channel=0 id=0 lun=0 [em]
hp DVD A DS8A5LH 1HE3 [rmb=1 cmdq=0 pqual=0 pdev=0x5]
/dev/sg3: scsi1 channel=0 id=0 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg4: scsi1 channel=0 id=1 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg5: scsi1 channel=0 id=2 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg6: scsi1 channel=0 id=3 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg7: scsi1 channel=0 id=4 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg8: scsi1 channel=0 id=5 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg9: scsi1 channel=0 id=6 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg10: scsi1 channel=0 id=7 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg11: scsi1 channel=0 id=8 lun=0
FUJITSU MXW3300FE 0906 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg12: scsi1 channel=0 id=9 lun=0
FUJITSU MXW3300FE 0906 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg13: scsi1 channel=0 id=10 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg14: scsi1 channel=0 id=11 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg15: scsi1 channel=0 id=12 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg16: scsi1 channel=0 id=13 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg17: scsi1 channel=0 id=14 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]

So, for each sg device (member of our batch of disks) we need to modify the sector size.

Two ways to do so – the first suggested by this post here, is by using sg_format in the following manner:

sg_format –format –size=512 /dev/sg2

Another post suggested using a dedicated program called ‘setblocksize’. I followed this one, and it worked fine. I had to power cycle the disks before the Linux could use them.

I did notice that disk performance were not bright. I got about 45MB/s write, and about 65-70 MB/s read for large sequential operations, using something like:

dd bs=1M if=/dev/sdf of=/dev/null bs=1M count=10000
dd bs=1M if=/dev/null of=/dev/sdf oflag=direct count=10000 # WARNING – this writes on the disk. Do not use for disks with data!

Fairly disappointing. Also, using multipath, when the shelf is connected to one FC port, and then back to another, showed me that with the setting:

path_grouping_policy multibus

I got about 10MB/s less compared to using “failover” flag (the default for Centos 6). Whatever modification I did to the multipathd.conf, I was unable to exceed this number when using multiple access. These results were consistent when using multibus or group_by_serial, however, when a single path was active and the other was passive, It clearly showed better. I did modify rr_min_io and rr_min_io_rq, but with no effect.

The low disk performance could suggest I need to flush the original disk firmware, however, I am not sure I will do so. If anyone is reading this and had different results – I would love to hear about it.

XenServer 6.5 PCI-Passthrough

While searching the web for how to perform PCI-Passthrough on XenServers, we mostly get info about previous versions. Since I have just completed setting up PCI-Passthrough on XenServer version 6. 5 (with recent update 8, just to give you some notion of the exact time frame), I am sharing it here.

Hardware: Cisco UCS blades, with fNIC. I wish to pass through two FC HBAs into a VM (it is going to act as a backup server, and I need it accessing the FC tape). While all my XenServers in this pool have four (4) FC HBAs, this particular XenServer node has six (6). I am intending the first four for SR communication and the remaining two for the PCI Passthrough process.

This is the output of ‘lspci | grep Fibre’:

0b:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0c:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0d:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0e:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0f:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
10:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)

So, I want to pass through 0f:00.0 and 10:00.0. I had to add to /boot/extlinux.conf the following two entries after the word ‘splash’ and before the three dashes:

pciback.hide=(0f:00.0)(10:00.0) xen-pciback.hide=(0f:00.0)(10:00.0)

Initially, and contrary to the documentation, the parameter pciback.hide had no effect. As soon as the VM started, the command ‘multipath -l‘ would hang forever (or until hard reset to the host).

To apply the settings above, run (for a good measure. Don’t think we need it, but did not read anything about it): ‘extlinux -i /boot‘ and then reboot.

Now, when the host is back, we need to add the devices to the VM. Make sure that the VM is in ‘off’ state before doing that. Your command would look like this:

xe vm-param-set uuid=<VM UUID> other-config:pci=0/0000:0f:00.0,0/0000:10:00.0

The expression ’0/0000′ is required. You can search for its purpose, however, in most cases, your value would look exactly like mine – ’0/0000′

Since my VM is Windows, here it almost ends: Start the VM, and if it boots correctly, Install Cisco VIC into it, as if it were a physical host. You’re done.

Redhat Cluster and Citrix XenServer

I wanted to write down a guide for RHCS on RHEL/Centos6 and XenServer.

If you want to do that, you need to go through two major challenges which you will encounter. I want to save on the search and sum it all up together here.

The first difficulty is the shared disk. In order to set up most common cluster scenarios, you will need a shared storage. You could either map the VMs to an iSCSI LUNs external to the environment, however, if you do not have such infrastructure (either because everything is based on SAS/FC, or you do not have the ability to set up iSCSI storage with reasonable level of availability), you will want XenServer to allow you to share the VDI between two VMs.

In order to do so, you will need to add a flag to all your pool’s XenServers, and to create the VDI in a specific method. First – the flag – you need to create a file in /etc/xensource called “allow_multiple_vdi_attach”. Do not forget to add it to all your XenServers:

touch /etc/xensource/allow_multiple_vdi_attach

Next, you will need to create your VDI as “raw” type. This is an example. You need to change the SR UUID to the one you use:

xe vdi-create sm-config:type=raw sr-uuid=687a023b-0b20-5e5f-d1ef-3db777ce7ae4 name-label=”My Raw LVM VDI” virtual-size=8GiB type=user

You can find Citrix article about it here.

Following that, you can complete your cluster setup and configuration. I will not add details about it here, as this is not the focus of this article. However, when it comes to fencing, you will need a solution. The solution I used was a fencing agent which was written specifically for XenServer using XenAPI, by using the agent called fence-xenserver. I did not use the fencing agents repository (which this page also points to), because I was unable to compile the required components to run on Centos6. They just don’t compile well. This is, however, a simple Python script which actually works.

In order to make it work, I did the following:

  • Extracted the archive (version 0.8)
  • Placed fence_cxs* in /usr/sbin, and removed their ‘.py’ suffix
  • Placed as-is in /usr/sbin
  • Verified /usr/sbin/fence_cxs* had execution permissions.

Now, I needed to add it to the cluster configuration. Since the agent cannot handle accessing a non-pool master, it had to be defined for each pool member (I cannot tell in advance which of them is going to have the pool master role when a failover should happen). So, this is my cluster.conf relevant parts:

<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver01″ passwd=”password” session_url=”https://xenserver01″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver02″ passwd=”password” session_url=”https://xenserver02″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver03″ passwd=”password” session_url=”https://xenserver03″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver04″ passwd=”password” session_url=”https://xenserver04″/>
<clusternode name=”clusternode1″ nodeid=”1″>
<method name=”xenserver01″>
<device name=”xenserver01″ vm_name=”clusternode1″/>
<method name=”xenserver02″>
<device name=”xenserver02″ vm_name=”clusternode1″/>
<method name=”xenserver03″>
<device name=”xenserver03″ vm_name=”clusternode1″/>
<method name=”xenserver04″>
<device name=”xenserver04″ vm_name=”clusternode1″/>
<clusternode name=”clusternode2″ nodeid=”2″>
<method name=”xenserver01″>
<device name=”xenserver01″ vm_name=”clusternode2″/>
<method name=”xenserver02″>
<device name=”xenserver02″ vm_name=”clusternode2″/>
<method name=”xenserver03″>
<device name=”xenserver03″ vm_name=”clusternode2″/>
<method name=”xenserver04″>
<device name=”xenserver04″ vm_name=”clusternode2″/>

Attached xenserver-fencing-cluster.xml for clarity (WordPress makes a mess out of that)

Note that I used four (4) entries, since my pool has four hosts. Also note the VM name (it is case sensitive), and your methods – one for each host, since you don’t want them running in parallel, but one at a time. Failover time is between 5-15 seconds on my tests, depending on who is the actually pool master (xenserver04 takes the longest, obviously). I did not test it with pool master down (before or without HA kicking in), nor with the hosts down and thus TCP timeout is longer (than when attempting to connect a host which responds immediately that it is not the pool master). However, if ILO fencing takes about 30-60 seconds, I am not complaining about the current timeouts.

Timeout when using Ricci as the backend for Corosync update in Redhat Cluster

When using Ricci as the engine for ‘cman_tool version -r’ command, you will experience timeouts (and practically – you will be unable to use ricci to update the cluster configuration across the nodes) when the ricci user password contains XML-sensitive characters, like <>&, etc.
As they say – FYI :-)

NetApp LUN Serial and SCSI Word 83

I was wandering for a long while about the connection between NetApp’s LUN Serial and the identifier the host sees, aka “Word 83″. There was an obvious connection, but I figured it out only today.

The LUN Serial is an ASCII representation of the hexadecimal Word 83, or, to be exact, the last 22 hex characters of it.

lun serial /vol/volume/qtree/lun
Serial#:  7S1PW?Bym7B

When querying the multipath device represented there, we get:

360a9800037533150573f4279316d3742 dm-7 NETAPP,LUN
\_ round-robin 0 [prio=4][active]
 \_ 1:0:0:30 sdm  8:192  [active][ready]
 \_ 2:0:0:30 sdz  65:144 [active][ready]

Using a simple web calculator of Hex-to-Text (example: Use this), we can see that 7S1PW?Bym7B is translated to
37533150573f42796d3742 , which represents the last 22 characters of the reported Word 83. I assume that the leftmost nine hex characters represent the storage device. So, easy to identify.

An additional nice trick is to ask the NetApp to represent the LUN Serial in hex:

lun serial -x /vol/volume/qtree/lun
Serial (hex)#: 0x37533150573f4279316d3742

which represents the same Word 83 we’ve seen before. However, NetApp will not allow you to set (under priv mode) the LUN Serial directly to a hex value. There comes the importance of the Hex-to-Text calculation tools.

Clone corrupted disk in XenServer

Following some unknown problems, I had recently several XenServer machines (different clusters, different sites and customers, and even different versions) with a VDI-END-of-File issues. It means that while you can start the VM correctly, perform XenMotion to another server you are unable to do any storage-migration task – neither Storage XenMotion, nor VDI copy or VM-move commands. In some cases, snapshots taken from the “ill” disks were misbehaving just the same. This is rather frustrating, because the way to solve it is by cloning the disk into a new one, and your hands are bound.

A method I have devised for the task is rather simple – Create a new VDI (on the target storage), map the original VDI and the new VDI to a domain0 machine, and copy using the ‘dd’ command, block-by-block. This is slow, thick, but it’s working.

How to do it? The steps, in general are:

  • Create a new VDI of the same size or larger than the original VDI.
  • Map the old and new VDIs UUID
  • Map the UUID of the control domain you intend to use for this task (it has to be which has access to both VDIs)
  • Turn off the ‘ill’ VM, mark the ‘ill’ VDI in a way you will be able to identify it easily (unique name label, for example), and unmap it from the VM
  • Create VBD for the VDI devices for the control domain, and plug them
  • Create Linux device file for the VBDs on the control domain
  • Perform ‘dd’ between the old and new disks (do not get confused with the direction, or you will overwrite your data!)
  • Unmap VBDs, destroy VBDs
  • Map the new VDI to the VM
  • Start the VM

I won’t go over the how to create a VDI. Use the XenCenter GUI to do it. Place it on the desired SR. Give it a noticeable name, so you would be able to recognise it

Get the UUID of the new VDI: xe vdi-list name-label=”The name label I used” | grep ^uuid | awk ‘{print $NF}’
Do the same to the source VDI. Use it’s name label, or use xe vbd-list to obtain its VDI UUID

Get the UUID of the control domain you want to use: xe vm-list is-control-domain=true

Unmap the VM’s VDI from it (after setting some very noticeable name for it, and noting the disk number/ID it had on the VM)

On the control domain, run:
xe vbd-create vdi-uuid=<’Ill’ VDI UUID> vm-uuid=<Control domain UUID> device=xvda
This command will result in a UUID. Note this UUID, as the source device UUID.

Run again for the target VDI. This time, use device=xvdb

Note this UUID as well. This is the target UUID.

We need to connect the VBDs and create a device node for them:
xe vbd-plug uuid=<UUID of source VBD created above>

There is a new block device available to the XenServer host’s control domain. To identify the new device, we need to run now:
tail -1 /proc/partitions
The resulting line would look something like this:

253 10 40960000 tdk

The interesting fields are the first, the 2nd and the last. We will use them to create a block device using the command ‘mknod’:

mknod /dev/tdk253 10

The result will be a block device file called /dev/tdk with the major 253 and minor 10.

We will repeat the process for the target VBD, and here we have two additional disks on the control domain.

We can (and should) copy using dd from the source to the target (don’t mix it!). Assuming /dev/tdk is the source, and /dev/tdl is the target, it would look like this:

dd if=/dev/tdk of=/dev/tdl bs=1M oflag=direct

We are using oflag=direct to enforce direct writes and not to saturate the control domain’s caches.

Following the operation, to release the disks and get back to business, we do:

  • xe vbd-unplug uuid=<SOURCE VBD UUID>
  • xe vbd-destroy uuid=<SOURCE VBD UUID>
  • xe vbd-unplug uuid=<TARGET VBD UUID>
  • xe vbd-destroy uuid=<TARGET VBD UUID>
  • Map the new disk to the VM, to the correct device number
  • Start the VM

If it starts OK, we can destroy the old VDI and have a bowl. If it doesn’t, we can always map the previous (source) VDI to the VM, and start it anew.

I hope it helps.

ZFS with Redhat Cluster Suite

This is a very nice project I have been working on. The hardware at hand – two servers, with a shared SAS bus containing several SAS disks. Since it’s a shared bus, no RAID solution would cut it, and as I don’t want to waste disks with ASM (“normal” redundancy meaning half the size…), I went to ZFS storage.

ZFS is a wonderful technology, with many advantages, but with some dangerous pitfalls. As I prefer Linux, I did not bother with any Sloaris solutions, and went directly to Centos 6. I will describe my cluster setup below.

I will disclose the entire setup, including hardware layout, Linux platform, ZFS module parameters, the Redhat Cluster Suite ZFS agent I wrote and the cluster.conf configuration file. I will also share my considerations regarding some of the choices I made. In addition, this system was designed to act as NFS storage for Citrix XenServer pool, so I will have to describe the changed I had to perform on the XenServer itself (which might make it unsupported, but I will have to live with it), to allow it to handle the timeouts resulting by server failover.

So first – the servers – each having a single CPU (quad core), 24GB RAM, and dual 1Gb/s NICs. Also – a tiny internal SATA disk is used for the OS. The shared disks – at the moment, 10 SAS disks, dual port (notice – older HP disks might mark in a very small letters that they are only a single-port SAS disks…), 72GB, 10K RPM. Zpool called ‘share’ with two 5 disks RaidZ1 vdevs. As I mentioned before – ZFS seemed like the best possible option allowing me to achieve my goals at minimal cost.

When I came to this project, I wanted to be able to use a native ZFS cluster agent, and not a ‘script’ agent, which takes a very long time to respond (30 seconds). Also – I wanted to be able to handle multiple storage pools concurrently – each floating on its own. While I have only one at the moment, I wanted the ability to have a fine-grained control over multiple pools. In addition – I am unable (or unwilling?) to handle the multiple filesystems introduced with each pool. I wanted to be able to import or export the pool silently, and with a clear head, thus I had to verify that the multiple filesystems are not in use as part of the export process.

As an agent, I wanted to comply with Redhat Cluster Suite (RHCS from now on) OCF syntax. I used the supplied script as an inspiration for my agent script, so some of it might look familiar. All credit goes to the original authors, of course.

The operating system I selected was Centos 6. Centos is based on Redhat Linux, and I find it mature and stable, which is exactly what I want when I plan a production-ready, enterprise-class storage solution. The version had to be x86_64, due to ZFS requirements, and due to the amount of RAM in the server.

To handle ZFS options, I added a file called /etc/modprobe.d/zfs.conf, with the following content

install zfs /bin/rm -f /etc/zfs/zpool.cache && /sbin/modprobe –ignore-install zfs
options zfs zfs_arc_max=12593790976
options zfs zfs_arc_min=12593790975

I had to verify there is no zpool.cache file. Since my pool was rather small (planned for 24 disks max), I was not concerned by the longer import process caused by not having the zpool.cache file. I was more concerned with automatic import process which might happen, and had to prevent it at almost any cost. In addition, I learned from other systems that the arc memory should never exceed half the RAM, and it should be given just a little under that.

Of course, when changing such module settings, you need to recreate initrd (dracut -f) to be on the safe side later on.

The agent script was placed in /usr/share/cluster directory. You must have rgmanager installed for this directory to exist, and anyhow, without rgmanager, you will have no cluster whatsoever.

This is the contents of the file. Notice that it is not compatible with Luci, so if you’re using it – them kids won’t play well together.

# Private return codes
. $(dirname $0)/ocf-shellfuncs
    cat &lt;
	This script will import and export ZFS storage pools
	It will make sure to mount and umount all child filesystems
        This is a ZFS pool
                Symbolic name for this zfs pool
                File System Name
		ZFS Pool name or ID
                ZFS pool name
		ZFS Pool alternate mount
                ZFS pool alternate mount
                If set, the cluster will kill all processes using 
                this file system when the resource group is 
                stopped.  Otherwise, the unmount will fail, and
                the resource group will be restarted.
                Force Unmount
                If set and unmounting the file system fails, the node will
                immediately reboot.  Generally, this is used in conjunction
                with force-unmount support, but it is not required.
                Seppuku Unmount
	<!-- Note: active monitoring is constant and supplants all              check depths -->
        <!-- Checks to see if we can read from the mountpoint -->
        <!-- Checks to see if we can write to the mountpoint (if !ROFS) -->
        echo $*
verify_driver() {
	ocf_log info "Verifying ZFS driver"
	lsmod | grep -w zfs &gt; /dev/null 2&gt;&amp;1 &amp;&amp; return 0
	ocf_log err "ZFS driver is not loaded"
	return $OCF_ERR_ARGS
verify_poolname() {
	ocf_log info "Verify pool name "
	if [ -z "$OCF_RESKEY_pool" ]
		ocf_log err "Missing pool name"
		return $OCF_ERR_ARGS
	zpool import | grep pool: | grep -w $OCF_RESKEY_pool &gt; /dev/null 2&gt;&amp;1 &amp;&amp; return 0
	ocf_log err "Cannot identify pool name"
	return $OCF_ERR_ARGS
verify_mounted_poolname() {
	ocf_log info "Verify pool name "
	if [ -z "$OCF_RESKEY_pool" ]
		ocf_log err "Missing pool name"
		return $OCF_ERR_ARGS
	zpool list $OCF_RESKEY_pool &gt; /dev/null 2&gt;&amp;1 &amp;&amp; return 0
	ocf_log err "Cannot identify pool name"
	return $OCF_ERR_ARGS
verify_mountpath() {
	ocf_log info "Verifying alternate root mount path"
	[ -z "$OCF_RESKEY_mount" ] &amp;&amp; return 0
	declare mp="${OCF_RESKEY_mount}"
	case "$mp" in
		/*)    	# found it
        	*)      # invalid format
			ocf_log err \
"verify_mountpath: Invalid mount point format (must begin with a '/'): \'$mp\'"
                return $OCF_ERR_ARGS
pool_import() {
	ocf_log info "Importing pool"
	[ -n "$OCF_RESKEY_mount" ] &amp;&amp; OPTS="-R $OCF_RESKEY_mount"
	zpool import $OCF_RESKEY_pool $OPTS
	if [ "$RET" -ne "0" ]
		ocf_log info "Cannot import without applying force"
		zpool import -f $OCF_RESKEY_pool $OPTS
	if [ "$RET" -ne "0" ]
		ocf_log err "Pool import failed for $OCF_RESKEY_pool. error=$RET"
		return 1
	ocf_log info "Imported ZFS pool"
	return $RET
check_and_release_fs() {
	ocf_log info "Checking and releasing FS"
	case ${OCF_RESKEY_force_unmount} in
        $YES_STR|on|true|1)	force_umount=$YES ;;
        *)		        force_umount="" ;;
	for i in `zfs list -t filesystem | grep ^${OCF_RESKEY_pool} | awk '{print $NF}'`
		# To be on the safe side. Why not?
		sleep 1
		# Is it mounted?
		if ! df -l | grep -w "$i" &gt; /dev/null 2&gt;&amp;1
			ocf_log info "Filesystem $i is not mounted"
		if [ `lsof $i | wc -l` -gt "0" ]
			ocf_log info "Filesystem $i is in use"
			if [ "$force_umount" ]
				ocf_log info "Attempting to kill processes on $i filesystem"
				fuser -k $i
				sleep 2
				if [ `lsof $i | wc -l` -gt "0" ]
					ocf_log err "Cannot umount filesystem $i - filesystem in use"
					return 1
				ocf_log err "Cannot umount filesystem $i
 - filesystem in use"
                                return 1
	return $RET	
self_fence() {
	ocf_log info "Should we validate and call self-fence?"
	case ${OCF_RESKEY_self_fence} in
		$YES_STR|on|true|1)       self_fence=$YES ;;
       		*)              self_fence="" ;;
	if [ "$self_fence" ]; then
		ocf_log alert "umount failed - REBOOTING"
                reboot -fn
pool_export() {
	ocf_log info "Exporting zfs pool"
	check_and_release_fs || self_fence
	zpool export $OCF_RESKEY_pool
	if [ "$RET" -ne "0" ]
		ocf_log err "Pool export failed for $OCF_RESKEY_pool. error=$RET"
		return 1
	return $RET
start() {
	ocf_log info "Starting ZFS"
	verify_driver || return $OCF_ERR_ARGS 
	verify_poolname || return $OCF_ERR_ARGS
	verify_mountpath || return $OCF_ERR_ARGS
	# Handle filesystem?
stop() {
	ocf_log info "Starting ZFS"
	verify_driver || return $OCF_ERR_ARGS 
	verify_mounted_poolname || return $OCF_ERR_ARGS
	verify_mountpath || return $OCF_ERR_ARGS
	# Handle filesystem?
is_imported() {
	ocf_log debug "Checking if $OCF_RESKEY_pool is imported"
	zpool list ${OCF_RESKEY_pool} &gt; /dev/null 2&gt;&amp;1
	return $?
is_alive() {
	ocf_log debug "Checking ZFS pool read/write"
	declare file=".writable_test.$(hostname)"
	declare TIMEOUT="10s"
	[ -z "$OCF_CHECK_LEVEL" ] &amp;&amp; export OCF_CHECK_LEVEL=0
	mount_point=`zfs list ${OCF_RESKEY_pool} | grep ${OCF_RESKEY_pool} | awk '{print $NF}'`
	test -d "$mount_point"
        if [ $? -ne 0 ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: $mount_point is not a directory"
                return $FAIL
	[ $OCF_CHECK_LEVEL -lt 10 ] &amp;&amp; return $YES
        # depth 10 test (read test)
        timeout -s 9 $TIMEOUT ls "$mount_point" &gt; /dev/null 2&gt; /dev/null
        if [ $errcode -ne 0 ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: failed read test on [$mount_point]. Return code: $errcode"
                return $NO
	[ $OCF_CHECK_LEVEL -lt 20 ] &amp;&amp; return $YES
        # depth 20 check (write test)
        for o in `echo $OCF_RESKEY_options | sed -e s/,/\ /g`; do
                if [ "$o" = "ro" ]; then
	if [ $rw -eq $YES ]; then
                while true; do
                        if [ -e "$file" ]; then
                timeout -s 9 $TIMEOUT touch $file &gt; /dev/null 2&gt; /dev/null
                if [ $errcode -ne 0 ]; then
                        ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: failed write test on [$mount_point]. Return code: $errcode"
                        return $NO
                rm -f $file &gt; /dev/null 2&gt; /dev/null
	return $YES
monitor() {
	ocf_log debug "Checking ZFS pool $OCF_RESKEY_pool, Level $OCF_CHECK_LEVEL"
	verify_driver || return $OCF_ERR_ARGS 
	if [ "$RET" -ne $YES ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: ${OCF_RESKEY_device} is not mounted on ${OCF_RESKEY_mountpoint}"
                return $OCF_NOT_RUNNING
	return $RET
if [ -z "$OCF_CHECK_LEVEL" ]; then
case $1 in
	ocf_log info "zfs start $OCF_RESKEY_pool\n"
	[ "$?" -ne "0" ] &amp;&amp; start || ocf_log info "$OCF_RESKEY_pool is already mounted"
	exit $?
	ocf_log info "zfs stop $OCF_RESKEY_pool\n"
	[ "$?" -eq "0" ] &amp;&amp; stop || ocf_log info "$OCF_RESKEY_pool is not mounted"
	exit $?
	ocf_log debug "ZFS monitor $OCF_RESKEY_pool"
	exit $?
	echo -e "zfs metadat $OCF_RESKEY_address\n" &gt;&gt;/tmp/out
	exit 0
	exit 0
	echo "usage: $0 {start|stop|status|monitor|restart|meta-data|validate-all}"

All I had to do now was to build the cluster.conf file.

The reason I placed the IP address as the last to start and the first to stop was that the other way around, the NFS client would receive an ordered disconnection command, and would not bother to establish a connection with the remaining server. Abruptly taking away the clustered IP address causes the NFS clients to initiate a reconnection process, of which the systems are supposed to recover

I have left this article incomplete for a while now. It has some stuff I do like to share, so I am sharing it as-is. I will (some day) complete it.

Two advanced bash tricks

Well, tricks is not the right word to describe advanced shell scripting usage, however, it does make some sense. These two topics are relevant to Bash version 4.0 and above, which is common for all modern-enough Linux distributions. Yours probably.

These ‘tricks’ are for advanced Bash scripting, and will assume you know how to handle the other advanced Bash topics. I will not instruct the basics here.

Trick #1 – redirected variable

What it means is the following.

Let’s assume that I have a list of objects, say: ‘LIST=”a b c d”‘, and you want to create a set of new variables by these names, holding data. For example:


How can you iterate through the contents of $LIST, and do it right? If you’re having only four objects, you can live with stating them manually, however, for a dynamic list (example: the results of /dev/sd*1 in your system), you might find it a bit problematic.

A solution is to use redirected variables. Up until recently, the method involved a very complex ‘expr’ command which was unpleasant at best, and hard to figure at its worst. Now we can use normal redirected variables, using the exclamation mark. See here:

# Place data into the list

# Read it!
echo ${!OBJECT}

Firstly – to assign value to the redirected variable, we must use ‘export’ prefix. $OBJECT=$RANDOM will not work.
Secondly – to show the content, we need to use exclamation mark inside the variable curly brackets, meaning we cannot call it $!OBJECT, but ${!OBJECT}.
We cannot dynamically create the variable name inside the curly brackets either, so ${!abc_$SUFFIX} won’t work either. We can create the name beforehand, and then use it, like this: DynName=abc_$SUFFIX ; echo ${!DynName}

Trick #2 – Using strings as an array index

It was impossible in the past, but now, one of the most useful features of having smart list is accessible in shell. We can now call an array with a label. For example:

for FILE in $( ls )
array["$FILE"]=$( ls -la $FILE | awk ‘{print $7}’ )

In this example we create array cells with the label being the name of the file, and populating them with the size (this is the result of ls -la 7th field) of this file.

This will work only if the array was declared beforehand using the following command (using the array name ‘array’ here):

declare -A array

Later on, it is easier to query data out of the array, as long as you know its index name. For example

echo ${array[$FILE]}

Of course – assuming there is an entry for ez-aton.txt in this array.

The best use I found for this feature so far was for comparing large lists, without the need to reorder the objects in the array. I find it to boost the capabilities of arrays in Bash, and arrays, in general, are very powerful tools to handle long and complex lists, when you need to note the position.

That’s all fox. Note that the blog editor might change quites (single and double) and dashes to the UTF-8 versions, which will not go well in a copy/paste attempt to experiment with the code examples placed here. You might need to edit the contents and fix the quotes/dashes manually.

If you have any questions, comment here, I will be happy to elaborate. I hope to be able to add more complex Bash stuff I get into once a while :-)

NetApp internals – how to add SSH keys without C$ nor NFS shares

This post will describe the process of placing SSH keys using the internal ‘systemshell’ command of NetApp. As always – when doing something which the vendor did not intend you to do, do it very carefully. This data was obtained from NetApp forums, and while I do not have the original post to link (I usually link to the original, as a courtesy to the original author), this is the content, as is.

First, set to advanced mode:
filer> priv set advanced

Then, unlock and set a password to diag account:
filer*> useradmin diaguser unlock
filer*> useradmin diaguser password

Start the systemshell, create the directory you need and put the pubkey generated in the authorized_keys file:
filer*> systemshell

login: diag
Password: the same you set in the previous step

filer% mkdir -p /mroot/etc/sshd/root/.ssh
filer% vi /mroot/etc/sshd/root/.ssh/authorized_keys
filer% sudo chown -R root:wheel /mroot/etc/sshd/root
filer% sudo chmod -R 0600 /mroot/etc/sshd/root

Last, exit systemshell, lock diag account and exit advanced mode:
filer% exit
filer*> useradmin diaguser lock
filer*> priv set admin

If you want to do it for any other user, just replace the word ‘root’ with the said user.

An additional note – I had to create a user to perform ‘df’ operations only. The purpose was to be able to obtain data using ‘ssh’ without disclosing the keys used for root SSH access, by having a very limited user, designed to do that.

So the commands to create such a user are as follows:

useradmin role add df -a cli-df*,login-ssh
useradmin group add df_users -r df
useradmin user add df -g df_users
(here you will be asked to enter the user’s password)

Hope it helps!



Windows 7 hammering dnsmasq

I migrated to dnsmasq just yesterday, and discovered that a Windows 7 machine was hammering the server with messages like this:

Feb  1 11:06:07 dns dnsmasq-dhcp[1078]: DHCPINFORM(eth0) 91:de:87:7b:e5:a8
Feb  1 11:06:07 dns dnsmasq-dhcp[1078]: DHCPACK(eth0) 91:de:87:7b:e5:a8 winpc

Googling a bit, I found out this link (with an explanation). The solution is fairly simple. Add the following line to your dnsmasq.conf file to solve the problem: