Posts Tagged ‘Disk Storage’

Redhat Cluster and Citrix XenServer

Thursday, April 9th, 2015

I wanted to write down a guide for RHCS on RHEL/Centos6 and XenServer.

If you want to do that, you need to go through two major challenges which you will encounter. I want to save on the search and sum it all up together here.

The first difficulty is the shared disk. In order to set up most common cluster scenarios, you will need a shared storage. You could either map the VMs to an iSCSI LUNs external to the environment, however, if you do not have such infrastructure (either because everything is based on SAS/FC, or you do not have the ability to set up iSCSI storage with reasonable level of availability), you will want XenServer to allow you to share the VDI between two VMs.

In order to do so, you will need to add a flag to all your pool’s XenServers, and to create the VDI in a specific method. First – the flag – you need to create a file in /etc/xensource called “allow_multiple_vdi_attach”. Do not forget to add it to all your XenServers:

touch /etc/xensource/allow_multiple_vdi_attach

Next, you will need to create your VDI as “raw” type. This is an example. You need to change the SR UUID to the one you use:

xe vdi-create sm-config:type=raw sr-uuid=687a023b-0b20-5e5f-d1ef-3db777ce7ae4 name-label=”My Raw LVM VDI” virtual-size=8GiB type=user

You can find Citrix article about it here.

Following that, you can complete your cluster setup and configuration. I will not add details about it here, as this is not the focus of this article. However, when it comes to fencing, you will need a solution. The solution I used was a fencing agent which was written specifically for XenServer using XenAPI, by using the agent called fence-xenserver. I did not use the fencing agents repository (which this page also points to), because I was unable to compile the required components to run on Centos6. They just don’t compile well. This is, however, a simple Python script which actually works.

In order to make it work, I did the following:

  • Extracted the archive (version 0.8)
  • Placed fence_cxs* in /usr/sbin, and removed their ‘.py’ suffix
  • Placed as-is in /usr/sbin
  • Verified /usr/sbin/fence_cxs* had execution permissions.

Now, I needed to add it to the cluster configuration. Since the agent cannot handle accessing a non-pool master, it had to be defined for each pool member (I cannot tell in advance which of them is going to have the pool master role when a failover should happen). So, this is my cluster.conf relevant parts:

<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver01″ passwd=”password” session_url=”https://xenserver01″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver02″ passwd=”password” session_url=”https://xenserver02″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver03″ passwd=”password” session_url=”https://xenserver03″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver04″ passwd=”password” session_url=”https://xenserver04″/>
<clusternode name=”clusternode1″ nodeid=”1″>
<method name=”xenserver01″>
<device name=”xenserver01″ vm_name=”clusternode1″/>
<method name=”xenserver02″>
<device name=”xenserver02″ vm_name=”clusternode1″/>
<method name=”xenserver03″>
<device name=”xenserver03″ vm_name=”clusternode1″/>
<method name=”xenserver04″>
<device name=”xenserver04″ vm_name=”clusternode1″/>
<clusternode name=”clusternode2″ nodeid=”2″>
<method name=”xenserver01″>
<device name=”xenserver01″ vm_name=”clusternode2″/>
<method name=”xenserver02″>
<device name=”xenserver02″ vm_name=”clusternode2″/>
<method name=”xenserver03″>
<device name=”xenserver03″ vm_name=”clusternode2″/>
<method name=”xenserver04″>
<device name=”xenserver04″ vm_name=”clusternode2″/>

Attached xenserver-fencing-cluster.xml for clarity (WordPress makes a mess out of that)

Note that I used four (4) entries, since my pool has four hosts. Also note the VM name (it is case sensitive), and your methods – one for each host, since you don’t want them running in parallel, but one at a time. Failover time is between 5-15 seconds on my tests, depending on who is the actually pool master (xenserver04 takes the longest, obviously). I did not test it with pool master down (before or without HA kicking in), nor with the hosts down and thus TCP timeout is longer (than when attempting to connect a host which responds immediately that it is not the pool master). However, if ILO fencing takes about 30-60 seconds, I am not complaining about the current timeouts.

NetApp “Broken disk label”

Thursday, August 1st, 2013

When using ‘disk show -v’ on a NetApp filer version 7.3.x, following replacement or addition of disk(s), you might see the above mentioned message. It is caused by incorrect disk label – of OnTap version 8, on an OnTap version 7.3.x system. The system cannot handle the incorrect label, and thus – ignores the disk.

A set of actions is required to clean the label and allow the NetApp to use this specific disk. The easiest method (although it will not be described here) would be to place the disk back in an OnTap 8 NetApp device, and clean the label from there, however, it is not always possible.

On your OnTap 7.3.x system, do the following (assuming you know the address of the disk, right?) – taken from NetApp’s forums here.

disk assign <diskid>
priv set diag
labelmaint isolate <diskid>
label wipe <diskid>
label wipev1 <diskid>
label makespare <diskid>
labelmaint unisolate
priv set

The fifth or sixth lines might fail to run, but still – the process will succeed as a whole.


Monday, July 14th, 2008

Linux works perfectly well with multiple storage links using dm-multipath. Not only that, but HP has released their own spawn of dm-multipath, which is optimized (or so claimed, but, anyhow, well configured) to work with EVA and MSA storage devices.

This is great, however, what do you do when mapping volume snapshots through dm-multipath? For each new snapshot, you enjoy a new WWID, which will remap to a new “mpath” name, or raw wwid (if “user_friendly_name” is not set). This can, and will set chaos to remote scripts. On each reboot, the new and shiny snapshot will aquire a new name, thus making scripting a hellish experience.

For the time being I have not tested ext3 labels. I suspect that using labels will fail, as the dm-multipath over layer device does not hide the under layered sd devices, and thus – the system might detect the same label more than once – once for each under layered device, and once for the dm-multipath over layer.

A solution which is both elegant and useful is to fixate the snapshots’ WWID through a small alteration to SSSU command. Append a string such as this to the snap create command:


Don’t use the numbers supplied here. “invent” your own 🙂

Mind you that you must use dashes, else the command will fail.

Doing so will allow you to always use the same WWID for the snapshots, and thus – save tons of hassle after system reboot when accessing snapshots through dm-multipath.

HP MSA1000 controller failover

Tuesday, March 27th, 2007

HP MSA1000 is an entry-level disk storage capable of communicating via different types of interfaces, such as SCSI and FC, and can allow FC failover. This FC failover, however, is controller failover and not path failover. It means that if the primary controller fails entirely, the backup controller will “kick in”. However, if a multi-path capable client will fail its primary interface, there is no guarantee that communication with the disks through the backup controller.

The symptom I have encountered was that the secondary path, while exposing the disks (while the primary path was down for one of the servers) to the server, did not allow any SCSI I/O operations. This prevented the Linux server’s SCSI layer from accessing the disks. So they did appear when doing “cat /proc/scsi/scsi“, however, they were not detected using, for example, “fdisk -l“, and the system logs got filled with “SCSI Error” messages.

About a month ago, after almost two years, a new firmware update has been released (can be found here). Two versions exist – Active/Passive and Active/Active.

I have upgraded the MSA1000 storage device.

After installing the Active/Active firmware upgrade (Notice Linux users – You must have X to run the “msa1500flash” utility), and after power cycling the MSA1000 device, things start to look good.

I have tested performance with a person on-site disconnecting fiber connections on-demand, and it worked great. About 2-5 seconds failover time.

Since this system run Oracle RAC, and it uses OCFS2, I had to update the failed-node timeout to be 31 seconds (per this Oracle’s OCFS site, which includes some really good tips).

So real High Availability can be archived after upgrading MSA1000 firmware.

Hitachi HW100 limitations

Sunday, March 19th, 2006

Not too long ago I have purchased a brand new Hitachi HW100 Workgroup Storage. 14+1 400GB SATA disks, dual-interface, each holding two fiber ports (2Gb/s LC). A nice machine. However, two limitations I have discovered are clouding my day:

1. It supports only Raid0 and Raid5. It will not allow me no-parity Raid, aka, Raid1. It is a pity, because my storage is meant for the lab, and not for production, and I wish to decide my own Raid level. That is why it’s 14+1 disks, and not 15. You must have a spare disk.

2. While not connected directly to hosts (Private Loop), only one port out of two ports in an interface works. It means I bought 4 ports, and actually got two. Why would I need four? I need to simulate multi-path, load balancing and utilize maximum flexibility when connecting this storage to my fiber network. After all, this network is all lab, no production network, and I need to maximize my output, with as little redundancy as possible.

These two limitations are somewhat hidden, and, especially the second one, was discovered only after I’ve questioned HDS’ support person. Pity. I will want to return it, and get a more capable one. I need to see what I can do with it.