Posts Tagged ‘redhat’

Connecting EMC/NetApp shelves as JBOD to a Linux machine

Wednesday, April 29th, 2015

Let’s say you have old shelves of either EMC or NetApp with SAS or SATA disks in them. And let’s say you want to connect them via FC to a Linux machine and have some nice ZFS machine/cluster, or whatever else. There are few things to know, and to attend in order for it to work.

The first one is the sector size. For NetApp – this applies only to non SATA disks (I don’t know about SSDs, though), and for EMC this could apply, as far as I noticed, to all disks – sector size is not 512 bytes, but 520 – the additional 8 bytes are used for block checksum. Linux does not handle well 520 blocks – the following error message will appear in the logs:

Unsupported sector size 520.

To solve it, we will need to identify the disks – using sg3_utils (in Centos-like – yum install sg3_utils) and then modify them to block size of 512 bytes. To identify the disks, run:

sg_scan -i
/dev/sg0: scsi0 channel=3 id=0 lun=0
HP P410i 3.66 [rmb=0 cmdq=1 pqual=0 pdev=0xc] /dev/sg1: scsi0 channel=0 id=0 lun=0
HP LOGICAL VOLUME 3.66 [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg2: scsi3 channel=0 id=0 lun=0 [em] hp DVD A DS8A5LH 1HE3 [rmb=1 cmdq=0 pqual=0 pdev=0x5] /dev/sg3: scsi1 channel=0 id=0 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg4: scsi1 channel=0 id=1 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg5: scsi1 channel=0 id=2 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg6: scsi1 channel=0 id=3 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg7: scsi1 channel=0 id=4 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg8: scsi1 channel=0 id=5 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg9: scsi1 channel=0 id=6 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg10: scsi1 channel=0 id=7 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg11: scsi1 channel=0 id=8 lun=0
FUJITSU MXW3300FE 0906 [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg12: scsi1 channel=0 id=9 lun=0
FUJITSU MXW3300FE 0906 [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg13: scsi1 channel=0 id=10 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg14: scsi1 channel=0 id=11 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg15: scsi1 channel=0 id=12 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg16: scsi1 channel=0 id=13 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0] /dev/sg17: scsi1 channel=0 id=14 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]

So, for each sg device (member of our batch of disks) we need to modify the sector size.

Two ways to do so – the first suggested by this post here, is by using sg_format in the following manner:

sg_format –format –size=512 /dev/sg2

Another post suggested using a dedicated program called ‘setblocksize’. I followed this one, and it worked fine. I had to power cycle the disks before the Linux could use them.

I did notice that disk performance were not bright. I got about 45MB/s write, and about 65-70 MB/s read for large sequential operations, using something like:

dd bs=1M if=/dev/sdf of=/dev/null bs=1M count=10000
dd bs=1M if=/dev/null of=/dev/sdf oflag=direct count=10000 # WARNING – this writes on the disk. Do not use for disks with data!

Fairly disappointing. Also, using multipath, when the shelf is connected to one FC port, and then back to another, showed me that with the setting:

path_grouping_policy multibus

I got about 10MB/s less compared to using “failover” flag (the default for Centos 6). Whatever modification I did to the multipathd.conf, I was unable to exceed this number when using multiple access. These results were consistent when using multibus or group_by_serial, however, when a single path was active and the other was passive, It clearly showed better. I did modify rr_min_io and rr_min_io_rq, but with no effect.

The low disk performance could suggest I need to flush the original disk firmware, however, I am not sure I will do so. If anyone is reading this and had different results – I would love to hear about it.

Timeout when using Ricci as the backend for Corosync update in Redhat Cluster

Saturday, March 14th, 2015

When using Ricci as the engine for ‘cman_tool version -r’ command, you will experience timeouts (and practically – you will be unable to use ricci to update the cluster configuration across the nodes) when the ricci user password contains XML-sensitive characters, like <>&, etc.
As they say – FYI :-)

ZFS with Redhat Cluster Suite

Friday, July 25th, 2014

This is a very nice project I have been working on. The hardware at hand - two servers, with a shared SAS bus containing several SAS disks. Since it's a shared bus, no RAID solution would cut it, and as I don't want to waste disks with ASM ("normal" redundancy meaning half the size...), I went to ZFS storage.

ZFS is a wonderful technology, with many advantages, but with some dangerous pitfalls. As I prefer Linux, I did not bother with any Sloaris solutions, and went directly to Centos 6. I will describe my cluster setup below.

I will disclose the entire setup, including hardware layout, Linux platform, ZFS module parameters, the Redhat Cluster Suite ZFS agent I wrote and the cluster.conf configuration file. I will also share my considerations regarding some of the choices I made. In addition, this system was designed to act as NFS storage for Citrix XenServer pool, so I will have to describe the changed I had to perform on the XenServer itself (which might make it unsupported, but I will have to live with it), to allow it to handle the timeouts resulting by server failover.

So first - the servers - each having a single CPU (quad core), 24GB RAM, and dual 1Gb/s NICs. Also - a tiny internal SATA disk is used for the OS. The shared disks - at the moment, 10 SAS disks, dual port (notice - older HP disks might mark in a very small letters that they are only a single-port SAS disks...), 72GB, 10K RPM. Zpool called 'share' with two 5 disks RaidZ1 vdevs. As I mentioned before - ZFS seemed like the best possible option allowing me to achieve my goals at minimal cost.

When I came to this project, I wanted to be able to use a native ZFS cluster agent, and not a 'script' agent, which takes a very long time to respond (30 seconds). Also - I wanted to be able to handle multiple storage pools concurrently - each floating on its own. While I have only one at the moment, I wanted the ability to have a fine-grained control over multiple pools. In addition - I am unable (or unwilling?) to handle the multiple filesystems introduced with each pool. I wanted to be able to import or export the pool silently, and with a clear head, thus I had to verify that the multiple filesystems are not in use as part of the export process.

As an agent, I wanted to comply with Redhat Cluster Suite (RHCS from now on) OCF syntax. I used the supplied script as an inspiration for my agent script, so some of it might look familiar. All credit goes to the original authors, of course.

The operating system I selected was Centos 6. Centos is based on Redhat Linux, and I find it mature and stable, which is exactly what I want when I plan a production-ready, enterprise-class storage solution. The version had to be x86_64, due to ZFS requirements, and due to the amount of RAM in the server.

To handle ZFS options, I added a file called /etc/modprobe.d/zfs.conf, with the following content

install zfs /bin/rm -f /etc/zfs/zpool.cache && /sbin/modprobe --ignore-install zfs
options zfs zfs_arc_max=12593790976
options zfs zfs_arc_min=12593790975

I had to verify there is no zpool.cache file. Since my pool was rather small (planned for 24 disks max), I was not concerned by the longer import process caused by not having the zpool.cache file. I was more concerned with automatic import process which might happen, and had to prevent it at almost any cost. In addition, I learned from other systems that the arc memory should never exceed half the RAM, and it should be given just a little under that.

Of course, when changing such module settings, you need to recreate initrd (dracut -f) to be on the safe side later on.

The agent script was placed in /usr/share/cluster directory. You must have rgmanager installed for this directory to exist, and anyhow, without rgmanager, you will have no cluster whatsoever.

This is the contents of the file. Notice that it is not compatible with Luci, so if you're using it - them kids won't play well together.


# Private return codes

. $(dirname $0)/ocf-shellfuncs

    cat < EOT


	This script will import and export ZFS storage pools
	It will make sure to mount and umount all child filesystems

        This is a ZFS pool

                Symbolic name for this zfs pool

                File System Name

		ZFS Pool name or ID

                ZFS pool name

		ZFS Pool alternate mount

                ZFS pool alternate mount

                If set, the cluster will kill all processes using 
                this file system when the resource group is 
                stopped.  Otherwise, the unmount will fail, and
                the resource group will be restarted.

                Force Unmount

                If set and unmounting the file system fails, the node will
                immediately reboot.  Generally, this is used in conjunction
                with force-unmount support, but it is not required.

                Seppuku Unmount




        echo $*

verify_driver() {
	ocf_log info "Verifying ZFS driver"
	lsmod | grep -w zfs > /dev/null >&1 && return 0
	ocf_log err "ZFS driver is not loaded"
	return $OCF_ERR_ARGS

verify_poolname() {
	ocf_log info "Verify pool name "
	if [ -z "$OCF_RESKEY_pool" ]
		ocf_log err "Missing pool name"
		return $OCF_ERR_ARGS
	zpool import | grep pool: | grep -w $OCF_RESKEY_pool > /dev/null 2>&1 && return 0
	ocf_log err "Cannot identify pool name"
	return $OCF_ERR_ARGS

verify_mounted_poolname() {
	ocf_log info "Verify pool name "
	if [ -z "$OCF_RESKEY_pool" ]
		ocf_log err "Missing pool name"
		return $OCF_ERR_ARGS
	zpool list $OCF_RESKEY_pool > /dev/null >&1 && return 0
	ocf_log err "Cannot identify pool name"
	return $OCF_ERR_ARGS

verify_mountpath() {
	ocf_log info "Verifying alternate root mount path"
	[ -z "$OCF_RESKEY_mount" ] && return 0
	declare mp="${OCF_RESKEY_mount}"
	case "$mp" in
		/*)    	# found it
        	*)      # invalid format
			ocf_log err 
"verify_mountpath: Invalid mount point format (must begin with a '/'): '$mp'"
                return $OCF_ERR_ARGS

pool_import() {
	ocf_log info "Importing pool"
	[ -n "$OCF_RESKEY_mount" ] && OPTS="-R $OCF_RESKEY_mount"
	zpool import $OCF_RESKEY_pool $OPTS
	if [ "$RET" -ne "0" ]
		ocf_log info "Cannot import without applying force"
		zpool import -f $OCF_RESKEY_pool $OPTS
	if [ "$RET" -ne "0" ]
		ocf_log err "Pool import failed for $OCF_RESKEY_pool. error=$RET"
		return 1
	ocf_log info "Imported ZFS pool"
	return $RET

check_and_release_fs() {
	ocf_log info "Checking and releasing FS"
	case ${OCF_RESKEY_force_unmount} in
        $YES_STR|on|true|1)	force_umount=$YES ;;
        *)		        force_umount="" ;;

	for i in `zfs list -t filesystem | grep ^${OCF_RESKEY_pool} | awk '{print $NF}'`
		# To be on the safe side. Why not?
		sleep 1
		# Is it mounted?
		if ! df -l | grep -w "$i" > /dev/null 2>&1
			ocf_log info "Filesystem $i is not mounted"
		if [ `lsof $i | wc -l` -gt "0" ]
			ocf_log info "Filesystem $i is in use"
			if [ "$force_umount" ]
				ocf_log info "Attempting to kill processes on $i filesystem"
				fuser -k $i
				sleep 2
				if [ `lsof $i | wc -l` -gt "0" ]
					ocf_log err "Cannot umount filesystem $i - filesystem in use"
					return 1
				ocf_log err "Cannot umount filesystem $i
 - filesystem in use"
                                return 1
	return $RET	

self_fence() {
	ocf_log info "Should we validate and call self-fence?"
	case ${OCF_RESKEY_self_fence} in
		$YES_STR|on|true|1)       self_fence=$YES ;;
       		*)              self_fence="" ;;

	if [ "$self_fence" ]; then
		ocf_log alert "umount failed - REBOOTING"
                reboot -fn

pool_export() {
	ocf_log info "Exporting zfs pool"
	check_and_release_fs || self_fence
	zpool export $OCF_RESKEY_pool
	if [ "$RET" -ne "0" ]
		ocf_log err "Pool export failed for $OCF_RESKEY_pool. error=$RET"
		return 1
	return $RET

start() {
	ocf_log info "Starting ZFS"
	verify_driver || return $OCF_ERR_ARGS 
	verify_poolname || return $OCF_ERR_ARGS
	verify_mountpath || return $OCF_ERR_ARGS
	# Handle filesystem?

stop() {
	ocf_log info "Starting ZFS"
	verify_driver || return $OCF_ERR_ARGS 
	verify_mounted_poolname || return $OCF_ERR_ARGS
	verify_mountpath || return $OCF_ERR_ARGS
	# Handle filesystem?

is_imported() {
	ocf_log debug "Checking if $OCF_RESKEY_pool is imported"
	zpool list ${OCF_RESKEY_pool} > /dev/null >&1
	return $?

is_alive() {
	ocf_log debug "Checking ZFS pool read/write"
	declare file=".writable_test.$(hostname)"
	declare TIMEOUT="10s"
	[ -z "$OCF_CHECK_LEVEL" ] && export OCF_CHECK_LEVEL=0
	mount_point=`zfs list ${OCF_RESKEY_pool} | grep ${OCF_RESKEY_pool} | awk '{print $NF}'`
	test -d "$mount_point"
        if [ $? -ne 0 ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: $mount_point is not a directory"
                return $FAIL
	[ $OCF_CHECK_LEVEL -lt 10 ] && return $YES

        # depth 10 test (read test)
        timeout -s 9 $TIMEOUT ls "$mount_point" > /dev/null 2> /dev/null
        if [ $errcode -ne 0 ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: failed read test on [$mount_point]. Return code: $errcode"
                return $NO

	[ $OCF_CHECK_LEVEL -lt 20 ] && return $YES

        # depth 20 check (write test)
        for o in `echo $OCF_RESKEY_options | sed -e s/,/ /g`; do
                if [ "$o" = "ro" ]; then
	if [ $rw -eq $YES ]; then
                while true; do
                        if [ -e "$file" ]; then
                timeout -s 9 $TIMEOUT touch $file > /dev/null 2> /dev/null
                if [ $errcode -ne 0 ]; then
                        ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: failed write test on [$mount_point]. Return code: $errcode"
                        return $NO
                rm -f $file > /dev/null 2> /dev/null

	return $YES

monitor() {
	ocf_log debug "Checking ZFS pool $OCF_RESKEY_pool, Level $OCF_CHECK_LEVEL"
	verify_driver || return $OCF_ERR_ARGS 
	if [ "$RET" -ne $YES ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: ${OCF_RESKEY_device} is not mounted on ${OCF_RESKEY_mountpoint}"
                return $OCF_NOT_RUNNING
	return $RET

if [ -z "$OCF_CHECK_LEVEL" ]; then

case $1 in
	ocf_log info "zfs start $OCF_RESKEY_pooln"
	[ "$?" -ne "0" ] && start || ocf_log info "$OCF_RESKEY_pool is already mounted"
	exit $?
	ocf_log info "zfs stop $OCF_RESKEY_pooln"
	[ "$?" -eq "0" ] && stop || ocf_log info "$OCF_RESKEY_pool is not mounted"
	exit $?
	ocf_log debug "ZFS monitor $OCF_RESKEY_pool"
	exit $?
	echo -e "zfs metadat $OCF_RESKEY_addressn" >>/tmp/out
	exit 0
	exit 0
	echo "usage: $0 {start|stop|status|monitor|restart|meta-data|validate-all}"

All I had to do now was to build the cluster.conf file.

The reason I placed the IP address as the last to start and the first to stop was that the other way around, the NFS client would receive an ordered disconnection command, and would not bother to establish a connection with the remaining server. Abruptly taking away the clustered IP address causes the NFS clients to initiate a reconnection process, of which the systems are supposed to recover

I have left this article incomplete for a while now. It has some stuff I do like to share, so I am sharing it as-is. I will (some day) complete it.

Extracting/Recreating RHEL/Centos6 initrd.img and install.img

Tuesday, October 1st, 2013

A quick note about extracting and recreating RHEL6 or Centos6 (and their derivations) installation media components:



mv initrd.img /tmp/initrd.img.xz
cd /tmp
xz –format=lzma initrd.img.xz –decompress
mkdir initrd
cd initrd
cpio -ivdum < ../initrd.img

Archive (after you applied your changes):

cd /tmp/initrd
find . | cpio -o -H newc | xz -9 –format=lzma > ../new-initrd.img



mount -o loop install.img /mnt
mkdir /tmp/install.img.dir
cd /mnt ; tar cf – –one-file-system . | ( cd /tmp/install.img.dir ; tar xf – )
umount /mnt

Archive (after you applied your changes):

cd /tmp
mksquashfs install.img.dir/ install-new.img

Additional note for Anaconda installation parameters:

I did not test it, however there is a boot flag called stage2= which should lead to a new install.img file, other than the hardcoded one. I don’t if it will accept /images/install-new.img as its flag, but it can be a good start there.

One more thing:

Make sure that the vmlinuz and initrd used for any custom properties, in $CDROOT/isolinux do not exceed 8.3 format. Longer names didn’t work for me. I assume (without any further checks) that this is isolinux limitation.

IPSec VPN for mobile devices on Linux

Saturday, December 8th, 2012

I have had recently the pleasure and challenge of setting up VPN server for mobile devices on top of Linux. the common method to do so would be by using IPSec + L2TP, as these are to more common methods mobile devices allow, and it should work quite fine with other types of clients (although I did not test it) like Linux, Windows and Mac.

I have decided to use PSK (Pre Shared Key) due to its relative simplicity when handling multiple clients (compared to managing certificate per-device), and its relative simplicity of setup.

My VPN server platform is Linux, x86_64 (64 bit), Centos 6. Latest release, which is for the time being 6.3 and some updates.

I have used the following link as a baseline, and added some extra about IPTables, which was a little challenge, where I wanted good-enough security around this setup.

Initially, I wanted to use OpenSWAN, however, it does not allow easy integration with dynamic IP address, and its policy, while capable of being very precise, was not flexible enough to handle varying local IP address.

First – Add the following two repositories: Nikoforge, for Racoon (ipsec-tools), and EPEL for xl2tpd.

You can add them the following way:

rpm -ivH
yum -y install
yum -y install ipsec-tools xl2tpd

Following that, create a script called /etc/racoon/

# set security policies
echo -e "flush;n
        spdadd[0][1701] udp -P in  ipsec esp/transport//require;n
        spdadd[1701][0] udp -P out ipsec esp/transport//require;n"
        | setkey -c
# enable IP forwarding
echo 1 > /proc/sys/net/ipv4/ip_forward

Make sure this script allows execution, and add it to /etc/rc.local

Racoon config /etc/racoon/racoon.conf looks like this for my setup:

path include "/etc/racoon";
path pre_shared_key "/etc/racoon/psk.txt";
path certificate "/etc/racoon/certs";
path script "/etc/racoon/scripts";
#log debug;
remote anonymous
      exchange_mode    aggressive,main;
      #exchange_mode    main;
      passive          on;
      proposal_check   obey;
      support_proxy    on;
      nat_traversal    on;
      ike_frag         on;
      dpd_delay        20;
      #generate_policy unique;
      generate_policy on;
      verify_identifier on;
            encryption_algorithm  aes;
            hash_algorithm        sha1;
            authentication_method pre_shared_key;
            dh_group              modp1024;
            encryption_algorithm  3des;
            hash_algorithm        sha1;
            authentication_method pre_shared_key;
            dh_group              modp1024;
sainfo anonymous
      encryption_algorithm     aes,3des;
      authentication_algorithm hmac_sha1;
      compression_algorithm    deflate;
      pfs_group                modp1024;

The PSK is kept inside /etc/racoon/psk.txt. It looks like this for me (changed password, duh!):

myHome   ApAssPhR@se

Both said files (/etc/racoon/racoon.conf and /etc/racoon/psk.txt) should have only-root permissions, aka 600.

Notice the bold myHome identifier. As the local address might change (either ppp dialup, or DHCP client), this one will be used instead of the local address identifier, as the common identifier of the connection. For Android devices, it will be defined as the ‘IPSec Identifier’ value.

We need to setup xl2tpd: Edit /etc/xl2tpd/xl2tpd.conf and have it look like this:

[global] debug tunnel = no
debug state = no
debug network = no
ipsec saref = yes
force userspace = yes
[lns default] ip range =
local ip =
refuse pap = yes
require authentication = yes
name = l2tpd
ppp debug = no
pppoptfile = /etc/ppp/options.xl2tpd
length bit = yes

The IP range will define the client VPN interface address. The amount should match the expected number of clients, or be somewhat larger, to be on the safe side. Don’t try to be a smart ass with it. Use explicit IP addresses. Easier that way. The “local” IP address should be external to the pool defined, or else a client might collide with it. I haven’t tried checking if xl2tpd allowed such configuration. You are invited to test, although it’s rather pointless.

Create the file /etc/ppp/options.xl2tpd with the following contents:

asyncmap 0
name l2tpd
lcp-echo-interval 10
lcp-echo-failure 100

The ms-dns option should specify the desired DNS the client will use. In my case – I have an internal DNS server, so I wanted it to use it. You can either use your internal, if you have any, or Google’s, for example.

Almost done – you should add the relevant login info to /etc/ppp/chap-secrets. It should look like: “username” * “password” * . In my case, it would look like this:

# Secrets for authentication using CHAP
# client    server    secret            IP addresses
ez-aton        *    “SomePassw0rd”        *

Select a good password. Security should not be taken lightly.

We’re almost done – we need to define IP forwarding, which can be done by adding the following line to /etc/sysctl.conf:

# Controls IP packet forwarding
net.ipv4.ip_forward = 1

and then running ‘sysctl -p’ to load these values.

Run the following commands, and your system is ready to accept connections:

chkconfig racoon on
chkconfig xl2tpd on
service racoon start
service xl2tpd start

That said – we have not configured IPTables, in case this server acts as the firewall as well. It does, in my case, so I have had to take special care for the IPTables rules.

As my rules are rather complex, I will only show the rules relevant to the system, assuming (and this is important!) it is both the firewall/router and the VPN endpoint. If this is not the case, you should search for more details about forwarding IPSec traffic to backend VPN server.

So, my IPTables rules would be these three:

iptables -A INPUT -p udp --dport 500 -j ACCEPT
iptables -A INPUT -p udp --dport 4500 -j ACCEPT
iptables -A INPUT -p esp -j ACCEPT
iptables -A INPUT -p 51 -j ACCEPT # Not sure it's required, but too lazy to test without

This covers the IPSec part, however, we would not want the L2TP server to accept connections from the net, just like that, so it has its own rule, for port 1701:

iptables -A INPUT -p udp -m policy –dir in –pol ipsec -m udp –dport 1701 -j ACCEPT

You can save your current iptables rules (after checking that they work correctly) using ‘service iptables save’, or manually (backup your original rules to be on the safe side), and you’re all ready to go.

Good luck!