Archive for the ‘Disk Storage’ Category

NetApp SnapMirror monitor script

Sunday, December 13th, 2009

I have had some work done lately with NetApp SnapMirror. I have snapped-mirrored some volumes and qtrees and I wanted to monitor their use and behavior over the line.

As you can expect, site-to-site replication of data is a fragile thing, especially when done on the level of the storage device, which is agnostic to the data kept on it. When replicating volumes, I should expect the relevant employees to be responsible regarding what’s placed there, because the storage does not filter out the junk. If someone had decided to add a new DVD image on the DB storage space, well – the DB won’t care, as long as there is enough free space, but the storage will attempt to replicate the added data to the alternate site, which means that if you are around your bandwidth limits, which is never a good thing, you will just create a delay gap you would hardly (if at all) be able to close.

For that, and since I don’t tend to trust people not to do stupid things, I have written this script.

What does it do?

This script will perform the following:

Alerting about non-idle SnapMirror session

Use with ‘-m alert’

Assuming SnapMirror is scheduled to a specific time, the script will alert if a session is active. With the flag ‘-a no’, it will not send an e-mail (if possible, see the configuration section below). With ‘-r yes’, it will react, setting throttle for each non-idle session, but then ‘-t VALUE’ should be specified, where VALUE is the numeric throttle in KB/s.

Limiting throttle to a SnapMirror session

Use with ‘-m throttle_limit’

The script will set a throttle for SnapMirror session(s). Setting limit by the flag ‘-t VALUE’, where VALUE is the numeric throttle in KB/s per each session.

Cancelling throttle limit

Use with ‘-m throttle_unlimit’

The script will set unlimited throttle for SnapMirror session(s).

Checking SnapMirror lag

Use with the ‘-m check_lag’

Since replication has a purpose of recovering, the lag of each SnapMirror session would show how far back we are. Use with ‘-d VALUE’, VALUE being numeric time in minutes to set alert threshold. The default threshold delay is one day (1440 minutes).

Checking snapshots size

Use with the ‘-m check_size’

This reports the expected delta to transfer. This can help estimate the success or failure of a future sync of data (snapmirror update) before it begins. Use with ‘-l’ flag to set it to log date/time of measure and the expected sizes into a file. By default, in /tmp/target_name.txt, where the target is the SnapMirror target.

General Options

Use with ‘-c filename’ for alternate configuration file.

Use with ‘-h’ to get general help.

Use with a list target names in the format of storage:/vol/volname/qtree or storage:volname to ignore targets in configuration file and use your own.

Configuration File

The configuration file is rather simple. By default it should be called “/etc/snapmirror_monitor.conf“. It consists of two main variables for the system:

TGTS=”storage2:/vol/volname/qtree

storage3:volname2

storage1:/vol/volnew/qtr2″

EMAIL=”user@domain.com another_user@domain.com”

Prerequisites

This script will run on any modern Linux machine. For it to communicate with the NetApp devices, you will need SSH enabled on the NetApps, and ssh key exchange so that the Linux would be able to access the NetApp without using passwords.

The Script

Below is the script. You can download it and use it as you like.

#!/bin/bash
# This script will monitor snapmirror status
# Assumption: Access through ssh to root on all storage devices involved
# This will also attempt to detect the diff which is to sync
 
# Written by Ez-Aton. Check http://run.tournament.org.il for updates or
# additional information
 
# Modes: 
# alert -> alert if snapmirror is still active
# throttle_limit -> Limit throttle to a given number (default or manually set)
# throttle_unlimit -> Open throttle limitation
# check_lag -> Report the snapmirror lage
# check_size -> Report the estimated data size to move
 
# Global variables
CONF=/etc/snapmirror_monitor.conf
LOG_PREFIX=/tmp
 
test_connection () {
        # Test to see that you can access the storage device
        # Arguments: NetApp name
        SSH_OPTS="-o ConnectTimeout=2"
        if ! ssh $SSH_OPTS $1 hostname &>/dev/null
        then
                echo "Cannot communicate via SSH to $1"
                exit 1
        fi
}
 
abort () {
        # Exit with a predefined error message
        echo $*
        exit 1
}
 
get_arguments () {
        # Get all arguments and define options
        # Argument: $@
        [ -z "$1" ] && set -- -h
        while [ -n "$1" ]
        do
                case "$1" in
                        -m)     shift
                                case "$1" in
                                        alert|throttle_limit|throttle_unlimit|check_lag|check_size)     MODE=$1
                                        ;;
                                        *)      abort "Mode is mandatory. Use -h flag to get list of avialable flags"
                                        ;;
                                esac
                                ;;
                        -a)     shift
                                case "$1" in
                                        [nN][oO])       NOMAIL=1
                                                        ;;
                                        *)              NOMAIL=0
                                                        ;;
                                esac
                                ;;
                        -r)     shift
                                case "$1" in
                                        [yY][eE][sS])   REACT=1
                                                        ;;
                                        *)              REACT=0
                                                        ;;
                                esac
                                ;;
                        -d)     shift
                                declare -i DELAY_TMP
                                DELAY_TMP=$1
                                [ "$DELAY_TMP" != "$1" ] && abort "Delay needs to be a number in minutes"
                                DELAY=$DELAY_TMP
                                ;;
                        -t)     shift
                                declare -i THROTTLE_TMP
                                THROTTLE_TMP=$1
                                [ "$THROTTLE_TMP" != "$1" ] && abort "Throttle needs to be a number"
                                THROTTLE=$THROTTLE_TMP
                                ;;
                        -c)     shift
                                [ -f "$1" ] || abort "Cannot find specified conf file"
                                CONF="$1"
                                ;;
                        -l)     LOG=1
                                ;;
                        -h)     echo "Usage: $0 -m [alert|throttle_limit|throttle_unlimit|check_lag|check_size] (-c CONF_FILE) [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                echo "Alert if SnapMirror is still running: $0 -m alert [-a no] (-r yes) [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                echo "Alert and throttle (react): $0 -m alert [-a no] -r yes -t [throttle_in_kb] [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                echo "Throttle a running SnapMirror: $0 -m throttle_limit -t throttle_in_kb [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                echo "Unlimit SnapMirror throttle: $0 -m throttle_unlimit [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                echo "To check lag: $0 -m check_lag -d delay_in_minutes (-a no) [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                echo "To check delta: $0 -m check_size [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                exit 0
                                ;;
                        *)      [ -z "$MODE" ] && abort "$0 mode required"
                                TGTS="$*"
                                ;;
                esac
                shift
        done
}
 
notify () {
        # Send an e-mail notification
        # Arguments: $@ - the subject
        # Contents are empty
        # And yes - one e-mail per event
        mail -s "$@" $EMAIL < /dev/null
}
 
idle () {
        # Check if transaction is idle
        # Arguments: Target name (example: storage:/vol/volname/qtree
 
        # Get the storage name out
        NETAPP=${1%%:*}
        test_connection $NETAPP #Verify this netapp is accessible
        ssh $NETAPP snapmirror status $1 | tail -1 | grep Idle$ &>/dev/null #Checks if the snapmirror is idle. If so, return true
        return $?
}
 
set_throttle () {
        # Sets throttle for target
        # Arguments: $1 Target name (example: storage:/vol/volname/qtree)
        # Arguments: $2 throttle value (number)
 
        # Get the storage name out
        NETAPP=${1%%:*}
        test_connection $NETAPP #Verify this netapp is accessible
        ssh $NETAPP snapmirror throttle $2 $1
}
 
get_lag () {
        # Gets the lag of snapmirror relationship in minutes
        # Arguments: Target name (example: storage:/vol/volname/qtree)
 
        # Get the storage name out
        NETAPP=${1%%:*}
        test_connection $NETAPP #Verify this netapp is accessible
        LAG=`ssh $NETAPP snapmirror status $1 | tail -1 | awk '{print $4}'`
        # LAG is in hh:mm:ss. We need to transfer it to minutes only
        H=`echo $LAG | cut -f 1 -d :`
        M=`echo $LAG | cut -f 2 -d :`
        let M=$M+$H*60
        echo $M
}
 
check_size () {
        # Checks the size of the snapshot to copy (diff)
        # Arguments: Target name (example: storage:/vol/volname/qtree)
 
        # Get the storage name out
        NETAPP=${1%%:*}
        test_connection $NETAPP #Verify this netapp is accessible
        # Get source storage name and path
        SRC=`ssh $NETAPP snapmirror status $1 | tail -1 | awk '{print $1}'`
        # Get the source filer and vol name from that
        NETAPP=${SRC%%:*}
        SPATH=${SRC##*:}
        SPATH=`echo $SPATH | sed s/'\/vol\/'//`
        SPATH=${SPATH%%/*}
 
        test_connection $NETAPP # Verify the target NetApp is accessible
        SNAP=`ssh $NETAPP snap list -n $SPATH | grep snapmirror | tail -1 | awk '{print $4}'`
        DELTA=`ssh $NETAPP snap delta $SPATH $SNAP | tail -2 | head -1 | awk '{print $5}'`
        echo "Snap delta for $1 is $DELTA KB"  
        LOG_TARGET=`echo $1 | tr / _`.txt
        [ -n "$LOG" ] && echo "`date` $DELTA" >> $LOG_PREFIX/$LOG_TARGET
}
 
 
### MAIN ###
get_arguments $@
. $CONF &>/dev/null
# if e-mail is not set, don't try to send
[ -z "$EMAIL" ] && NOMAIL=1
 
[ -z "$TGTS" ] && abort "You need at least one snapmirror target"
 
case $MODE in
        alert)  if [ "$REACT" == "1" ]
                then
                        [ -z "$THROTTLE" ] && abort "When setting 'react' flag, you must specify throttle"
                fi
                for i in $TGTS
                do
                        if ! idle $i
                        then
                                echo -n "$i is not idle. "
                                [ "$NOMAIL" != "1" ] && notify "$i is not idle"
                                if [ "$REACT" == "1" ]
                                then
                                        echo -n "We are set to react. Limiting throttle"
                                        set_throttle $i $THROTTLE
                                fi
                                echo
                        fi
                done
                ;;
        throttle_limit) [ -z "$THROTTLE" ] && abort "Throttle requires throttle value"
                        for i in $TGTS
                        do
                                echo "Setting throttle for $i to $THROTTLE"
                                set_throttle $i $THROTTLE
                        done
                        ;;
        throttle_unlimit)       for i in $TGTS
                                do
                                        echo "Setting throttle for $i to unlimited"
                                        set_throttle $i 0
                                done
                        ;;
        check_lag)      [ -z "$DELAY" ] && DELAY=1440
                        for i in $TGTS
                        do
                                LAG=`get_lag $i`
                                if [ "$LAG" -gt "$DELAY" ]
                                then
                                        echo "Failure: The delay for $i is $LAG minutes"
                                        [ "$NOMAIL" != "1" ] && notify "$i is lagged $LAG minutes, above the threshold $DELAY"
                                else
                                        echo "Normal: The delay for $i is $LAG minutes"
                                fi
                        done
                        ;;
        check_size)     for i in $TGTS
                        do
                                check_size $i
                        done
                        ;;
        *)      echo "Option $MODE is not implemented yet"
                exit 0
                ;;
esac

iSCSI persistent configurations agains us all

Thursday, November 19th, 2009

Using iSCSI with dm-multipath is rather common setup. With iSCSI running over Ethernet cables, which are too easy to disconnect (either on purpose or by mistake), being cheap and common technology – multipath becomes a must. If you have multiple network links, this is only expected that you use multipath for your iSCSI configuration. It’s cheap, it’s easy and it works.

This, however, comes with a price tag. Not money – the components are cheap and common, but there are configuration acts which should take place.

It is easy to find info either in the open-iscsi documentations, the Internet, whatever, and I will go over them just below, but there are some catches which one should be aware of.

Per the common documentation, unlike regular iSCSI communication, when dealing with multipath, you would like iSCSI to fail rather quickly and let the SCSI layer handle the errors, thus letting dm-multipath handle the errors and do its work.

The configuration directives are rather simple. In the iscsid.conf file (on RHEL5 located in /etc/iscsi/iscsid.conf ), you need to change the value

node.session.timeo.replacement_timeout =

To a very short period of time. By default, it is set to 120 seconds, which are two minutes before anyone will notify the SCSI subsystem of any disk IO errors. A good value would be 5 seconds, which should allow for very short network disconnection (which could happen) and still – let the dm-multipath manage errors fast enough so that applications would not fail on disk timeouts.

Another two parameters which should be defined are the following (read the comments above them in the config file):

node.conn[0].timeo.noop_out_interval =
node.conn[0].timeo.noop_out_timeout =

These values control the interval in which the iSCSI layer tests communication to the targets.

Also, in multipath.conf you will need to set the following feature, so that IOPs will not be lost:

features		"1 queue_if_no_path"

These configuration directives can be found in these two pages from RedHat:

iscsi-modifying-link-loss-behavior-dmmultipath

iscsi-replacements_timeout

This is nice and pretty. However, if you have failed to do so at start, and defined your iSCSI targets based on the default configurations, you will notice that it still takes very long for iSCSI to notify the SCSI subsystem of the errors. You could check the values used by iSCSI through running the command:

iscsiadm -m node -T <target name>

Check out especially the line called node.session.timeo.replacement_timeout. Its value is the one which decides the actual behavior of iSCSI.

To change it, there are several methods. One of them is to clean up the iSCSI persistent configurations, located in /var/lib/iscsi and then re-login to the iSCSI targets. Only then you will have the new target configuration.

Check again with iscsiadm as described above, and check that this value matches.

XenServer create snapshots for all machines

Friday, August 7th, 2009

XenServer is a wonderful tool. One of the better parts of it is its powerful scripting language, powered by the ‘xe’ command.

In order to capture a mass of snapshots, you can either do it manually from the GUI, or scripted. The script supplied below will include shell functions to capture Quiesce snapshots, and it that fails, normal snapshots of every running VM on the system.

Reason: NetApp SnapMirror, or other backup (maybe for later export) scheduled actions.

#!/bin/bash
# This script will supply functions for snapshotting and snapshot destroy including disks
# Written by Ez-Aton
# Visit my web blog for more stuff, at http://run.tournament.org.il
 
# Global variables:
UUID_LIST_FILE=/tmp/SNAP_UUIDS.txt
 
# Function
function assign_all_uuids () {
	# Construct artificial non-indexed list with name (removing annoying characters) and UUID
	LIST=""
	for UUID in `xe vm-list power-state=running is-control-domain=false | grep uuid | awk '{print $NF}'`
	do
		NAME=`xe vm-param-get param-name=name-label uuid=$UUID | tr ' ' _ | tr -d '(' | tr -d ')'`
		LIST="$LIST $NAME:$UUID"
	done
	echo $LIST
}
 
function take_snap_quiesce () {
	# We attempt to take a snapshot with quench
	# Arguments: $1 name ; $2 uuid
	# We attempt to snapshot the machine and set the value of snap_uuid to the snapshot uuid, if successful.
	# Return 1 if failed
 
	if SNAP_UUID=`xe vm-snapshot-with-quiesce vm=$2 new-name-label=${1}_snapshot`
	then
		# echo "Snapshot-with-quiesce for $1 successful"
		return 0
	else
		echo "Snapshot-with-quiesce for $1 failed"
		return 1
	fi
}
 
function take_snap () {
	# We attempt to take a snapshot
	# Arguments: $1 name ; $2 uuid
	# We attempt to snapshot the machine and set the value of snap_uuid to the snapshot uuid, if successful.
	# Return 1 if failed
 
	if SNAP_UUID=`xe vm-snapshot vm=$2 new-name-label=${1}_snapshot`
	then
		#echo "Snapshot for $1 successful"
		echo $SNAP_UUID
		return 0
	else
		echo "Snapshot-with-quiesce for $1 failed"
		return 1
	fi
}
 
function stop_ha_template () {
	# Templates inherit their settings from the origin
	# We need to turn off HA
	# $1 : Template UUID
	if [ -z "$1" ]
	then
		echo "Missing template UUID"
		return 1
	fi
	xe template-param-set ha-always-run=false uuid=$1
}
 
function get_vdi () {
	# This function will get a space delimited list of VDI UUIDs of a given snapshot/template UUID
	# Arguments: $1 template UUID
	# It will also verify that each VBD is an actual snapshot
	if [ -z "$1" ]
	then
		echo "No arguments? We need the template UUID"
		return 1
	fi
	VDIS=""
	for VBD in `xe vbd-list vm-uuid=$1 | grep ^uuid | awk '{print $NF}'`
	do
		echo "VBD: $VBD"
		if [ ! `xe vbd-param-get param-name=type uuid=$VBD` = "CD" ]
		then
			CUR_VDI=`xe vdi-list vbd-uuids=$VBD | grep ^uuid | awk '{print $NF}'`
			if `xe vdi-param-get uuid=$CUR_VDI param-name=is-a-snapshot`
			then
				VDIS="$VDIS $CUR_VDI"
			else
				echo "VDI is not a snapshot!"
				return 1
			fi
			CUR_VDI=""
		fi
	done
	echo $VDIS
}
 
function remove_vdi () {
	# This function will get a list of VDIs and remove them
	# Carefull!
	for VDI in $@
	do
		if xe vdi-destroy uuid=$VDI
		then
			echo "Success in removing VDI $VDI"
		else
			echo "Failure in removing VDI $VDI"
			return 1
		fi
	done
}
 
function remove_template () {
	# This funciton will remove a template
	# $1 template UUID
	if [ -z "$1" ]
	then
		echo "Required UUID"
		return 1
	fi
	xe template-param-set is-a-template=false uuid=$1
	if ! xe vm-uninstall force=true uuid=$1
	then
		echo "Failure to remove VM/Template"
		return 1
	fi
}
 
function remove_all_template () {
	# This function will completely remove a template
	# The steps are as follow:
	# $1 is the UUID of the template
	# Calculate its VDIs
	# Remove the template
	# Remove the VDIs
	if [ -z "$1" ]
	then
		echo "No Template UUID was supplied"
		return 1
	fi
	# We now collect the value of $VDIS
	get_vdi $1
	if [ "$?" -ne "0" ]
	then
		echo "Failed to get VDIs for Template $1"
		return 1
	fi
	if ! remove_template $1
	then
		echo "Failure to remove template $1"
		return 1
	fi
	if ! remove_vdi $VDIS
	then
		return 1
	fi
}
 
function create_all_snapshots () {
	# In this function we will run all over $LIST and create snapshots of each machine, keeping the UUID of it inside a file
	# $@ - list of machines in the $LIST format
	if [ -f $UUID_LIST_FILE ]
	then
		mv $UUID_LIST_FILE $UUID_LIST_FILE.$$
	fi
	for i in $@
	do
		SNAP_UUID=`take_snap_quiesce ${i%%:*} ${i##*:}`
		if [ "$?" -ne "0" ]
		then
			echo "Problem taking snapshot with quiesce for ${i%%:*}"
			echo "Attempting normal snapshot"
			SNAP_UUID=`take_snap ${i%%:*} ${i##*:}`
			if [ "$?" -ne "0" ]
                	then
                        	echo "Problem taking snapshot for ${i%%:*}"
				SNAP_UUID=""
			fi
		fi
		stop_ha_template $SNAP_UUID
		echo $SNAP_UUID >> $UUID_LIST_FILE
	done
}

Possible use will be like this:

. /usr/local/bin/xen_functions.sh

create_all_snapshots `assign_all_uuids` &> /tmp/snap_create.log

LVM Recovery

Friday, May 29th, 2009

A friend of mine made a grieve mistake – partition a disk containing Linux LVM directly on it, without any partition table. Oops.

When dealing with multi-Tera sized disks, one gets to encounter limitations not known on smaller scales – the 2TB limitation. Normal partition table can contain only around 2TB mapping, meaning that to create larger partitions, or even smaller partitions which exceed that specific limit, you have to take one of two actions:

  • Use GPT partition tables, which is meant for large disks, and partition the disk to the size limits you desire
  • Define LVM PV directly on the block device (the command would look like ‘pvcreate /dev/sdb -> see? No partitions)

“Surprisingly” and for no good reason, it appears that the disk which was used completely for the LVM PV suddenly had a single GPT partition on it. Hmmmm.

This is/was a single disk in a two-PV VG continging a single LV spanned all over the VG space. Following the “mysterious” actions, the VG refused to start, claiming that it could not find PV with PVID <some UID>.

This is a step where one should stop and call a professional if he doesn’t know for sure how to continue. These following actions are very risky to your data, and could result in you either recovering from tapes (if exist) or seeking a new job, if this is/was some mission-critical data.

First – go to /etc/lvm/archive and find the latest file named after the VG which has been destroyed. Look into it – you should see the PV is in there. Search the PV based on the UID reported not to repond on the logs.

Second – you need to remove the GPT partition from the disk. The PV will be recreated exactly as it was suppoed to be before. Replace /dev/some_disk with your own device file.

fdisk /dev/some_disk

d

w

Third – Reread the VG archive file, to be on the safe side. Verify again that the PV you are about to recreate is the one you are to. When done, run the following command

pvcreate -u <UID> /dev/some_disk

Again – the name of the device file has been changed in this example to prevent copy-paste incidents from happening.

Fourth – Run vgcfgrestore with the name of the VG as parameter. This command would restore your meta information into the PV and VG.

vgcfgrestore VG_TEST

Fifth – Activate the VG:

vgchange -ay VG_TEST

Now the volumes should be up, and you have the ability to attempt to mount these volumes.

Notice that the data might be corrupted in some way. Running fsck is recommended, although time-consuming.

Good luck!

IBM DS3400 expand Logical Drive (LUN)

Wednesday, May 6th, 2009

I have always liked IBM DS series management suite. I have claimed once that your first storage (and with it – your way of thinking about storage abstraction, I assume) is your favorite storage. I have been using the Storage Manager 9 for years now, even before it was 9 (I think that it was 7 then), and way before it has become 10-something version.

SM9 is the tool to manage the DS4xxx storage series (previously fastT series, if you can remember…), however, their “smaller” brothers, the DS3xxx series have arrived with something else claimed to be “Storage Manager” as well. It has began as SM2, and I was sure surprised to see it under the name “Storage Manager 10″ today.

It seems as if someone wanted to make life easy, so everything is hidden behind some messy-looking dashboard. Finding the so-well-known abstraction logic is almost impossible for me, no matter how many times I will use this disaster of a product. Can’t complain about the storage devices – they are rather good, and although IBM has recently had some firmware issues and stability issues with some of its storage products – I can’t say I hate this device. Storage Manager 2 (or 10, today), however, is something I prefer to avoid.

After all this ranting and crying, I can share this piece of technical information. Storage Manager 10 (2, whatever) does not allow for Logical Drive to be expanded. You can expand the Raid Array – no problem. Just add some more disks to it, however, the free space created in that specific array will not be available for your reallocation pleasure – you cannot expand any of your existing Logical Drives with it.

Google has pointed me to that specific post which enlightened me with a new understanding of the DS series – the SM10, while being nothing more than a limited GUI for the storage-handicap IT persons, does not completely cancel the storage device’s abilities. The device can expand LUNs, so SM10 can allow it, secretly.

You need to know what you’re doing. I will copy/paste the contents of that blog entry for future use:

Expanding LUNs on a DS3000 series disk system:

First make sure you have enough space in your array. Next go to the script editor in the Storage Manager gui. Choose script editor. Type in a command like this:

set logicalDrive ["FINKOPAS01-BOOT"] addCapacity=50GB;

This would add 50GB’s to the disk “FINKOPAS01-BOOT”.

Now choose run command from the menu. After the disk system has expanded your LUN you should see the extra capacity in the disk management console window.

*NOTE*: This information was originally found here: http://www.systemx.co.nz/xcelerator/?q=node/99

The target link is dead there, unfortunately, however, the “secret” information is available for all.

Do not forget the semi-comma at the end of the line. It won’t work otherwise. IBM… :-)

Ubuntu Hardy 8.04 USB performance issues

Monday, February 9th, 2009

As the owner of a nice laptop running Hardy, I had a huge performance degradation when accessing USB storage devices. Speed could reach 1MB/s at most, and usually, half that speed.

The trick that solved my problem was suggested in this post, and after I have tested it, I was happy with the results.

The trick is to backup the old ehci_hcd.ko module aside, and to replace it with the one from kernel 2.6.24-19-generic. Following that action, you need to make sure this module can force load, else version mismatch whill prevent it from loading.

To do so, I have added the following line to /etc/modprobe.d/options:

install ehci_hcd /sbin/modprobe –ignore-install –force ehci_hcd

This solves the force load issues.

Relocating LVs with snapshots

Monday, February 2nd, 2009

Linux LVM is a wonderful thing. It is scalable, flexible, and truly, almost enterprise-class in every details. It lacks, of course, at IO performance for LVM snapshots, but this can be worked-around in several creative ways (if I haven’t shown here before, I will sometime).

What it can’t do is dealing with a mixture of Stripes, Mirrors and Snapshots in a single logical volume. It cannot allow you to mirror a stripped LV (even if you can follow the requirementes), it cannot allow you to snapshot a mirrored or a stripped volume. You get the idea. A volume you can protect, you cannot snapshot. A volume with snapshots cannot be mirrored or altered.

For the normal user, what you get is usually enough. For storage management per-se, this is just not enough. When I wanted to reduce a VG – remove a disk from an existing volume group,  I had to evacuate it from any existing logical volume. The command to perform this actions is ‘pvmove‘ which is capable of relocating data from within a PV to other PVs. This is done through mirroring each logical volume and then removing the origin.

Mirroring, however, cannot be performed on LVs with snapshots, or on an already mirrored LV, so these require different handling.

We can detect which LVs reside on our physical volume by issuing the following command

pvdisplay -m /dev/sdf1

/dev/sdf1 was only an example. You will see the contents of this PV. So next, performing

pvmove /dev/sdf1

would attempt to relocate every existing LV from this specific PV to any other available PV. We can use this command to change the disk balance and allocations on multi-disk volume groups. This will be discussed on a later post.

Following a ‘pvmove‘ command, all linear volumes are relocated, if space permits, to another PVs. The remaining LVs are either mirrored or LVs with snapshots.

To relocate a mirrored LV, you need to un-mirror it first. To do so, first detect using ‘pvdisplay‘ which LV is belongs to (the name should be easy to follow) and then change it to non-mirrored.

lvconvert -m0 /dev/VolGroup00/test-mirror

This will convert it to be a linear volume instead of a mirror, so you could move it, if it still resides on the PV you are to remove.

Snapshot volumes are more complicated, due to their nature. Since all my snapshots are of a filesystem, I could allow myself to use tar to perform the action.

The steps are as follow:

  1. tar the contents of the snapshot source to nowhere, but save an incremental file
  2. Copy the source incremental file to a new name, and tar the contents of a snapshot according to this copy.
  3. Repeat the previous step for each snapshot.
  4. Remove all snapshots
  5. Relocate the snapshot source using ‘pvmove
  6. Build the snapshots and then recover the data into them

This is a script to do steps 1 to 3. It will not remove LVs, for obvious reasons. This script was not tested, but should work, of course :-)

None of the LVs should be mounted for it to function. It’s better to have harder requirements than to destroy data by double-mounting it, or accessing it while it is being changed.

#!/bin/bash
# Get: VG Base-LV, snapshot name, snapshot name, snapshot name...
# Example:
# ./backup VolGroup00 base snap1 snap2 snap3
# Written by Ez-Aton
 
TARGET=/tmp
if [ "$@" -le 3 ]
then
   echo "Parameters: $0 VG base snap snap snap snap"
   exit 1
fi
VG=$1
BASE=$2
shift 2
 
function check_not_mounted () {
   # Check if partition is mounted
   if mount | grep /dev/mapper/${VG}-${1}
   then
      return 0
   else
      return 1
   fi
}
 
function create_base_diff () {
   # This function will create the diff file for the base
   mount /dev/${VG}/${BASE} $MNT
   if [ $? -ne 0 ]
   then
      echo "Failed to mount base"
      exit 1
   fi
   cd $MNT
   tar -g $TARGET/${BASE}.tar.gz.diff -czf - . &gt; /dev/null
   cd -
   umount $MNT
}
 
function create_snap_diff () {
   mount /dev/${VG}/${1} $MNT
   if [ $? -ne 0 ]
   then
      echo "Failed to mount base"
      exit 1
   fi
   cp $TARGET/${BASE}.tar.gz.diff $TARGET/$1.tar.gz.diff
   cd $MNT
   tar -g $TARGET/${1}.tar.gz.diff -czf $TARGET/${1}.tar.gz .
   cd -
   umount $MNT
}
 
function create_mount () {
   # Creates a temporary mount point
   if [ ! -d /mnt/$$ ]
   then
      mkdir /mnt/$$
   fi
   MNT=/mnt/$$
}
 
create_mount
if check_not_mounted $BASE
then
   create_base_diff
else
   echo "$BASE is mounted. Exiting now"
   exit 1
fi
for i in $@
do
   if check_not_mounted $i
   then
      create_snap_diff $i
   else
      echo "$i is mounted! I will not touch it!"
   fi
done

The remaining steps should be rather easy – just mount the newly created snapshots and restore the tar file on them.

Persistent raw devices for Oracle RAC with iSCSI

Saturday, December 6th, 2008

If you’re into Oracle RAC over iSCSI, you should be rather content – this configuration is a simple and supported. However, working with some iSCSI target devices, order and naming is not consistent between both Oracle nodes.

The simple solutions are by using OCFS2 labels, or by using ASM, however, if you decide to place your voting disks and cluster registry disks on raw devices, you are to face a problem.

iSCSI on RHEL5:

There are few guides, but the simple method is this:

  1. Configure mapping in your iSCSI target device
  2. Start the iscsid and iscsi services on your Linux
    • service iscsi start
    • service iscsid start
    • chkconfig iscsi on
    • chkconfig iscsid on
  3. Run “iscsiadm -m discovery -t st -p target-IP
  4. Run “iscsiadm -m node -L all
  5. Edit /etc/iscsi/send_targets and add to it the IP address of the target for automatic login on restart

You need to configure partitioning according to the requirements.

If you are to setup OCFS2 volumes for the voting and for the cluster registry, there should not be a problem as long as you use labels, however, if you require raw volumes, you need to change udev to create your raw devices for you.

On a system with persistent disk naming, follow this process, however, on a system with changing disk names (every reboot names are different), the process can become a bit more complex.

First, detect your scsi_id for each device. While names might change upon reboots, scsi_ids do not.

scsi_id -g -u -s /block/sdc

Replace sda with the device name you are looking for. Notice that /block/sda is a reference to /sys/block/sdc

Use the scsi_id generated by that to create the raw devices. Edit /etc/udev/rules.d/50-udev.rules and find line 298. Add a line below with the following contents:

KERNEL==”sd*[0-9]“, ENV{ID_SERIAL}==”14f70656e66696c000000000004000000010f00000e000000″, SYMLINK+=”disk/by-id/$env{ID_BUS}-$env{ID_SERIAL}-part%n” ACTION==”add” RUN+=”/bin/raw /dev/raw/raw%n %N”

Things to notice:

  1. The ENV{ID_SERIAL} is the same scsi_id obtained earlier
  2. This line will create a raw device in the name of raw and number in /dev/raw for each partition
  3. If you want to differtiate between two (or more) disks, change the name from raw to an aduqate name, like “crsa”, “crsb”, etc, for example:

KERNEL==”sd*[0-9]“, ENV{ID_SERIAL}==”14f70656e66696c000000000005000000010f00000e000000″, SYMLINK+=”disk/by-id/$env{ID_BUS}-$env{ID_SERIAL}-part%n” ACTION==”add” RUN+=”/bin/raw /dev/raw/crs%n %N”

Following these changes, run “udevtrigger” to reload the rules. Be advised that “udevtrigger” might reset network connection.

MySQL permissions for LVM Snapshots

Thursday, October 23rd, 2008

aking LVM snapshots as a mean of backing up MySQL is rather simple, as can be described here. However, if you are into security, you would strive to grant minimal permissions for the action to the MySQL user. Per MySQL Documentation, the required privileges is “RELOAD”. That should be enough, granted on *.*, of course.

Raw devices for Oracle on RedHat (RHEL) 5

Tuesday, October 21st, 2008

There is a major confusion among DBAs regarding how to setup raw devices for Oracle RAC or Oracle Clusterware. This confusion is caused by the turn RedHat took in how to define raw devices.

Raw devices are actually a manifestation of character devices pointing to block devices. Character devices are non-buffered, so they act as FIFO, and have no OS cache, which is why Oracle likes them so much for Clusterware CRS and voting.

On other Unix types, commonly there are two invocations for each disk device – a block device (i.e /dev/dsk/c0d0t0s1) and a character device (i.e. /dev/rdsk/c0d0t0s1). This is not the case for Linux, and thus, a special “raw”, aka character, device is to be defined for each partition we want to participate in the cluster, either as CRS or voting disk.

On RHEL4, raw devices were setup easily using the simple and coherent file /etc/sysconfig/rawdevices, which included an internal example. On RHEL5 this is not the case, and customizing in a rather less documented method the udev subsystem is required.

Check out the source of this information, at this entry about raw devices. I will add it here, anyhow, with a slight explanation:

1. Add to /etc/udev/rules.d/60-raw.rules:

ACTION==”add”, KERNEL==”sdb1″, RUN+=”/bin/raw /dev/raw/raw1 %N”

2. To set permission (optional, but required for Oracle RAC!), create a new /etc/udev/rules.d/99-raw-perms.rules containing lines such as:

KERNEL==”raw[1-2]“, MODE=”0640″, GROUP=”oinstall”, OWNER=”oracle”

Notice this:

  1. The raw-perms.rules file name has to begin with the number 99, which defines its order during rules apply, so that it will be used after all other rules take place. Using lower numbers might cause permissions to be incorrect.
  2. The following permissions have to apply:
  • OCR Device(s): root:oinstall , mode 0640
  • Voting device(s): oracle:oinstall, mode 0666
  • You don’t have to use raw devices for ASM volumes on Linux, as the ASMLib library is very effective and easier to manage.