NetApp SnapMirror monitor script
Sunday, December 13th, 2009I have had some work done lately with NetApp SnapMirror. I have snapped-mirrored some volumes and qtrees and I wanted to monitor their use and behavior over the line.
As you can expect, site-to-site replication of data is a fragile thing, especially when done on the level of the storage device, which is agnostic to the data kept on it. When replicating volumes, I should expect the relevant employees to be responsible regarding what’s placed there, because the storage does not filter out the junk. If someone had decided to add a new DVD image on the DB storage space, well – the DB won’t care, as long as there is enough free space, but the storage will attempt to replicate the added data to the alternate site, which means that if you are around your bandwidth limits, which is never a good thing, you will just create a delay gap you would hardly (if at all) be able to close.
For that, and since I don’t tend to trust people not to do stupid things, I have written this script.
What does it do?
This script will perform the following:
Alerting about non-idle SnapMirror session
Use with ‘-m alert’
Assuming SnapMirror is scheduled to a specific time, the script will alert if a session is active. With the flag ‘-a no’, it will not send an e-mail (if possible, see the configuration section below). With ‘-r yes’, it will react, setting throttle for each non-idle session, but then ‘-t VALUE’ should be specified, where VALUE is the numeric throttle in KB/s.
Limiting throttle to a SnapMirror session
Use with ‘-m throttle_limit’
The script will set a throttle for SnapMirror session(s). Setting limit by the flag ‘-t VALUE’, where VALUE is the numeric throttle in KB/s per each session.
Cancelling throttle limit
Use with ‘-m throttle_unlimit’
The script will set unlimited throttle for SnapMirror session(s).
Checking SnapMirror lag
Use with the ‘-m check_lag’
Since replication has a purpose of recovering, the lag of each SnapMirror session would show how far back we are. Use with ‘-d VALUE’, VALUE being numeric time in minutes to set alert threshold. The default threshold delay is one day (1440 minutes).
Checking snapshots size
Use with the ‘-m check_size’
This reports the expected delta to transfer. This can help estimate the success or failure of a future sync of data (snapmirror update) before it begins. Use with ‘-l’ flag to set it to log date/time of measure and the expected sizes into a file. By default, in /tmp/target_name.txt, where the target is the SnapMirror target.
General Options
Use with ‘-c filename’ for alternate configuration file.
Use with ‘-h’ to get general help.
Use with a list target names in the format of storage:/vol/volname/qtree or storage:volname to ignore targets in configuration file and use your own.
Configuration File
The configuration file is rather simple. By default it should be called “/etc/snapmirror_monitor.conf“. It consists of two main variables for the system:
TGTS=”storage2:/vol/volname/qtree
storage3:volname2
storage1:/vol/volnew/qtr2″
EMAIL=”[email protected] [email protected]”
Prerequisites
This script will run on any modern Linux machine. For it to communicate with the NetApp devices, you will need SSH enabled on the NetApps, and ssh key exchange so that the Linux would be able to access the NetApp without using passwords.
The Script
Below is the script. You can download it and use it as you like.
#!/bin/bash # This script will monitor snapmirror status # Assumption: Access through ssh to root on all storage devices involved # This will also attempt to detect the diff which is to sync # Written by Ez-Aton. Check http://run.tournament.org.il for updates or # additional information # Modes: # alert -> alert if snapmirror is still active # throttle_limit -> Limit throttle to a given number (default or manually set) # throttle_unlimit -> Open throttle limitation # check_lag -> Report the snapmirror lage # check_size -> Report the estimated data size to move # Global variables CONF=/etc/snapmirror_monitor.conf LOG_PREFIX=/tmp test_connection () { # Test to see that you can access the storage device # Arguments: NetApp name SSH_OPTS="-o ConnectTimeout=2" if ! ssh $SSH_OPTS $1 hostname &>/dev/null then echo "Cannot communicate via SSH to $1" exit 1 fi } abort () { # Exit with a predefined error message echo $* exit 1 } get_arguments () { # Get all arguments and define options # Argument: [email protected] [ -z "$1" ] && set -- -h while [ -n "$1" ] do case "$1" in -m) shift case "$1" in alert|throttle_limit|throttle_unlimit|check_lag|check_size) MODE=$1 ;; *) abort "Mode is mandatory. Use -h flag to get list of avialable flags" ;; esac ;; -a) shift case "$1" in [nN][oO]) NOMAIL=1 ;; *) NOMAIL=0 ;; esac ;; -r) shift case "$1" in [yY][eE][sS]) REACT=1 ;; *) REACT=0 ;; esac ;; -d) shift declare -i DELAY_TMP DELAY_TMP=$1 [ "$DELAY_TMP" != "$1" ] && abort "Delay needs to be a number in minutes" DELAY=$DELAY_TMP ;; -t) shift declare -i THROTTLE_TMP THROTTLE_TMP=$1 [ "$THROTTLE_TMP" != "$1" ] && abort "Throttle needs to be a number" THROTTLE=$THROTTLE_TMP ;; -c) shift [ -f "$1" ] || abort "Cannot find specified conf file" CONF="$1" ;; -l) LOG=1 ;; -h) echo "Usage: $0 -m [alert|throttle_limit|throttle_unlimit|check_lag|check_size] (-c CONF_FILE) [tgt_filer:volume tgt_filer:/vol/vol/qtree]" echo "Alert if SnapMirror is still running: $0 -m alert [-a no] (-r yes) [tgt_filer:volume tgt_filer:/vol/vol/qtree]" echo "Alert and throttle (react): $0 -m alert [-a no] -r yes -t [throttle_in_kb] [tgt_filer:volume tgt_filer:/vol/vol/qtree]" echo "Throttle a running SnapMirror: $0 -m throttle_limit -t throttle_in_kb [tgt_filer:volume tgt_filer:/vol/vol/qtree]" echo "Unlimit SnapMirror throttle: $0 -m throttle_unlimit [tgt_filer:volume tgt_filer:/vol/vol/qtree]" echo "To check lag: $0 -m check_lag -d delay_in_minutes (-a no) [tgt_filer:volume tgt_filer:/vol/vol/qtree]" echo "To check delta: $0 -m check_size [tgt_filer:volume tgt_filer:/vol/vol/qtree]" exit 0 ;; *) [ -z "$MODE" ] && abort "$0 mode required" TGTS="$*" ;; esac shift done } notify () { # Send an e-mail notification # Arguments: [email protected] - the subject # Contents are empty # And yes - one e-mail per event mail -s "[email protected]" $EMAIL /dev/null #Checks if the snapmirror is idle. If so, return true return $? } set_throttle () { # Sets throttle for target # Arguments: $1 Target name (example: storage:/vol/volname/qtree) # Arguments: $2 throttle value (number) # Get the storage name out NETAPP=${1%%:*} test_connection $NETAPP #Verify this netapp is accessible ssh $NETAPP snapmirror throttle $2 $1 } get_lag () { # Gets the lag of snapmirror relationship in minutes # Arguments: Target name (example: storage:/vol/volname/qtree) # Get the storage name out NETAPP=${1%%:*} test_connection $NETAPP #Verify this netapp is accessible LAG=`ssh $NETAPP snapmirror status $1 | tail -1 | awk '{print $4}'` # LAG is in hh:mm:ss. We need to transfer it to minutes only H=`echo $LAG | cut -f 1 -d :` M=`echo $LAG | cut -f 2 -d :` let M=$M+$H*60 echo $M } check_size () { # Checks the size of the snapshot to copy (diff) # Arguments: Target name (example: storage:/vol/volname/qtree) # Get the storage name out NETAPP=${1%%:*} test_connection $NETAPP #Verify this netapp is accessible # Get source storage name and path SRC=`ssh $NETAPP snapmirror status $1 | tail -1 | awk '{print $1}'` # Get the source filer and vol name from that NETAPP=${SRC%%:*} SPATH=${SRC##*:} SPATH=`echo $SPATH | sed s/'/vol/'//` SPATH=${SPATH%%/*} test_connection $NETAPP # Verify the target NetApp is accessible SNAP=`ssh $NETAPP snap list -n $SPATH | grep snapmirror | tail -1 | awk '{print $4}'` DELTA=`ssh $NETAPP snap delta $SPATH $SNAP | tail -2 | head -1 | awk '{print $5}'` echo "Snap delta for $1 is $DELTA KB" LOG_TARGET=`echo $1 | tr / _`.txt [ -n "$LOG" ] && echo "`date` $DELTA" >> $LOG_PREFIX/$LOG_TARGET } ### MAIN ### get_arguments [email protected] . $CONF &>/dev/null # if e-mail is not set, don't try to send [ -z "$EMAIL" ] && NOMAIL=1 [ -z "$TGTS" ] && abort "You need at least one snapmirror target" case $MODE in alert) if [ "$REACT" == "1" ] then [ -z "$THROTTLE" ] && abort "When setting 'react' flag, you must specify throttle" fi for i in $TGTS do if ! idle $i then echo -n "$i is not idle. " [ "$NOMAIL" != "1" ] && notify "$i is not idle" if [ "$REACT" == "1" ] then echo -n "We are set to react. Limiting throttle" set_throttle $i $THROTTLE fi echo fi done ;; throttle_limit) [ -z "$THROTTLE" ] && abort "Throttle requires throttle value" for i in $TGTS do echo "Setting throttle for $i to $THROTTLE" set_throttle $i $THROTTLE done ;; throttle_unlimit) for i in $TGTS do echo "Setting throttle for $i to unlimited" set_throttle $i 0 done ;; check_lag) [ -z "$DELAY" ] && DELAY=1440 for i in $TGTS do LAG=`get_lag $i` if [ "$LAG" -gt "$DELAY" ] then echo "Failure: The delay for $i is $LAG minutes" [ "$NOMAIL" != "1" ] && notify "$i is lagged $LAG minutes, above the threshold $DELAY" else echo "Normal: The delay for $i is $LAG minutes" fi done ;; check_size) for i in $TGTS do check_size $i done ;; *) echo "Option $MODE is not implemented yet" exit 0 ;; esac