NetApp SnapMirror monitor script

I have had some work done lately with NetApp SnapMirror. I have snapped-mirrored some volumes and qtrees and I wanted to monitor their use and behavior over the line.

As you can expect, site-to-site replication of data is a fragile thing, especially when done on the level of the storage device, which is agnostic to the data kept on it. When replicating volumes, I should expect the relevant employees to be responsible regarding what’s placed there, because the storage does not filter out the junk. If someone had decided to add a new DVD image on the DB storage space, well – the DB won’t care, as long as there is enough free space, but the storage will attempt to replicate the added data to the alternate site, which means that if you are around your bandwidth limits, which is never a good thing, you will just create a delay gap you would hardly (if at all) be able to close.

For that, and since I don’t tend to trust people not to do stupid things, I have written this script.

What does it do?

This script will perform the following:

Alerting about non-idle SnapMirror session

Use with ‘-m alert’

Assuming SnapMirror is scheduled to a specific time, the script will alert if a session is active. With the flag ‘-a no’, it will not send an e-mail (if possible, see the configuration section below). With ‘-r yes’, it will react, setting throttle for each non-idle session, but then ‘-t VALUE’ should be specified, where VALUE is the numeric throttle in KB/s.

Limiting throttle to a SnapMirror session

Use with ‘-m throttle_limit’

The script will set a throttle for SnapMirror session(s). Setting limit by the flag ‘-t VALUE’, where VALUE is the numeric throttle in KB/s per each session.

Cancelling throttle limit

Use with ‘-m throttle_unlimit’

The script will set unlimited throttle for SnapMirror session(s).

Checking SnapMirror lag

Use with the ‘-m check_lag’

Since replication has a purpose of recovering, the lag of each SnapMirror session would show how far back we are. Use with ‘-d VALUE’, VALUE being numeric time in minutes to set alert threshold. The default threshold delay is one day (1440 minutes).

Checking snapshots size

Use with the ‘-m check_size’

This reports the expected delta to transfer. This can help estimate the success or failure of a future sync of data (snapmirror update) before it begins. Use with ‘-l’ flag to set it to log date/time of measure and the expected sizes into a file. By default, in /tmp/target_name.txt, where the target is the SnapMirror target.

General Options

Use with ‘-c filename’ for alternate configuration file.

Use with ‘-h’ to get general help.

Use with a list target names in the format of storage:/vol/volname/qtree or storage:volname to ignore targets in configuration file and use your own.

Configuration File

The configuration file is rather simple. By default it should be called “/etc/snapmirror_monitor.conf“. It consists of two main variables for the system:

TGTS=”storage2:/vol/volname/qtree

storage3:volname2

storage1:/vol/volnew/qtr2″

EMAIL=”user@domain.com another_user@domain.com”

Prerequisites

This script will run on any modern Linux machine. For it to communicate with the NetApp devices, you will need SSH enabled on the NetApps, and ssh key exchange so that the Linux would be able to access the NetApp without using passwords.

The Script

Below is the script. You can download it and use it as you like.

#!/bin/bash
# This script will monitor snapmirror status
# Assumption: Access through ssh to root on all storage devices involved
# This will also attempt to detect the diff which is to sync
 
# Written by Ez-Aton. Check http://run.tournament.org.il for updates or
# additional information
 
# Modes: 
# alert -> alert if snapmirror is still active
# throttle_limit -> Limit throttle to a given number (default or manually set)
# throttle_unlimit -> Open throttle limitation
# check_lag -> Report the snapmirror lage
# check_size -> Report the estimated data size to move
 
# Global variables
CONF=/etc/snapmirror_monitor.conf
LOG_PREFIX=/tmp
 
test_connection () {
        # Test to see that you can access the storage device
        # Arguments: NetApp name
        SSH_OPTS="-o ConnectTimeout=2"
        if ! ssh $SSH_OPTS $1 hostname &>/dev/null
        then
                echo "Cannot communicate via SSH to $1"
                exit 1
        fi
}
 
abort () {
        # Exit with a predefined error message
        echo $*
        exit 1
}
 
get_arguments () {
        # Get all arguments and define options
        # Argument: $@
        [ -z "$1" ] && set -- -h
        while [ -n "$1" ]
        do
                case "$1" in
                        -m)     shift
                                case "$1" in
                                        alert|throttle_limit|throttle_unlimit|check_lag|check_size)     MODE=$1
                                        ;;
                                        *)      abort "Mode is mandatory. Use -h flag to get list of avialable flags"
                                        ;;
                                esac
                                ;;
                        -a)     shift
                                case "$1" in
                                        [nN][oO])       NOMAIL=1
                                                        ;;
                                        *)              NOMAIL=0
                                                        ;;
                                esac
                                ;;
                        -r)     shift
                                case "$1" in
                                        [yY][eE][sS])   REACT=1
                                                        ;;
                                        *)              REACT=0
                                                        ;;
                                esac
                                ;;
                        -d)     shift
                                declare -i DELAY_TMP
                                DELAY_TMP=$1
                                [ "$DELAY_TMP" != "$1" ] && abort "Delay needs to be a number in minutes"
                                DELAY=$DELAY_TMP
                                ;;
                        -t)     shift
                                declare -i THROTTLE_TMP
                                THROTTLE_TMP=$1
                                [ "$THROTTLE_TMP" != "$1" ] && abort "Throttle needs to be a number"
                                THROTTLE=$THROTTLE_TMP
                                ;;
                        -c)     shift
                                [ -f "$1" ] || abort "Cannot find specified conf file"
                                CONF="$1"
                                ;;
                        -l)     LOG=1
                                ;;
                        -h)     echo "Usage: $0 -m [alert|throttle_limit|throttle_unlimit|check_lag|check_size] (-c CONF_FILE) [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                echo "Alert if SnapMirror is still running: $0 -m alert [-a no] (-r yes) [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                echo "Alert and throttle (react): $0 -m alert [-a no] -r yes -t [throttle_in_kb] [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                echo "Throttle a running SnapMirror: $0 -m throttle_limit -t throttle_in_kb [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                echo "Unlimit SnapMirror throttle: $0 -m throttle_unlimit [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                echo "To check lag: $0 -m check_lag -d delay_in_minutes (-a no) [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                echo "To check delta: $0 -m check_size [tgt_filer:volume tgt_filer:/vol/vol/qtree]"
                                exit 0
                                ;;
                        *)      [ -z "$MODE" ] && abort "$0 mode required"
                                TGTS="$*"
                                ;;
                esac
                shift
        done
}
 
notify () {
        # Send an e-mail notification
        # Arguments: $@ - the subject
        # Contents are empty
        # And yes - one e-mail per event
        mail -s "$@" $EMAIL < /dev/null
}
 
idle () {
        # Check if transaction is idle
        # Arguments: Target name (example: storage:/vol/volname/qtree
 
        # Get the storage name out
        NETAPP=${1%%:*}
        test_connection $NETAPP #Verify this netapp is accessible
        ssh $NETAPP snapmirror status $1 | tail -1 | grep Idle$ &>/dev/null #Checks if the snapmirror is idle. If so, return true
        return $?
}
 
set_throttle () {
        # Sets throttle for target
        # Arguments: $1 Target name (example: storage:/vol/volname/qtree)
        # Arguments: $2 throttle value (number)
 
        # Get the storage name out
        NETAPP=${1%%:*}
        test_connection $NETAPP #Verify this netapp is accessible
        ssh $NETAPP snapmirror throttle $2 $1
}
 
get_lag () {
        # Gets the lag of snapmirror relationship in minutes
        # Arguments: Target name (example: storage:/vol/volname/qtree)
 
        # Get the storage name out
        NETAPP=${1%%:*}
        test_connection $NETAPP #Verify this netapp is accessible
        LAG=`ssh $NETAPP snapmirror status $1 | tail -1 | awk '{print $4}'`
        # LAG is in hh:mm:ss. We need to transfer it to minutes only
        H=`echo $LAG | cut -f 1 -d :`
        M=`echo $LAG | cut -f 2 -d :`
        let M=$M+$H*60
        echo $M
}
 
check_size () {
        # Checks the size of the snapshot to copy (diff)
        # Arguments: Target name (example: storage:/vol/volname/qtree)
 
        # Get the storage name out
        NETAPP=${1%%:*}
        test_connection $NETAPP #Verify this netapp is accessible
        # Get source storage name and path
        SRC=`ssh $NETAPP snapmirror status $1 | tail -1 | awk '{print $1}'`
        # Get the source filer and vol name from that
        NETAPP=${SRC%%:*}
        SPATH=${SRC##*:}
        SPATH=`echo $SPATH | sed s/'\/vol\/'//`
        SPATH=${SPATH%%/*}
 
        test_connection $NETAPP # Verify the target NetApp is accessible
        SNAP=`ssh $NETAPP snap list -n $SPATH | grep snapmirror | tail -1 | awk '{print $4}'`
        DELTA=`ssh $NETAPP snap delta $SPATH $SNAP | tail -2 | head -1 | awk '{print $5}'`
        echo "Snap delta for $1 is $DELTA KB"  
        LOG_TARGET=`echo $1 | tr / _`.txt
        [ -n "$LOG" ] && echo "`date` $DELTA" >> $LOG_PREFIX/$LOG_TARGET
}
 
 
### MAIN ###
get_arguments $@
. $CONF &>/dev/null
# if e-mail is not set, don't try to send
[ -z "$EMAIL" ] && NOMAIL=1
 
[ -z "$TGTS" ] && abort "You need at least one snapmirror target"
 
case $MODE in
        alert)  if [ "$REACT" == "1" ]
                then
                        [ -z "$THROTTLE" ] && abort "When setting 'react' flag, you must specify throttle"
                fi
                for i in $TGTS
                do
                        if ! idle $i
                        then
                                echo -n "$i is not idle. "
                                [ "$NOMAIL" != "1" ] && notify "$i is not idle"
                                if [ "$REACT" == "1" ]
                                then
                                        echo -n "We are set to react. Limiting throttle"
                                        set_throttle $i $THROTTLE
                                fi
                                echo
                        fi
                done
                ;;
        throttle_limit) [ -z "$THROTTLE" ] && abort "Throttle requires throttle value"
                        for i in $TGTS
                        do
                                echo "Setting throttle for $i to $THROTTLE"
                                set_throttle $i $THROTTLE
                        done
                        ;;
        throttle_unlimit)       for i in $TGTS
                                do
                                        echo "Setting throttle for $i to unlimited"
                                        set_throttle $i 0
                                done
                        ;;
        check_lag)      [ -z "$DELAY" ] && DELAY=1440
                        for i in $TGTS
                        do
                                LAG=`get_lag $i`
                                if [ "$LAG" -gt "$DELAY" ]
                                then
                                        echo "Failure: The delay for $i is $LAG minutes"
                                        [ "$NOMAIL" != "1" ] && notify "$i is lagged $LAG minutes, above the threshold $DELAY"
                                else
                                        echo "Normal: The delay for $i is $LAG minutes"
                                fi
                        done
                        ;;
        check_size)     for i in $TGTS
                        do
                                check_size $i
                        done
                        ;;
        *)      echo "Option $MODE is not implemented yet"
                exit 0
                ;;
esac

Tags: , , , ,

4 Responses to “NetApp SnapMirror monitor script”

  1. Alex F. Says:

    I am trying to download the snapshot_mirror.sh script, with no success.
    Could you please update/provide the path to this script?

    thanks,
    alexf

  2. ez-aton Says:

    Weird. Probably a problem with the plugin.
    Use the little down-facing arrow to the right of the snapshot_mirror.sh header in the post, and you will see the entire script. Just copy/paste its content. I will try to see how to solve it.

    Thanks for letting me know about this problem!
    Ez

  3. Landes Arkady Says:

    #!/bin/bash
    ##############################
    # Written by Arkadi Landes
    # 07/2013
    ##############################

    # GLOBAL Variables

    EMAIL=”arkadlan@pelephone.co.il”
    MAX_DELAY_MINUTES=600
    EHTRIX_VOL_NUMBER=17
    SSH_SERVER=bzika
    DEST_NETAPP=nalab01
    EMAIL_TMP_FILE=/tmp/check_ethrix.tmp

    # FUNCTIONS

    get_lag_for_volume () {
    vol=$1

    #echo “Calculating lag for Volume $vol”
    snap_time=`grep $vol /tmp/ethrix.txt| awk ‘{print $2}’`

    H=`echo $snap_time | cut -f1 -d”:”`
    M=`echo $snap_time | cut -f2 -d”:”`

    let TOTAL_LAG_MINUTES=$H*60+$M

    echo -e “The lag for $vol is $TOTAL_LAG_MINUTES minutes\n”

    }
    ##############################
    # MAIN
    ##############################
    # Empty email tmp file
    :>$EMAIL_TMP_FILE
    :>/tmp/ethrix.txt

    echo -e “Welcome\n”
    echo “Checking snapmirror status in ${DEST_NETAPP} (thru ${SSH_SERVER})”
    ssh ${SSH_SERVER} ssh ${DEST_NETAPP} snapmirror status | grep -i ethrix | awk -F” ” ‘{print $1″\t”$4}’ > /tmp/ethrix.txt
    # Check if all the volumes are in the output
    if [ `wc -l /tmp/ethrix.txt | awk '{print$1}'` -ne $EHTRIX_VOL_NUMBER ]; then
    echo “Some volumes are missing …”
    exit 1
    fi

    for volume in `cat /tmp/ethrix.txt | awk ‘{print $1}’`
    do
    get_lag_for_volume $volume

    if [[ $TOTAL_LAG_MINUTES -gt $MAX_DELAY_MINUTES ]]
    then # LAG is LARGER than the MAX valu
    echo “ERROR in volume $volume”
    echo -e “Lag in volume $volume is $TOTAL_LAG_MINUTES. Its larger than the MAX value ($MAX_DELAY_MINUTES) \n” >> ${EMAIL_TMP_FILE}
    else # Lag is Smaller – Everything is OK
    echo “Lag for volume $volume is OK…”
    fi
    done

    # Send error report
    if [ `wc -l ${EMAIL_TMP_FILE} | awk '{print$1}'` -gt 0 ]; then
    cat ${EMAIL_TMP_FILE} | mutt -s “Ethrix Snapmirror Error” $EMAIL
    fi

  4. ez-aton Says:

    Thanks Arkady! I will try to implement it and use it. Thanks for sharing!
    Ez

Leave a Reply