## Posts Tagged ‘cloning’

### ZFS clone script

Sunday, March 28th, 2021

ZFS has some magical features, comparable to NetApp’s WAFL capabilities. One of the less-used on is the ZFS send/receive, which can be utilised as an engine below something much like NetApp’s SnapMirror or SnapVault.

The idea, if you are not familiar with NetApp’s products, is to take a snapshot of a dataset on the source, and clone it to a remote storage. Then, take another snapshot, and clone only the delta between both snapshots, and so on. This allows for cloning block-level changes only, which reduces clone payload and the time required to clone it.

Copy and save this file as clone_zfs_snapshots.sh. Give it execution permissions.

#!/bin/bash
# This script will clone ZFS snapshots incrementally over SSH to a target server
# Snapshot name structure: [email protected]${TGT_HASH}_INT ; where INT is an increment number # Written by Etzion. Feel free to use. See more stuff in my blog at https://run.tournament.org.il # Arguments: #$1: ZFS filesystem name
# $2: (target ZFS system):(target ZFS filesystem) IAM=$0
ZFS=/sbin/zfs
LOCKDIR=/dev/shm
LOCAL_SNAPS_TO_LEAVE=3
RESUME_LIMIT=3

### FUNCTIONS ###

# Sanity and usage
function usage() {
echo "Usage: $IAM SRC REMOTE_SERVER:ZFS_TARGET (port=SSH_PORT)" echo "ZFS_TARGET is the parent of filesystems which will be created with the original source names" echo "Example:$IAM share/test backupsrv:backup"
echo "It will create a filesystem 'test' under the pool 'backup' on 'backupsrv' with clone"
echo "of the current share/test ZFS filesystem"
echo "This script is (on purpose) not a recursive script"
echo "For the script to work correctly, it *must* have SSH key exchanged from source to target"
exit 0
}

function abort() {
# exit errorously with a message
echo "[email protected]"
pkill -P $$remove_lock exit 1 } function parse_parameters() { # Parses command line parameters # called with * SRC_FS=1 shift TGT=1 shift for i in * do case {i} in port=*) PORT={i##*=} ;; hash=*) HASH={i##*=} ;; esac done TGT_SYS={TGT%%:*} TGT_FS={TGT##*:} # Use a short substring of MD5sum of the target name for later unique identification SRC_DIRNAME_FS={SRC_FS#*/} if [ -z "hash" ] then TGT_FULLHASH="echo TGT_FS/{SRC_DIRNAME_FS} | md5sum -" TGT_HASH={TGT_FULLHASH:1:7} else TGT_HASH={hash} fi } function sanity() { # Verify we have all details [ -z "SRC_FS" ] && usage [ -z "TGT_FS" ] && usage [ -z "TGT_SYS" ] && usage ZFS list -H -o name SRC_FS > /dev/null 2>&1 || abort "Source filesystem SRC_FS does not exist" # check_target_fs || abort "Target ZFS filesystem TGT_FS on TGT_SYS does not exist, or not imported" } function remove_lock() { # Removes the lock file \rm -f {LOCKDIR}/SRC_LOCK } function construct_ssh_cmd() { # Constract the remote SSH command # Here is a good place to put atomic parameters used for the SSH [ -z "{PORT}" ] && PORT=22 SSH="ssh -p PORT TGT_SYS -o ConnectTimeout=3" CONTROL_SSH="SSH -f" } function get_last_remote_snapshots() { # Gets the last snapshot name on a remote system, to match it to our snapshots remoteSnapTmpObj=SSH "ZFS list -H -t snapshot -r -o name {TGT_FS}/{SRC_DIRNAME_FS}" | grep {SRC_DIRNAME_FS}@ | grep {TGT_HASH} # Create a list of all snapshot indexes. Empty means its the first one remoteSnaps="" for snapIter in {remoteSnapTmpObj} do remoteSnaps="remoteSnaps {snapIter##*@{TGT_HASH}_}" done } function check_if_remote_snapshot_exists() { # Argument: 1 ->; Name of snapshot # Checks if this snapshot exists on remote node SSH "ZFS list -H -t snapshot -r -o name {TGT_FS}/{SRC_DIRNAME_FS}@{TGT_HASH}_{newLocalIndex}" return ? } function get_last_local_snapshots() { # This function will return an array of local existing snapshots using the existing TGT_HASH localSnapTmpObj=ZFS list -H -t snapshot -r -o name SRC_FS | grep [email protected] | grep TGT_HASH  # Convert into a list and remove the HASH and everything before it. We should have clear list of indexes localSnapList="" for snapIter in {localSnapTmpObj} do localSnapList="localSnapList {snapIter##*@{TGT_HASH}_}" done # Convert object to array localSnapList=( localSnapList ) # Get the last object let localSnapArrayObj={#localSnapList[@]}-1 } function delete_snapshot() { # This function will delete a snapshot # arguments: 1 -> snapshot name [ -z "1" ] && abort "Cleanup snapshot got no arguments" ZFS destroy 1 #ZFS destroy {SRC_FS}@{TGT_HASH}_{newLocalIndex} } function find_matching_snapshot() { # This function will attempt to find a matching snapshot as a replication baseline # Gets the latest local snapshot index localRecentIndex={localSnapList[localSnapArrayObj]} # Gets the latest mutual snapshot index while [ localSnapArrayObj -ge 0 ] do # Check if the current counter already exists if echo "remoteSnaps" | grep -w {localSnapList[localSnapArrayObj]} > /dev/null 2>&1 then # We know the mutual index. commonIndex={localSnapList[localSnapArrayObj]} return 0 fi let localSnapArrayObj-- done # If we've reached here - there is no mutual index! abort "There is no mutual snapshot index, you will have to resync" } function cleanup_snapshots() { # Creates a list of snapshots to delete and then calls delete_snapshot function # We are using the most recent common index, localSnapArrayObj as the latest reference for deletion let deleteArrayObj=localSnapArrayObj-{LOCAL_SNAPS_TO_LEAVE} snapsToDelete="" # Construct a list of snapshots to delete, and delete it in reverse order while [ deleteArrayObj -ge 0 ] do # Construct snapshot name snapsToDelete="snapsToDelete {SRC_FS}@{TGT_HASH}_{localSnapList[deleteArrayObj]}" let deleteArrayObj-- done snapsToDelete=( snapsToDelete ) snapDelete=0 while [ snapDelete -lt {#snapsToDelete[@]} ] do # Delete snapshot delete_snapshot {snapsToDelete[snapDelete]} let snapDelete++ done } function initialize() { # This is a unique case where we initialize the first sync # We will call this procedure when remoteSnaps is empty (meaning that there was no snapshot whatsoever) # We have to verify that the target has no existing old snapshots here # is it empty? echo "Going to perform an initialization replication. It might wipe the target TGT_FS completely" echo "Press Enter to proceed, or Ctrl+C to abort" read "abc" ### Decided to remove this check ### [ -n "LOCSNAP_LIST" ] && abort "No target snapshots while local history snapshots exists. Clean up history and try again" RECEIVE_FLAGS="-sFdvu" newLocalIndex=1 # NEW_LOC_INDEX=1 create_local_snapshot newLocalIndex open_remote_socket sleep 1 ZFS send -ce {SRC_FS}@{TGT_HASH}_{newLocalIndex} | nc TGT_SYS NC_PORT 2>&1 if [ "?" -ne "0" ] then # Do no cleanup current snapshot # delete_snapshot {SRC_FS}@{TGT_HASH}_{newLocalIndex} abort "Failed to send initial snapshot to target system" fi sleep 1 # Set target to RO SSH ZFS set readonly=on TGT_FS [ "?" -ne "0" ] && abort "Failed to set remote filesystem TGT_FS to read-only" # No need to remove local snapshot } function create_local_snapshot() { # Creates snapshot on local storage # uses argument 1 [ -z "1" ] && abort "Failed to get new snapshot index" ZFS snapshot {SRC_FS}@{TGT_HASH}_{1} [ "?" -ne "0" ] && abort "Failed to create local snapshot. Check error message" } function open_remote_socket() { # Starts remote socket via SSH (as the control operation) # port is 3000 + three-digit random number let NC_PORT=3000+RANDOM%1000 CONTROL_SSH "nc -l -i 90 NC_PORT | ZFS receive {RECEIVE_FLAGS} TGT_FS > /tmp/output 2>&1 ; sync" #CONTROL_SSH "socat tcp4-listen:{NC_PORT} - | ZFS receive {RECEIVE_FLAGS} TGT_FS > /tmp/output 2>&1 ; sync" #zfs send -R [email protected] | zfs receive -Fdvu zpnew } function send_zfs() { # Do the heavy lifting of opening remote socket and starting ZFS send/receive open_remote_socket sleep 1 ZFS send -ce -I {SRC_FS}@{TGT_HASH}_{commonIndex} {SRC_FS}@{TGT_HASH}_{newLocalIndex} | nc -i 90 TGT_SYS NC_PORT #ZFS send -ce -I {SRC_FS}@{TGT_HASH}_{commonIndex} {SRC_FS}@{TGT_HASH}_{newLocalIndex} | socat tcp4-connect:{TGT_SYS}:{NC_PORT} - sleep 20 } function increment() { # Create a new snapshot with the index localRecentIndex+1, and replicate it to the remote system # Baseline is the most recent common snapshot index commonIndex RECEIVE_FLAGS="-Fsdvu" # With an 'F' flag maybe? # Handle the case of latest snapshot in DR is newer than current latest snapshot, due to mistaken deletion remoteSnaps=( remoteSnaps ) let remoteIndex={#remoteSnaps[@]} # Get last snapshot on DR if [ {localRecentIndex} -lt {remoteIndex} ] then let newLocalIndex={remoteIndex}+1 else let newLocalIndex=localRecentIndex+1 fi create_local_snapshot newLocalIndex send_zfs # if [ "?" -ne "0" ] # then # Cleanup current snapshot #delete_snapshot {SRC_FS}@{TGT_HASH}_{newLocalIndex} #abort "Failed to send incremental snapshot to target system" # fi if ! verify_correctness then if ! loop_resume # If we can then # We either could not resume operation or failed to run with the required amount of iterations # For now we abort. echo "Deleting local snapshot" delete_snapshot {SRC_FS}@{TGT_HASH}_{newLocalIndex} abort "Remote snapshot should have the index of the latest snapshot, but it is not. The current remote snapshot index is {commonIndex}" fi fi } function loop_resume() { # Attempts to loop over resuming until limit attempt has been reached REMOTE_TOKEN=(SSH "ZFS get -Ho value receive_resume_token {TGT_FS}/{SRC_DIRNAME_FS}") if [ "REMOTE_TOKEN" == "-" ] then return 1 fi # We have a valid resume token. We will retry COUNT=1 while [ "COUNT" -le "RESUME_LIMIT" ] do # For ease of handline - for each iteration, we will request the token again echo "Attempting resume operation" REMOTE_TOKEN=(SSH "ZFS get -Ho value receive_resume_token {TGT_FS}/{SRC_DIRNAME_FS}") let COUNT++ open_remote_socket ZFS send -e -t REMOTE_TOKEN | nc -i 90 TGT_SYS NC_PORT #ZFS send -e -t REMOTE_TOKEN | socat tcp4-connect:{TGT_SYS}:{NC_PORT} - sleep 20 if verify_correctness then echo "Done" return 0 fi done # If we've reached here, we have failed to run the required iterations. Lets just verify again return 1 } function verify_correctness() { # Check remote index, and verify it is correct with the current, latest snapshot if check_if_remote_snapshot_exists then echo "Replication Successful" return 0 else echo "Replication failed" return 1 fi } ### MAIN ### [ whoami != "root" ] && abort "This script has to be called by the root user" [ -z "1" ] && usage parse_parameters * SRC_LOCK=echo SRC_FS | tr / _ if [ -f {LOCKDIR}/SRC_LOCK ] then echo "Already locked. If should not be the case - remove {LOCKDIR}/SRC_LOCK" exit 1 fi sanity touch {LOCKDIR}/SRC_LOCK construct_ssh_cmd get_last_remote_snapshots # Have a string list of remoteSnaps # If we dont have remote snapshot it should be initialization if [ -z "remoteSnaps" ] then initialize echo "completed initialization. Done" remove_lock exit 0 fi # We can get here only if it is not initialization get_last_local_snapshots # Have a list (array) of localSnaps find_matching_snapshot # Get the latest local index and the latest common index available increment # Creates a new snapshot and sends/receives it cleanup_snapshots # Cleans up old local snapshots pkill -P$$
remove_lock
echo "Done"


A manual initial run should be called manually. If you expect a very long initial sync, you should run it in tmux to screen, to avoid failing in the middle.

To run the command, run it like this:

./clone_zfs_snapshots.sh share/my-data backuphost:share


This will create under the pool ‘share’ in the host ‘backuphost’ a filesystem matching the source (in this case: share/my-data) and set it to read-only. The script will create a snapshot with a unique name based on a shortened hash of the destination, with a counting number suffix, and start cloning the snapshot to the remote host. When called again, it will create a snapshot with the same name, but different index, and clone the delta to the remote host. In case of a disconnection, the clone will retry a few times before failing.

Note that the receiving side does not remove snapshots, so handling (too) old snapshots on the backup host remains up to you.

### Ontap Simulator, and some insights about NetApp

Tuesday, May 9th, 2006

First and foremost – the Ontap simulator, a great tool which surely can assist in learning NetApp interface and utilization, lacks in performance. It has some built-in limitations – No FCP, no disks (virtual disks) larger than 1GB (per my trial-and-error. I might find out I was wrong somehow, and put in on this website), and low performance. I’ve got about 300KB/s transfer rate both on iSCSI and on NFS. To make sure it was not due to some network hog hiding somewhere on my net(s), I’ve even tried it from the host of the simulator itself, but to no avail. Low performance. Don’t try to use it as your own home iSCSI Target. Better just use Linux for this purpose, with the drivers obtained from here (It’s one of my next steps into “shared storage(s) for all”).

Another issue – After much reading through NetApp documentation, I’ve reached the following concepts of the product. Please correct me if you see fit:

The older method was to create a volume (vol create) directly from disks. Either using raid_dp or raid4.

The current method is to create aggregations (aggr create) from disks. Each aggregate consists of raid groups. A raid group (rg) can be made up of up to eight physical disks. Each group of disks (an rg) has one or two parity disks, depending on the type of raid (raid 4 uses one parity, and raid_dp uses “double parity”, as its name can suggest).

Actually, I can assume that each aggregation is formatted using the WAFL filesystem, which leads to the conclusion that modern (flex) volumes are logical “chunks” of this whole WAFL layout. In the past, each volume was a separated WAFL formatted unit, and each size change required adding disks.

This separation of the flex volume from the aggregation suggests to me the possibility of multiple-root capable WAFL. It can explain the lack of requirement for a continuous space on the aggregation. This eases the space management, and allows for fast and easy “cloning” of volumes.

I believe that the new “clone” method is based on the WAFL built-in snapshot capabilities. Although WAFL Snapshots are supposed to be space conservatives, they require a guaranteed space on the aggregation prior to committing the clone itself. If the aggregation is too crowded, they will fail with the error message “not enough space”. If there is enough for snapshots, but not enough to guarantee a full clone, you’ll get a message saying “space not guaranteed”.

I see the flex volumes as some combination between filesystem (WAFL) and LVM, living together on the same level.

LUNs on NetApp: iSCSI and/or Fibre LUNs are actually managed as a single (per-LUN) large file contained within a volume. This file has special permissions (I was not able to copy it or modify it while it was online and I had root permissions. However, I am rather new to NetApp technology), and it is being exported as a disk outside. Much like an ISO image (which is a large file containing a whole filesystem layout) these files contain a whole disk layout, including partition tables, LVM headers, etc – just like a real disk.

Thinking about it, it’s neither impossible nor very surprising. A disk is no more than a container of data, of blocks, and if you can utilize the required communication protocol used for accessing it and managing its blocks (aka, the transport layer on which filesystem can access the block data), you can, with just a little translation interface, set up a virtual disk which will behave just like any regular disk.

This brings us to the advantages of NetApp’s WAFL – the ability to minimize I/O while maintaining a set of snapshots for the system – a list of per-block modification history. It means you can “snapshot” your LUN, being physically no more than a file on a WAFL-based volume, and you can go back with your data to a previous date – an hour, a day, a week. Time travel for your data.

There are, unfortunately, some major side effects. If you’ve read the WAFL description from NetApp, my summary will be inaccurate at best. If you haven’t, it will be enough, but still you are most encouraged to read it. The idea is that this filesystem is made out of multi-layers of pointers, and of blocks. A pointer can point to more than one block. When you commit a snapshot, you do not change the pointers, you do not move data, you just modify the set of pointers. When there is any change in the data (meaning a block is changed), the pointer points to the alternate block instead of the previous (historical) block, but keeps reference of the older block’s location. This way, only modified blocks are actually recreated, while any unmodified data remains on the same spot on the physical disk. An additional claim of NetApp is that their WAFL is optimized for the raid4 and raid_dp they use, and utilizes it in a smart manner.

The problem with WAFL, as can be easily seen, is fragmentation. For CIFS and NFS, it does not cause much of a problem, as the system is very capable of read-ahead just to solve this issue. However, A LUN (which is supposed to act as a continuous layout, just like any hard-drive or raid-array in the world and on which various file-system related operations occur) gets fragmented.

Unlike CIFS or NFS, LUN read-ahead is harder to predict, as the client tries to do just the same. Unlike real disks, NetApp LUNs do not behave, performance-wise, like the hard-drive layout any DB or FS has learned to expect and was best optimized for. It means, for my example, that on a DB with lots of small changes, that the DB itself would have tried to commit changes in large write operations, committed every so and so interval, and would thrive to commit them as close to each other, as continuous as possible. On NetApp LUN this will cause fragmentation, and will result in lower write (and later read) performance.

That’s all for today.