Posts Tagged ‘snapshot technology’

ZFS clone script

Sunday, March 28th, 2021

ZFS has some magical features, comparable to NetApp’s WAFL capabilities. One of the less-used on is the ZFS send/receive, which can be utilised as an engine below something much like NetApp’s SnapMirror or SnapVault.

The idea, if you are not familiar with NetApp’s products, is to take a snapshot of a dataset on the source, and clone it to a remote storage. Then, take another snapshot, and clone only the delta between both snapshots, and so on. This allows for cloning block-level changes only, which reduces clone payload and the time required to clone it.

Copy and save this file as Give it execution permissions.

# This script will clone ZFS snapshots incrementally over SSH to a target server
# Snapshot name structure: [email protected]${TGT_HASH}_INT ; where INT is an increment number
# Written by Etzion. Feel free to use. See more stuff in my blog at
# Arguments:
# $1: ZFS filesystem name
# $2: (target ZFS system):(target ZFS filesystem)



# Sanity and usage
function usage() {
	echo "ZFS_TARGET is the parent of filesystems which will be created with the original source names"
	echo "Example: $IAM share/test backupsrv:backup"
	echo "It will create a filesystem 'test' under the pool 'backup' on 'backupsrv' with clone"
	echo "of the current share/test ZFS filesystem"
	echo "This script is (on purpose) not a recursive script"
	echo "For the script to work correctly, it *must* have SSH key exchanged from source to target"
	exit 0

function abort() {
	# exit errorously with a message
	echo "[email protected]"
	pkill -P $$
	exit 1

function parse_parameters() {
	# Parses command line parameters
	# called with $*
	for i in $*
		case ${i} in
			port=*)	PORT=${i##*=}
			hash=*)	HASH=${i##*=}
	# Use a short substring of MD5sum of the target name for later unique identification
	if [ -z "$hash" ]
		TGT_FULLHASH="`echo $TGT_FS/${SRC_DIRNAME_FS} | md5sum -`"


function sanity() {
	# Verify we have all details
	[ -z "$SRC_FS" ] && usage
	[ -z "$TGT_FS" ] && usage
	[ -z "$TGT_SYS" ] && usage
	$ZFS list -H -o name $SRC_FS > /dev/null 2>&1 || abort "Source filesystem $SRC_FS does not exist"
	# check_target_fs || abort "Target ZFS filesystem $TGT_FS on $TGT_SYS does not exist, or not imported"

function remove_lock() {
	# Removes the lock file
	\rm -f ${LOCKDIR}/$SRC_LOCK

function construct_ssh_cmd() {
	# Constract the remote SSH command
	# Here is a good place to put atomic parameters used for the SSH
	[ -z "${PORT}" ] && PORT=22
	SSH="ssh -p $PORT $TGT_SYS -o ConnectTimeout=3"

function get_last_remote_snapshots() {
	# Gets the last snapshot name on a remote system, to match it to our snapshots
	remoteSnapTmpObj=`$SSH "$ZFS list -H -t snapshot -r -o name ${TGT_FS}/${SRC_DIRNAME_FS}" | grep ${SRC_DIRNAME_FS}@ | grep ${TGT_HASH}`
	# Create a list of all snapshot indexes. Empty means its the first one
	for snapIter in ${remoteSnapTmpObj}
	  remoteSnaps="$remoteSnaps ${snapIter##*@${TGT_HASH}_}"

function check_if_remote_snapshot_exists() {
	# Argument: $1 ->; Name of snapshot
	# Checks if this snapshot exists on remote node
	$SSH "$ZFS list -H -t snapshot -r -o name ${TGT_FS}/${SRC_DIRNAME_FS}@${TGT_HASH}_${newLocalIndex}"
	return $?

function get_last_local_snapshots() {
	# This function will return an array of local existing snapshots using the existing TGT_HASH
    localSnapTmpObj=`$ZFS list -H -t snapshot -r -o name $SRC_FS | grep [email protected] | grep $TGT_HASH `
    # Convert into a list and remove the HASH and everything before it. We should have clear list of indexes
    for snapIter in ${localSnapTmpObj}
    	localSnapList="$localSnapList ${snapIter##*@${TGT_HASH}_}"
    # Convert object to array
    localSnapList=( $localSnapList )
    # Get the last object
    let localSnapArrayObj=${#localSnapList[@]}-1

function delete_snapshot() {
	# This function will delete a snapshot
	# arguments: $1 -> snapshot name
	[ -z "$1" ] && abort "Cleanup snapshot got no arguments"
	$ZFS destroy $1
	#$ZFS destroy ${SRC_FS}@${TGT_HASH}_${newLocalIndex}

function find_matching_snapshot() {
	# This function will attempt to find a matching snapshot as a replication baseline
	# Gets the latest local snapshot index
    # Gets the latest mutual snapshot index
    while [ $localSnapArrayObj -ge 0 ]
    	# Check if the current counter already exists
    	if echo "$remoteSnaps" | grep -w ${localSnapList[$localSnapArrayObj]} > /dev/null 2>&1
    		# We know the mutual index.
    		return 0
    	let localSnapArrayObj--
    # If we've reached here - there is no mutual index!
    abort "There is no mutual snapshot index, you will have to resync"

function cleanup_snapshots() {
	# Creates a list of snapshots to delete and then calls delete_snapshot function
	# We are using the most recent common index, $localSnapArrayObj as the latest reference for deletion
	let deleteArrayObj=$localSnapArrayObj-${LOCAL_SNAPS_TO_LEAVE}
	# Construct a list of snapshots to delete, and delete it in reverse order
	while [ $deleteArrayObj -ge 0 ]
		# Construct snapshot name
		snapsToDelete="$snapsToDelete ${SRC_FS}@${TGT_HASH}_${localSnapList[$deleteArrayObj]}"
		let deleteArrayObj--
	snapsToDelete=( $snapsToDelete )

	while [ $snapDelete -lt ${#snapsToDelete[@]} ]
		# Delete snapshot
		delete_snapshot ${snapsToDelete[$snapDelete]}
		let snapDelete++

function initialize() {
	# This is a unique case where we initialize the first sync
	# We will call this procedure when $remoteSnaps is empty (meaning that there was no snapshot whatsoever)
	# We have to verify that the target has no existing old snapshots here
	# is it empty?
	echo "Going to perform an initialization replication. It might wipe the target $TGT_FS completely"
	echo "Press Enter to proceed, or Ctrl+C to abort"
	read "abc"
	### Decided to remove this check
	### [ -n "$LOCSNAP_LIST" ] && abort "No target snapshots while local history snapshots exists. Clean up history and try again"
	create_local_snapshot $newLocalIndex
	sleep 1
	$ZFS send -ce ${SRC_FS}@${TGT_HASH}_${newLocalIndex} | nc $TGT_SYS $NC_PORT 2>&1
	if [ "$?" -ne "0" ]
		# Do no cleanup current snapshot
		# delete_snapshot ${SRC_FS}@${TGT_HASH}_${newLocalIndex}
		abort "Failed to send initial snapshot to target system"
	sleep 1
	# Set target to RO
	$SSH $ZFS set readonly=on $TGT_FS
	[ "$?" -ne "0" ] && abort "Failed to set remote filesystem $TGT_FS to read-only" # No need to remove local snapshot

function create_local_snapshot() {
	# Creates snapshot on local storage
	# uses argument $1
	[ -z "$1" ] && abort "Failed to get new snapshot index"
	$ZFS snapshot ${SRC_FS}@${TGT_HASH}_${1}
	[ "$?" -ne "0" ] && abort "Failed to create local snapshot. Check error message"

function open_remote_socket() {
	# Starts remote socket via SSH (as the control operation)
	# port is 3000 + three-digit random number
	let NC_PORT=3000+$RANDOM%1000
	$CONTROL_SSH "nc -l -i 90 $NC_PORT | $ZFS receive ${RECEIVE_FLAGS} $TGT_FS > /tmp/output 2>&1 ; sync"
	#$CONTROL_SSH "socat tcp4-listen:${NC_PORT} - | $ZFS receive ${RECEIVE_FLAGS} $TGT_FS > /tmp/output 2>&1 ; sync"
	#zfs send -R [email protected] | zfs receive -Fdvu zpnew

function send_zfs() {
	# Do the heavy lifting of opening remote socket and starting ZFS send/receive
	sleep 1
	$ZFS send -ce -I ${SRC_FS}@${TGT_HASH}_${commonIndex} ${SRC_FS}@${TGT_HASH}_${newLocalIndex} | nc -i 90 $TGT_SYS $NC_PORT 
	#$ZFS send -ce -I ${SRC_FS}@${TGT_HASH}_${commonIndex} ${SRC_FS}@${TGT_HASH}_${newLocalIndex} | socat tcp4-connect:${TGT_SYS}:${NC_PORT} -
	sleep 20


function increment() {
	# Create a new snapshot with the index $localRecentIndex+1, and replicate it to the remote system
	# Baseline is the most recent common snapshot index $commonIndex
	RECEIVE_FLAGS="-Fsdvu" # With an 'F' flag maybe?
	# Handle the case of latest snapshot in DR is newer than current latest snapshot, due to mistaken deletion
	remoteSnaps=( $remoteSnaps )
	let remoteIndex=${#remoteSnaps[@]} # Get last snapshot on DR
	if [ ${localRecentIndex} -lt ${remoteIndex} ]
		let newLocalIndex=${remoteIndex}+1
		let newLocalIndex=localRecentIndex+1
	create_local_snapshot $newLocalIndex


	# if [ "$?" -ne "0" ]
	# then

		# Cleanup current snapshot
		#delete_snapshot ${SRC_FS}@${TGT_HASH}_${newLocalIndex}
		#abort "Failed to send incremental snapshot to target system"
	# fi
	if ! verify_correctness

		if ! loop_resume # If we can
			# We either could not resume operation or failed to run with the required amount of iterations
			# For now we abort. 
			echo "Deleting local snapshot"
			delete_snapshot ${SRC_FS}@${TGT_HASH}_${newLocalIndex}
			abort "Remote snapshot should have the index of the latest snapshot, but it is not. The current remote snapshot index is ${commonIndex}"

function loop_resume() {
	# Attempts to loop over resuming until limit attempt has been reached
	REMOTE_TOKEN=$($SSH "$ZFS get -Ho value receive_resume_token ${TGT_FS}/${SRC_DIRNAME_FS}")
	if [ "$REMOTE_TOKEN" == "-" ]
		return 1
	# We have a valid resume token. We will retry
	while [ "$COUNT" -le "$RESUME_LIMIT" ]
		# For ease of handline - for each iteration, we will request the token again
		echo "Attempting resume operation" 
		REMOTE_TOKEN=$($SSH "$ZFS get -Ho value receive_resume_token ${TGT_FS}/${SRC_DIRNAME_FS}")
		let COUNT++
		$ZFS send -e -t $REMOTE_TOKEN | nc -i 90 $TGT_SYS $NC_PORT
		#$ZFS send -e -t $REMOTE_TOKEN | socat tcp4-connect:${TGT_SYS}:${NC_PORT} -
		sleep 20
		if verify_correctness
			echo "Done"
			return 0
	# If we've reached here, we have failed to run the required iterations. Lets just verify again
	return 1

function verify_correctness() {
	# Check remote index, and verify it is correct with the current, latest snapshot

    if check_if_remote_snapshot_exists
    	echo "Replication Successful"
    	return 0
    	echo "Replication failed"
    	return 1

### MAIN ###
[ `whoami` != "root" ] && abort "This script has to be called by the root user"
[ -z "$1" ] && usage
parse_parameters $*
SRC_LOCK=`echo $SRC_FS | tr / _`
if [ -f ${LOCKDIR}/$SRC_LOCK ] 
	echo "Already locked. If should not be the case - remove ${LOCKDIR}/$SRC_LOCK"
	exit 1
get_last_remote_snapshots # Have a string list of remoteSnaps
# If we dont have remote snapshot it should be initialization
if [ -z "$remoteSnaps" ]
	echo "completed initialization. Done"
	exit 0

# We can get here only if it is not initialization
get_last_local_snapshots # Have a list (array) of localSnaps
find_matching_snapshot # Get the latest local index and the latest common index available
increment # Creates a new snapshot and sends/receives it
cleanup_snapshots # Cleans up old local snapshots
pkill -P $$
echo "Done"

A manual initial run should be called manually. If you expect a very long initial sync, you should run it in tmux to screen, to avoid failing in the middle.

To run the command, run it like this:

./ share/my-data backuphost:share

This will create under the pool ‘share’ in the host ‘backuphost’ a filesystem matching the source (in this case: share/my-data) and set it to read-only. The script will create a snapshot with a unique name based on a shortened hash of the destination, with a counting number suffix, and start cloning the snapshot to the remote host. When called again, it will create a snapshot with the same name, but different index, and clone the delta to the remote host. In case of a disconnection, the clone will retry a few times before failing.

Note that the receiving side does not remove snapshots, so handling (too) old snapshots on the backup host remains up to you.

Poor Man’s DRP + Snapshots – Linux only

Friday, October 6th, 2006

When you own a data storage, one of your major considerations is how to backup your data. Several solutions exist to answer this question.

When your data grows to a certain size, you encounter an additional issues – How to backup the data with minimum performance impact.

It is quite obvious that backup devices has a specific speed and performance. It is quite obvious that is you have more data than you can stream into your tape deviced during night, your backup would probably continue during working hours.

Several solutions exist to deal with this problem, amongst you can find the solution of faster backup tapes, broader bandwidth between your storage container and your backup devices. The issue I will demonstrate has to do with a third option – create a real-time replica on another server, and backup the replica only.

When it comes to Linux, I’ve always felt that the backup/restore software companies were rather slow to supply solutions fit for Linux, especailly compared to the widening usage of Linux-based systems in the market.

One of the more intriging solutions which grew in the OpenSource community is called DRBD – Distributed Redundant Block Device. It allows the creation of a logical block device which overlayes two physical block devices – one local and one remotely accessible via network. It can be easilly described as network Raid-1 solution.

The wonders of real-time volume replica between two servers should not be discussed here. The advantages are well known, as are the disadvantages, of which the largest one is the heavy performance toll on such a system.

The wonders of snapshots are also well known. NetApp gains its main capital based on their sophisticated snapshot technology (WAFL, etc). Other storage vendors have added the abilities to take snapshots with higher or lower effeciancy, however, one of the newer players in this under-the-spotlight area is the OpenSource LVM2 for Linux, with its snapshot capabilities. Although still not perfect, it does show a promise I will soon demonstrate, combined with DRBD, described above.

The combined wonders of volume replication together with scheduled snapshots can offer the ability to execute backup of consistant snapshot data, the ability to get back to a desired volume’s point-in-time and the power to reduce the load of backing up on mission-critical datacenters. All these, at the price of internet connection which will allow you to download the latest DRBD software.

I have tested it on a home-made setup – Two Virtual Linux server running on a single VMware-Server machine.

The host is Pentium4 1.8GHz, with 1GB RDRAM, and three IDE harddrives, running Centos 4.4

The guests are two Centos 4.4 machines, with 160MB RAM each, two virtual NICs – one public and one private, minimal installation, and Dag Wieers‘ YUM repositories added to them.

The guest will be called DRBD-test1 and DRBD-test2. The first will act as the mission-critical server, and the second will be the replica (target) server.

Both guests were updated to the latest updates available at this time. Both are running kernel version 2.6.9-42.0.2.EL, DRBD version 0.7.21-1.c4, and kernel-module-drbd-2.6.9-42.EL-0.7.21-1.c4

Installing the kernel-module package put the drbd.ko modules in /lib/modules/2.6.9-42.EL instead of my running kernel (2.6.9-42.0.2.EL), so after verifying that the modules were able to load into my running kernel, I have moved them to the kernel/drivers/block directory inside the modules tree, and run ‘depmod -a‘.

I decided to use a consistant configuraion, and defined the storage to replicate in a similar manner:

On /dev/sdb I’ve created PV (pvcreate /dev/sdb). Assigned this PV to VG named vg00, and created two LVs on it: meta (256MB) and source (2GB) on the guest acting as the mission critical server, and meta (256MB) and target (2GB) on the one acting as replica.

I have created the device /dev/drbd0, per DRBD’s Howto, built the configuration file drbd.conf, and loaded the modules.

Forced the Source guest to act as the primary, and replication began.

When replication has finished, I have created a snapshot of the LV target and mounted it correctly: "lvcreate -L 200M -s -n snap /dev/vg00/target && mount /dev/vg00/snap /mnt"

I was able to access the data inside the volume, without changing the Primary/Secondary order of the servers. I have created a script which used DD to stress the I/O of the DRBD volume on the source server, and created a script which took scheduled (every minute) snapshots of the target volume. I have learned the following:

1. It works, but

2. The size limitaion forced on snapshot (200MB in my case) should never be filled up. When running DD on the source volume (creating 50MB empty files), the space consumed by the snapshots increases, and if/when a snapshot exceeds the 100% utilization, it is inaccessible anymore. To view the current usage of a snapshot, run "lvdisplay /dev/vg00/snap" (in my example).

During that evaluation, one of my virtual server crushed, due to LVM2 snapshot problem. LVM2 is not yet perfect on RH based systems…

Performance on another time. I wan’t too happy with it, however on this experiment my goal was to find out if such a setup can be built rather than to measure the performance impact.

Generally speaking – I was rather happy with the results – It showed that this setup can actually work. It proved to me again that OSS innovations elevate Linux to the enterprise.

Now that I know that such a setup can be done, all left to do is to fine-tune it to minimum performance impact, and test again to see if it can actually be a well-suited solution for the questions I’ve started with.