Posts Tagged ‘LVM’

LVM Recovery

Friday, May 29th, 2009

A friend of mine made a grieve mistake – partition a disk containing Linux LVM directly on it, without any partition table. Oops.

When dealing with multi-Tera sized disks, one gets to encounter limitations not known on smaller scales – the 2TB limitation. Normal partition table can contain only around 2TB mapping, meaning that to create larger partitions, or even smaller partitions which exceed that specific limit, you have to take one of two actions:

  • Use GPT partition tables, which is meant for large disks, and partition the disk to the size limits you desire
  • Define LVM PV directly on the block device (the command would look like ‘pvcreate /dev/sdb -> see? No partitions)

“Surprisingly” and for no good reason, it appears that the disk which was used completely for the LVM PV suddenly had a single GPT partition on it. Hmmmm.

This is/was a single disk in a two-PV VG continging a single LV spanned all over the VG space. Following the “mysterious” actions, the VG refused to start, claiming that it could not find PV with PVID <some UID>.

This is a step where one should stop and call a professional if he doesn’t know for sure how to continue. These following actions are very risky to your data, and could result in you either recovering from tapes (if exist) or seeking a new job, if this is/was some mission-critical data.

First – go to /etc/lvm/archive and find the latest file named after the VG which has been destroyed. Look into it – you should see the PV is in there. Search the PV based on the UID reported not to repond on the logs.

Second – you need to remove the GPT partition from the disk. The PV will be recreated exactly as it was suppoed to be before. Replace /dev/some_disk with your own device file.

fdisk /dev/some_disk

d

w

Third – Reread the VG archive file, to be on the safe side. Verify again that the PV you are about to recreate is the one you are to. When done, run the following command

pvcreate -u <UID> /dev/some_disk

Again – the name of the device file has been changed in this example to prevent copy-paste incidents from happening.

Fourth – Run vgcfgrestore with the name of the VG as parameter. This command would restore your meta information into the PV and VG.

vgcfgrestore VG_TEST

Fifth – Activate the VG:

vgchange -ay VG_TEST

Now the volumes should be up, and you have the ability to attempt to mount these volumes.

Notice that the data might be corrupted in some way. Running fsck is recommended, although time-consuming.

Good luck!

Relocating LVs with snapshots

Monday, February 2nd, 2009

Linux LVM is a wonderful thing. It is scalable, flexible, and truly, almost enterprise-class in every details. It lacks, of course, at IO performance for LVM snapshots, but this can be worked-around in several creative ways (if I haven’t shown here before, I will sometime).

What it can’t do is dealing with a mixture of Stripes, Mirrors and Snapshots in a single logical volume. It cannot allow you to mirror a stripped LV (even if you can follow the requirementes), it cannot allow you to snapshot a mirrored or a stripped volume. You get the idea. A volume you can protect, you cannot snapshot. A volume with snapshots cannot be mirrored or altered.

For the normal user, what you get is usually enough. For storage management per-se, this is just not enough. When I wanted to reduce a VG – remove a disk from an existing volume group,  I had to evacuate it from any existing logical volume. The command to perform this actions is ‘pvmove‘ which is capable of relocating data from within a PV to other PVs. This is done through mirroring each logical volume and then removing the origin.

Mirroring, however, cannot be performed on LVs with snapshots, or on an already mirrored LV, so these require different handling.

We can detect which LVs reside on our physical volume by issuing the following command

pvdisplay -m /dev/sdf1

/dev/sdf1 was only an example. You will see the contents of this PV. So next, performing

pvmove /dev/sdf1

would attempt to relocate every existing LV from this specific PV to any other available PV. We can use this command to change the disk balance and allocations on multi-disk volume groups. This will be discussed on a later post.

Following a ‘pvmove‘ command, all linear volumes are relocated, if space permits, to another PVs. The remaining LVs are either mirrored or LVs with snapshots.

To relocate a mirrored LV, you need to un-mirror it first. To do so, first detect using ‘pvdisplay‘ which LV is belongs to (the name should be easy to follow) and then change it to non-mirrored.

lvconvert -m0 /dev/VolGroup00/test-mirror

This will convert it to be a linear volume instead of a mirror, so you could move it, if it still resides on the PV you are to remove.

Snapshot volumes are more complicated, due to their nature. Since all my snapshots are of a filesystem, I could allow myself to use tar to perform the action.

The steps are as follow:

  1. tar the contents of the snapshot source to nowhere, but save an incremental file
  2. Copy the source incremental file to a new name, and tar the contents of a snapshot according to this copy.
  3. Repeat the previous step for each snapshot.
  4. Remove all snapshots
  5. Relocate the snapshot source using ‘pvmove
  6. Build the snapshots and then recover the data into them

This is a script to do steps 1 to 3. It will not remove LVs, for obvious reasons. This script was not tested, but should work, of course :-)

None of the LVs should be mounted for it to function. It’s better to have harder requirements than to destroy data by double-mounting it, or accessing it while it is being changed.

#!/bin/bash
# Get: VG Base-LV, snapshot name, snapshot name, snapshot name...
# Example:
# ./backup VolGroup00 base snap1 snap2 snap3
# Written by Ez-Aton
 
TARGET=/tmp
if [ "$@" -le 3 ]
then
   echo "Parameters: $0 VG base snap snap snap snap"
   exit 1
fi
VG=$1
BASE=$2
shift 2
 
function check_not_mounted () {
   # Check if partition is mounted
   if mount | grep /dev/mapper/${VG}-${1}
   then
      return 0
   else
      return 1
   fi
}
 
function create_base_diff () {
   # This function will create the diff file for the base
   mount /dev/${VG}/${BASE} $MNT
   if [ $? -ne 0 ]
   then
      echo "Failed to mount base"
      exit 1
   fi
   cd $MNT
   tar -g $TARGET/${BASE}.tar.gz.diff -czf - . &gt; /dev/null
   cd -
   umount $MNT
}
 
function create_snap_diff () {
   mount /dev/${VG}/${1} $MNT
   if [ $? -ne 0 ]
   then
      echo "Failed to mount base"
      exit 1
   fi
   cp $TARGET/${BASE}.tar.gz.diff $TARGET/$1.tar.gz.diff
   cd $MNT
   tar -g $TARGET/${1}.tar.gz.diff -czf $TARGET/${1}.tar.gz .
   cd -
   umount $MNT
}
 
function create_mount () {
   # Creates a temporary mount point
   if [ ! -d /mnt/$$ ]
   then
      mkdir /mnt/$$
   fi
   MNT=/mnt/$$
}
 
create_mount
if check_not_mounted $BASE
then
   create_base_diff
else
   echo "$BASE is mounted. Exiting now"
   exit 1
fi
for i in $@
do
   if check_not_mounted $i
   then
      create_snap_diff $i
   else
      echo "$i is mounted! I will not touch it!"
   fi
done

The remaining steps should be rather easy – just mount the newly created snapshots and restore the tar file on them.

Quick provisioning of virtual machines

Friday, February 1st, 2008

When one wants to achieve fast provisioning of virtual machines, some solutions might come into account. The one I prefer uses Linux LVM snapshot capabilities to duplicate one working machine into few.

This can happen, of course, only if the host running VMware-Server is Linux.

LVM snapshots have one vast disadvantage – performance. When a block on the source of the snapshot is being changed for the first time, the original block is being replicated to each and every snapshot COOW space. It means that a creation of a 1GB file on a volume having ten snapshots means a total copy of 10GB of data across your disks. You cannot ignore this performance impact.

LVM2 has support for read/write snapshots. I have come up with a nice way of utilizing this capability to my benefit. An R/W snapshot which is being changed does not replicate its changes to any other snapshot. All changes are considered local to this snapshot, and are being maintained only in its COOW space. So adding a 1GB file to a snapshot has zero impact on the rest of the snapshots or volumes.

The idea is quite simple, and it works like this:

1. Create adequate logical volume with a given size (I used 9GB for my own purposes). The name of the LV in my case will be /dev/VGVM3/centos-base

2. Mount this LV on a directory, and create a VM inside it. In my case, it’s in /vmware/centos-base

3. Install the VM as the baseline for all your future VMs. If you might not want Apache on some of them, don’t install it on the baseline.

4. Install vmware-tools on the baseline.

5. Disable the service “kudzu”

6. Update as required

7. In my case I always use DHCP. You can set it to obtain its IP once from a given location, or whatever you feel like.

8. Shut down the VM.

9. In the VM’s .vmx file add a line like this:

uuid.action = “create”

I have added below (expand to read) two scripts which will create the snapshot, mount it and register it, including new MAC and UUID.

Press below for the scripts I have used to create and destroy VMs

create-replica.sh:

#!/bin/sh
# This script will replicate vms from a given (predefined) source to a new system
# Written by Ez-Aton, http://www.tournament.org.il/run
# Arguments: name

# FUNCITONS BE HERE
test_can_do () {
# To be able to snapshot, we need a set of things to happen
if [ -d $DIR/$TARGET ] ; then
echo “Directory already exists. You don’t want to do it…”
exit 1
fi
if [ -f $VG/$TARGET ] ; then
echo “Target snapshot exists”
exit 1
fi
if [ `vmrun list | grep -c $DIR/$SRC/$SRC.vmx` -gt "0" ] ; then
echo “Source VM is still running. Shut it down before proceeding”
exit 1
fi
if [ `vmware-cmd -l | grep -c $DIR/$TARGET/$SRC.vmx` -ne "0" ] ; then
echo “VM already registered. Unregister first”
exit 1
fi
}

do_snapshot () {
# Take the snapshot
lvcreate -s -n $TARGET -L $SNAPSIZE $VG/$SRC
RET=$?
if [ "$RET" -ne "0" ]; then
echo “Failed to create snapshot”
exit 1
fi
}

mount_snapshot () {
# This function creates the required directories and mounts the snapshot there
mkdir $DIR/$TARGET
mount $VG/$TARGET $DIR/$TARGET
RET=$?
if [ "$RET" -ne "0" ]; then
echo “Failed to mount snapshot”
exit 1
fi
}

alter_snap_vmx () {
# This function will alter the name in the VMX and make it the $TARGET name
cat $DIR/$TARGET/$SRC.vmx | grep -v “displayName” > $DIR/$TARGET/$TARGET.vmx
echo “displayName = \”$TARGET\”" >> $DIR/$TARGET/$TARGET.vmx
cat $DIR/$TARGET/$TARGET.vmx > $DIR/$TARGET/$SRC.vmx
\rm $DIR/$TARGET/$TARGET.vmx
}

register_vm () {
# This function will register the VM to VMWARE
vmware-cmd -s register $DIR/$TARGET/$SRC.vmx
}

# MAIN
if [ -z "$1" ]; then
echo “Arguments: The target name”
exit 1
fi

# Parameters:
SRC=centos-base         #The name of the source image, and the source dir
PREFIX=centos             #All targets will be created in the name centos-$NAME
DIR=/vmware               #My VMware VMs default dir
SNAPSIZE=6G              #My COOW space
VG=/dev/VGVM3           #The name of the VG
TARGET=”$PREFIX-$1″

test_can_do
do_snapshot
mount_snapshot
alter_snap_vmx
register_vm
exit 0

remove-replica.sh:

#!/bin/sh
# This script will remove a snapshot machine
# Written by Ez-Aton, http://www.tournament.org.il/run
# Arguments: machine name

#FUNCTIONS
does_it_exist () {
# Check if the described VM exists
if [ `vmware-cmd -l | grep -c $DIR/$TARGET/$SRC.vmx` -eq "0" ]; then
echo “No such VM”
exit 1
fi
if [ ! -e $VG/$TARGET ]; then
echo “There is no matching snapshot volume”
exit 1
fi
if [ `lvs $VG/$TARGET | awk '{print $5}' | grep -c $SRC` -eq "0" ]; then
echo “This is not a snapshot, or a snapshot of the wrong LV”
exit 1
fi
}

ask_a_thousand_times () {
# This function verifies that the right thing is actually done
echo “You are about to remove a virtual machine and an LVM. Details:”
echo “Machine name: $TARGET”
echo “Logical Volume: $VG/$TARGET”
echo -n “Are you sure? (y/N): ”
read RES
if [ "$RES" != "Y" ]&&[ "$RES" != "y" ]; then
echo “Decided not to do it”
exit 0
fi
echo “”
echo “You have asked to remove this machine”
echo -n “Again: Are you sure? (y/N): ”
read RES
if [ "$RES" != "Y" ]&&[ "$RES" != "y" ]; then
echo “Decided not to do it”
exit 0
fi
echo “Removing VM and snapshot”
}

shut_down_vm () {
# Shut down the VM and unregister it
vmware-cmd $DIR/$TARGET/$SRC.vmx stop hard
vmware-cmd -s unregister $DIR/$TARGET/$SRC.vmx
}

remove_snapshot () {
# Umount and remove the snapshot
umount $DIR/$TARGET
RET=$?
if [ "$RET" -ne "0" ]; then
echo “Cannot umount $DIR/$TARGET”
exit 1
fi
lvremove -f $VG/$TARGET
RET=$?
if [ "$RET" -ne "0" ]; then
echo “Cannot remove snapshot LV”
exit 1
fi
}

remove_dir () {
# Removes the mount point
rmdir $DIR/$TARGET
}

#MAIN
if [ -z "$1" ]; then
echo “No machine name. Exiting”
exit 1
fi

#PARAMETERS:
DIR=/vmware                #VMware default VMs location
VG=/dev/VGVM3            #The name of the VG
PREFIX=centos              #Prefix to the name. All these VMs will be called centos-$NAME
TARGET=”$PREFIX-$1″
SRC=centos-base           #The name of the baseline image, LVM, etc. All are the same

does_it_exist
ask_a_thousand_times
shut_down_vm
remove_snapshot
remove_dir

exit 0

Pros:

1. Very fast provisioning. It takes almost five seconds, and that’s because my server is somewhat loaded.

2. Dependable: KISS at its marvel.

3. Conservative on space

4. Conservative on I/O load (unlike the traditional use of LVM snapshot, as explained in the beginning of this section).

Cons:

1. Cannot streamline the contents of snapshot into the main image (LVM team will implement it in the future, I think)

2. Cannot take a snapshot of a snapshot (same as above)

3. If the COOW space of any of the snapshots is full (viewable through the command ‘lvs‘) then on boot, the source LV might not become active (confirmed RH4 bug, and this is the system I have used)

4. My script does not edit/alter /etc/fstab (I have decided it to be rather risky, and it was not worth the effort at this time)

5. My script does not check if there is enough available space in the VG. Not required, as it will fail if creation of LV will fail

You are most welcome to contribute any further changes done to this script. Please maintain my URL in the script if you decide to use it.

Thanks!

Linux LVM performace measurement

Sunday, June 10th, 2007

Modern Linux LVM offers great abilities to maintain snapshots of existing logical volumes. Unlike NetApp “Write Anywhere File Layout” (WAFL), Linux LVM uses “Copy-on-Write” (COW) to allow snapshots. The process, in general, can be described in this pdf document.

I have issues several small tests, just to get real-life estimations of what is the actual performance impact such COW method can cause.

Server details:

1. CPU: 2x Xion 2.8GHz

2. Disks: /dev/sda – system disk. Did not touch it; /dev/sdb – used for the LVM; /dev/sdc – used for the LVM

3. Mount: LV is mounted (and remains mounted) on /vmware

Results:

1. No snapshot, Using VG on /dev/sdb only:

# time dd if=/dev/zero of=/vmware/test.2GB bs=1M count=2048
2048+0 records in
2048+0 records out

real 0m16.088s
user 0m0.009s
sys 0m8.756s

2. With snapshot on the same disk (/dev/sdb):

# time dd if=/dev/zero of=/vmware/test.2GB bs=1M count=2048
2048+0 records in
2048+0 records out

real 6m5.185s
user 0m0.008s
sys 0m11.754s

3. With snapshot on 2nd disk (/dev/sdc):

# time dd if=/dev/zero of=/vmware/test.2GB bs=1M count=2048
2048+0 records in
2048+0 records out

real 5m17.604s
user 0m0.004s
sys 0m11.265s

4. Same as before, creating a new empty file on the disk:

# time dd if=/dev/zero of=/vmware/test2.2GB bs=1M count=2048
2048+0 records in
2048+0 records out

real 3m24.804s
user 0m0.006s
sys 0m11.907s

5. Removed the snapshot. Created a 3rd file:

LVM Snapshots with MySQL

Saturday, December 2nd, 2006

Nowadays, when LVM2 is common and is actually the default in installation of RedHat based distributions, using its snapshot capabilities can save lots of grief when files are deleted or when you need to revert to a day in the past – both for your files and for your MySQL DB.

I have created a script which is based on the following assumptions:

1. Inside /etc/samba/smb.conf there is a directive such as: include /etc/samba/smb.conf

2. There is a single LV containing all the system’s data. It doesn’t occupy all the physical disk (or, for the matter, the entire VG space). Free space is 10-20% of disk size

3. Specific share directives are located inside /etc/samba/smb.conf.snapshot.full. An empty file /etc/samba/smb.conf.snapshot.empty exists.

4. I do not trust all places to hold a password for their MySQL (although it is advised!). This script assumes such password doesn’t always exist

5. The script mounts the snapshot read-only just after creating an empty file with the date of the snapshot inside its root.

The script is attached here. take-snapshot.txt

Single-Node Linux Heartbeat Cluster with DRBD on Centos

Monday, October 23rd, 2006

The trick is simple, and many of those who deal with HA cluster get at least once to such a setup – have HA cluster without HA.

Yep. Single node, just to make sure you know how to get this system to play.

I have just completed it with Linux Heartbeat, and wish to share the example of a setup single-node cluster, with DRBD.

First – get the packages.

It took me some time, but following Linux-HA suggested download link (funny enough, it was the last place I’ve searched for it) gave me exactly what I needed. I have downloaded the following RPMS:

heartbeat-2.0.7-1.c4.i386.rpm

heartbeat-ldirectord-2.0.7-1.c4.i386.rpm

heartbeat-pils-2.0.7-1.c4.i386.rpm

heartbeat-stonith-2.0.7-1.c4.i386.rpm

perl-Mail-POP3Client-2.17-1.c4.noarch.rpm

perl-MailTools-1.74-1.c4.noarch.rpm

perl-Net-IMAP-Simple-1.16-1.c4.noarch.rpm

perl-Net-IMAP-Simple-SSL-1.3-1.c4.noarch.rpm

I was required to add up the following RPMS:

perl-IO-Socket-SSL-1.01-1.c4.noarch.rpm

perl-Net-SSLeay-1.25-3.rf.i386.rpm

perl-TimeDate-1.16-1.c4.noarch.rpm

I have added DRBD RPMS, obtained from YUM:

drbd-0.7.21-1.c4.i386.rpm

kernel-module-drbd-2.6.9-42.EL-0.7.21-1.c4.i686.rpm (Note: Make sure the module version fits your kernel!)

As soon as I finished searching for dependent RPMS, I was able to install them all in one go, and so I did.

Configuring DRBD:

DRBD was a tricky setup. It would not accept missing destination node, and would require me to actually lie. My /etc/drbd.conf looks as follows (thanks to the great assistance of linux-ha.org):

resource web {
protocol C;
incon-degr-cmd “echo ‘!DRBD! pri on incon-degr’ | wall ; sleep 60 ; halt -f”; #Replace later with halt -f
startup { wfc-timeout 0; degr-wfc-timeout 120; }
disk { on-io-error detach; } # or panic, …
syncer {
group 0;
rate 80M; #1Gb/s network!
}
on p800old {
device /dev/drbd0;
disk /dev/VolGroup00/drbd-src;
address 1.2.3.4:7788; #eth0 network address!
meta-disk /dev/VolGroup00/drbd-meta[0];
}
on node2 {
device /dev/drbd0;
disk /dev/sda1;
address 192.168.99.2:7788; #eth0 network address!
meta-disk /dev/sdb1[0];
}
}

I have had two major problems with this setup:

1. I had no second node, so I left this “default” as the 2nd node. I never did expect to use it.

2. I had no free space (non-partitioned space) on my disk. Lucky enough, I tend to install Centos/RH using the installation defaults unless some special need arises, so using the power of the LVM, I have disabled swap (swapoff -a), decreased its size (lvresize -L -500M /dev/VolGroup00/LogVol01), created two logical volumes for DRBD meta and source (lvcreate -n drbd-meta -L +128M VolGroup00 && lvcreate -n drbd-src -L +300M VolGroup00), reformatted the swap (mkswap /dev/VolGroup00/LogVol01), activated the swap (swapon -a) and formatted /dev/VolGroup00/drbd-src (mke2fs -j /dev/VolGroup00/drbd-src). Thus I have now additional two volumes (the required minimum) and can operate this setup.

Solving the space issue, I had to start DRBD for the first time. Per Linux-HA DRBD Manual, it had to be done by running the following commands:

modprobe drbd

drbdadm up all

drbdadm — –do-what-I-say primary all

This has brought the DRBD up for the first time. Now I had to turn it off, and concentrate on Heartbeat:

drbdadm secondary all

Heartbeat settings were as follow:

/etc/ha.d/ha.cf:

use_logd on #?Or should it be used?
udpport 694
keepalive 1 # 1 second
deadtime 10
initdead 120
bcast eth0
node p800old #`uname -n` name
crm yes
auto_failback off #?Or no
compression bz2
compression_threshold 2

I have also created a relevant /etc/ha.d/haresources, although I’ve never used it (this file has no importance when using “crm=yes” in ha.cf). I did, however, use it as a source for /usr/lib/heartbeat/haresources2cib.py:

p800old IPaddr::1.2.3.10/8/1.255.255.255 drbddisk::web Filesystem::/dev/drbd0::/mnt::ext3 httpd

It is clear that the virtual IP will be 1.2.3.10 in my class A network, and DRBD would have to go up before mounting the storage. After all this, the application would kick in, and would bring up my web page. The application, Apache, was modified beforehand to use the IP 1.2.3.10:80, and to search for DocumentRoot in /mnt

Running /usr/lib/heartbeat/haresources2cib.py on the file (no need to redirect output, as it is already directed to /var/lib/heartbeat/crm/cib.xml), and I was ready to go.

/etc/init.d/heartbeat start (while another terminal is open with tail -f /var/log/messages), and Heartbeat is up. It took it few minutes to kick the resources up, however, I was more than happy to see it all work. Cool.

The logic is quite simple, the idea is very basic, and as long as the system is being managed correctly, there is no reason for it to get to a dangerous state. Moreover, since we’re using DRBD, Split Brain cannot actually endanger the data, so we get compensated for the price we might pay, performance-wise, on a real two-node HA environment following these same guidelines.

I cannot express my gratitude to http://www.linux-ha.org, which is the source of all this (adding up with some common sense). Their documents are more than required to setup a full working HA environment.

HP-UX and Software Raid1

Tuesday, August 8th, 2006

I have installed today an HP-UX 11i V2 on PA-Risc server, and it went quite fine. I have used the “Technical Environment” DVDs for installation, and it went fine. I was unable to find, however, the Raid1 (Mirror) tools for the LVM.

Symptoms: There is no parameter “-m” to “lvextend“. According to documentaion (or even better, HP Forum1 and HP Forum2), it is plain simple, using the lvextend. Only here I got to figure that it was part of the LVM package for Enterprise Servers.

I finally found it in the CD called “Mission Critical Operating Environment DVD 1″. Inside,in a bundle called “HPUX11i-OE-Ent”. I have selected “LVM” from the list there, installed, let the system recompile the kernel, and reboot. Then lvextend will started accepting the “-m” flag.

Per the posts described above, I run:

for LVOL in `ls /dev/vg00/l*` ; do

lvextend -m 1 $LVOL

done

Took a while, but at least it worked.

Ontap Simulator, and some insights about NetApp

Tuesday, May 9th, 2006

First and foremost – the Ontap simulator, a great tool which surely can assist in learning NetApp interface and utilization, lacks in performance. It has some built-in limitations – No FCP, no disks (virtual disks) larger than 1GB (per my trial-and-error. I might find out I was wrong somehow, and put in on this website), and low performance. I’ve got about 300KB/s transfer rate both on iSCSI and on NFS. To make sure it was not due to some network hog hiding somewhere on my net(s), I’ve even tried it from the host of the simulator itself, but to no avail. Low performance. Don’t try to use it as your own home iSCSI Target. Better just use Linux for this purpose, with the drivers obtained from here (It’s one of my next steps into “shared storage(s) for all”).

Another issue – After much reading through NetApp documentation, I’ve reached the following concepts of the product. Please correct me if you see fit:

The older method was to create a volume (vol create) directly from disks. Either using raid_dp or raid4.

The current method is to create aggregations (aggr create) from disks. Each aggregate consists of raid groups. A raid group (rg) can be made up of up to eight physical disks. Each group of disks (an rg) has one or two parity disks, depending on the type of raid (raid 4 uses one parity, and raid_dp uses “double parity”, as its name can suggest).

Actually, I can assume that each aggregation is formatted using the WAFL filesystem, which leads to the conclusion that modern (flex) volumes are logical “chunks” of this whole WAFL layout. In the past, each volume was a separated WAFL formatted unit, and each size change required adding disks.

This separation of the flex volume from the aggregation suggests to me the possibility of multiple-root capable WAFL. It can explain the lack of requirement for a continuous space on the aggregation. This eases the space management, and allows for fast and easy “cloning” of volumes.

I believe that the new “clone” method is based on the WAFL built-in snapshot capabilities. Although WAFL Snapshots are supposed to be space conservatives, they require a guaranteed space on the aggregation prior to committing the clone itself. If the aggregation is too crowded, they will fail with the error message “not enough space”. If there is enough for snapshots, but not enough to guarantee a full clone, you’ll get a message saying “space not guaranteed”.

I see the flex volumes as some combination between filesystem (WAFL) and LVM, living together on the same level.

LUNs on NetApp: iSCSI and/or Fibre LUNs are actually managed as a single (per-LUN) large file contained within a volume. This file has special permissions (I was not able to copy it or modify it while it was online and I had root permissions. However, I am rather new to NetApp technology), and it is being exported as a disk outside. Much like an ISO image (which is a large file containing a whole filesystem layout) these files contain a whole disk layout, including partition tables, LVM headers, etc – just like a real disk.

Thinking about it, it’s neither impossible nor very surprising. A disk is no more than a container of data, of blocks, and if you can utilize the required communication protocol used for accessing it and managing its blocks (aka, the transport layer on which filesystem can access the block data), you can, with just a little translation interface, set up a virtual disk which will behave just like any regular disk.

This brings us to the advantages of NetApp’s WAFL – the ability to minimize I/O while maintaining a set of snapshots for the system – a list of per-block modification history. It means you can “snapshot” your LUN, being physically no more than a file on a WAFL-based volume, and you can go back with your data to a previous date – an hour, a day, a week. Time travel for your data.

There are, unfortunately, some major side effects. If you’ve read the WAFL description from NetApp, my summary will be inaccurate at best. If you haven’t, it will be enough, but still you are most encouraged to read it. The idea is that this filesystem is made out of multi-layers of pointers, and of blocks. A pointer can point to more than one block. When you commit a snapshot, you do not change the pointers, you do not move data, you just modify the set of pointers. When there is any change in the data (meaning a block is changed), the pointer points to the alternate block instead of the previous (historical) block, but keeps reference of the older block’s location. This way, only modified blocks are actually recreated, while any unmodified data remains on the same spot on the physical disk. An additional claim of NetApp is that their WAFL is optimized for the raid4 and raid_dp they use, and utilizes it in a smart manner.

The problem with WAFL, as can be easily seen, is fragmentation. For CIFS and NFS, it does not cause much of a problem, as the system is very capable of read-ahead just to solve this issue. However, A LUN (which is supposed to act as a continuous layout, just like any hard-drive or raid-array in the world and on which various file-system related operations occur) gets fragmented.

Unlike CIFS or NFS, LUN read-ahead is harder to predict, as the client tries to do just the same. Unlike real disks, NetApp LUNs do not behave, performance-wise, like the hard-drive layout any DB or FS has learned to expect and was best optimized for. It means, for my example, that on a DB with lots of small changes, that the DB itself would have tried to commit changes in large write operations, committed every so and so interval, and would thrive to commit them as close to each other, as continuous as possible. On NetApp LUN this will cause fragmentation, and will result in lower write (and later read) performance.

That’s all for today.