Posts Tagged ‘performance’

XenServer 6.0 with DRBD

Wednesday, January 18th, 2012

DRBD is a low-cost shared-SAN-like solution, which has several great benefits, among which are no single point of failure, and very low cost (local storage and network cable). Its main disadvantages are in the need to constantly monitor it, and make sure it does what’s expected. Also – in some cases – performance might be affected greatly.

If you need XenServer pool with VMs XenMotion (used to call it LiveMigration. I liked it better then…), but you cannot afford or do not want classic shared storage acting a single point of failure, DRBD could be for you. You have to understand the limitations, however.

The most important limitation is with data consistency. If you aim at using it as Active/Active, as I have, you need to make sure that under any circumstance you will not have split brain, as it will mean losing data (you will recover to an older point in time). If you aim at Active/Passive, or all your VMs will run on a single host, then the danger is lower, however – for A/A, and VMs spread across both hosts – the danger is imminent, and you should be aware of it.

This does not mean that you will have to run crying in case of split brain. It means you might be required to export/import VMs to maintain consistent data, and that you will have a very long downtime. Kinda defies the purpose of XenMotion and all…

Using the DRBD guid here, you will find an excellent solution, but not a complete one. I will describe my additions to this document.

So, first, you need to download the DRBD packages. I have re-packaged them, as they did not match XenServer with XS60E003 update. You can grub this particular tar.gz here: drbd-8.3.12-xenserver6.0-xs003.tar.gz . I did not use DRBD 8.4.1, as it has shown great instability and liked getting split-brained all the time. Don’t want it with our system, do we?

Make sure you have defined the private link between your hosts, both as a network interface, as described, and in both servers’ /etc/hosts file. It will be easier later. Verify that the host hostname matches the configuration file, else DRBD will not start.

Next, follow the mentioned guide.

Unlike this guide, I did not define DRBD to be Active/Active in the configuration file. I have noticed that upon reboot of the pool master (and always it), probably due to timing issues, as the XE Toolstack did not release the DRBD device, it would have started in split-brain mode, and I was incapable of handling it correctly. No matter when I have tried to set the service to start, as early as possible, it would have always start in split-brain mode.

The workaround was to let it start in passive mode, and while being read-only device, XE Toolstack cannot use it. Then I wait (in /etc/rc.local) for it to complete sync, and connect the PBD.

You will need each host PBD for this specific SR.

You can do it by running:

for i in `xe host-list --minimal` ; do 
echo -n "host `xe host-param-get param-name=hostname uuid=$i`  "
echo "PBD `xe pbd-list sr-uuid=$(xe  sr-list name-label=drbd-sr1 --minimal) --minimal`"
done

This will result in a line per host with the DRBD PBD uuid. Replace drbd-sr1 with your actual DRBD SR name.

You will require this info later.

My drbd.conf file looks like this:

# You can find an example in  /usr/share/doc/drbd.../drbd.conf.example

#include "drbd.d/global_common.conf";
#include "drbd.d/*.res";

resource drbd-sr1 {
protocol C;
startup {
degr-wfc-timeout 120; # 2 minutes.
outdated-wfc-timeout 2; # 2 seconds.
#become-primary-on both;
}

handlers {
    split-brain "/usr/lib/drbd/notify.sh root";
}

disk {
max-bio-bvecs 1;
no-md-flushes;
no-disk-flushes;
no-disk-barrier;
}

net {
allow-two-primaries;
cram-hmac-alg "sha1";
shared-secret "Secr3T";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-1pri consensus;
after-sb-2pri disconnect;
#after-sb-2pri call-pri-lost-after-sb;
max-buffers 8192;
max-epoch-size 8192;
sndbuf-size 1024k;
}

syncer {
rate 1G;
al-extents 2099;
}

on xenserver1 {
device /dev/drbd1;
disk /dev/sda3;
address 10.1.1.1:7789;
meta-disk internal;
}
on xenserver2 {
device /dev/drbd1;
disk /dev/sda3;
address 10.1.1.2:7789;
meta-disk internal;
}
}

I did not force them both to become primary, as split-brain handling in A/A mode is very complex. I have forced them to start as secondary.
Then, in /etc/rc.local, I have added the following lines:

echo 1 > /sys/devices/system/cpu/cpu1/online
while grep sync /proc/drbd > /dev/null 2>&1
do
        sleep 5
done
/sbin/drbdadm primary all
/opt/xensource/bin/xe pbd-plug uuid=dfb02709-2483-a11a-cb0e-eac0fb05d8e2

This performs the following:

  • Add an additional core to Domain 0, to reduce chances of CPU overload with DRBD
  • Waits for any sync to complete (if DRBD failed, it will continue, but you will have a split brain, or no DRBD at all)
  • Brings the DRBD device to primary mode. I have had only one DRBD device, but this can be performed selectively for each device
  • Reconnects the PBD which, till this point in the boot sequence, was disconnected. An important note – replace the uuid with the one discovered above for each host – each host should unplug its own PBD.

To sum it up – until sync has been completed, the PBD will not be plugged, and until then, no VMs can run on this SR. Split brain handling for A/P configuration is so much easier.

Some additional notes:

  • I have failed horribly when the interconnect cable was down. I did not implement hardware fencing mechanisms, but it would probably be a very good practice for production systems. Disconnecting the cross cable will result in a split brain.
  • For this system to be worthy, it has to have external monitoring. DRBD must be monitored at all times.
  • Practice and document cases of single node failure, both nodes failure, host master failure, etc. Make sure you know how to react before it happens in real-life.
  • Performance was measured on a Linux RHEL6 VM to be about 82MB/s. The hardware it was tested on was Dell PE R610 with a very nice RAID5 array, etc. When the 2nd host was down, performance resulted in abour 450MB/s, so the bandwidth, in this particular case, matters.
  • Performance test was done using the command:
    dd if=/dev/zero bs=1M of=/tmp/test_file.dd oflag=direct
    Without the oflag=direct, the system will overload the disk write cache of the OS, and not the disk itself (at least – not immediately).
  • I did not test random-access performance.
Hope it helps

Xen VMs performance collection

Saturday, October 18th, 2008

Unlike VMware Server, Xen’s HyperVisor does not allow an easy collection of performance information. The management machine, called “Domain-0” is actually a privileged virtual machine, and thus – get its own small share of CPUs and RAM. Collecting performance information on it will lead to, well, collecting performance information for a single VM, and not the whole bunch.

Local tools, such as “xentop” allows collection of information, however, combining this with Cacti, or any other SNMP-based collection tool is a bit tricky.

A great solution is provided by Ian P. Christian in his blog post about Xen monitoring. He has created a Perl script to collect information. I have taken the liberty to fix several minor things with his permission. The modified scripts are presented below. Name the script (according to your version of Xen) “/usr/local/bin/xen_stats.pl” and set it to be executable:

For Xen 3.1

#!/usr/bin/perl -w

use strict;

# declare...
sub trim($);
#xen_cloud.tar.gz
# we need to run 2 iterations because CPU stats show 0% on the first, and I'm putting .1 second betwen them to speed it up
my @result = split(/n/, `xentop -b -i 2 -d.1`);

# remove the first line
shift(@result);

shift(@result) while @result && $result[0] !~ /^xentop - /;

# the next 3 lines are headings..
shift(@result);
shift(@result);
shift(@result);
shift(@result);

foreach my $line (@result)
{
  my @xenInfo = split(/[t ]+/, trim($line));
  printf("name: %s, cpu_sec: %d, cpu_percent: %.2f, vbd_rd: %d, vbd_wr: %dn",
    $xenInfo[0],
    $xenInfo[2],
    $xenInfo[3],
    $xenInfo[14],
    $xenInfo[15]
    );
}

# trims leading and trailing whitespace
sub trim($)
{
  my $string = shift;
  $string =~ s/^s+//;
  $string =~ s/s+$//;
  return $string;
}

For Xen 3.2 and Xen 3.3

#!/usr/bin/perl -w

use strict;

# declare…
sub trim($);

# we need to run 2 iterations because CPU stats show 0% on the first, and I’m putting .1 second between them to speed it up
my @result = split(/n/, `/usr/sbin/xentop -b -i 2 -d.1`);

# remove the first line
shift(@result);
shift(@result) while @result && $result[0] !~ /^[t ]+NAME/;
shift(@result);

foreach my $line (@result)
{
        my @xenInfo = split(/[t ]+/, trim($line));
        printf(“name: %s, cpu_sec: %d, cpu_percent: %.2f, vbd_rd: %d, vbd_wr: %dn“,
        $xenInfo[0],
        $xenInfo[2],
        $xenInfo[3],
        $xenInfo[14],
        $xenInfo[15]
        );
}
# trims leading and trailing whitespace
sub trim($)
{
        my $string = shift;
        $string =~ s/^s+//;
        $string =~ s/s+$//;
        return $string;
}

Cron settings for Domain-0

Create a file “/etc/cron.d/xenstat” with the following contents:

# This will run xen_stats.pl every minute
*/1 * * * * root /usr/local/bin/xen_stats.pl > /tmp/xen-stats.new && cat /tmp/xen-stats.new > /var/run/xen-stats

SNMP settings for Domain-0

Add the line below to “/etc/snmp/snmpd.conf” and then restart the snmpd service

extend xen-stats   /bin/cat /var/run/xen-stats

Cacti

I reduced Ian Cacti script to be based on a per-server setup, meaning this script gets the host (dom-0) name from Cacti, but cannot support live migrations. I will try to deal with combining live migrations with Cacti in the future.

Download and extract my modified xen_cloud.tar.gz file. Extract it, place the script and config in its relevant location, and import the template into Cacti. It should work like charm.

A note – the PHP script will work only on PHP5 and above. Works flawlessly on Centos5.2 for me.

Graphing on-demand Linux system performance parameters

Tuesday, May 20th, 2008

Current servers are way more powerful than we could have imagined before. With quad-core CPUs, even the simple dual-socket servers contain lots of horse-power. Remember our attitude towards CPU power five years ago, and see that we’re way beyond our needs.

When modern servers are equipped with at least eight cores, other, non-CPU related issues become noticeable. Storage, as always, remains a common bottleneck, and, as an increase in expectations always accommodate increase in abilities, memory and other elements can be the cause for performance degradation.

sar‘ is a known tool for Linux and other Unix flavors, however, understanding the contexts within is not trivial, and while the data is there, figuring what is relevant for the issue at hand becomes, with more disk devices, and more CPUs, even more complicated.

kSar is a simple java utility which makes this whole mess into a simple, readable graphs, capable of being exported to PDF for the pleasure of the customers (where applicable). It parses existing sar files, or the extracted contents of ‘sa’ files (from, by default, /var/log/sa/). It is a useful tool, and I recommend it with all my heart.

Alas, when it comes to parsing ‘sa’ files, you will need, in most cases, either to export the file into text on the source machine, or use a similar version of sysstat tools, as changes in versions reflect changes in the binary format used by sar.

You can obtain the sysstat utils from here, and compile it for your needs. You will need only ‘sar’ on your own machine.

An important note – you will not be able to compile sysstat utils using GCC 4.x. Only 3.x will do it. The error would look like:

warning: ‘packed’ attribute ignored for field of type `unsigned char…

followed by compilation errors. Using GCC version 3.x will work just fine.

Some few small insights

Wednesday, March 12th, 2008

Lately I have been overloaded above my capabilities. This did not prevent me from doing all kind of things, but most of them are too small to justify a real entry here, so I have decided to make a small collection of small stuff someone might need to know, in order to make it indexed in search engines. These small insights might save some time for someone. This is a noble cause.

1. Oreon is a nice overlay for Nagios, however, it is poorly documented, and some of the existing docs are in French. I have put hours on building it into a working setup, and I hope to be able to write down the process as is.

2. “Sun Java System Active Server Pages” does not support 64bit Linux installations – at least not if you’re interested in using it with your existing Apache server. Look here. Seems nothing has changed.

3. Under Ubuntu 7.10, Compiz suffers from a major memory leak when using NVidia display adapters. You can read about it in the bug page. I was able, thanks to this link, to workaround it using compiz –indirect-rendering . Does not see to cause any ill-effect on my display performance.

4. Suse 10 and wireless cards – This one is a great guide, which I would happily recommend.

5. Flushing the existing read buffer for your Linux machine (should never be done, unless you’re testing performance) can be done by running the following command:

echo 1 > /proc/sys/vm/drop_caches

Seems to be enough for today. Hope these tips help.

Quick provisioning of virtual machines

Friday, February 1st, 2008

When one wants to achieve fast provisioning of virtual machines, some solutions might come into account. The one I prefer uses Linux LVM snapshot capabilities to duplicate one working machine into few.

This can happen, of course, only if the host running VMware-Server is Linux.

LVM snapshots have one vast disadvantage – performance. When a block on the source of the snapshot is being changed for the first time, the original block is being replicated to each and every snapshot COOW space. It means that a creation of a 1GB file on a volume having ten snapshots means a total copy of 10GB of data across your disks. You cannot ignore this performance impact.

LVM2 has support for read/write snapshots. I have come up with a nice way of utilizing this capability to my benefit. An R/W snapshot which is being changed does not replicate its changes to any other snapshot. All changes are considered local to this snapshot, and are being maintained only in its COOW space. So adding a 1GB file to a snapshot has zero impact on the rest of the snapshots or volumes.

The idea is quite simple, and it works like this:

1. Create adequate logical volume with a given size (I used 9GB for my own purposes). The name of the LV in my case will be /dev/VGVM3/centos-base

2. Mount this LV on a directory, and create a VM inside it. In my case, it’s in /vmware/centos-base

3. Install the VM as the baseline for all your future VMs. If you might not want Apache on some of them, don’t install it on the baseline.

4. Install vmware-tools on the baseline.

5. Disable the service “kudzu”

6. Update as required

7. In my case I always use DHCP. You can set it to obtain its IP once from a given location, or whatever you feel like.

8. Shut down the VM.

9. In the VM’s .vmx file add a line like this:

uuid.action = “create”

I have added below (expand to read) two scripts which will create the snapshot, mount it and register it, including new MAC and UUID.

Press below for the scripts I have used to create and destroy VMs

create-replica.sh:

#!/bin/sh
# This script will replicate vms from a given (predefined) source to a new system
# Written by Ez-Aton, http://www.tournament.org.il/run
# Arguments: name

# FUNCITONS BE HERE
test_can_do () {
# To be able to snapshot, we need a set of things to happen
if [ -d $DIR/$TARGET ] ; then
echo “Directory already exists. You don’t want to do it…”
exit 1
fi
if [ -f $VG/$TARGET ] ; then
echo “Target snapshot exists”
exit 1
fi
if [ `vmrun list | grep -c $DIR/$SRC/$SRC.vmx` -gt “0” ] ; then
echo “Source VM is still running. Shut it down before proceeding”
exit 1
fi
if [ `vmware-cmd -l | grep -c $DIR/$TARGET/$SRC.vmx` -ne “0” ] ; then
echo “VM already registered. Unregister first”
exit 1
fi
}

do_snapshot () {
# Take the snapshot
lvcreate -s -n $TARGET -L $SNAPSIZE $VG/$SRC
RET=$?
if [ “$RET” -ne “0” ]; then
echo “Failed to create snapshot”
exit 1
fi
}

mount_snapshot () {
# This function creates the required directories and mounts the snapshot there
mkdir $DIR/$TARGET
mount $VG/$TARGET $DIR/$TARGET
RET=$?
if [ “$RET” -ne “0” ]; then
echo “Failed to mount snapshot”
exit 1
fi
}

alter_snap_vmx () {
# This function will alter the name in the VMX and make it the $TARGET name
cat $DIR/$TARGET/$SRC.vmx | grep -v “displayName” > $DIR/$TARGET/$TARGET.vmx
echo “displayName = “$TARGET”” >> $DIR/$TARGET/$TARGET.vmx
cat $DIR/$TARGET/$TARGET.vmx > $DIR/$TARGET/$SRC.vmx
rm $DIR/$TARGET/$TARGET.vmx
}

register_vm () {
# This function will register the VM to VMWARE
vmware-cmd -s register $DIR/$TARGET/$SRC.vmx
}

# MAIN
if [ -z “$1” ]; then
echo “Arguments: The target name”
exit 1
fi

# Parameters:
SRC=centos-base         #The name of the source image, and the source dir
PREFIX=centos             #All targets will be created in the name centos-$NAME
DIR=/vmware               #My VMware VMs default dir
SNAPSIZE=6G              #My COOW space
VG=/dev/VGVM3           #The name of the VG
TARGET=”$PREFIX-$1″

test_can_do
do_snapshot
mount_snapshot
alter_snap_vmx
register_vm
exit 0

remove-replica.sh:

#!/bin/sh
# This script will remove a snapshot machine
# Written by Ez-Aton, http://www.tournament.org.il/run
# Arguments: machine name

#FUNCTIONS
does_it_exist () {
# Check if the described VM exists
if [ `vmware-cmd -l | grep -c $DIR/$TARGET/$SRC.vmx` -eq “0” ]; then
echo “No such VM”
exit 1
fi
if [ ! -e $VG/$TARGET ]; then
echo “There is no matching snapshot volume”
exit 1
fi
if [ `lvs $VG/$TARGET | awk ‘{print $5}’ | grep -c $SRC` -eq “0” ]; then
echo “This is not a snapshot, or a snapshot of the wrong LV”
exit 1
fi
}

ask_a_thousand_times () {
# This function verifies that the right thing is actually done
echo “You are about to remove a virtual machine and an LVM. Details:”
echo “Machine name: $TARGET”
echo “Logical Volume: $VG/$TARGET”
echo -n “Are you sure? (y/N): ”
read RES
if [ “$RES” != “Y” ]&&[ “$RES” != “y” ]; then
echo “Decided not to do it”
exit 0
fi
echo “”
echo “You have asked to remove this machine”
echo -n “Again: Are you sure? (y/N): ”
read RES
if [ “$RES” != “Y” ]&&[ “$RES” != “y” ]; then
echo “Decided not to do it”
exit 0
fi
echo “Removing VM and snapshot”
}

shut_down_vm () {
# Shut down the VM and unregister it
vmware-cmd $DIR/$TARGET/$SRC.vmx stop hard
vmware-cmd -s unregister $DIR/$TARGET/$SRC.vmx
}

remove_snapshot () {
# Umount and remove the snapshot
umount $DIR/$TARGET
RET=$?
if [ “$RET” -ne “0” ]; then
echo “Cannot umount $DIR/$TARGET”
exit 1
fi
lvremove -f $VG/$TARGET
RET=$?
if [ “$RET” -ne “0” ]; then
echo “Cannot remove snapshot LV”
exit 1
fi
}

remove_dir () {
# Removes the mount point
rmdir $DIR/$TARGET
}

#MAIN
if [ -z “$1” ]; then
echo “No machine name. Exiting”
exit 1
fi

#PARAMETERS:
DIR=/vmware                #VMware default VMs location
VG=/dev/VGVM3            #The name of the VG
PREFIX=centos              #Prefix to the name. All these VMs will be called centos-$NAME
TARGET=”$PREFIX-$1″
SRC=centos-base           #The name of the baseline image, LVM, etc. All are the same

does_it_exist
ask_a_thousand_times
shut_down_vm
remove_snapshot
remove_dir

exit 0

Pros:

1. Very fast provisioning. It takes almost five seconds, and that’s because my server is somewhat loaded.

2. Dependable: KISS at its marvel.

3. Conservative on space

4. Conservative on I/O load (unlike the traditional use of LVM snapshot, as explained in the beginning of this section).

Cons:

1. Cannot streamline the contents of snapshot into the main image (LVM team will implement it in the future, I think)

2. Cannot take a snapshot of a snapshot (same as above)

3. If the COOW space of any of the snapshots is full (viewable through the command ‘lvs‘) then on boot, the source LV might not become active (confirmed RH4 bug, and this is the system I have used)

4. My script does not edit/alter /etc/fstab (I have decided it to be rather risky, and it was not worth the effort at this time)

5. My script does not check if there is enough available space in the VG. Not required, as it will fail if creation of LV will fail

You are most welcome to contribute any further changes done to this script. Please maintain my URL in the script if you decide to use it.

Thanks!