Posts Tagged ‘Virtualization’

XenServer “Internal error: Failure… no loader found”

Saturday, October 24th, 2009

It has been long since I had the time to write here. I have recently been involved more and more with XenServer virtualization, as you might see in the blogs, and following a solution to a rather common problem, I have decided to post it here.

The problem: When attempting to boot a Linux VM on XenServer (5.0 and 5.5), you get the following error message:

Error: Starting VM ‘Cacti’ – Internal error: Failure(“Error from xenguesthelper: caught exception: Failure(\\\”Subprocess failure: Failure(\\\\\\\”xc_dom_linux_build: [2] xc_dom_find_loader: no loader found\\\\\\\\n\\\\\\\”)\\\”)”)

This is very common with Linux VMs which were converted from physical (or other, non-PV virtualization) to XenServer.

This will probably either happen during the P2V process, or after a successful update to the Linux VM.

The cause is that the original kernel, non PV-aware one, has not been removed, and GRUB likes to load from it. XenServer will use the GRUB menu, but will not display it to us to select our desired kernel.

With no chance to intervene, XenServer will attempt to load a PV-enabled machine using non-PV kernel, and will fail.

Preventing the problem is quite simple – remove your non-PV kernel (non-xen) so that future updates will not attempt to update it as well and set it to be the default kernel. Very simple.

Solving the problem in less than two minutes is a bit more tricky. Let’s see how to solve it.

All operations are performed from within the control domain. This guide does not apply to StorageLink or NetApp/Equalogic devices, as they behave differently. This applies only to LVM-over-something, whatever it may be.

First, we will need to find the name of the VDI we are to work on. Use xe in the following manner, using the VM’s name:

xe vbd-list vm-name-label=Cacti

uuid ( RO)             : 128f29dc-4a14-1a2d-75d1-8674d3d2403b
vm-uuid ( RO): eae053de-4a20-28a5-f335-f5a18dd79993
vm-name-label ( RO): Cacti
vdi-uuid ( RO): 90524af4-5b20-4412-9bfe-f1fe27f220b1
empty ( RO): false
device ( RO): xvda

uuid ( RO)             : de177727-b28a-8b79-e73e-d08366d56277
vm-uuid ( RO): eae053de-4a20-28a5-f335-f5a18dd79993
vm-name-label ( RO): Cacti
vdi-uuid ( RO): <not in database>
empty ( RO): true
device ( RO): xvdd

It is very common that xvdd is used for CDROM, so we can safely ignore the second section. The first section is the more interesting one. There is a correlation between the name of the VDI and the name of the LVM on the disk. We can find this specific LV using the following command. Notice that the name of the VDI is used here as the argument for the ‘grep’ command:

lvs | grep 90524af4-5b20-4412-9bfe-f1fe27f220b1

LV-90524af4-5b20-4412-9bfe-f1fe27f220b1 VG_XenStorage-4aa20fc2-fd92-20c2-c549-bed2597c622b -wi-a-  10.00G

We now have our LV path! As you can see, its status is offline. We need to set it to online state. Using both the LV and the VG name, we can do it like that:

lvchange -ay /dev/VG_XenStorage-4aa20fc2-fd92-20c2-c549-bed2597c622b/LV-90524af4-5b20-4412-9bfe-f1fe27f220b1

Now we can access the volume. We can actually check that the problem is the one we look for, using pygrub:

pygrub /dev/VG_XenStorage-4aa20fc2-fd92-20c2-c549-bed2597c622b/LV-90524af4-5b20-4412-9bfe-f1fe27f220b1

We should now see the GRUB menu of the VM at question. If you don’t see any menu, either you have missed a step or used the wrong disk.

The menu should show you all the list of kernels. The default one is the one highlighted, and if it doesn’t include the word “xen” with it, most likely that we have found the problem.

We now need to change to a PV-capable kernel. We will need to access the “/boot” partition of the Linux VM, and change GRUB’s options there.

First we map the disk to a loop device, so we can access its partitions:

losetup /dev/loop1 /dev/VG_XenStorage-4aa20fc2-fd92-20c2-c549-bed2597c622b/LV-90524af4-5b20-4412-9bfe-f1fe27f220b1

Notice that you need to use the entire path to the LV, that the LV is online, and that loop1 is not in use. If it is, you will have a message saying something like “LOOP_SET_FD: Device or resource busy”

Now we need to access its partitions. We will map them using ‘kpartx’ to /dev/mapper/ devices. Notice we’re using the same loop device name:

kaprtx -a /dev/loop1

Now, new files present themselves in /dev/mapper:

ls -la /dev/mapper/
total 0
drwxr-xr-x  2 root root     220 Oct 24 12:39 .
drwxr-xr-x 14 root root   16560 Oct 24 12:31 ..
crw——-  1 root root  10, 62 Sep 29 10:15 control
brw-rw—-  1 root disk 252,  5 Oct 24 12:39 loop1p1
brw-rw—-  1 root disk 252,  6 Oct 24 12:39 loop1p2
brw-rw—-  1 root disk 252,  7 Oct 24 12:39 loop1p3

Usually, the first partition represents /boot, so we can now mount it and work on it:

mount /dev/mapper/loop1p1 /mnt

All we need to do is edit /mnt/grub/menu.lst to match our requirements, and then wrap everything back up:

umount /mnt

kpartx -u /dev/loop1

losetup -d /dev/loop1

We don’t have to change the LV to offline, because the XenServer will activate it if it’s not, however, we could do it, to be on the safe side:

lvchange -an /dev/VG_XenStorage-4aa20fc2-fd92-20c2-c549-bed2597c622b/LV-90524af4-5b20-4412-9bfe-f1fe27f220b1

Now we can activate the VM, and see it boot successfully.

This whole process takes several minutes the first time, and even less later.

I hope this helps.

Xen VMs performance collection

Saturday, October 18th, 2008

Unlike VMware Server, Xen’s HyperVisor does not allow an easy collection of performance information. The management machine, called “Domain-0″ is actually a privileged virtual machine, and thus – get its own small share of CPUs and RAM. Collecting performance information on it will lead to, well, collecting performance information for a single VM, and not the whole bunch.

Local tools, such as “xentop” allows collection of information, however, combining this with Cacti, or any other SNMP-based collection tool is a bit tricky.

A great solution is provided by Ian P. Christian in his blog post about Xen monitoring. He has created a Perl script to collect information. I have taken the liberty to fix several minor things with his permission. The modified scripts are presented below. Name the script (according to your version of Xen) “/usr/local/bin/xen_stats.pl” and set it to be executable:

For Xen 3.1

?Download xen_stats.pl
#!/usr/bin/perl -w
 
use strict;
 
# declare...
sub trim($);
#<a href="/blog/files/xen_cloud.tar.gz" title="xen_cloud.tar.gz" target="_blank">xen_cloud.tar.gz</a>
# we need to run 2 iterations because CPU stats show 0% on the first, and I'm putting .1 second betwen them to speed it up
my @result = split(/\n/, `xentop -b -i 2 -d.1`);
 
# remove the first line
shift(@result);
 
shift(@result) while @result &amp;&amp; $result[0] !~ /^xentop - /;
 
# the next 3 lines are headings..
shift(@result);
shift(@result);
shift(@result);
shift(@result);
 
foreach my $line (@result)
{
  my @xenInfo = split(/[\t ]+/, trim($line));
  printf("name: %s, cpu_sec: %d, cpu_percent: %.2f, vbd_rd: %d, vbd_wr: %d\n",
    $xenInfo[0],
    $xenInfo[2],
    $xenInfo[3],
    $xenInfo[14],
    $xenInfo[15]
    );
}
 
# trims leading and trailing whitespace
sub trim($)
{
  my $string = shift;
  $string =~ s/^\s+//;
  $string =~ s/\s+$//;
  return $string;
}

For Xen 3.2 and Xen 3.3

?Download xen_stats.pl
#!/usr/bin/perl -w
 
use strict;
 
# declare…
sub trim($);
 
# we need to run 2 iterations because CPU stats show 0% on the first, and I’m putting .1 second between them to speed it up
my @result = split(/\n/, `/usr/sbin/xentop -b -i 2 -d.1`);
 
# remove the first line
shift(@result);
shift(@result) while @result &amp;&amp; $result[0] !~ /^[\t ]+NAME/;
shift(@result);
 
foreach my $line (@result)
{
        my @xenInfo = split(/[\t ]+/, trim($line));
        printf(“name: %s, cpu_sec: %d, cpu_percent: %.2f, vbd_rd: %d, vbd_wr: %d\n,
        $xenInfo[0],
        $xenInfo[2],
        $xenInfo[3],
        $xenInfo[14],
        $xenInfo[15]
        );
}
# trims leading and trailing whitespace
sub trim($)
{
        my $string = shift;
        $string =~ s/^\s+//;
        $string =~ s/\s+$//;
        return $string;
}

Cron settings for Domain-0

Create a file “/etc/cron.d/xenstat” with the following contents:

# This will run xen_stats.pl every minute
*/1 * * * * root /usr/local/bin/xen_stats.pl > /tmp/xen-stats.new && cat /tmp/xen-stats.new > /var/run/xen-stats

SNMP settings for Domain-0

Add the line below to “/etc/snmp/snmpd.conf” and then restart the snmpd service

extend xen-stats   /bin/cat /var/run/xen-stats

Cacti

I reduced Ian Cacti script to be based on a per-server setup, meaning this script gets the host (dom-0) name from Cacti, but cannot support live migrations. I will try to deal with combining live migrations with Cacti in the future.

Download and extract my modified xen_cloud.tar.gz file. Extract it, place the script and config in its relevant location, and import the template into Cacti. It should work like charm.

A note – the PHP script will work only on PHP5 and above. Works flawlessly on Centos5.2 for me.

New version of Cacti, and using spine

Monday, January 21st, 2008

A while ago, a newer version of Cacti became available through Dag’s RPM repository. An upgrade went without any special events, and was nothing to write home about.

A failure in one of my customer’s Cacti system lead me to test the system using “spine” – the “cactid” new generation.

I felt as if it acts faster and better, but had no measurable results (as the broken Cacti system did not work at all). I have decided to propagate the change to a local system I have, which is running Cacti locally. This is a virtual machine, dedicated only to this task.

Almost a day later I can see the results. Not only the measurements are continuous, but the load on the system dropped, and the load on the VM server dropped in accordance. Check the graphs below!

MySQL CPU load reduces at around midnight
as well as the amount of MySQL locks
and innoDB I/O
A small increase in the amount of table locks
A graph which didn’t function starts working
System load average reduces dramatically
Also comparing to a longer period of time
And the virtual host (the carrier), which runs several other guests in addition to this one, without any other change, shows a great improvement in CPU consumption

These measures talk for themselves. From now on (unless it’s realy vital), spine is my perfered engine.

A note about VMware-Server machine security

Saturday, November 10th, 2007

VMware allow setting a virtual machine as a private machine. By doing so, it actually adds to “/etc/vmware/vm-list-private” an additional comment, stating who is the owner of the machine. For example:

cat /etc/vmware/vm-list-private
# This file is automatically generated.
# Hand-editing this file is not recommended.
config “/vmware/Centos4-01/Centos4-01.vmx|root”
config “/vmware/Centos4-02/Centos4-02.vmx|user”

While it is very effective when used with VMware-Console (the nice GUI) – you cannot see machines which are not owned by your own user (in our example – “user”). it has nothing to do with actual permissions on the machine.

Using vmware-cmd you can control machines which are not yours, and are supposed to be private. For example, using

vmware-cmd /vmware/Centos4-01/Centos4-01.vmx stop

as the user “user”, you might be able to turn it off, overriding the obvious, or so you think, permission scheme set up by VMware through the “private guest” settings done above.

This actually has to do with the permissions and ownership on the actual vmx file. To revoke the ability to control your machines or even list them by using vmware-cmd, by an unauthorized user.

The best practice I can suggest is by setting a directory for each user (for example: /vmware for production causes, /qa for QA machines, /user1 for user1 machines, etc), and granting, recursively, permissions on this directory only to the user or group who should have the ability to control these machines. That way, even “vmware-cmd -l” which lists the available guests on an host, will not be able to view guests not owned by the invoking users.

To sum things up, private guests are all about how the GUI decides if and when to display them. eXecute permissions on the vmx files will set who can actually control a guest machine.

RedHat Cluster, and some more

Sunday, February 12th, 2006

It’s been a long while since I’ve written. I get to have, once a while, a period of time dedicated for laziness. I’ve had just one of these for the last few weeks, in which I’ve been almost completely idle. Usually, waking up from such idle time is a time dedicated to self studies and hard work, so I don’t fight my idle periods too hard. This time, I’ve had the pleasure of testing and playing, for personal reasons, both with VMWare GSX, in a “Cluster-In-a-Box” setup, based on recommendation regarding MSCS, altered for Linux (and later, Veritas Cluster Service) and both with RedHat Cluster Server, with the notion of playing with RedHat’s GFS, but, regrettably, without the last.

First, VMware. In their latest rivalty with Microsoft over the issue of Virtualization of servers and desktops, MS has gained an advantage lately. Due to the lower prices of “Virtual Server 2005″, comparing with “VMware GSX Server”, and due to their excellent marketing system (from which we should all learn, if I may say!), Not a few servers and virtual server farms, especially the ones running Windows/Windows setups, had moved to this MS solution, which is as capable as VMware GSX Server. Judging by the history of such rivalries, MS would have won. They always have. However, VMware, in an excellent move, has announced that the next generation of their GSX, simply called “Server”, would be for free. Free for everyone. In this they probably mean to invest more in their more robust ESX server, and give the GSX as a taste of their abilities. While MS do not have any more advanced product than their Virtual Server, it could mean a death blow to their effort in this direction. It could even mean they will just give away their product! While this will happen, we, the customers, will earn a selection of free, advanced and reliable products designed for virtualization. Could it be any better than that?

One more advantage of this “Virtualization for the People” is that community based virtual images, of even the most complicated to install setups can and would be widely available. Meaning to shorten installation time, and allow for a quick working system for everyone. It will require, however, better knowledge and understanding of the products themselves, as merely installing them will not be enough. To survive the future market, you won’t be able to just sell an installation of a product, but should be able to support an out-of-the-box setup of it. That’s for the freelances, and the partially freelances of us…

So, I’ve reinstalled my GSX, and started playing with it. The original goal was to actually run a working setup of RHEL, VCS and Oracle 10g. Unfortunately, VCS supports only RH3 (update 2?), and not RH4, which was a shame. At that point, I’ve considered using RH Cluster Server for the task at hand. It grew to the task of learning this cluster server, and nothing more, which I did, and I can and would share my concepts about it here.

First – Names – I’ve had the pleasure of working with numerous cluster solutions. I’m thrilled each time I get to play with another cluster solution the naming conventions, and name changes vendors do, just to keep themselves unique. I hate it. So here’s a little explanation:
All clusters contain a group of resources (Resource Group, as most vendors call them). This group contains a set of resources, and in some cases, relations (order of startup, dependencies, etc). Each resource could be any single element required for an application. Example – Resource could be an IP address, which without you won’t be able to contact the application. Resource could be a disk device, containing the application’s data. It could be an application start/stop script, and it could be a sub-application – an application required for the whole group to be up, such as a DB for DB driven web server. The order you would ask them to start would be IP, disk, DB, web server (in our case). You’d ask the IP to be brought up first because some of the cluster servers can trick an IP based clients into some delay, so the client hardly feels the short downtime of application failover. But this is for later. So, in a resource group, we have resources. If we can separate resources into different groups, if they have no required dependency between them, it is always better to do so. In our previous example, lets say our web server uses the DB, but it contacts it using IP address, or using hostname. In this case, we won’t need the DB to run on the same physical machine the web server is using, and in such a case, assuming the physical disk holding the DB and the one holding the rest of the web application are not the same disk, we could separate them.

The idea, if I can try to sum it up, is to split your application to the smallest self-maintained structures. Each structure will be called resource group, and each component in this structure is a resource. On some cluster servers, one could group and set dependencies between resource groups, which allows for even more scalability, but that is not our case.

So we had resource groups containing resources. Each computer, a member in the cluster, is called a node. Now, let’s assume our cluster containing three nodes, but we want our application (our resource group) to be able to run on only two specific? In this case, we need to define, for our resource group, which nodes are to be associated with it. In RH Cluster Server, a thing called “Domain” is designed for it. This Domain containes a list of nodes. This Domain can be associated with Resource Group, and thus set priority of failover, and set the group of nodes allowed to deal with the resource group.

All clusters have a single point of error (unlike failure). The whole purpose of the cluster is to allow for non-cluster-aware application the high-availability you could expect for a (relatively) low price. We’re great – we know how to bring an application up, we know how to bring it down. We can assume when the other node(s) is/are down. We cannot be sure of it. We try. We demand few means of communication, so that one link failure won’t cause us to corrupt our shared volumes (by trying multiple access into them). We set a whole system of logic, a heartbit, just name it, to avoid, at almost all cost, a status of split-head – two cluster nodes believing they are the only ones up. You can guess what it means, right?

In RH, there is a heartbit, sure. However, it is based on bonding, in the event of more than one NIC, and not on separated infrastructures. It is a simple network-based HB, with nothing special about it. In case of loss of connection, it would have reset the inactive node, if it saw fit, using a mechanism they call “Fence”. A “Fence” is a system by which the cluster can *know* for sure (or almost for sure) a node has been down, or the cluster can physically take a node down (poweroff if needs), such as control of the UPS the node is connected to, or its power switch, or alternate monitoring infrastructure, such as the Fibre Channel Switch, etc. In such an event, the cluster can know for sure, or can assume, at least, that the hung node has been reset, or it can force it to reset, to release some hung application.

Naming – Resource group is called Service. Resource remains resource, but an application resource *must* be defined by an rc-like script, which accepts start/stop (/restart?). Nothing complicated to it, really. The service contains all required resources.

I was not happy with the cluster, if I can sum up my issues with it. Monitoring machines (nodes) it did correctly, but in the simple enough example I’ve chosen to setup, using apache as a resource (only afterwards I’ve noticed it to be the example RedHat used in their documentation) it failed miserably to take the correct action when an application failed (unlike a failure of a node). I’ve defined my “Service” to contain the following three items:

1) IP Address – Unique for my testing purposes.

2) Shared partition (in my case, and thanks to VMware, /dev/sdb1, mounted at /var/www/html)

3) The Apache application – “/etc/init.d/httpd”

All in all, it was brought up correctly, and switch-over went just fine, including in a case of correct and incorrect reset of the active/passive node, however, when I’ve killed my apache (killall httpd), the cluster detected failure in the application, but was helpless with it. I was unable to bring down the “Service”, as it failed to turn off Apache (duh!), so it did not release neither the IP address, nor the shared volume. In so doing, I’ve had to restart the service rgmanager on both nodes, after manual removal of the remains of the “Service”. I didn’t like it. I expect the cluster to notice failure in the application, which it did, but I expect it to either try to restart the application (/etc/init.d/httpd stop && /etc/init.d/httpd start) before it fails completely, or to set a flag saying it is down, remove the remains of the “Service” from the node in question (release the IP address and the shared storage), and try to bring it up on the other node(s). It did nothing of the likes. It just failed, completely, and required manual intervention.

I expect HA-Cluster to be able to react to an application or resource failure, and not just to a node failure. Since HA-Clusters are meant for the non-ideal world, a place where computers crash, where hardware failures occure, and where applications just die, while servers remain working, I expect the Cluster Server to be able to handle the full variety of problems, but maybe i was expecting too much. I believe it to be better in their future versions, and I believe it could have been done quite easily right now, as long as detection of the failed application occurred, which it has, but it’s not for me to define the cluster’s abilities. This cluster is not mature enough for real-life production sites, if and only because of its failure to react correctly to a resource failure, without demanding manual intervention. A year from now, I’ll probably recommend it as a cheap and reliable solution for most common HA-related tasks, but not today.

That leaves me with VCS and Oracle, which I’ll deal with in the future, wether I like it or not :-)