Posts Tagged ‘performance’

Max your CPU performance on Linux servers

Friday, July 2nd, 2021

I have had an interesting experience with HPE servers, where the BIOS was defined to allow max performance (in contrast to ‘balanced’ mode), but still – the CPU was not at max all the time.

While we generally, strive to a greener computing, when having low-latency workload, we expect a deterministic performance. We want our database to provide results in a timely, and more important – a consistent way. We want to expect a certain query/job to take the same time, whenever we run it. CPU throttling and frequency manipulation, which are part of CPU power-management are not our friends. A certain query can take longer, if the CPU is currently throttled, and can switch to another, slower, core mid-task. This can result in a very complex performance tuning and server behaviour troubleshooting.

The goal of setting the BIOS to ‘max-performance’ is obvious, then, however – it does not have the desired effect. The CPU keeps on throttling, and, while some power-reducing features are disabled, not all. This is not our goal.

Adding to the boot options of the server (via the GRUB/GRUB2/GRUB-EFI/Whatever) the following parameters, should disable Intel CPU throttling, and all (i)relevant C_states. I have not tested, nor investigated AMD behaviour:

nosoftlockup intel_idle.max_cstate=0 mce=ignore_ce

Of course – a reboot is required for these settings to take effect.

If you want to control power management afterwards, you can manually disable a certain core/CPU by running a command such as this:

echo 0 > /sys/devices/system/cpu/cpu<number>/online

It will disable the CPU (and place it in deep power save) until woken by another CPU, by echoing ‘1’ to that same ‘file’.

Connecting EMC/NetApp shelves as JBOD to a Linux machine

Wednesday, April 29th, 2015

Let’s say you have old shelves of either EMC or NetApp with SAS or SATA disks in them. And let’s say you want to connect them via FC to a Linux machine and have some nice ZFS machine/cluster, or whatever else. There are few things to know, and to attend in order for it to work.

The first one is the sector size. For NetApp – this applies only to non SATA disks (I don’t know about SSDs, though), and for EMC this could apply, as far as I noticed, to all disks – sector size is not 512 bytes, but 520 – the additional 8 bytes are used for block checksum. Linux does not handle well 520 blocks – the following error message will appear in the logs:

Unsupported sector size 520.

To solve it, we will need to identify the disks – using sg3_utils (in Centos-like – yum install sg3_utils) and then modify them to block size of 512 bytes. To identify the disks, run:

sg_scan -i
/dev/sg0: scsi0 channel=3 id=0 lun=0
HP P410i 3.66 [rmb=0 cmdq=1 pqual=0 pdev=0xc]
/dev/sg1: scsi0 channel=0 id=0 lun=0
HP LOGICAL VOLUME 3.66 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg2: scsi3 channel=0 id=0 lun=0 [em]
hp DVD A DS8A5LH 1HE3 [rmb=1 cmdq=0 pqual=0 pdev=0x5]
/dev/sg3: scsi1 channel=0 id=0 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg4: scsi1 channel=0 id=1 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg5: scsi1 channel=0 id=2 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg6: scsi1 channel=0 id=3 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg7: scsi1 channel=0 id=4 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg8: scsi1 channel=0 id=5 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg9: scsi1 channel=0 id=6 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg10: scsi1 channel=0 id=7 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg11: scsi1 channel=0 id=8 lun=0
FUJITSU MXW3300FE 0906 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg12: scsi1 channel=0 id=9 lun=0
FUJITSU MXW3300FE 0906 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg13: scsi1 channel=0 id=10 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg14: scsi1 channel=0 id=11 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg15: scsi1 channel=0 id=12 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg16: scsi1 channel=0 id=13 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg17: scsi1 channel=0 id=14 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]

So, for each sg device (member of our batch of disks) we need to modify the sector size.

Two ways to do so – the first suggested by this post here, is by using sg_format in the following manner:

sg_format –format –size=512 /dev/sg2

Another post suggested using a dedicated program called ‘setblocksize’. I followed this one, and it worked fine. I had to power cycle the disks before the Linux could use them.

I did notice that disk performance were not bright. I got about 45MB/s write, and about 65-70 MB/s read for large sequential operations, using something like:

dd bs=1M if=/dev/sdf of=/dev/null bs=1M count=10000
dd bs=1M if=/dev/null of=/dev/sdf oflag=direct count=10000 # WARNING – this writes on the disk. Do not use for disks with data!

Fairly disappointing. Also, using multipath, when the shelf is connected to one FC port, and then back to another, showed me that with the setting:

path_grouping_policy multibus

I got about 10MB/s less compared to using “failover” flag (the default for Centos 6). Whatever modification I did to the multipathd.conf, I was unable to exceed this number when using multiple access. These results were consistent when using multibus or group_by_serial, however, when a single path was active and the other was passive, It clearly showed better. I did modify rr_min_io and rr_min_io_rq, but with no effect.

The low disk performance could suggest I need to flush the original disk firmware, however, I am not sure I will do so. If anyone is reading this and had different results – I would love to hear about it.

XenServer 6.5 PCI-Passthrough

Thursday, April 16th, 2015

While searching the web for how to perform PCI-Passthrough on XenServers, we mostly get info about previous versions. Since I have just completed setting up PCI-Passthrough on XenServer version 6. 5 (with recent update 8, just to give you some notion of the exact time frame), I am sharing it here.

Hardware: Cisco UCS blades, with fNIC. I wish to pass through two FC HBAs into a VM (it is going to act as a backup server, and I need it accessing the FC tape). While all my XenServers in this pool have four (4) FC HBAs, this particular XenServer node has six (6). I am intending the first four for SR communication and the remaining two for the PCI Passthrough process.

This is the output of ‘lspci | grep Fibre’:

0b:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0c:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0d:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0e:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0f:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
10:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)

So, I want to pass through 0f:00.0 and 10:00.0. I had to add to /boot/extlinux.conf the following two entries after the word ‘splash’ and before the three dashes:

pciback.hide=(0f:00.0)(10:00.0) xen-pciback.hide=(0f:00.0)(10:00.0)

Initially, and contrary to the documentation, the parameter pciback.hide had no effect. As soon as the VM started, the command ‘multipath -l‘ would hang forever (or until hard reset to the host).

To apply the settings above, run (for a good measure. Don’t think we need it, but did not read anything about it): ‘extlinux -i /boot‘ and then reboot.

Now, when the host is back, we need to add the devices to the VM. Make sure that the VM is in ‘off’ state before doing that. Your command would look like this:

xe vm-param-set uuid=<VM UUID> other-config:pci=0/0000:0f:00.0,0/0000:10:00.0

The expression ‘0/0000’ is required. You can search for its purpose, however, in most cases, your value would look exactly like mine – ‘0/0000’

Since my VM is Windows, here it almost ends: Start the VM, and if it boots correctly, Install Cisco VIC into it, as if it were a physical host. You’re done.

XenServer 6.0 with DRBD

Wednesday, January 18th, 2012

DRBD is a low-cost shared-SAN-like solution, which has several great benefits, among which are no single point of failure, and very low cost (local storage and network cable). Its main disadvantages are in the need to constantly monitor it, and make sure it does what’s expected. Also – in some cases – performance might be affected greatly.

If you need XenServer pool with VMs XenMotion (used to call it LiveMigration. I liked it better then…), but you cannot afford or do not want classic shared storage acting a single point of failure, DRBD could be for you. You have to understand the limitations, however.

The most important limitation is with data consistency. If you aim at using it as Active/Active, as I have, you need to make sure that under any circumstance you will not have split brain, as it will mean losing data (you will recover to an older point in time). If you aim at Active/Passive, or all your VMs will run on a single host, then the danger is lower, however – for A/A, and VMs spread across both hosts – the danger is imminent, and you should be aware of it.

This does not mean that you will have to run crying in case of split brain. It means you might be required to export/import VMs to maintain consistent data, and that you will have a very long downtime. Kinda defies the purpose of XenMotion and all…

Using the DRBD guid here, you will find an excellent solution, but not a complete one. I will describe my additions to this document.

So, first, you need to download the DRBD packages. I have re-packaged them, as they did not match XenServer with XS60E003 update. You can grub this particular tar.gz here: drbd-8.3.12-xenserver6.0-xs003.tar.gz . I did not use DRBD 8.4.1, as it has shown great instability and liked getting split-brained all the time. Don’t want it with our system, do we?

Make sure you have defined the private link between your hosts, both as a network interface, as described, and in both servers’ /etc/hosts file. It will be easier later. Verify that the host hostname matches the configuration file, else DRBD will not start.

Next, follow the mentioned guide.

Unlike this guide, I did not define DRBD to be Active/Active in the configuration file. I have noticed that upon reboot of the pool master (and always it), probably due to timing issues, as the XE Toolstack did not release the DRBD device, it would have started in split-brain mode, and I was incapable of handling it correctly. No matter when I have tried to set the service to start, as early as possible, it would have always start in split-brain mode.

The workaround was to let it start in passive mode, and while being read-only device, XE Toolstack cannot use it. Then I wait (in /etc/rc.local) for it to complete sync, and connect the PBD.

You will need each host PBD for this specific SR.

You can do it by running:

for i in `xe host-list --minimal` ; do 
echo -n "host `xe host-param-get param-name=hostname uuid=$i`  "
echo "PBD `xe pbd-list sr-uuid=$(xe  sr-list name-label=drbd-sr1 --minimal) --minimal`"
done

This will result in a line per host with the DRBD PBD uuid. Replace drbd-sr1 with your actual DRBD SR name.

You will require this info later.

My drbd.conf file looks like this:

# You can find an example in  /usr/share/doc/drbd.../drbd.conf.example

#include "drbd.d/global_common.conf";
#include "drbd.d/*.res";

resource drbd-sr1 {
protocol C;
startup {
degr-wfc-timeout 120; # 2 minutes.
outdated-wfc-timeout 2; # 2 seconds.
#become-primary-on both;
}

handlers {
    split-brain "/usr/lib/drbd/notify.sh root";
}

disk {
max-bio-bvecs 1;
no-md-flushes;
no-disk-flushes;
no-disk-barrier;
}

net {
allow-two-primaries;
cram-hmac-alg "sha1";
shared-secret "Secr3T";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-1pri consensus;
after-sb-2pri disconnect;
#after-sb-2pri call-pri-lost-after-sb;
max-buffers 8192;
max-epoch-size 8192;
sndbuf-size 1024k;
}

syncer {
rate 1G;
al-extents 2099;
}

on xenserver1 {
device /dev/drbd1;
disk /dev/sda3;
address 10.1.1.1:7789;
meta-disk internal;
}
on xenserver2 {
device /dev/drbd1;
disk /dev/sda3;
address 10.1.1.2:7789;
meta-disk internal;
}
}

I did not force them both to become primary, as split-brain handling in A/A mode is very complex. I have forced them to start as secondary.
Then, in /etc/rc.local, I have added the following lines:

echo 1 > /sys/devices/system/cpu/cpu1/online
while grep sync /proc/drbd > /dev/null 2>&1
do
        sleep 5
done
/sbin/drbdadm primary all
/opt/xensource/bin/xe pbd-plug uuid=dfb02709-2483-a11a-cb0e-eac0fb05d8e2

This performs the following:

  • Add an additional core to Domain 0, to reduce chances of CPU overload with DRBD
  • Waits for any sync to complete (if DRBD failed, it will continue, but you will have a split brain, or no DRBD at all)
  • Brings the DRBD device to primary mode. I have had only one DRBD device, but this can be performed selectively for each device
  • Reconnects the PBD which, till this point in the boot sequence, was disconnected. An important note – replace the uuid with the one discovered above for each host – each host should unplug its own PBD.

To sum it up – until sync has been completed, the PBD will not be plugged, and until then, no VMs can run on this SR. Split brain handling for A/P configuration is so much easier.

Some additional notes:

  • I have failed horribly when the interconnect cable was down. I did not implement hardware fencing mechanisms, but it would probably be a very good practice for production systems. Disconnecting the cross cable will result in a split brain.
  • For this system to be worthy, it has to have external monitoring. DRBD must be monitored at all times.
  • Practice and document cases of single node failure, both nodes failure, host master failure, etc. Make sure you know how to react before it happens in real-life.
  • Performance was measured on a Linux RHEL6 VM to be about 82MB/s. The hardware it was tested on was Dell PE R610 with a very nice RAID5 array, etc. When the 2nd host was down, performance resulted in abour 450MB/s, so the bandwidth, in this particular case, matters.
  • Performance test was done using the command:
    dd if=/dev/zero bs=1M of=/tmp/test_file.dd oflag=direct
    Without the oflag=direct, the system will overload the disk write cache of the OS, and not the disk itself (at least – not immediately).
  • I did not test random-access performance.
Hope it helps

Xen VMs performance collection

Saturday, October 18th, 2008

Unlike VMware Server, Xen’s HyperVisor does not allow an easy collection of performance information. The management machine, called “Domain-0” is actually a privileged virtual machine, and thus – get its own small share of CPUs and RAM. Collecting performance information on it will lead to, well, collecting performance information for a single VM, and not the whole bunch.

Local tools, such as “xentop” allows collection of information, however, combining this with Cacti, or any other SNMP-based collection tool is a bit tricky.

A great solution is provided by Ian P. Christian in his blog post about Xen monitoring. He has created a Perl script to collect information. I have taken the liberty to fix several minor things with his permission. The modified scripts are presented below. Name the script (according to your version of Xen) “/usr/local/bin/xen_stats.pl” and set it to be executable:

For Xen 3.1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/usr/bin/perl -w
 
use strict;
 
# declare…
sub trim($);
 
# we need to run 2 iterations because CPU stats show 0% on the first, and I’m putting .1 second between them to speed it up
my @result = split(/n/, `/usr/sbin/xentop -b -i 2 -d.1`);
 
# remove the first line
shift(@result);
shift(@result) while @result &amp;&amp; $result[0] !~ /^[t ]+NAME/;
shift(@result);
 
foreach my $line (@result)
{
        my @xenInfo = split(/[t ]+/, trim($line));
        printf(“name: %s, cpu_sec: %d, cpu_percent: %.2f, vbd_rd: %d, vbd_wr: %dn,
        $xenInfo[0],
        $xenInfo[2],
        $xenInfo[3],
        $xenInfo[14],
        $xenInfo[15]
        );
}
# trims leading and trailing whitespace
sub trim($)
{
        my $string = shift;
        $string =~ s/^s+//;
        $string =~ s/s+$//;
        return $string;
}

For Xen 3.2 and Xen 3.3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/usr/bin/perl -w
 
use strict;
 
# declare…
sub trim($);
 
# we need to run 2 iterations because CPU stats show 0% on the first, and I’m putting .1 second between them to speed it up
my @result = split(/n/, `/usr/sbin/xentop -b -i 2 -d.1`);
 
# remove the first line
shift(@result);
shift(@result) while @result &amp;&amp; $result[0] !~ /^[t ]+NAME/;
shift(@result);
 
foreach my $line (@result)
{
        my @xenInfo = split(/[t ]+/, trim($line));
        printf(“name: %s, cpu_sec: %d, cpu_percent: %.2f, vbd_rd: %d, vbd_wr: %dn,
        $xenInfo[0],
        $xenInfo[2],
        $xenInfo[3],
        $xenInfo[14],
        $xenInfo[15]
        );
}
# trims leading and trailing whitespace
sub trim($)
{
        my $string = shift;
        $string =~ s/^s+//;
        $string =~ s/s+$//;
        return $string;
}

Cron settings for Domain-0

Create a file “/etc/cron.d/xenstat” with the following contents:

# This will run xen_stats.pl every minute
*/1 * * * * root /usr/local/bin/xen_stats.pl > /tmp/xen-stats.new && cat /tmp/xen-stats.new > /var/run/xen-stats

SNMP settings for Domain-0

Add the line below to “/etc/snmp/snmpd.conf” and then restart the snmpd service

extend xen-stats   /bin/cat /var/run/xen-stats

Cacti

I reduced Ian Cacti script to be based on a per-server setup, meaning this script gets the host (dom-0) name from Cacti, but cannot support live migrations. I will try to deal with combining live migrations with Cacti in the future.

Download and extract my modified xen_cloud.tar.gz file. Extract it, place the script and config in its relevant location, and import the template into Cacti. It should work like charm.

A note – the PHP script will work only on PHP5 and above. Works flawlessly on Centos5.2 for me.