Posts Tagged ‘Linux’

Extracting multi-layered initramfs

Thursday, December 5th, 2019

Modern Kernel specification (can be seen here) defined the initial ramdisk (initrd or initramfs, depends on who you ask) to allow stacking of compressed or uncompressed CPIO archives. It means, in fact, that you can extend your current initramfs by appending a cpio.gz (or cpio) file at the end, containing the additions or changes to the filesystem (be it directories, files, links and anything else you can think about).

An example of this action:

1
2
3
4
5
mkdir /tmp/test
cd /tmp/test
tar -C /home/ezaton/test123 -cf - . | tar xf - # Clones the contests of /home/ezaton/test123 to this location
find ./ | cpio -o -H newc > ../test.cpio.gz # Creates a compressed CPIO file
cat ../test.cpio.gz >> /boot/initramfs-`uname -r`.img

This should work (I haven’t tried, and if you do it – make sure you have a copy of the original initramfs file!), and the contents of the directory /tmp/test would be reflected in the initramfs.

This method allows us to quickly modify existing ramdisk, replacing files (the stacked cpio files are extracted by order), and practically – doing allot of neat tricks.

The trickier question, however, is how to extract the stacked CPIO files.
If you create a file containing multiple cpio.gz files, appended, and just try to extract them, only the contents of the first CPIO file would be extracted.

The Kernel can do it, and so are we. The basic concept we need to understand is that GZIP compresses a stream. It means that there is no difference between a file structured of stacked CPIO files, and then compressed altogether, or a file constructed by appending cpio.gz files. The result would be similar, and so is the handling of the file. It also means that we do not need to run a loop of zcat/un-cpio and then again zcat/un-cpio on the file chunk by chunk, but when we decompress the file, we decompress it in whole.

Let’s create an example file:

1
2
3
4
5
6
cd /tmp for i in {1..10} ; do
    mkdir test${i}
    touch test${i}/test${i}-file
    find ./test${i} | cpio -o -H newc | gzip > test${i}.cpio.gz
    cat test${i}.cpio.gz >> test-of-all.cpio.gz 
done

This script will create ten directories called test1 to test10, each containing a single file called test<number>-file. Each of them will both be archived into a dedicated cpio.gz file (named the same) and appended to a larger file called test-of-all.cpio.gz

If we run the following script to extract the contents, we will get only the first CPIO contents:

1
2
3
mkdir /tmp/extract
cd /tmp/extract
zcat ../test-of-all.cpio.gz | cpio -id # Format is newc, but it is auto detected

The resulting would be the directory ‘test1’ with a single file in it, but with nothing else. The trick to extract all files would be to run the following command:

1
2
3
4
rm -Rf /tmp/extrac # Cleanup
mkdir /tmp/extract
cd /tmp/extract
zcat ../test-of-all.cpio.gz | while cpio -id ; do : ; done

This will extract all files, until there is no more cpio format remaining. Then the ‘cpio’ command will fail and the loop would end.

Some additional notes:
The ‘:’ is a place holder (does nothing) because ‘while’ loop requires a command. It is a legitimate command in shell.

So – now you can extract even complex CPIO structures, such as can be found in older Foreman “Discovery Image” (very old implementation), Tiny Core Linux (see this forum post, and this wiki note as reference on where this stacking is invoked) and more. This said, for extracting Centos/RHEL7 initramfs, which is structured of uncompressed CPIO appended by a cpio.gz file, a different command is required, and a post about it (works for Ubuntu and RHEL) can be found here.

Auto mapping USB Disk on Key to KVM VM using libvirt and udev

Monday, November 19th, 2018

I was required to auto map a USB DoK to a KVM VM (specific VM, mind you!), as a result of connecting this device to the host. I’ve looked it up on the Internet, and the closest I could get there was this link. It was almost a complete solution, but it had a few bugs, so I will re-describe the whole process, with the fixes I’ve added to the process and udev rules file. While this guide is rather old, it did solve my requirement, which was to map a specific set of devices (“known USB devices”) to the VM, and not any and every USB device (or even – USB DoK) connected to the system.

In my example, I’ve used SanDisk Corp. Ultra Fit, which its USB identifier is 0781:5583, as can be seen using ‘lsusb’ command:

[[email protected] ~]# lsusb
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 020: ID 0781:5583 SanDisk Corp. Ultra Fit

My VM is called “centos7.0” in this example. I am using integrated KVM+QEMU+LIBVIRT on a generic CentOS 7.5 system.

Preparation

You will need to prepare two files:

  • USB definitions file (for easier config of libvirt)
  • UDEV rules file (which will be triggered by add/remove operation, and will call the USB definitions file)
USB Definitions file

I’ve placed it in /opt/autousb/hostdev-0781:5583.xml , and it holds the following (mind the USB device identifiers!)

1
2
3
4
5
6
<hostdev mode='subsystem' type='usb'>
  <source>
    <vendor id='0x0781'/>
    <product id='0x5583'/>
  </source>
</hostdev>

I’ve created a file /etc/udev/rules.d/90-libvirt-usb.rules with the content below. Note that the device identifiers are there, but in the “remove” section they appear differently. Remove leading zero(s) and change the string. This is caused because on removal, the device does not report all its properties to the OS. Also – you cannot connect more than three (3) such devices to a VM, so when you fail to detach three devices (following a consecutive insert/remove operations, for example), you will not be able to attach a fourth time.

ACTION=="add", \
    SUBSYSTEM=="usb", \
    ENV{ID_VENDOR_ID}=="0781", \
    ENV{ID_MODEL_ID}=="5583", \
    RUN+="/usr/bin/virsh attach-device centos7.0 /opt/autousb/hostdev-0781:5583.xml"
ACTION=="remove", \
    SUBSYSTEM=="usb", \
    ENV{PRODUCT}=="781/5583/100", \
    RUN+="/usr/bin/virsh detach-device centos7.0 /opt/autousb/config/hostdev-0781:5583.xml"

Now, all that’s left to do is to reload udev using the following command:

udevadm trigger

To monitor the system behaviour, run either of these commands:

udevadm monitor --property --udev 

or

udevadm monitor --environment --udev 

Linux answers to ARP who-is on the wrong network interface

Friday, April 14th, 2017

Assume a server has two network interfaces as follows:

  • eth0 : 192.168.0.1/24
  • eth1 : 192.168.10.1/24

Let’s assume these interfaces reside on the different VLANs. Lets assume they were connected incorrectly, in such a way that eth0 is connected to VLAN 10, which servers 192.168.10.0/24 and eth1 is connected to VLAN 2, which serves 192.168.0.0/24.

You would expect that queries by other hosts on VLAN 2 (which is connected to eth1, but serves 192.168.0.0/24!) would not get responses from the server. You are wrong.

Linux will answer who-is queries on VLAN 2, replying with eth1’s MAC address to queries for 192.168.0.1 IP address.

This example is a simple example, but it can get ugly if your eth0 mimics a different network, and you want the server to be disconnected. I have had to “forge” a network setup on a different VLAN, mimicking the original network and subnet. However – a “backdoor” I have opened (on an additional NIC) between the mimicking server and the original server on a different IP class (a private one) resulted in the mimicking server answering to ARP queries, causing the clients to attempt connecting to the mimicking server instead of the production server. The clients could not complete the TCP handcheck because the mimicking server attempted to contact them via eth0, which was on the fake network, and did not actually reach anywhere.

This was a more complex example, however – the result is the same – the response on the “wrong” network interface to ARP who-is queries might hijack data which should be delivered elsewhere.

There is a solution! You need to setup the sysctl parameter arp_ignore  to either of the following values. The parameter is hidden in /proc/sys/net/ipv4/conf/<NIC>/arp_ignore

The parameters documentation is as follows:

arp_ignore – INTEGER
Define different modes for sending replies in response to received ARP requests that resolve local target IP addresses:
0 – (default): reply for any local target IP address, configured on any interface
1 – reply only if the target IP address is local address configured on the incoming interface
2 – reply only if the target IP address is local address configured on the incoming interface and both with the sender’s IP address are part from same subnet on this interface
3 – do not reply for local addresses configured with scope host, only resolutions for global and link addresses are replied

The value “1” or “2” would do the trick in such cases.

Connecting EMC/NetApp shelves as JBOD to a Linux machine

Wednesday, April 29th, 2015

Let’s say you have old shelves of either EMC or NetApp with SAS or SATA disks in them. And let’s say you want to connect them via FC to a Linux machine and have some nice ZFS machine/cluster, or whatever else. There are few things to know, and to attend in order for it to work.

The first one is the sector size. For NetApp – this applies only to non SATA disks (I don’t know about SSDs, though), and for EMC this could apply, as far as I noticed, to all disks – sector size is not 512 bytes, but 520 – the additional 8 bytes are used for block checksum. Linux does not handle well 520 blocks – the following error message will appear in the logs:

Unsupported sector size 520.

To solve it, we will need to identify the disks – using sg3_utils (in Centos-like – yum install sg3_utils) and then modify them to block size of 512 bytes. To identify the disks, run:

sg_scan -i
/dev/sg0: scsi0 channel=3 id=0 lun=0
HP P410i 3.66 [rmb=0 cmdq=1 pqual=0 pdev=0xc]
/dev/sg1: scsi0 channel=0 id=0 lun=0
HP LOGICAL VOLUME 3.66 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg2: scsi3 channel=0 id=0 lun=0 [em]
hp DVD A DS8A5LH 1HE3 [rmb=1 cmdq=0 pqual=0 pdev=0x5]
/dev/sg3: scsi1 channel=0 id=0 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg4: scsi1 channel=0 id=1 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg5: scsi1 channel=0 id=2 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg6: scsi1 channel=0 id=3 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg7: scsi1 channel=0 id=4 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg8: scsi1 channel=0 id=5 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg9: scsi1 channel=0 id=6 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg10: scsi1 channel=0 id=7 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg11: scsi1 channel=0 id=8 lun=0
FUJITSU MXW3300FE 0906 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg12: scsi1 channel=0 id=9 lun=0
FUJITSU MXW3300FE 0906 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg13: scsi1 channel=0 id=10 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg14: scsi1 channel=0 id=11 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg15: scsi1 channel=0 id=12 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg16: scsi1 channel=0 id=13 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg17: scsi1 channel=0 id=14 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]

So, for each sg device (member of our batch of disks) we need to modify the sector size.

Two ways to do so – the first suggested by this post here, is by using sg_format in the following manner:

sg_format –format –size=512 /dev/sg2

Another post suggested using a dedicated program called ‘setblocksize’. I followed this one, and it worked fine. I had to power cycle the disks before the Linux could use them.

I did notice that disk performance were not bright. I got about 45MB/s write, and about 65-70 MB/s read for large sequential operations, using something like:

dd bs=1M if=/dev/sdf of=/dev/null bs=1M count=10000
dd bs=1M if=/dev/null of=/dev/sdf oflag=direct count=10000 # WARNING – this writes on the disk. Do not use for disks with data!

Fairly disappointing. Also, using multipath, when the shelf is connected to one FC port, and then back to another, showed me that with the setting:

path_grouping_policy multibus

I got about 10MB/s less compared to using “failover” flag (the default for Centos 6). Whatever modification I did to the multipathd.conf, I was unable to exceed this number when using multiple access. These results were consistent when using multibus or group_by_serial, however, when a single path was active and the other was passive, It clearly showed better. I did modify rr_min_io and rr_min_io_rq, but with no effect.

The low disk performance could suggest I need to flush the original disk firmware, however, I am not sure I will do so. If anyone is reading this and had different results – I would love to hear about it.

XenServer 6.5 PCI-Passthrough

Thursday, April 16th, 2015

While searching the web for how to perform PCI-Passthrough on XenServers, we mostly get info about previous versions. Since I have just completed setting up PCI-Passthrough on XenServer version 6. 5 (with recent update 8, just to give you some notion of the exact time frame), I am sharing it here.

Hardware: Cisco UCS blades, with fNIC. I wish to pass through two FC HBAs into a VM (it is going to act as a backup server, and I need it accessing the FC tape). While all my XenServers in this pool have four (4) FC HBAs, this particular XenServer node has six (6). I am intending the first four for SR communication and the remaining two for the PCI Passthrough process.

This is the output of ‘lspci | grep Fibre’:

0b:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0c:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0d:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0e:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0f:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
10:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)

So, I want to pass through 0f:00.0 and 10:00.0. I had to add to /boot/extlinux.conf the following two entries after the word ‘splash’ and before the three dashes:

pciback.hide=(0f:00.0)(10:00.0) xen-pciback.hide=(0f:00.0)(10:00.0)

Initially, and contrary to the documentation, the parameter pciback.hide had no effect. As soon as the VM started, the command ‘multipath -l‘ would hang forever (or until hard reset to the host).

To apply the settings above, run (for a good measure. Don’t think we need it, but did not read anything about it): ‘extlinux -i /boot‘ and then reboot.

Now, when the host is back, we need to add the devices to the VM. Make sure that the VM is in ‘off’ state before doing that. Your command would look like this:

xe vm-param-set uuid=<VM UUID> other-config:pci=0/0000:0f:00.0,0/0000:10:00.0

The expression ‘0/0000’ is required. You can search for its purpose, however, in most cases, your value would look exactly like mine – ‘0/0000’

Since my VM is Windows, here it almost ends: Start the VM, and if it boots correctly, Install Cisco VIC into it, as if it were a physical host. You’re done.