Posts Tagged ‘Linux’

Linux LVM explained

Saturday, July 11th, 2020

You can find bazillion sites explaining Linux LVM, however, I am preparing for my next article, about partition resize for the advanced user, and LVM deep understanding is required, so I have decided to explain some of the internals of LVM for the advanced user. This explains the how it is built more than the how to use it, so if you’re looking for the right commands – you are not likely to find them here. If you are looking for the theoretical understanding of how LVM is structured, what is PV, PE, LE and so on – this is probably an article you want to read.

In general, a block device – a disk, a partition, SSD, RamDisk, character device mapped as block (loop) or whatever – can be signed as a ‘physical device’ (PV) for the purpose of LVM. A physical device (from now on – PV) is a block device which can hold data and allow random access to it. For ease of definitions – a disk or its equivalent. If you can format and mount it – it can act as PV. The data this PV is required to hold is both the LVM metadata, and the PV ‘physical extents’ (PE). I will use the term PE.

The ‘Physical Extents’ are small partitions (logical definition, there is no ‘fdisk’ like tool to create them) the PV is being split to. It means that if we define a PE as a 32M chunk (this is a logical parameter when creating Volume Group. On that later), the PV will be split into many 32MB small chunks, each has its own number (sequential number, of course) in this PV. We will have PE #0, and PE#1 and so on. We, as humans, have (almost) no interaction with this numbering, but it is important we understand them.

All these ‘physical extents’ (PE) which reside on a ‘physical volume’ (PV) are mapped to a logical object called ‘logical volume’ (LV). A logical volume is the actual object we can use to place our data on. It behaves like any other block device or partition – we can format it, partition it (heavens knows why, but it can be done), mount it (when it has a file system), put our important data on – and so on. About how the mapping looks like – later in this article.

The connection between PE residing on a PV to the LV is kept in a logical object called “Volume Group” (VG). A “volume group” (VG) is a logical and theoretical object which merges the PE provided by multiple PVs into a logical group of objects with a mapping to the LV. This sounds complicated, I am sure, but we’ll get deeper into it soon.

As said – a VG is a logical object holding PVs (with their PEs) on one hand, and LVs (with their LEs, – about it later) on the other hand. It has no ‘real’ existence, except as a group of objects. A PV can be member of a single VG (but a single VG can have many PVs), and an LV can be a member of single VG (but again – a single VG can have many LVs). When we look at the metadata, later in this article, it should become more clear.

In order to understand how PEs are located on a disk, Let’s take a look at this nice drawing.
This drawing will show a (basic partitioning) disk, with Master Boot Record (MBR) and two partitions, of which the 2nd is used as LVM PV.
The PV has a small metadata signature, and many PEs.

We can ask the LVM mechanism nicely to export the metadata configuration. Since a volume group (VG) can hold multiple PVs (physical volumes, aka – block devices) the metadata will reside in the beginning of each disk (PV) for the sake of redundancy. This is important when we want to recover a failed LVM caused by human error or missing disk(s).

Moreover – because the LV has only logical mapping to the PEs residing on disks (can be more than one, and even more than three! ), the order of the PEs mapped to a single LV doesn’t have to be continuous, nor does it has to reside on a single disk. This is a flexible system, and we’ll get to that later.

I would like to show an exported (backed-up) VG metada for the sake of our observation. I will add comments inline for your viewing pleasure

# Generated by LVM2 version 2.02.98(2)-RHEL6 (2012-10-15): Thu Jun  5 00:00:00 2019

contents = "Text Format Volume Group"
version = 1

### This is the description of the command used to create this file ###
description = "vgcfgbackup -f /tmp/VG-export.txt VG00"

### Some information about the creation host and time ###
creation_host = "localhost.localdomain"	# Linux localhost.localdomain 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64
creation_time = 1594292258	# Thu Jun  5 00:00:00 2019

### Volume group information ###
VG00 {  ### Name of the Volume Group ###
	id = "8svbhm-euN1-d7Hr-PGIo-yHnH-kIIa-yxECBa"  ### Each object has unique ID to prevent confusion ###
	seqno = 8
	format = "lvm2" # informational
	status = ["RESIZEABLE", "READ", "WRITE"]
	flags = []
	extent_size = 65536		# 32 Megabytes ### The size of a single PE in Sectors. This is across all VG (all the member PVs), regardless of the PV size! ###
	max_lv = 0   ### Configurable limitations. None.
	max_pv = 0
	metadata_copies = 0

	physical_volumes { ### The list of the member PVs ###

		pv0 {  ### This is the first PV. They will have names like 'pv0' or 'pv1'. Nothing very artistic ###
			id = "FRDFDw-fMrG-ma1d-2rP5-bqck-cFsz-fr2OWf"   ### UUID. A unique identifier allowing for easy scan
			device = "/dev/sda2"	# Hint only ### This is only a hint. Device-mapper (LVM kernel engine) scans for LVM metadata on all disk partitions ###

			status = ["ALLOCATABLE"]  ### Can we allocate PEs from this PV? Why not? We can prevent it from allocating space. On that - some other time ###
			flags = []
			dev_size = 209590272	# 99.9404 Gigabytes ### The PV size in Sectors. This is very important. ###
			pe_start = 2048 # The offset of the first PE, #0, from the beginning of the PV, in Sectors ###
			pe_count = 3198	# 99.9375 Gigabytes # How many PEs do we have here? The size can be easily calculated by multiplying the amount of PEs (pe_count) with the size of each PE (extent_size)
		}
	}

I will go further into the LV topic shortly, but in the meanwhile – let’s see what we have here. This is the global definition of a Volume Group (VG) and its physical volume(s) (PV). The VG name is ‘VG00’ and it has a unique ID (which is why you do not want to map storage snapshost of an LVM to the same machine in parallel, without fully understanding what you are doing). We have the size of the PE – 32M in our case. As soon as the VG was created – it cannot be changed. A note – the PEs don’t have a header on-disk, meaning you cannot binary-dump a hard drive and look for the beginning or end of each PE. The PEs are defined as a mapping, and the driver can jump to the right location on the disk. It is fairly easy – calculate the position of the PE you aim at by multiplying the PE size with the sequential number of the PE, jump to this number relatively to the beginning of the partition, and you’re there.

Let’s look at the PV definition here – we have its UUID, which is extremely important, as it identified the PV for the VG. Since there is no order constraint on the devices (you can reverse the disk order for a multiple-PV system, and LVM will not get affected) – the only way LVM identifies the member PVs is by looking at their metadata copy, containing their UUID. If the metadata is damaged, missing or has an incorrect UUID, we get to data recovery! (or metadata recovery, which is easier, but still unpleasant).
Since the physical OS disk mapping doesn’t matter, because LVM makes use of PV UUID, the block device name is only a hint, for the human who might read this config backup file.
We have the status. A PV can be set to ‘not allocatable’ – let’s say we want to evict a PV from a VG – this can be done, however, in the meanwhile, we would not want anyone allocating data on this soon-to-be-removed PV – so we set it to ‘not allocatable’ to keep it empty.
It can have additional flags, used in cases of external lock management like in HA clusters.
Next, it shows the size of the device in sectors ; the PE beginning location (relative to the beginning of the PV), and the amount of PEs present in it.

Now, let’s look at how an LV is defined. Again – comments inline:

logical_volumes {

		lvroot {  ### The name of the LV ###
			id = "dmaQ5x-eTX0-JRsR-aMhG-Ldz5-SlR6-lAT6EB"  ### A unique identifier.  ###
			status = ["READ", "WRITE", "VISIBLE"] ### It is available R/W and visible. It can be none of these too ###
			flags = [] ### Special arguments. None defined ###
			creation_host = "localhost.localdomain"
			creation_time = 1594157738	# 2019-01-01 08:42:18 +0000
			segment_count = 1 ### An LV can be continuous or split in multiple ways. I will demonstrate that later ###

			segment1 { ### The first continuous are (and the only one, in our case ###
				start_extent = 0 ### Where does it start with the LOGICAL extent? On that later ###
				extent_count = 875	# 27.3438 Gigabytes ### The amount of LEs used by this segment, meaning - the segment size or length ###

				type = "striped" 	# linear  # There are multiple types. striped is the common one - a linear setup
				stripe_count = 1 ###

				stripes = [ ### Where does this segment reside *physically*? ###
					"pv0", 0 ### On 'pv0' we've seen before! And where does it start? On PE 0 (the first one) ###
				]
			}
		}

		lvswap { ### Another LV
			id = "E3Ei62-j0h6-cGu5-w9OB-l9tU-0Qf5-f09bvh"
			status = ["READ", "WRITE", "VISIBLE"]
			flags = []
			creation_host = "localhost.localdomain"
			creation_time = 1594157749	# 2019-01-01 08:42:29 +0000
			segment_count = 1

			segment1 {
				start_extent = 0  ### Tee LE of the LV. On LEs - later ###
				extent_count = 94	# 2.9375 Gigabytes

				type = "striped"
				stripe_count = 1	# linear

				stripes = [
					"pv0", 1813 ### Here we start at PE number 1813. More details below ###
				]
			}
		}
	}

Before I explain the LV settings, I need to explain what ‘Logical Extent’ is. A block device has to be presented to the operating system as a continuous device with random-access capabilities. So, logically, an LV has to be continuous. However – we do know that LVM allows us to modify, migrate and even resize an existing LV into split areas of a disk or disks (PVs). This is achieved by defining an LV as made out of a set of small chunks, ordered in a continuous manner. They are ordered in such a way, however, since they are logical, they can be mapped to any PEs we have, in a non-ordered mode. It means, practically, that this ‘chunk’, called “Logical Extent” (LE) is in the size of PE, and maps to one (or more, in cases of LVM RAID. Not included in this article). So an LV has a continuous array of LEs mapped to non-continuous list of PEs. This way, LVM can satisfy both the OS requirement for a block device, with the relevant properties, while maintaining flexibility with the actual disk positioning.

Here is another image to elaborate some more on the LE-to-PE mapping. This image was taken, with permission, from ‘thegeekdiary’ article explaining Linux LVM basics. If you want to know how to do stuff – you should check this article. I am just explaining how things look internally.

So – Back to our configuration. What do we have here? A Logical Volume (LV) is a logical unit with parameters, like name, UUID, status and so on. We can see that the LV called ‘lvroot’ has one ‘segment’ (called ‘segment1’). A segment is an uninterrupted list of continuous blocks, with a logical starting point and length (aka – uninterrupted list) with mapping of “extents” (in the configuration – meaning LE) to the starting point on the PV, defined as “PV”, PE_number. In this configuration, we can see that ‘lvroot’ block (LE) 0 begins at the PV ‘pv0’ block (PE) 0.

Here is aconfiguration dump of the same LV after I have migrated the first 10 PEs to another location in the disk (PV), using the command
pvmove –alloc anywhere /dev/sda2:0-9

lvroot {
                        id = "dmaQ5x-eTX0-JRsR-aMhG-Ldz5-SlR6-lAT6EB"
                        status = ["READ", "WRITE", "VISIBLE"]
                        flags = []
                        creation_host = "localhost.localdomain"
                        creation_time = 1594157738	# 2019-01-01 08:42:18 +0000
                        segment_count = 2 ### We now have two segments! ###

                        segment1 {  ### This is the beginning of the LV - mapped as LE 0-9 (the first 10, which I have migrated) ###
                                start_extent = 0
                                extent_count = 10       # 320 Megabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 1907 ### They are on pv0, but somewhere further back the disk, on PE 1907 and onwards! ###
                                ]
                        }
                        segment2 { # This is the next segment, of blocks 10 to the end ###
                                start_extent = 10
                                extent_count = 865      # 27.0312 Gigabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 10 ### It resides at the original location, which was PE 10 and onwards ###
                                ]
                        }
                }

The LV mapping has changed to match the change. The first 10 blocks (LEs) of lvroot are somewhere else on the disk on PV ‘pv0’ at location 1907, and the next segment of blocks remains in its original position – blocks 10 and onwards, except that because I’ve split the LV into two chunks, it has to have a new ‘segment’ definition.

This concludes my explanation of disk positioning and how it looks like, with LVM internals. We went through what PV is, what PE is, what LV and LE are, and how they are related to each other. Just to stress – a VG is a logical construct combining the PVs, PEs to the LEs and LVs.

If you find anything incorrect, not clear enough or want me to go further into any detail – drop me a note. I will be happy to hear from you.

Multiple users with the same UID/GID

Monday, February 3rd, 2020

First, let me state that this is not a desirable action. It can be done, because, as root, there are so many things which are considered “bad practice” you can still do – this is part of what’s ‘root’ is all about – you know what your system needs, and you know how to do it, even if it’s in a twisted weird way.

In this case, there are two users. One of them is an application user, used by the application administrators, who do not share their password (which is good). The other account is used for file transfers to this directory by an external system which does not support SSH keys. So – the first team won’t share their password (which is fine), the second team needs to place files, and a very complex process of copying the files as the second user, and then chown them to the application user is devised.

A quick solution: Make both users have the same UID and same GID. The result would be that the first user (application user) would have its own password and continue doing whatever it is doing now, while the second user would be able to just drop files where they should be, and they will remain there, with the correct permissions.

A reminder – Linux cares little for user names. They are used in many reverse and forward translations, however, on filesystem, the user ID and group ID (UID and GID, in that order) are what matters. The file’s metadata includes the number, not the name.

A simple solution would be to create the user with ‘useradd’ and the flag ‘-o’ which means “non-unique”. This is very simple to do, and would pose no problem.

However, the application users might see, when running ‘ls’ commands, that the files belong to the other, transfer, user, and vice versa. This is caused not by the current login information, but due to the local NSCD caches in use. In particular – ‘nscd’ – the Name Service Caching Daemon.

So – we would strive to have both users see their own “name” when listing files, because otherwise, it will create some user unrest, which we strive to avoid.

The trick is to disable caching, by editing the file /etc/nscd.conf with these values:

enable-cache passwd no
persistent passwd no

Following that, restart the ‘nscd.service’ on your system, and your users should see their “own” name when listing files.

Extracting multi-layered initramfs

Thursday, December 5th, 2019

Modern Kernel specification (can be seen here) defined the initial ramdisk (initrd or initramfs, depends on who you ask) to allow stacking of compressed or uncompressed CPIO archives. It means, in fact, that you can extend your current initramfs by appending a cpio.gz (or cpio) file at the end, containing the additions or changes to the filesystem (be it directories, files, links and anything else you can think about).

An example of this action:

1
2
3
4
5
mkdir /tmp/test
cd /tmp/test
tar -C /home/ezaton/test123 -cf - . | tar xf - # Clones the contests of /home/ezaton/test123 to this location
find ./ | cpio -o -H newc > ../test.cpio.gz # Creates a compressed CPIO file
cat ../test.cpio.gz >> /boot/initramfs-`uname -r`.img

This should work (I haven’t tried, and if you do it – make sure you have a copy of the original initramfs file!), and the contents of the directory /tmp/test would be reflected in the initramfs.

This method allows us to quickly modify existing ramdisk, replacing files (the stacked cpio files are extracted by order), and practically – doing allot of neat tricks.

The trickier question, however, is how to extract the stacked CPIO files.
If you create a file containing multiple cpio.gz files, appended, and just try to extract them, only the contents of the first CPIO file would be extracted.

The Kernel can do it, and so are we. The basic concept we need to understand is that GZIP compresses a stream. It means that there is no difference between a file structured of stacked CPIO files, and then compressed altogether, or a file constructed by appending cpio.gz files. The result would be similar, and so is the handling of the file. It also means that we do not need to run a loop of zcat/un-cpio and then again zcat/un-cpio on the file chunk by chunk, but when we decompress the file, we decompress it in whole.

Let’s create an example file:

1
2
3
4
5
6
cd /tmp for i in {1..10} ; do
    mkdir test${i}
    touch test${i}/test${i}-file
    find ./test${i} | cpio -o -H newc | gzip > test${i}.cpio.gz
    cat test${i}.cpio.gz >> test-of-all.cpio.gz 
done

This script will create ten directories called test1 to test10, each containing a single file called test<number>-file. Each of them will both be archived into a dedicated cpio.gz file (named the same) and appended to a larger file called test-of-all.cpio.gz

If we run the following script to extract the contents, we will get only the first CPIO contents:

1
2
3
mkdir /tmp/extract
cd /tmp/extract
zcat ../test-of-all.cpio.gz | cpio -id # Format is newc, but it is auto detected

The resulting would be the directory ‘test1’ with a single file in it, but with nothing else. The trick to extract all files would be to run the following command:

1
2
3
4
rm -Rf /tmp/extrac # Cleanup
mkdir /tmp/extract
cd /tmp/extract
zcat ../test-of-all.cpio.gz | while cpio -id ; do : ; done

This will extract all files, until there is no more cpio format remaining. Then the ‘cpio’ command will fail and the loop would end.

Some additional notes:
The ‘:’ is a place holder (does nothing) because ‘while’ loop requires a command. It is a legitimate command in shell.

So – now you can extract even complex CPIO structures, such as can be found in older Foreman “Discovery Image” (very old implementation), Tiny Core Linux (see this forum post, and this wiki note as reference on where this stacking is invoked) and more. This said, for extracting Centos/RHEL7 initramfs, which is structured of uncompressed CPIO appended by a cpio.gz file, a different command is required, and a post about it (works for Ubuntu and RHEL) can be found here.

EDIT: It seems the kernel-integrated CPIO extracting method will not “overwrite” a file with a later layer of cpio.gz contents, so I will have to investigate a different approach to that. FYI.

Auto mapping USB Disk on Key to KVM VM using libvirt and udev

Monday, November 19th, 2018

I was required to auto map a USB DoK to a KVM VM (specific VM, mind you!), as a result of connecting this device to the host. I’ve looked it up on the Internet, and the closest I could get there was this link. It was almost a complete solution, but it had a few bugs, so I will re-describe the whole process, with the fixes I’ve added to the process and udev rules file. While this guide is rather old, it did solve my requirement, which was to map a specific set of devices (“known USB devices”) to the VM, and not any and every USB device (or even – USB DoK) connected to the system.

In my example, I’ve used SanDisk Corp. Ultra Fit, which its USB identifier is 0781:5583, as can be seen using ‘lsusb’ command:

[[email protected] ~]# lsusb
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 020: ID 0781:5583 SanDisk Corp. Ultra Fit

My VM is called “centos7.0” in this example. I am using integrated KVM+QEMU+LIBVIRT on a generic CentOS 7.5 system.

Preparation

You will need to prepare two files:

  • USB definitions file (for easier config of libvirt)
  • UDEV rules file (which will be triggered by add/remove operation, and will call the USB definitions file)
USB Definitions file

I’ve placed it in /opt/autousb/hostdev-0781:5583.xml , and it holds the following (mind the USB device identifiers!)

1
2
3
4
5
6
<hostdev mode='subsystem' type='usb'>
  <source>
    <vendor id='0x0781'/>
    <product id='0x5583'/>
  </source>
</hostdev>

I’ve created a file /etc/udev/rules.d/90-libvirt-usb.rules with the content below. Note that the device identifiers are there, but in the “remove” section they appear differently. Remove leading zero(s) and change the string. This is caused because on removal, the device does not report all its properties to the OS. Also – you cannot connect more than three (3) such devices to a VM, so when you fail to detach three devices (following a consecutive insert/remove operations, for example), you will not be able to attach a fourth time.

ACTION=="add", \
    SUBSYSTEM=="usb", \
    ENV{ID_VENDOR_ID}=="0781", \
    ENV{ID_MODEL_ID}=="5583", \
    RUN+="/usr/bin/virsh attach-device centos7.0 /opt/autousb/hostdev-0781:5583.xml"
ACTION=="remove", \
    SUBSYSTEM=="usb", \
    ENV{PRODUCT}=="781/5583/100", \
    RUN+="/usr/bin/virsh detach-device centos7.0 /opt/autousb/config/hostdev-0781:5583.xml"

Now, all that’s left to do is to reload udev using the following command:

udevadm trigger

To monitor the system behaviour, run either of these commands:

udevadm monitor --property --udev 

or

udevadm monitor --environment --udev 

Linux answers to ARP who-is on the wrong network interface

Friday, April 14th, 2017

Assume a server has two network interfaces as follows:

  • eth0 : 192.168.0.1/24
  • eth1 : 192.168.10.1/24

Let’s assume these interfaces reside on the different VLANs. Lets assume they were connected incorrectly, in such a way that eth0 is connected to VLAN 10, which servers 192.168.10.0/24 and eth1 is connected to VLAN 2, which serves 192.168.0.0/24.

You would expect that queries by other hosts on VLAN 2 (which is connected to eth1, but serves 192.168.0.0/24!) would not get responses from the server. You are wrong.

Linux will answer who-is queries on VLAN 2, replying with eth1’s MAC address to queries for 192.168.0.1 IP address.

This example is a simple example, but it can get ugly if your eth0 mimics a different network, and you want the server to be disconnected. I have had to “forge” a network setup on a different VLAN, mimicking the original network and subnet. However – a “backdoor” I have opened (on an additional NIC) between the mimicking server and the original server on a different IP class (a private one) resulted in the mimicking server answering to ARP queries, causing the clients to attempt connecting to the mimicking server instead of the production server. The clients could not complete the TCP handcheck because the mimicking server attempted to contact them via eth0, which was on the fake network, and did not actually reach anywhere.

This was a more complex example, however – the result is the same – the response on the “wrong” network interface to ARP who-is queries might hijack data which should be delivered elsewhere.

There is a solution! You need to setup the sysctl parameter arp_ignore  to either of the following values. The parameter is hidden in /proc/sys/net/ipv4/conf/<NIC>/arp_ignore

The parameters documentation is as follows:

arp_ignore – INTEGER
Define different modes for sending replies in response to received ARP requests that resolve local target IP addresses:
0 – (default): reply for any local target IP address, configured on any interface
1 – reply only if the target IP address is local address configured on the incoming interface
2 – reply only if the target IP address is local address configured on the incoming interface and both with the sender’s IP address are part from same subnet on this interface
3 – do not reply for local addresses configured with scope host, only resolutions for global and link addresses are replied

The value “1” or “2” would do the trick in such cases.