Posts Tagged ‘storage management’

Extend /boot from within a Linux system

Saturday, March 27th, 2021

This is a tricky one. In order to resize /boot, which is, commonly the first partition, you need to push forward the beginning of the next partition. This is not an easy task, especially if you are not using LVM – then you have to use external partitioning modification tools, like PQMagic (if it still exists, who knows?), or other such offline tools.

However, if you are using LVM, there is a (complex) trick to it. We need to evict some of the first few PEs, resize the partition to begin at a new location, and then re-sign (and restore) the LVM meta-data in a way which will reflect the relative change in data blocks position (aka – the new PEs). To have some additional grasp of LVM and its meta-data, I recommend you read my article here.

Also, and this is an important note – you cannot change an open (in-use) partition on systems prior to RHEL 8 (on which I have not tested my solution just yet) – meaning – you can change the partition layout on the disk, but the kernel will not refresh that information and would not act accordingly until reboot.

If you have not tried this before, or not sure about all the details in this post, I urge you to use a VM for testing purposes. A failure in this process might leave your data inaccessible, and you do not want that.

So, we have a complex set of tasks:

  • If there is some empty space somewhere on the LVM PV, migrate the first X blocks out.
  • Export the LVM meta-data so we could edit it afterwards
  • Recreate the partition (delete and recreate) with a new starting location
  • (here comes the tricky part) – sign the partition’s updated beginning with LVM meta-data, with the updated relative block locations.

Assumptions:

  • The disk partition layout is /boot as /dev/sda1 and LVM PV on /dev/sda2
  • The LVM VG name is ‘VG’
  • We are using modern dracut-capable system, such as RHEL/CentOS version 6 and above (not tested on version 8 yet)
  • We use basic (msdos) partition layout and not GPT

Clear 500MB for further use, if not enough free space in PV:

In order to do so, we will need 500MB of free space in our PV. If space is an issue, you can easily clean up space from the swap space, by stopping swap, reducing the LV size, signing swap with ‘mkswap’ and starting swap again. This is in a nutshell, and I will not go further into it.

Move the first 500MB out of the beginning of the PV:

We need to do some math. The size of a single PE is defined in the LVM VG settings. By default it is 4MB today, and it can be checked using ‘vgdisplay’ command. Look for the field ‘PE Size’. So – 500MB is 125 PEs. So our command would be:

pvmove --alloc anywhere /dev/sda2:0-124

Which will migrate the first 125 PEs starting at position 0 to 124 away somewhere in the VG.

Export LVM meta-data to a file, and edit it for future handling:

vgcfgbackup -f /tmp/vg-orig.txt VG 

This command will create a file called /tmp/vg-orig.txt which will contain the original VG meta-data copy. We will clone this file and edit it:

cp /tmp/vg-orig.txt /tmp/vg.txt

Now comes the more complex part. We need to adjust the meta-data file to reflect the relative change in the block location. We will edit the new /tmp/vg.txt file. Initially – find the block describing ‘pv0’, which is the first PV in your VG (and maybe the only one), and verify that ‘pv0’ is the correct device, by verifying the ‘device’ directive in this block.
Now comes the harder part – Each LV block in the meta-data file has a sub-section describing disk segments. These blocks describe the relative location of the LV in the PVs. I have already pointed at my article describing the meta-data file and how to read it. The task is to find the ‘stripes’ directive in each LV sub-segment, and reduce the amount of PEs – in our case – 125. It needs to be done for all LVs which reside on our ‘pv0’ – one after the other. An example would look like this:

lvswap { ### Another LV
			id = "E3Ei62-j0h6-cGu5-w9OB-l9tU-0Qf5-f09bvh"
			status = ["READ", "WRITE", "VISIBLE"]
			flags = []
			creation_host = "localhost.localdomain"
			creation_time = 1594157749	# 2019-01-01 08:42:29 +0000
			segment_count = 1

			segment1 {
				start_extent = 0  ### Tee LE of the LV. On LEs - later ###
				extent_count = 94	# 2.9375 Gigabytes

				type = "striped"
				stripe_count = 1	# linear

				stripes = [
					# Was: "pv0", 1813. Now:
					"pv0", 1688 ### reduced 125 PEs ###
				]
			}
		}

Copy the resulting file /tmp/vg.txt (after double-checking it!) to /boot. We will use /boot later on to re-sign the PV meta-data.

Recreate the partition:

Another tricky part. You cannot just resize a partition, or at least – without the tool (parted, fdisk – depending on your OS version) attempting to resize the over-layer, and failing to do so. Most tools do not allow changes to the size of the partitions at all, so we will need to delete and recreate the partition layout. Now, depending if you are using GPT or msdos partition, your tools might vary, but in this case, I handle only msdos partition layout, so the tools will be in accordance. Other tools can apply for GPT layout, and the process, in general, will work on GPT as well.

So – we will backup the partition layout before we change it. The command ‘sfdisk’ will allow us to do so, so we can call

sfdisk -d /dev/sda > /boot/original-disk-layout.txt

I am leaving quite a lot of stuff on /boot partition, because this partition is not a member of the LVM volume group, and will remain, mostly, unaffected during our process. You can use an external USB disk, or any other non-LVM partition, as long as you verify you can access it from within the boot process, directly from initrd/initramfs, or dracut. /boot is, commonly, accessible from within the boot process.

Now we modify the partition layout. To do so, I recommend to document the original start point of the two interesting partitions – /boot (usually /dev/sda1) and our PV (in this example: /dev/sda2). I prefer using ‘sector’ directives. An example would be:

parted -s /dev/sda "unit s p"

It is common, for modern Linux systems, to have /boot starting at sector 2048 (which is 1MB into the disk). This is due to block alignment, however, I will not discuss this here. The interesting part is the size of a sector (commonly 512b, but can be 4K for ‘advanced format’ disks), so we will be able to calculate the new partitions starting positions and sizes.

Now, using ‘parted’ we need to remove the 2nd partition (in my example, note. It might vary on your setup) and recreate it at a newer location – 125PEs further, or 500MB further, or 1024000 sectors ahead. So, if our starting sector is 411648(s), then we will have to create the partition starting at sector 1435648 (=411648+1024000), with the original ending location. Don’t forget to set this partition to LVM. Assuming you have saved the starting point of the partition in the variable StartOfPart, and the original ending in EndOfPart, your command would look like this:

parted -s /dev/sda "unit s rm 2 mkpart primary $(( StartOfPart + 1024000 )) ${EndOfPart} set 2 lvm on"

Now, we need to recreate the /boot partition (partition #1 in my example) to include the new size. Again – we need to document its beginning, and now recreate it. Assuming we have kept the same variables as before, the command would look like this:

parted -s /dev/sda "unit s rm 1 mkpart primary ${StartOfPart} $(( EndOfPart + 1024000 - 1 )) set 1 boot on"

The kernel will not update the new partitions sizes because they are in use. We will need a reboot, however – when we reboot (do not do that just yet), we will no longer have access to our LVM. This is because it will not have meta-data anymore, and we will need to recreate it.

Prepare a script to place in /boot, called vgrecover.sh which will hold the following lines:

#!/bin/sh
sed -i 's/locking_type = 4/locking_type = 0/g' /etc/lvm/lvm.conf
lvm pvcreate -u ${PVID} --restorefile /mnt/vg.txt /dev/sda2
lvm vgcfgrestore -f /mnt/vg.txt VG

You need to save the PVID for /dev/sda2 and replace this value in this script. This is the field ‘PV UUID’ in the output of the command:

pvdisplay /dev/sda2

Some more explanations: The device in our example is /dev/sda2 (change it to match your device name), and the VG name is ‘VG’ (again – change to match your setup). This script needs to be placed on /boot and be made executable.

Before our reboot:

We need to verify the following files exist on our /boot:

  • vg.txt
  • vgrecover.sh
  • original-disk-layout.txt

If any of these files is missing, you will not be able to boot, you will not be able to recover your system, and you will not be able to access the data there ever again!

I also recommend you keep your original-disk-layout.txt file somewhere external. If you have made a partitioning mistake and changed the beginning of /boot, you will not have access to /boot and all its files, and having this file elsewhere (on external disk, for example) will help you recover the partition layout quickly and with no frustration.

Now comes another risky part: reboot and get into recovery shell used by GRUB. See my article here to understand how to enter recovery shell. If you have a different OS version, your boot arguments might differ. An external boot media (like RHEL/CentOS recovery boot, or Ubuntu live) could also suffice to complete the task, but it is preferred to use the GRUB recovery console to reduce the change of some unknown automatic task or detection process doing stuff for you.

We need to break the boot sequence in the pre-mount phase. We will have a minimal shell on which we need to run the following commands:

mkdir /mnt
mount /dev/sda1 /mnt
/mnt/vgrecover.sh

We are mounting /dev/sda1 (our /boot) on /mnt, which we have just created. Then we call the vgrecover.sh script we have created before. It will use LVM recovery commands to re-sign the PV on /dev/sda2, and then recover the VG meta-data using our modified meta-data file, describing a new relative positions of LVs.

When done, assuming no problems happened there, just umount /mnt and reboot. The system should boot up successfully, however, /boot will not have the designated size just yet.

Extending /boot :

The partition /dev/sda1 is of the updated size now, however, the filesystem is not. You can verify that using ‘fdisk -l /dev/sda’ of ‘parted -s /dev/sda unit s p’ or any other command. If this is not the case, then check your process.

Extending the filesystem depends on the type of filesystem. You can run ‘df -hPT /boot’ to identify the filesystem type. If it is XFS, use the command:

xfs_growfs /boot

If the filesystem is of type ext3 or ext4, use

resize2fs /dev/sda1

Other filesystems will require different tools, and since I cannot cover it all, I leave it to you. This is an online process, and as soon as it is over, the new size will show in the ‘df’ command.

Recovery:

If, for some reason, the disk partitioning or PV re-signing failed, and the system cannot boot, you can use the original-disk-layout.txt file in /boot to recover the original disk layout. Boot into GRUB rescue mode as shown above, and run:

mkdir /mnt
mount /dev/sda1 /mnt
sfdisk -f /dev/sda < /mnt/original-disk-layout.txt

If your /boot is inaccessible, and the file original-disk-layout.txt was kept on an external storage, you can use a live Ubuntu, or any other live system to run the ‘sfdisk’ command as shown above to recover /dev/sda original partitioning layout.

Bottom line:

This is a possible, although complex, task, and you should practice it on a VM, with disk snapshots before you attempt to kill production servers. Leave me a comment if it worked, or if there is anything I need to add or correct in this post. Thanks, and good luck!

Linux LVM explained

Saturday, July 11th, 2020

You can find bazillion sites explaining Linux LVM, however, I am preparing for my next article, about partition resize for the advanced user, and LVM deep understanding is required, so I have decided to explain some of the internals of LVM for the advanced user. This explains the how it is built more than the how to use it, so if you’re looking for the right commands – you are not likely to find them here. If you are looking for the theoretical understanding of how LVM is structured, what is PV, PE, LE and so on – this is probably an article you want to read.

In general, a block device – a disk, a partition, SSD, RamDisk, character device mapped as block (loop) or whatever – can be signed as a ‘physical device’ (PV) for the purpose of LVM. A physical device (from now on – PV) is a block device which can hold data and allow random access to it. For ease of definitions – a disk or its equivalent. If you can format and mount it – it can act as PV. The data this PV is required to hold is both the LVM metadata, and the PV ‘physical extents’ (PE). I will use the term PE.

The ‘Physical Extents’ are small partitions (logical definition, there is no ‘fdisk’ like tool to create them) the PV is being split to. It means that if we define a PE as a 32M chunk (this is a logical parameter when creating Volume Group. On that later), the PV will be split into many 32MB small chunks, each has its own number (sequential number, of course) in this PV. We will have PE #0, and PE#1 and so on. We, as humans, have (almost) no interaction with this numbering, but it is important we understand them.

All these ‘physical extents’ (PE) which reside on a ‘physical volume’ (PV) are mapped to a logical object called ‘logical volume’ (LV). A logical volume is the actual object we can use to place our data on. It behaves like any other block device or partition – we can format it, partition it (heavens knows why, but it can be done), mount it (when it has a file system), put our important data on – and so on. About how the mapping looks like – later in this article.

The connection between PE residing on a PV to the LV is kept in a logical object called “Volume Group” (VG). A “volume group” (VG) is a logical and theoretical object which merges the PE provided by multiple PVs into a logical group of objects with a mapping to the LV. This sounds complicated, I am sure, but we’ll get deeper into it soon.

As said – a VG is a logical object holding PVs (with their PEs) on one hand, and LVs (with their LEs, – about it later) on the other hand. It has no ‘real’ existence, except as a group of objects. A PV can be member of a single VG (but a single VG can have many PVs), and an LV can be a member of single VG (but again – a single VG can have many LVs). When we look at the metadata, later in this article, it should become more clear.

In order to understand how PEs are located on a disk, Let’s take a look at this nice drawing.
This drawing will show a (basic partitioning) disk, with Master Boot Record (MBR) and two partitions, of which the 2nd is used as LVM PV.
The PV has a small metadata signature, and many PEs.

We can ask the LVM mechanism nicely to export the metadata configuration. Since a volume group (VG) can hold multiple PVs (physical volumes, aka – block devices) the metadata will reside in the beginning of each disk (PV) for the sake of redundancy. This is important when we want to recover a failed LVM caused by human error or missing disk(s).

Moreover – because the LV has only logical mapping to the PEs residing on disks (can be more than one, and even more than three! ), the order of the PEs mapped to a single LV doesn’t have to be continuous, nor does it has to reside on a single disk. This is a flexible system, and we’ll get to that later.

I would like to show an exported (backed-up) VG metada for the sake of our observation. I will add comments inline for your viewing pleasure

# Generated by LVM2 version 2.02.98(2)-RHEL6 (2012-10-15): Thu Jun  5 00:00:00 2019

contents = "Text Format Volume Group"
version = 1

### This is the description of the command used to create this file ###
description = "vgcfgbackup -f /tmp/VG-export.txt VG00"

### Some information about the creation host and time ###
creation_host = "localhost.localdomain"	# Linux localhost.localdomain 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64
creation_time = 1594292258	# Thu Jun  5 00:00:00 2019

### Volume group information ###
VG00 {  ### Name of the Volume Group ###
	id = "8svbhm-euN1-d7Hr-PGIo-yHnH-kIIa-yxECBa"  ### Each object has unique ID to prevent confusion ###
	seqno = 8
	format = "lvm2" # informational
	status = ["RESIZEABLE", "READ", "WRITE"]
	flags = []
	extent_size = 65536		# 32 Megabytes ### The size of a single PE in Sectors. This is across all VG (all the member PVs), regardless of the PV size! ###
	max_lv = 0   ### Configurable limitations. None.
	max_pv = 0
	metadata_copies = 0

	physical_volumes { ### The list of the member PVs ###

		pv0 {  ### This is the first PV. They will have names like 'pv0' or 'pv1'. Nothing very artistic ###
			id = "FRDFDw-fMrG-ma1d-2rP5-bqck-cFsz-fr2OWf"   ### UUID. A unique identifier allowing for easy scan
			device = "/dev/sda2"	# Hint only ### This is only a hint. Device-mapper (LVM kernel engine) scans for LVM metadata on all disk partitions ###

			status = ["ALLOCATABLE"]  ### Can we allocate PEs from this PV? Why not? We can prevent it from allocating space. On that - some other time ###
			flags = []
			dev_size = 209590272	# 99.9404 Gigabytes ### The PV size in Sectors. This is very important. ###
			pe_start = 2048 # The offset of the first PE, #0, from the beginning of the PV, in Sectors ###
			pe_count = 3198	# 99.9375 Gigabytes # How many PEs do we have here? The size can be easily calculated by multiplying the amount of PEs (pe_count) with the size of each PE (extent_size)
		}
	}

I will go further into the LV topic shortly, but in the meanwhile – let’s see what we have here. This is the global definition of a Volume Group (VG) and its physical volume(s) (PV). The VG name is ‘VG00’ and it has a unique ID (which is why you do not want to map storage snapshost of an LVM to the same machine in parallel, without fully understanding what you are doing). We have the size of the PE – 32M in our case. As soon as the VG was created – it cannot be changed. A note – the PEs don’t have a header on-disk, meaning you cannot binary-dump a hard drive and look for the beginning or end of each PE. The PEs are defined as a mapping, and the driver can jump to the right location on the disk. It is fairly easy – calculate the position of the PE you aim at by multiplying the PE size with the sequential number of the PE, jump to this number relatively to the beginning of the partition, and you’re there.

Let’s look at the PV definition here – we have its UUID, which is extremely important, as it identified the PV for the VG. Since there is no order constraint on the devices (you can reverse the disk order for a multiple-PV system, and LVM will not get affected) – the only way LVM identifies the member PVs is by looking at their metadata copy, containing their UUID. If the metadata is damaged, missing or has an incorrect UUID, we get to data recovery! (or metadata recovery, which is easier, but still unpleasant).
Since the physical OS disk mapping doesn’t matter, because LVM makes use of PV UUID, the block device name is only a hint, for the human who might read this config backup file.
We have the status. A PV can be set to ‘not allocatable’ – let’s say we want to evict a PV from a VG – this can be done, however, in the meanwhile, we would not want anyone allocating data on this soon-to-be-removed PV – so we set it to ‘not allocatable’ to keep it empty.
It can have additional flags, used in cases of external lock management like in HA clusters.
Next, it shows the size of the device in sectors ; the PE beginning location (relative to the beginning of the PV), and the amount of PEs present in it.

Now, let’s look at how an LV is defined. Again – comments inline:

logical_volumes {

		lvroot {  ### The name of the LV ###
			id = "dmaQ5x-eTX0-JRsR-aMhG-Ldz5-SlR6-lAT6EB"  ### A unique identifier.  ###
			status = ["READ", "WRITE", "VISIBLE"] ### It is available R/W and visible. It can be none of these too ###
			flags = [] ### Special arguments. None defined ###
			creation_host = "localhost.localdomain"
			creation_time = 1594157738	# 2019-01-01 08:42:18 +0000
			segment_count = 1 ### An LV can be continuous or split in multiple ways. I will demonstrate that later ###

			segment1 { ### The first continuous are (and the only one, in our case ###
				start_extent = 0 ### Where does it start with the LOGICAL extent? On that later ###
				extent_count = 875	# 27.3438 Gigabytes ### The amount of LEs used by this segment, meaning - the segment size or length ###

				type = "striped" 	# linear  # There are multiple types. striped is the common one - a linear setup
				stripe_count = 1 ###

				stripes = [ ### Where does this segment reside *physically*? ###
					"pv0", 0 ### On 'pv0' we've seen before! And where does it start? On PE 0 (the first one) ###
				]
			}
		}

		lvswap { ### Another LV
			id = "E3Ei62-j0h6-cGu5-w9OB-l9tU-0Qf5-f09bvh"
			status = ["READ", "WRITE", "VISIBLE"]
			flags = []
			creation_host = "localhost.localdomain"
			creation_time = 1594157749	# 2019-01-01 08:42:29 +0000
			segment_count = 1

			segment1 {
				start_extent = 0  ### Tee LE of the LV. On LEs - later ###
				extent_count = 94	# 2.9375 Gigabytes

				type = "striped"
				stripe_count = 1	# linear

				stripes = [
					"pv0", 1813 ### Here we start at PE number 1813. More details below ###
				]
			}
		}
	}

Before I explain the LV settings, I need to explain what ‘Logical Extent’ is. A block device has to be presented to the operating system as a continuous device with random-access capabilities. So, logically, an LV has to be continuous. However – we do know that LVM allows us to modify, migrate and even resize an existing LV into split areas of a disk or disks (PVs). This is achieved by defining an LV as made out of a set of small chunks, ordered in a continuous manner. They are ordered in such a way, however, since they are logical, they can be mapped to any PEs we have, in a non-ordered mode. It means, practically, that this ‘chunk’, called “Logical Extent” (LE) is in the size of PE, and maps to one (or more, in cases of LVM RAID. Not included in this article). So an LV has a continuous array of LEs mapped to non-continuous list of PEs. This way, LVM can satisfy both the OS requirement for a block device, with the relevant properties, while maintaining flexibility with the actual disk positioning.

Here is another image to elaborate some more on the LE-to-PE mapping. This image was taken, with permission, from ‘thegeekdiary’ article explaining Linux LVM basics. If you want to know how to do stuff – you should check this article. I am just explaining how things look internally.

So – Back to our configuration. What do we have here? A Logical Volume (LV) is a logical unit with parameters, like name, UUID, status and so on. We can see that the LV called ‘lvroot’ has one ‘segment’ (called ‘segment1’). A segment is an uninterrupted list of continuous blocks, with a logical starting point and length (aka – uninterrupted list) with mapping of “extents” (in the configuration – meaning LE) to the starting point on the PV, defined as “PV”, PE_number. In this configuration, we can see that ‘lvroot’ block (LE) 0 begins at the PV ‘pv0’ block (PE) 0.

Here is aconfiguration dump of the same LV after I have migrated the first 10 PEs to another location in the disk (PV), using the command
pvmove –alloc anywhere /dev/sda2:0-9

lvroot {
                        id = "dmaQ5x-eTX0-JRsR-aMhG-Ldz5-SlR6-lAT6EB"
                        status = ["READ", "WRITE", "VISIBLE"]
                        flags = []
                        creation_host = "localhost.localdomain"
                        creation_time = 1594157738	# 2019-01-01 08:42:18 +0000
                        segment_count = 2 ### We now have two segments! ###

                        segment1 {  ### This is the beginning of the LV - mapped as LE 0-9 (the first 10, which I have migrated) ###
                                start_extent = 0
                                extent_count = 10       # 320 Megabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 1907 ### They are on pv0, but somewhere further back the disk, on PE 1907 and onwards! ###
                                ]
                        }
                        segment2 { # This is the next segment, of blocks 10 to the end ###
                                start_extent = 10
                                extent_count = 865      # 27.0312 Gigabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 10 ### It resides at the original location, which was PE 10 and onwards ###
                                ]
                        }
                }

The LV mapping has changed to match the change. The first 10 blocks (LEs) of lvroot are somewhere else on the disk on PV ‘pv0’ at location 1907, and the next segment of blocks remains in its original position – blocks 10 and onwards, except that because I’ve split the LV into two chunks, it has to have a new ‘segment’ definition.

This concludes my explanation of disk positioning and how it looks like, with LVM internals. We went through what PV is, what PE is, what LV and LE are, and how they are related to each other. Just to stress – a VG is a logical construct combining the PVs, PEs to the LEs and LVs.

If you find anything incorrect, not clear enough or want me to go further into any detail – drop me a note. I will be happy to hear from you.

Hot-resize disks on Linux

Monday, April 6th, 2020

After major investigations around, I came to the conclusion that a full guide describing the procedure required for online disk resize on Linux (especially – expanding disks). I have created a guide for RHEL5/6/7/8 (works the same for Centos or OEL or ScientificLinux – RHEL-based Linux systems) which takes into account the following four scenarios:

  • Expanding a disk where there is a filesystem directly on disk (no partitioning used)
  • Expanding a disk where there is LVM PV directly on disk (no partitioning used)
  • Expanding a disk where there is a filesystem on partition (a single partition taking all the disk’s space)
  • Expanding a disk where there is an LVM PV on partition (a single partition taking all the disk’s space)

All four scenarios were tested with and without use of multipath (device-mapper-multipath). Also – notes about using GPT compared to MBR are given. The purpose is to provide a full guideline for hot-extending disks.

This document does not describe the process of extending disks on the storage/virtualisation/NAS/whatever end. Updating the storage client configuration to refresh the disk topology might differ in various versions of Linux and storage communication methods – iSCSI, FC, FCoE, AoE, local virtualised disk (VMware/KVM/Xen/XenServer/HyperV) and so on. Each connectivity/OS combination might require different refresh method called on the client. In this lab, I use iSCSI and iSCSI software initiator.

The Lab

A storage server running Linux (Centos 7) with targetcli tools exporting 5GB (or more) LUN through iSCSI to Linux clients running Centos5, Centos6, Centos7 and Centos8, with the latest updates (5.11, 6.10, 7.7, 8.1). See some interesting insights on iSCSI target disk expansion using linux LIU (targetcli command line) in my previous post.

The iSCSI clients all see the disk as ‘/dev/sda’ block device. When using LVM, the volume group name is tempvg and the logical volume name is templv. When using multipath, the mpath name is mpatha. On some systems the mpath partition would appear as mpatha1 and on others as mpathap1.

iSCSI client disk/partitions were performed like this:

Centos5:

* Filesystem on disk

1
2
mkfs.ext3 /dev/sda
mount /dev/sda /mnt

* LVM on disk

1
2
3
4
5
pvcreate /dev/sda
vgcreate tempvg /dev/sda
lvcreate -l 100%FREE -n templv tempvg
mkfs.ext3 /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

* Filesystem on partition

1
2
3
parted -s /dev/sda "mklabel msdos mkpart primary 1 -1"
mkfs.ext3 /dev/sda1
mount /dev/sda1 /mnt

* LVM on partition

1
2
3
4
5
6
parted -s /dev/sda "mklabel msdos mkpart primary 1 -1 set 1 lvm on"
pvcreate /dev/sda1
vgcreate tempvg /dev/sda1
lvcreate -l 100%FREE -n templv tempvg
mkfs.ext3 /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

Centos6:

* Filesystem on disk

1
2
mkfs.ext4 /dev/sda
mount /dev/sda /mnt

* LVM on disk

1
2
3
4
5
pvcreate /dev/sda
vgcreate tempvg /dev/sda
lvcreate -l 100%FREE -n templv tempvg
mkfs.ext4 /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

* Filesystem on partition

1
2
3
parted -s /dev/sda "mklabel msdos mkpart primary 1 -1"
mkfs.ext4 /dev/sda1
mount /dev/sda1 /mnt

* LVM on partition

1
2
3
4
5
6
parted -s /dev/sda "mklabel msdos mkpart primary 1 -1 set 1 lvm on"
pvcreate /dev/sda1
vgcreate tempvg /dev/sda1
lvcreate -l 100%FREE -n templv tempvg
mkfs.ext4 /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

Centos7/8:

* Filesystem on disk

1
2
mkfs.xfs /dev/sda
mount /dev/sda /mnt

* LVM on disk

1
2
3
4
5
pvcreate /dev/sda
vgcreate tempvg /dev/sda
lvcreate -l 100%FREE -n templv tempvg
mkfs.xfs /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

* Filesystem on partition

1
2
3
parted -a optimal -s /dev/sda "mklabel msdos mkpart primary 1 -1"
mkfs.xfs /dev/sda1
mount /dev/sda1 /mnt

* LVM on partition

1
2
3
4
5
6
parted -a optimal -s /dev/sda "mklabel msdos mkpart primary 1 -1 set 1 lvm on"
pvcreate /dev/sda1
vgcreate tempvg /dev/sda1
lvcreate -l 100%FREE -n templv tempvg
mkfs.xfs /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

Some variations might exist. For example, use of ‘GPT’ partition layout would result in a parted command like this:

1
parted -s /dev/sda "mklabel gpt mkpart ' ' 1 -1"

Also, for multipath devices, replace the block device /dev/sda with /dev/mapper/mpatha, like this:

1
parted -a optimal -s /dev/mapper/mpatha "mklabel msdos mkpart primary 1 -1"

There are several common tasks, such as expanding filesystems – for XFS, using xfs_growfs <mount target> ; for ext3fs and ext4fs using resize2fs <device path>. Same goes for LVM expansion – using pvresize <device path>, followed by lvextend command, followed by the filesystem expanding command as noted above.

The document layout

The document will describe the client commands for each OS, sorted by action. The process would be as following:

  • Expand the visualised storage layout (storage has already expanded LUN. Now we need the OS to update to the change)
  • (if in use) Expand the multipath device
  • (if partitioned) Expand the partition
  • Expand the LVM PV
  • Expand the filesystem

Actions

For each OS/scenario/mutipath combination, we will format and mount the relevant block device, and attempt an online expansion.

Operations following disk expansion

Expanding the visualised storage layout

For iSCSI, it works quite the same for all OS versions. For other transport types, actions might differ.

1
iscsiadm -m node -R

Expanding multipath device

If using multipath device (device-mapper-multipath), an update to the multipath device layout is required. Run the following command (for all OSes)

1
multipathd -k"resize map mpatha"

Expanding the partition (if disk partitions are in use)

This is a bit complicated part. It differs greatly both in the capability and the commands in use between different versions of operation systems.

Centos 5/6

Online expansion of partition is impossible, except if used with device-mapper-multipath, in which case we force the multipath device to refresh its paths to recreate the device. It will result in an I/O error if there is only a single path defined. For non-multipath setup, a umount and re-mount is required. Disk partition layout cannot be read while the disk is in use.

Without Multipath
1
2
fdisk /dev/sda # Delete and recreate the partition from the same starting point
partprobe # Run when disk is not mounted, or else it will not refresh partition size
With Multipath
1
2
3
4
5
6
fdisk /dev/mapper/mpatha # Delete and recreate the partition from the same starting point
partprobe
multipathd -k"reconfigure" # Sufficient for Centos 6
multipathd -k"remove path sda" # Required for Centos 5
multipathd -k"add path sda" # Required for Centos 5
# Repeat for all sub-paths of expanded device
Centos 7/8
Without Multipath
1
2
fdisk /dev/sda # Delete and recreate partition from the same starting point. Sufficient for Centos 8
partx -u /dev/sda # Required for Centos 7
with Multipath
1
2
fdisk /dev/mapper/mpatha # Delete and recreate the partition from the same starting point. Sufficient for Centos 8
kpartx -u /dev/mapper/mpatha # Can use partx

Expanding LVM PV and LV

1
pvresize DEVICE
Device can be /dev/sda ; /dev/sda1 ; /dev/mapper/mpatha ; /dev/mapper/mpathap1 ; /dev/mapper/mpatha1 – according to the disk layout and LVM choice. lvextend -l +100%FREE /dev/tempvg/templv

Expanding filesystem

For ext3fs and ext4fs
1
resize2fs DEVICe
Device can be /dev/sda ; /dev/sda1 ; /dev/mapper/mpatha ; /dev/mapper/mpathap1 ; /dev/mapper/mpatha1 – according to the disk layout and LVM choice.
For xfs
1
xfs_growfs /mnt

Additional Considerations

MBR vs GPT

On most Linux versions (For Centos – up and including version 7) the command ‘fdisk’ is incapable of handling GPT partition layout. If using GPT partition layout, use of gdisk is recommended, if it exists for the OS. If not, parted is a decent although somewhat limited alternative.

gdisk command can also modify a partition layout (at your own risk, of course) from MBR to GPT and vice versa. This is very useful in saving large data migrations where legacy MBR partition layout was used on disks which are to be expanded beyond the 2TB limits.

GPT backup table is located at the end of the disk, so when extending a GPT disk, it is require to repair the GPT backup table. Based on my lab tests – it is impossible to both extend the partition and repair the GPT backup table location in a single call to gdisk. Two runs are required – one to fix the GPT backup table, and then – after the changes were saved – another to extend the partition.

Storage transport

I have demonstrated use of iSCSI software initiator on Linux. Different storage transport exist – each may require its own method of ‘notifying’ the OS of changed storage layout. See RedHat’s article about disk resizing (RHN access required). This article explains how to refresh the storage transport for a combination of various transports and RHEL versions. and sub-versions.

Relocating LVs with snapshots

Monday, February 2nd, 2009

Linux LVM is a wonderful thing. It is scalable, flexible, and truly, almost enterprise-class in every details. It lacks, of course, at IO performance for LVM snapshots, but this can be worked-around in several creative ways (if I haven’t shown here before, I will sometime).

What it can’t do is dealing with a mixture of Stripes, Mirrors and Snapshots in a single logical volume. It cannot allow you to mirror a stripped LV (even if you can follow the requirementes), it cannot allow you to snapshot a mirrored or a stripped volume. You get the idea. A volume you can protect, you cannot snapshot. A volume with snapshots cannot be mirrored or altered.

For the normal user, what you get is usually enough. For storage management per-se, this is just not enough. When I wanted to reduce a VG – remove a disk from an existing volume group,  I had to evacuate it from any existing logical volume. The command to perform this actions is ‘pvmove‘ which is capable of relocating data from within a PV to other PVs. This is done through mirroring each logical volume and then removing the origin.

Mirroring, however, cannot be performed on LVs with snapshots, or on an already mirrored LV, so these require different handling.

We can detect which LVs reside on our physical volume by issuing the following command

pvdisplay -m /dev/sdf1

/dev/sdf1 was only an example. You will see the contents of this PV. So next, performing

pvmove /dev/sdf1

would attempt to relocate every existing LV from this specific PV to any other available PV. We can use this command to change the disk balance and allocations on multi-disk volume groups. This will be discussed on a later post.

Following a ‘pvmove‘ command, all linear volumes are relocated, if space permits, to another PVs. The remaining LVs are either mirrored or LVs with snapshots.

To relocate a mirrored LV, you need to un-mirror it first. To do so, first detect using ‘pvdisplay‘ which LV is belongs to (the name should be easy to follow) and then change it to non-mirrored.

lvconvert -m0 /dev/VolGroup00/test-mirror

This will convert it to be a linear volume instead of a mirror, so you could move it, if it still resides on the PV you are to remove.

Snapshot volumes are more complicated, due to their nature. Since all my snapshots are of a filesystem, I could allow myself to use tar to perform the action.

The steps are as follow:

  1. tar the contents of the snapshot source to nowhere, but save an incremental file
  2. Copy the source incremental file to a new name, and tar the contents of a snapshot according to this copy.
  3. Repeat the previous step for each snapshot.
  4. Remove all snapshots
  5. Relocate the snapshot source using ‘pvmove
  6. Build the snapshots and then recover the data into them

This is a script to do steps 1 to 3. It will not remove LVs, for obvious reasons. This script was not tested, but should work, of course 🙂

None of the LVs should be mounted for it to function. It’s better to have harder requirements than to destroy data by double-mounting it, or accessing it while it is being changed.

#!/bin/bash
# Get: VG Base-LV, snapshot name, snapshot name, snapshot name...
# Example:
# ./backup VolGroup00 base snap1 snap2 snap3
# Written by Ez-Aton

TARGET=/tmp
if [ "[email protected]" -le 3 ]
then
   echo "Parameters: $0 VG base snap snap snap snap"
   exit 1
fi
VG=$1
BASE=$2
shift 2

function check_not_mounted () {
   # Check if partition is mounted
   if mount | grep /dev/mapper/${VG}-${1}
   then
      return 0
   else
      return 1
   fi
}

function create_base_diff () {
   # This function will create the diff file for the base
   mount /dev/${VG}/${BASE} $MNT
   if [ $? -ne 0 ]
   then
      echo "Failed to mount base"
      exit 1
   fi
   cd $MNT
   tar -g $TARGET/${BASE}.tar.gz.diff -czf - . > /dev/null
   cd -
   umount $MNT
}

function create_snap_diff () {
   mount /dev/${VG}/${1} $MNT
   if [ $? -ne 0 ]
   then
      echo "Failed to mount base"
      exit 1
   fi
   cp $TARGET/${BASE}.tar.gz.diff $TARGET/$1.tar.gz.diff
   cd $MNT
   tar -g $TARGET/${1}.tar.gz.diff -czf $TARGET/${1}.tar.gz .
   cd -
   umount $MNT
}

function create_mount () {
   # Creates a temporary mount point
   if [ ! -d /mnt/$$ ]
   then
      mkdir /mnt/$$
   fi
   MNT=/mnt/$$
}

create_mount
if check_not_mounted $BASE
then
   create_base_diff
else
   echo "$BASE is mounted. Exiting now"
   exit 1
fi
for i in [email protected]
do
   if check_not_mounted $i
   then
      create_snap_diff $i
   else
      echo "$i is mounted! I will not touch it!"
   fi
done

The remaining steps should be rather easy – just mount the newly created snapshots and restore the tar file on them.