Archive for the ‘Disk Storage’ Category

Enable blk_mq on Redhat (OEL/Centos) 7 and 8

Saturday, December 18th, 2021

The IO scheduler blk_mq was designed to increase disk performance – better IOps and lower latency – on flash disk systems, with emphasis on NVME devices. A nice by-product, is that it increases performance on the “older” SCSI/SAS/SATA layer as well, when the disks in the back are SSD, and even when they are rotational disks, in some cases.

When attempting to increase system disk performance (where and when this is the bottleneck, such as in the case of databases), this is a valid and relevant configuration option. If you have a baseline performance metrics to compare to – great ; But even if you do not have other, previous performance details – in most cases – for a modern system – your performance will improve after this change.

NVME devices get this scheduler enabled by default, so no worry there. However, your system might not be aware of the backend storage type of devices, nor of the backend capabilities when dealing with SCSI subsystem, so – by default – these settings are not enabled.

To enable these settings on RHEL/OEL/Centos version 7 and 8, you should do the following:

Edit /etc/default/grub and append to your GRUB_CMDLINE_LINUX the following string:

scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=y

This will enable blk_mq both on the SCSI subsystem, and the device mapper (DM), which includes software RAID, LVM and device-mapper-multipath (aka – multipath).

Following this, run:

grub2-mkconfig -o /boot/grub2/grub.cfg  # for legacy BIOS

or

grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg  # for EFI

Of course – a reboot is required for these changes to take effect. Following the reboot, you will be able to see that the contents of /sys/block/sda/queue/scheduler (for /dev/sda, in my example. Check your relevant disks, of course) shows “[mq-deadline]”

If you want to read more about blk_mq – you can find it in great detail in this blog post.

Reduce traverse time of large directory trees on Linux

Tuesday, December 14th, 2021

Every Linux admin is familiar with the long time running through a large directory tree (with hundred of thousands of files and more) can take. Most are aware that if you re-run the same run-through, it will be shorter.

This is caused by a short-valid filesystem cache, where the memory is allocated to other tasks, or the metadata required cache exceeds the available for this task.

If the system is focused on files, meaning that its prime task is holding files (like NFS server, for example) and the memory is largely available, a certain tunable can reduce recurring directory dives (like the ‘find’ or ‘rsync’ commands, which run huge amounts of attribute queries):

sysctl vm.vfs_cache_pressure=10

The default value is 100. Lower values will cause the system to prefer keeping this cache. A quote from kernel’s memory tunables page:

vfs_cache_pressure
——————————–
This percentage value controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects.

At the default value of vfs_cache_pressure=100 the kernel will attempt to reclaim dentries and inodes at a “fair” rate with respect to pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will never reclaim dentries and inodes due to memory pressure and this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to reclaim dentries and inodes.

Increasing vfs_cache_pressure significantly beyond 100 may have negative performance impact. Reclaim code needs to take various locks to find freeable directory and inode objects. With vfs_cache_pressure=1000, it will look for ten times more freeable objects than there are.

Better iostat visibility of ZFS vdevs

Sunday, July 4th, 2021

All ZFS users are familiar with ‘zpool iostat’ command, however, it is not easily translated into Linux ‘iostat’ command. Using large pools with many disks will result in a mess, where it’s hard to identify which disk is which, and going to a translation table from time to time, to identify a suspect slow disk.

Linux ‘iostat’ command allows to use aliases, and if you’r using vdev_id.conf file, and you are using ZFS aliased names, you can harness the same naming to your ‘iostat’ command. See my command example below – note that in this setup I do not use multipath or other DM devices, but a direct approach to /dev/sd devices. Also – in this case – I have a few (slightly more than a dozen) disks, so no need to address /dev/sd[a-z][a-z] devices:

iostat -kt 5 -j vdev -x /dev/sd? /dev/nvme0n1

You can chain more devices to the end of this line. The result should be something like this:

07/04/2021 10:36:54 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.26    0.00    0.87    1.49    0.00   97.39

     r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util Device
    0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00 nvme0n1
   42.80   15.40   5461.60   1938.40     0.00     0.00   0.00   0.00    0.50    0.87   0.01   127.61   125.87   1.12   6.50 sata500g2
   52.20    0.00   6631.20      0.00     0.00     0.00   0.00   0.00    0.43    0.00   0.00   127.03     0.00   1.40   7.30 sata500g1
   40.60    0.40   5196.80     51.20     0.00     0.00   0.00   0.00    0.44    0.50   0.00   128.00   128.00   1.41   5.80 sata500g4
   71.60    6.20   9148.00    776.80     0.00     0.00   0.00   0.00    0.45    0.48   0.00   127.77   125.29   1.34  10.46 sata500g3
   35.00    6.00   4463.20    768.00     0.00     0.00   0.00   0.00    0.44    0.53   0.00   127.52   128.00   1.24   5.08 sata1t
    0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00 mid1
   28.80   10.60    748.00     84.00     0.00     0.00   0.00   0.00    1.55    0.53   0.04    25.97     7.92   1.11   4.36 top4
    5.00   18.00    122.40    106.40     0.00     0.00   0.00   0.00   13.32   18.32   0.39    24.48     5.91   1.18   2.72 top5
    4.60   27.40    124.00    160.80     0.00     0.00   0.00   0.00   10.35   15.71   0.46    26.96     5.87   1.09   3.48 top2
   26.40   12.20    676.80     88.80     0.00     0.00   0.00   0.00    2.14    0.52   0.05    25.64     7.28   1.07   4.12 bot3
    4.60   25.40    104.80    137.60     0.00     0.00   0.00   0.00    5.26    0.64   0.04    22.78     5.42   0.31   0.94 mid4
    5.40   19.00    130.40    119.20     0.00     0.00   0.00   0.00    3.81    0.52   0.02    24.15     6.27   0.57   1.38 mid5
   25.00   12.00    596.80     80.00     0.00     0.20   0.00   1.64    3.61    0.13   0.08    23.87     6.67   1.03   3.80 mid2
   28.00   11.20    678.40     81.60     0.00     0.00   0.00   0.00    3.67    0.59   0.10    24.23     7.29   1.23   4.84 bot2
    5.00   23.40    114.40    140.80     0.00     0.00   0.00   0.00    9.28    0.43   0.05    22.88     6.02   0.51   1.44 bot4
    5.00   27.00    120.80    151.20     0.00     0.00   0.00   0.00    4.04    0.75   0.03    24.16     5.60   0.49   1.56 bot5
    0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00 mid3
   27.40    8.60    661.60     77.60     0.00     0.00   0.00   0.00    1.77    0.49   0.04    24.15     9.02   1.16   4.18 bot1
   27.00    9.60    692.00     84.80     0.00     0.00   0.00   0.00    2.10    0.52   0.05    25.63     8.83   1.15   4.20 top1

ZFS clone script

Sunday, March 28th, 2021

ZFS has some magical features, comparable to NetApp’s WAFL capabilities. One of the less-used on is the ZFS send/receive, which can be utilised as an engine below something much like NetApp’s SnapMirror or SnapVault.

The idea, if you are not familiar with NetApp’s products, is to take a snapshot of a dataset on the source, and clone it to a remote storage. Then, take another snapshot, and clone only the delta between both snapshots, and so on. This allows for cloning block-level changes only, which reduces clone payload and the time required to clone it.

Copy and save this file as clone_zfs_snapshots.sh. Give it execution permissions.

#!/bin/bash
# This script will clone ZFS snapshots incrementally over SSH to a target server
# Snapshot name structure: [email protected]${TGT_HASH}_INT ; where INT is an increment number
# Written by Etzion. Feel free to use. See more stuff in my blog at https://run.tournament.org.il
# Arguments:
# $1: ZFS filesystem name
# $2: (target ZFS system):(target ZFS filesystem)

IAM=$0
ZFS=/sbin/zfs
LOCKDIR=/dev/shm
LOCAL_SNAPS_TO_LEAVE=3
RESUME_LIMIT=3

### FUNCTIONS ###

# Sanity and usage
function usage() {
	echo "Usage: $IAM SRC REMOTE_SERVER:ZFS_TARGET (port=SSH_PORT)"
	echo "ZFS_TARGET is the parent of filesystems which will be created with the original source names"
	echo "Example: $IAM share/test backupsrv:backup"
	echo "It will create a filesystem 'test' under the pool 'backup' on 'backupsrv' with clone"
	echo "of the current share/test ZFS filesystem"
	echo "This script is (on purpose) not a recursive script"
	echo "For the script to work correctly, it *must* have SSH key exchanged from source to target"
	exit 0
}

function abort() {
	# exit errorously with a message
	echo "[email protected]"
	pkill -P $$
	remove_lock
	exit 1
}

function parse_parameters() {
	# Parses command line parameters
	# called with $*
	SRC_FS=$1
	shift
	TGT=$1
	shift
	for i in $*
	do
		case ${i} in
			port=*)	PORT=${i##*=}
			;;
			hash=*)	HASH=${i##*=}
			;;
		esac
	done
	TGT_SYS=${TGT%%:*}
	TGT_FS=${TGT##*:}
	# Use a short substring of MD5sum of the target name for later unique identification
	SRC_DIRNAME_FS=${SRC_FS#*/}
	if [ -z "$hash" ]
	then
		TGT_FULLHASH="`echo $TGT_FS/${SRC_DIRNAME_FS} | md5sum -`"
		TGT_HASH=${TGT_FULLHASH:1:7}
	else
		TGT_HASH=${hash}
	fi

}

function sanity() {
	# Verify we have all details
	[ -z "$SRC_FS" ] && usage
	[ -z "$TGT_FS" ] && usage
	[ -z "$TGT_SYS" ] && usage
	$ZFS list -H -o name $SRC_FS > /dev/null 2>&1 || abort "Source filesystem $SRC_FS does not exist"
	# check_target_fs || abort "Target ZFS filesystem $TGT_FS on $TGT_SYS does not exist, or not imported"
}

function remove_lock() {
	# Removes the lock file
	\rm -f ${LOCKDIR}/$SRC_LOCK
}

function construct_ssh_cmd() {
	# Constract the remote SSH command
	# Here is a good place to put atomic parameters used for the SSH
	[ -z "${PORT}" ] && PORT=22
	SSH="ssh -p $PORT $TGT_SYS -o ConnectTimeout=3"
	CONTROL_SSH="$SSH -f"
}

function get_last_remote_snapshots() {
	# Gets the last snapshot name on a remote system, to match it to our snapshots
	remoteSnapTmpObj=`$SSH "$ZFS list -H -t snapshot -r -o name ${TGT_FS}/${SRC_DIRNAME_FS}" | grep ${SRC_DIRNAME_FS}@ | grep ${TGT_HASH}`
	# Create a list of all snapshot indexes. Empty means its the first one
	remoteSnaps=""
	for snapIter in ${remoteSnapTmpObj}
	do
	  remoteSnaps="$remoteSnaps ${snapIter##*@${TGT_HASH}_}"
	done
}

function check_if_remote_snapshot_exists() {
	# Argument: $1 ->; Name of snapshot
	# Checks if this snapshot exists on remote node
	$SSH "$ZFS list -H -t snapshot -r -o name ${TGT_FS}/${SRC_DIRNAME_FS}@${TGT_HASH}_${newLocalIndex}"
	return $?
}

function get_last_local_snapshots() {
	# This function will return an array of local existing snapshots using the existing TGT_HASH
    localSnapTmpObj=`$ZFS list -H -t snapshot -r -o name $SRC_FS | grep [email protected] | grep $TGT_HASH `
    # Convert into a list and remove the HASH and everything before it. We should have clear list of indexes
    localSnapList=""
    for snapIter in ${localSnapTmpObj}
    do
    	localSnapList="$localSnapList ${snapIter##*@${TGT_HASH}_}"
    done
    # Convert object to array
    localSnapList=( $localSnapList )
    # Get the last object
    let localSnapArrayObj=${#localSnapList[@]}-1
}

function delete_snapshot() {
	# This function will delete a snapshot
	# arguments: $1 -> snapshot name
	[ -z "$1" ] && abort "Cleanup snapshot got no arguments"
	$ZFS destroy $1
	#$ZFS destroy ${SRC_FS}@${TGT_HASH}_${newLocalIndex}
}

function find_matching_snapshot() {
	# This function will attempt to find a matching snapshot as a replication baseline
	# Gets the latest local snapshot index
	localRecentIndex=${localSnapList[$localSnapArrayObj]}
    # Gets the latest mutual snapshot index
    while [ $localSnapArrayObj -ge 0 ]
    do
    	# Check if the current counter already exists
    	if echo "$remoteSnaps" | grep -w ${localSnapList[$localSnapArrayObj]} > /dev/null 2>&1
    	then
    		# We know the mutual index.
    		commonIndex=${localSnapList[$localSnapArrayObj]}
    		return 0
    	fi
    	let localSnapArrayObj--
    done
    # If we've reached here - there is no mutual index!
    abort "There is no mutual snapshot index, you will have to resync"
}

function cleanup_snapshots() {
	# Creates a list of snapshots to delete and then calls delete_snapshot function
	# We are using the most recent common index, $localSnapArrayObj as the latest reference for deletion
	let deleteArrayObj=$localSnapArrayObj-${LOCAL_SNAPS_TO_LEAVE}
	snapsToDelete=""
	# Construct a list of snapshots to delete, and delete it in reverse order
	while [ $deleteArrayObj -ge 0 ]
	do
		# Construct snapshot name
		snapsToDelete="$snapsToDelete ${SRC_FS}@${TGT_HASH}_${localSnapList[$deleteArrayObj]}"
		let deleteArrayObj--
	done
	snapsToDelete=( $snapsToDelete )

	snapDelete=0
	
	while [ $snapDelete -lt ${#snapsToDelete[@]} ]
	do
		# Delete snapshot
		delete_snapshot ${snapsToDelete[$snapDelete]}
		let snapDelete++
	done
}

function initialize() {
	# This is a unique case where we initialize the first sync
	# We will call this procedure when $remoteSnaps is empty (meaning that there was no snapshot whatsoever)
	# We have to verify that the target has no existing old snapshots here
	# is it empty?
	echo "Going to perform an initialization replication. It might wipe the target $TGT_FS completely"
	echo "Press Enter to proceed, or Ctrl+C to abort"
	read "abc"
	### Decided to remove this check
	### [ -n "$LOCSNAP_LIST" ] && abort "No target snapshots while local history snapshots exists. Clean up history and try again"
	RECEIVE_FLAGS="-sFdvu"
	newLocalIndex=1
	# NEW_LOC_INDEX=1
	create_local_snapshot $newLocalIndex
	open_remote_socket
	sleep 1
	$ZFS send -ce ${SRC_FS}@${TGT_HASH}_${newLocalIndex} | nc $TGT_SYS $NC_PORT 2>&1
	if [ "$?" -ne "0" ]
	then
		# Do no cleanup current snapshot
		# delete_snapshot ${SRC_FS}@${TGT_HASH}_${newLocalIndex}
		abort "Failed to send initial snapshot to target system"
	fi
	sleep 1
	# Set target to RO
	$SSH $ZFS set readonly=on $TGT_FS
	[ "$?" -ne "0" ] && abort "Failed to set remote filesystem $TGT_FS to read-only" # No need to remove local snapshot
}

function create_local_snapshot() {
	# Creates snapshot on local storage
	# uses argument $1
	[ -z "$1" ] && abort "Failed to get new snapshot index"
	$ZFS snapshot ${SRC_FS}@${TGT_HASH}_${1}
	[ "$?" -ne "0" ] && abort "Failed to create local snapshot. Check error message"
}

function open_remote_socket() {
	# Starts remote socket via SSH (as the control operation)
	# port is 3000 + three-digit random number
	let NC_PORT=3000+$RANDOM%1000
	$CONTROL_SSH "nc -l -i 90 $NC_PORT | $ZFS receive ${RECEIVE_FLAGS} $TGT_FS > /tmp/output 2>&1 ; sync"
	#$CONTROL_SSH "socat tcp4-listen:${NC_PORT} - | $ZFS receive ${RECEIVE_FLAGS} $TGT_FS > /tmp/output 2>&1 ; sync"
	#zfs send -R [email protected] | zfs receive -Fdvu zpnew
}

function send_zfs() {
	# Do the heavy lifting of opening remote socket and starting ZFS send/receive
	open_remote_socket
	sleep 1
	$ZFS send -ce -I ${SRC_FS}@${TGT_HASH}_${commonIndex} ${SRC_FS}@${TGT_HASH}_${newLocalIndex} | nc -i 90 $TGT_SYS $NC_PORT 
	#$ZFS send -ce -I ${SRC_FS}@${TGT_HASH}_${commonIndex} ${SRC_FS}@${TGT_HASH}_${newLocalIndex} | socat tcp4-connect:${TGT_SYS}:${NC_PORT} -
	sleep 20

}

function increment() {
	# Create a new snapshot with the index $localRecentIndex+1, and replicate it to the remote system
	# Baseline is the most recent common snapshot index $commonIndex
	RECEIVE_FLAGS="-Fsdvu" # With an 'F' flag maybe?
	# Handle the case of latest snapshot in DR is newer than current latest snapshot, due to mistaken deletion
	remoteSnaps=( $remoteSnaps )
	let remoteIndex=${#remoteSnaps[@]} # Get last snapshot on DR
	if [ ${localRecentIndex} -lt ${remoteIndex} ]
	then
		let newLocalIndex=${remoteIndex}+1
	else
		let newLocalIndex=localRecentIndex+1
	fi
	create_local_snapshot $newLocalIndex

	send_zfs

	# if [ "$?" -ne "0" ]
	# then

		# Cleanup current snapshot
		#delete_snapshot ${SRC_FS}@${TGT_HASH}_${newLocalIndex}
		#abort "Failed to send incremental snapshot to target system"
	# fi
	if ! verify_correctness
	then

		if ! loop_resume # If we can
		then
			# We either could not resume operation or failed to run with the required amount of iterations
			# For now we abort. 
			echo "Deleting local snapshot"
			delete_snapshot ${SRC_FS}@${TGT_HASH}_${newLocalIndex}
			abort "Remote snapshot should have the index of the latest snapshot, but it is not. The current remote snapshot index is ${commonIndex}"
		fi
	fi
}

function loop_resume() {
	# Attempts to loop over resuming until limit attempt has been reached
	REMOTE_TOKEN=$($SSH "$ZFS get -Ho value receive_resume_token ${TGT_FS}/${SRC_DIRNAME_FS}")
	if [ "$REMOTE_TOKEN" == "-" ]
	then
		return 1
	fi
	# We have a valid resume token. We will retry
	COUNT=1
	while [ "$COUNT" -le "$RESUME_LIMIT" ]
	do
		# For ease of handline - for each iteration, we will request the token again
		echo "Attempting resume operation" 
		REMOTE_TOKEN=$($SSH "$ZFS get -Ho value receive_resume_token ${TGT_FS}/${SRC_DIRNAME_FS}")
		let COUNT++
		open_remote_socket
		$ZFS send -e -t $REMOTE_TOKEN | nc -i 90 $TGT_SYS $NC_PORT
		#$ZFS send -e -t $REMOTE_TOKEN | socat tcp4-connect:${TGT_SYS}:${NC_PORT} -
		sleep 20
		if verify_correctness
		then
			echo "Done"
			return 0
		fi
	done
	# If we've reached here, we have failed to run the required iterations. Lets just verify again
	return 1
}

function verify_correctness() {
	# Check remote index, and verify it is correct with the current, latest snapshot

    if check_if_remote_snapshot_exists
    then
    	echo "Replication Successful"
    	return 0
    else
    	echo "Replication failed"
    	return 1
    fi
}

### MAIN ###
[ `whoami` != "root" ] && abort "This script has to be called by the root user"
[ -z "$1" ] && usage
parse_parameters $*
SRC_LOCK=`echo $SRC_FS | tr / _`
if [ -f ${LOCKDIR}/$SRC_LOCK ] 
then
	echo "Already locked. If should not be the case - remove ${LOCKDIR}/$SRC_LOCK"
	exit 1
fi
sanity
touch ${LOCKDIR}/$SRC_LOCK
construct_ssh_cmd
get_last_remote_snapshots # Have a string list of remoteSnaps
# If we dont have remote snapshot it should be initialization
if [ -z "$remoteSnaps" ]
then
	initialize
	echo "completed initialization. Done"
	remove_lock
	exit 0
fi

# We can get here only if it is not initialization
get_last_local_snapshots # Have a list (array) of localSnaps
find_matching_snapshot # Get the latest local index and the latest common index available
increment # Creates a new snapshot and sends/receives it
cleanup_snapshots # Cleans up old local snapshots
pkill -P $$
remove_lock
echo "Done"

A manual initial run should be called manually. If you expect a very long initial sync, you should run it in tmux to screen, to avoid failing in the middle.

To run the command, run it like this:

./clone_zfs_snapshots.sh share/my-data backuphost:share

This will create under the pool ‘share’ in the host ‘backuphost’ a filesystem matching the source (in this case: share/my-data) and set it to read-only. The script will create a snapshot with a unique name based on a shortened hash of the destination, with a counting number suffix, and start cloning the snapshot to the remote host. When called again, it will create a snapshot with the same name, but different index, and clone the delta to the remote host. In case of a disconnection, the clone will retry a few times before failing.

Note that the receiving side does not remove snapshots, so handling (too) old snapshots on the backup host remains up to you.

Extend /boot from within a Linux system

Saturday, March 27th, 2021

This is a tricky one. In order to resize /boot, which is, commonly the first partition, you need to push forward the beginning of the next partition. This is not an easy task, especially if you are not using LVM – then you have to use external partitioning modification tools, like PQMagic (if it still exists, who knows?), or other such offline tools.

However, if you are using LVM, there is a (complex) trick to it. We need to evict some of the first few PEs, resize the partition to begin at a new location, and then re-sign (and restore) the LVM meta-data in a way which will reflect the relative change in data blocks position (aka – the new PEs). To have some additional grasp of LVM and its meta-data, I recommend you read my article here.

Also, and this is an important note – you cannot change an open (in-use) partition on systems prior to RHEL 8 (on which I have not tested my solution just yet) – meaning – you can change the partition layout on the disk, but the kernel will not refresh that information and would not act accordingly until reboot.

If you have not tried this before, or not sure about all the details in this post, I urge you to use a VM for testing purposes. A failure in this process might leave your data inaccessible, and you do not want that.

So, we have a complex set of tasks:

  • If there is some empty space somewhere on the LVM PV, migrate the first X blocks out.
  • Export the LVM meta-data so we could edit it afterwards
  • Recreate the partition (delete and recreate) with a new starting location
  • (here comes the tricky part) – sign the partition’s updated beginning with LVM meta-data, with the updated relative block locations.

Assumptions:

  • The disk partition layout is /boot as /dev/sda1 and LVM PV on /dev/sda2
  • The LVM VG name is ‘VG’
  • We are using modern dracut-capable system, such as RHEL/CentOS version 6 and above (not tested on version 8 yet)
  • We use basic (msdos) partition layout and not GPT

Clear 500MB for further use, if not enough free space in PV:

In order to do so, we will need 500MB of free space in our PV. If space is an issue, you can easily clean up space from the swap space, by stopping swap, reducing the LV size, signing swap with ‘mkswap’ and starting swap again. This is in a nutshell, and I will not go further into it.

Move the first 500MB out of the beginning of the PV:

We need to do some math. The size of a single PE is defined in the LVM VG settings. By default it is 4MB today, and it can be checked using ‘vgdisplay’ command. Look for the field ‘PE Size’. So – 500MB is 125 PEs. So our command would be:

pvmove --alloc anywhere /dev/sda2:0-124

Which will migrate the first 125 PEs starting at position 0 to 124 away somewhere in the VG.

Export LVM meta-data to a file, and edit it for future handling:

vgcfgbackup -f /tmp/vg-orig.txt VG 

This command will create a file called /tmp/vg-orig.txt which will contain the original VG meta-data copy. We will clone this file and edit it:

cp /tmp/vg-orig.txt /tmp/vg.txt

Now comes the more complex part. We need to adjust the meta-data file to reflect the relative change in the block location. We will edit the new /tmp/vg.txt file. Initially – find the block describing ‘pv0’, which is the first PV in your VG (and maybe the only one), and verify that ‘pv0’ is the correct device, by verifying the ‘device’ directive in this block.
Now comes the harder part – Each LV block in the meta-data file has a sub-section describing disk segments. These blocks describe the relative location of the LV in the PVs. I have already pointed at my article describing the meta-data file and how to read it. The task is to find the ‘stripes’ directive in each LV sub-segment, and reduce the amount of PEs – in our case – 125. It needs to be done for all LVs which reside on our ‘pv0’ – one after the other. An example would look like this:

lvswap { ### Another LV
			id = "E3Ei62-j0h6-cGu5-w9OB-l9tU-0Qf5-f09bvh"
			status = ["READ", "WRITE", "VISIBLE"]
			flags = []
			creation_host = "localhost.localdomain"
			creation_time = 1594157749	# 2019-01-01 08:42:29 +0000
			segment_count = 1

			segment1 {
				start_extent = 0  ### Tee LE of the LV. On LEs - later ###
				extent_count = 94	# 2.9375 Gigabytes

				type = "striped"
				stripe_count = 1	# linear

				stripes = [
					# Was: "pv0", 1813. Now:
					"pv0", 1688 ### reduced 125 PEs ###
				]
			}
		}

Copy the resulting file /tmp/vg.txt (after double-checking it!) to /boot. We will use /boot later on to re-sign the PV meta-data.

Recreate the partition:

Another tricky part. You cannot just resize a partition, or at least – without the tool (parted, fdisk – depending on your OS version) attempting to resize the over-layer, and failing to do so. Most tools do not allow changes to the size of the partitions at all, so we will need to delete and recreate the partition layout. Now, depending if you are using GPT or msdos partition, your tools might vary, but in this case, I handle only msdos partition layout, so the tools will be in accordance. Other tools can apply for GPT layout, and the process, in general, will work on GPT as well.

So – we will backup the partition layout before we change it. The command ‘sfdisk’ will allow us to do so, so we can call

sfdisk -d /dev/sda > /boot/original-disk-layout.txt

I am leaving quite a lot of stuff on /boot partition, because this partition is not a member of the LVM volume group, and will remain, mostly, unaffected during our process. You can use an external USB disk, or any other non-LVM partition, as long as you verify you can access it from within the boot process, directly from initrd/initramfs, or dracut. /boot is, commonly, accessible from within the boot process.

Now we modify the partition layout. To do so, I recommend to document the original start point of the two interesting partitions – /boot (usually /dev/sda1) and our PV (in this example: /dev/sda2). I prefer using ‘sector’ directives. An example would be:

parted -s /dev/sda "unit s p"

It is common, for modern Linux systems, to have /boot starting at sector 2048 (which is 1MB into the disk). This is due to block alignment, however, I will not discuss this here. The interesting part is the size of a sector (commonly 512b, but can be 4K for ‘advanced format’ disks), so we will be able to calculate the new partitions starting positions and sizes.

Now, using ‘parted’ we need to remove the 2nd partition (in my example, note. It might vary on your setup) and recreate it at a newer location – 125PEs further, or 500MB further, or 1024000 sectors ahead. So, if our starting sector is 411648(s), then we will have to create the partition starting at sector 1435648 (=411648+1024000), with the original ending location. Don’t forget to set this partition to LVM. Assuming you have saved the starting point of the partition in the variable StartOfPart, and the original ending in EndOfPart, your command would look like this:

parted -s /dev/sda "unit s rm 2 mkpart primary $(( StartOfPart + 1024000 )) ${EndOfPart} set 2 lvm on"

Now, we need to recreate the /boot partition (partition #1 in my example) to include the new size. Again – we need to document its beginning, and now recreate it. Assuming we have kept the same variables as before, the command would look like this:

parted -s /dev/sda "unit s rm 1 mkpart primary ${StartOfPart} $(( EndOfPart + 1024000 - 1 )) set 1 boot on"

The kernel will not update the new partitions sizes because they are in use. We will need a reboot, however – when we reboot (do not do that just yet), we will no longer have access to our LVM. This is because it will not have meta-data anymore, and we will need to recreate it.

Prepare a script to place in /boot, called vgrecover.sh which will hold the following lines:

#!/bin/sh
sed -i 's/locking_type = 4/locking_type = 0/g' /etc/lvm/lvm.conf
lvm pvcreate -u ${PVID} --restorefile /mnt/vg.txt /dev/sda2
lvm vgcfgrestore -f /mnt/vg.txt VG

You need to save the PVID for /dev/sda2 and replace this value in this script. This is the field ‘PV UUID’ in the output of the command:

pvdisplay /dev/sda2

Some more explanations: The device in our example is /dev/sda2 (change it to match your device name), and the VG name is ‘VG’ (again – change to match your setup). This script needs to be placed on /boot and be made executable.

Before our reboot:

We need to verify the following files exist on our /boot:

  • vg.txt
  • vgrecover.sh
  • original-disk-layout.txt

If any of these files is missing, you will not be able to boot, you will not be able to recover your system, and you will not be able to access the data there ever again!

I also recommend you keep your original-disk-layout.txt file somewhere external. If you have made a partitioning mistake and changed the beginning of /boot, you will not have access to /boot and all its files, and having this file elsewhere (on external disk, for example) will help you recover the partition layout quickly and with no frustration.

Now comes another risky part: reboot and get into recovery shell used by GRUB. See my article here to understand how to enter recovery shell. If you have a different OS version, your boot arguments might differ. An external boot media (like RHEL/CentOS recovery boot, or Ubuntu live) could also suffice to complete the task, but it is preferred to use the GRUB recovery console to reduce the change of some unknown automatic task or detection process doing stuff for you.

We need to break the boot sequence in the pre-mount phase. We will have a minimal shell on which we need to run the following commands:

mkdir /mnt
mount /dev/sda1 /mnt
/mnt/vgrecover.sh

We are mounting /dev/sda1 (our /boot) on /mnt, which we have just created. Then we call the vgrecover.sh script we have created before. It will use LVM recovery commands to re-sign the PV on /dev/sda2, and then recover the VG meta-data using our modified meta-data file, describing a new relative positions of LVs.

When done, assuming no problems happened there, just umount /mnt and reboot. The system should boot up successfully, however, /boot will not have the designated size just yet.

Extending /boot :

The partition /dev/sda1 is of the updated size now, however, the filesystem is not. You can verify that using ‘fdisk -l /dev/sda’ of ‘parted -s /dev/sda unit s p’ or any other command. If this is not the case, then check your process.

Extending the filesystem depends on the type of filesystem. You can run ‘df -hPT /boot’ to identify the filesystem type. If it is XFS, use the command:

xfs_growfs /boot

If the filesystem is of type ext3 or ext4, use

resize2fs /dev/sda1

Other filesystems will require different tools, and since I cannot cover it all, I leave it to you. This is an online process, and as soon as it is over, the new size will show in the ‘df’ command.

Recovery:

If, for some reason, the disk partitioning or PV re-signing failed, and the system cannot boot, you can use the original-disk-layout.txt file in /boot to recover the original disk layout. Boot into GRUB rescue mode as shown above, and run:

mkdir /mnt
mount /dev/sda1 /mnt
sfdisk -f /dev/sda < /mnt/original-disk-layout.txt

If your /boot is inaccessible, and the file original-disk-layout.txt was kept on an external storage, you can use a live Ubuntu, or any other live system to run the ‘sfdisk’ command as shown above to recover /dev/sda original partitioning layout.

Bottom line:

This is a possible, although complex, task, and you should practice it on a VM, with disk snapshots before you attempt to kill production servers. Leave me a comment if it worked, or if there is anything I need to add or correct in this post. Thanks, and good luck!