Posts Tagged ‘Disk Storage’

Hot-resize disks on Linux

Monday, April 6th, 2020

After major investigations around, I came to the conclusion that a full guide describing the procedure required for online disk resize on Linux (especially – expanding disks). I have created a guide for RHEL5/6/7/8 (works the same for Centos or OEL or ScientificLinux – RHEL-based Linux systems) which takes into account the following four scenarios:

  • Expanding a disk where there is a filesystem directly on disk (no partitioning used)
  • Expanding a disk where there is LVM PV directly on disk (no partitioning used)
  • Expanding a disk where there is a filesystem on partition (a single partition taking all the disk’s space)
  • Expanding a disk where there is an LVM PV on partition (a single partition taking all the disk’s space)

All four scenarios were tested with and without use of multipath (device-mapper-multipath). Also – notes about using GPT compared to MBR are given. The purpose is to provide a full guideline for hot-extending disks.

This document does not describe the process of extending disks on the storage/virtualisation/NAS/whatever end. Updating the storage client configuration to refresh the disk topology might differ in various versions of Linux and storage communication methods – iSCSI, FC, FCoE, AoE, local virtualised disk (VMware/KVM/Xen/XenServer/HyperV) and so on. Each connectivity/OS combination might require different refresh method called on the client. In this lab, I use iSCSI and iSCSI software initiator.

The Lab

A storage server running Linux (Centos 7) with targetcli tools exporting 5GB (or more) LUN through iSCSI to Linux clients running Centos5, Centos6, Centos7 and Centos8, with the latest updates (5.11, 6.10, 7.7, 8.1). See some interesting insights on iSCSI target disk expansion using linux LIU (targetcli command line) in my previous post.

The iSCSI clients all see the disk as ‘/dev/sda’ block device. When using LVM, the volume group name is tempvg and the logical volume name is templv. When using multipath, the mpath name is mpatha. On some systems the mpath partition would appear as mpatha1 and on others as mpathap1.

iSCSI client disk/partitions were performed like this:

Centos5:

* Filesystem on disk

1
2
mkfs.ext3 /dev/sda
mount /dev/sda /mnt

* LVM on disk

1
2
3
4
5
pvcreate /dev/sda
vgcreate tempvg /dev/sda
lvcreate -l 100%FREE -n templv tempvg
mkfs.ext3 /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

* Filesystem on partition

1
2
3
parted -s /dev/sda "mklabel msdos mkpart primary 1 -1"
mkfs.ext3 /dev/sda1
mount /dev/sda1 /mnt

* LVM on partition

1
2
3
4
5
6
parted -s /dev/sda "mklabel msdos mkpart primary 1 -1 set 1 lvm on"
pvcreate /dev/sda1
vgcreate tempvg /dev/sda1
lvcreate -l 100%FREE -n templv tempvg
mkfs.ext3 /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

Centos6:

* Filesystem on disk

1
2
mkfs.ext4 /dev/sda
mount /dev/sda /mnt

* LVM on disk

1
2
3
4
5
pvcreate /dev/sda
vgcreate tempvg /dev/sda
lvcreate -l 100%FREE -n templv tempvg
mkfs.ext4 /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

* Filesystem on partition

1
2
3
parted -s /dev/sda "mklabel msdos mkpart primary 1 -1"
mkfs.ext4 /dev/sda1
mount /dev/sda1 /mnt

* LVM on partition

1
2
3
4
5
6
parted -s /dev/sda "mklabel msdos mkpart primary 1 -1 set 1 lvm on"
pvcreate /dev/sda1
vgcreate tempvg /dev/sda1
lvcreate -l 100%FREE -n templv tempvg
mkfs.ext4 /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

Centos7/8:

* Filesystem on disk

1
2
mkfs.xfs /dev/sda
mount /dev/sda /mnt

* LVM on disk

1
2
3
4
5
pvcreate /dev/sda
vgcreate tempvg /dev/sda
lvcreate -l 100%FREE -n templv tempvg
mkfs.xfs /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

* Filesystem on partition

1
2
3
parted -a optimal -s /dev/sda "mklabel msdos mkpart primary 1 -1"
mkfs.xfs /dev/sda1
mount /dev/sda1 /mnt

* LVM on partition

1
2
3
4
5
6
parted -a optimal -s /dev/sda "mklabel msdos mkpart primary 1 -1 set 1 lvm on"
pvcreate /dev/sda1
vgcreate tempvg /dev/sda1
lvcreate -l 100%FREE -n templv tempvg
mkfs.xfs /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

Some variations might exist. For example, use of ‘GPT’ partition layout would result in a parted command like this:

1
parted -s /dev/sda "mklabel gpt mkpart ' ' 1 -1"

Also, for multipath devices, replace the block device /dev/sda with /dev/mapper/mpatha, like this:

1
parted -a optimal -s /dev/mapper/mpatha "mklabel msdos mkpart primary 1 -1"

There are several common tasks, such as expanding filesystems – for XFS, using xfs_growfs <mount target> ; for ext3fs and ext4fs using resize2fs <device path>. Same goes for LVM expansion – using pvresize <device path>, followed by lvextend command, followed by the filesystem expanding command as noted above.

The document layout

The document will describe the client commands for each OS, sorted by action. The process would be as following:

  • Expand the visualised storage layout (storage has already expanded LUN. Now we need the OS to update to the change)
  • (if in use) Expand the multipath device
  • (if partitioned) Expand the partition
  • Expand the LVM PV
  • Expand the filesystem

Actions

For each OS/scenario/mutipath combination, we will format and mount the relevant block device, and attempt an online expansion.

Operations following disk expansion

Expanding the visualised storage layout

For iSCSI, it works quite the same for all OS versions. For other transport types, actions might differ.

1
iscsiadm -m node -R

Expanding multipath device

If using multipath device (device-mapper-multipath), an update to the multipath device layout is required. Run the following command (for all OSes)

1
multipathd -k"resize map mpatha"

Expanding the partition (if disk partitions are in use)

This is a bit complicated part. It differs greatly both in the capability and the commands in use between different versions of operation systems.

Centos 5/6

Online expansion of partition is impossible, except if used with device-mapper-multipath, in which case we force the multipath device to refresh its paths to recreate the device. It will result in an I/O error if there is only a single path defined. For non-multipath setup, a umount and re-mount is required. Disk partition layout cannot be read while the disk is in use.

Without Multipath
1
2
fdisk /dev/sda # Delete and recreate the partition from the same starting point
partprobe # Run when disk is not mounted, or else it will not refresh partition size
With Multipath
1
2
3
4
5
6
fdisk /dev/mapper/mpatha # Delete and recreate the partition from the same starting point
partprobe
multipathd -k"reconfigure" # Sufficient for Centos 6
multipathd -k"remove path sda" # Required for Centos 5
multipathd -k"add path sda" # Required for Centos 5
# Repeat for all sub-paths of expanded device

Centos 7/8

Without Multipath
1
2
fdisk /dev/sda # Delete and recreate partition from the same starting point. Sufficient for Centos 8
partx -u /dev/sda # Required for Centos 7
with Multipath
1
2
fdisk /dev/mapper/mpatha # Delete and recreate the partition from the same starting point. Sufficient for Centos 8
kpartx -u /dev/mapper/mpatha # Can use partx

Expanding LVM PV and LV

1
pvresize DEVICE
Device can be /dev/sda ; /dev/sda1 ; /dev/mapper/mpatha ; /dev/mapper/mpathap1 ; /dev/mapper/mpatha1 – according to the disk layout and LVM choice. lvextend -l +100%FREE /dev/tempvg/templv

Expanding filesystem

For ext3fs and ext4fs
1
resize2fs DEVICe
Device can be /dev/sda ; /dev/sda1 ; /dev/mapper/mpatha ; /dev/mapper/mpathap1 ; /dev/mapper/mpatha1 – according to the disk layout and LVM choice.
For xfs
1
xfs_growfs /mnt

Additional Considerations

MBR vs GPT

On most Linux versions (For Centos – up and including version 7) the command ‘fdisk’ is incapable of handling GPT partition layout. If using GPT partition layout, use of gdisk is recommended, if it exists for the OS. If not, parted is a decent although somewhat limited alternative.

gdisk command can also modify a partition layout (at your own risk, of course) from MBR to GPT and vice versa. This is very useful in saving large data migrations where legacy MBR partition layout was used on disks which are to be expanded beyond the 2TB limits.

GPT backup table is located at the end of the disk, so when extending a GPT disk, it is require to repair the GPT backup table. Based on my lab tests – it is impossible to both extend the partition and repair the GPT backup table location in a single call to gdisk. Two runs are required – one to fix the GPT backup table, and then – after the changes were saved – another to extend the partition.

Storage transport

I have demonstrated use of iSCSI software initiator on Linux. Different storage transport exist – each may require its own method of ‘notifying’ the OS of changed storage layout. See RedHat’s article about disk resizing (RHN access required). This article explains how to refresh the storage transport for a combination of various transports and RHEL versions. and sub-versions.

Ubuntu 18.04 and TPM2 encrypted system disk

Monday, May 27th, 2019

When you encrypt your entire disk, you are required to enter your passphrase every time you boot your computer. The TPM device has a purpose – keeping your secrets secure (available only to your running system), and combined with SecureBoot, which prevents any unknown kernel/disk from booting, and with BIOS password – you should be fully protected against data theft, even when an attacker has a physical access to your computer in the comfort of her home.

It is easier said than done. For the passphrase to work, you need to make sure your initramfs (the initial RAM disk) has the means to extract the passphrase from the TPM, and give it to the encryptFS LUKS mechanism.

I am assuming you are installing an Ubuntu 18 (tested on 18.04) from scratch, you have TPM2 device (Dell Latitude 7490, in my case), and you know your way a bit around Linux. I will not delay on explaining how to edit a file in a restricted location and so on. Also – these steps do not include the hardening of your BIOS settings – passwords and the likes.

Install an Ubuntu 18.04 System

Install an Ubuntu system, and make sure to select ‘encrypt disk’. We aim at full disk encryption. The installer might ask if it should enable SecureBoot – it should, so let it do so. Make sure you remember the passphrases you’ve used in the disk encryption process. We will generate a more complex one later on, but this should do for now. Reboot the system, enter the passphrase, and let’s get to work

Make sure TPM2 works

TPM2 device is /dev/tpm0, in most cases. I did not go into TPM Resource Manager, because it felt overkill for this task. However, you will need to install (using ‘apt’) the package tpm2_tools. Use this opportunity to install ‘perl’.

You can, following that, check that your TPM is working by running the command:

sudo tpm2_nvdefine -x 0x1500016 -a 0x40000001 -s 64 -t 0x2000A -T device

This command will define a 64 byte space at the address 0x1500016, and will not go through the resource manager, but directly to the device.

Generate secret passphrase

To generate your 64bit secret passphrase, you can use the following command (taken from here):

cat /dev/urandom | tr -dc ‘a-zA-Z0-9’ | fold -w 64 | head -n 1 > root.key

Add this key to the LUKS disk

To add this passphrase, you will need to identify the device. Take a look at /etc/crypttab (we will edit it later) and identify the 1nd field – the label, which will relate to the device. Since you have to be in EFI mode, it is most likely /dev/sda3. Note that you will be required to enter the current (and known) passphrase. You can later (when everything’s working fine) remove this passphrase

sudo cryptsetup luksAddKey /dev/sda3 root.key

Save the key inside your TPM slot

Now, we have an additional passphrase, which we will save with the TPM device. Run the following command:

sudo tpm2_nvwrite -x 0x1500016 -a 0x40000001 -f root.key -T device

This will save the key to the slot we have defined earlier. If the command succeeded, remove the root.key file from your system, to prevent easy access to the decryption key.

Create a key recovery script

To read the key from the TPM device, we will need to run a script. The following script would do the trick. Save it to /usr/local/sbin/key and give it execution permissions by root (I used 750, which was excellent, and did not invoke errors later on)

1
2
3
4
5
#!/bin/sh
key=$(tpm2_nvread -x 0x1500016 -a 0x40000001 -s 64 -o 0 -T device | tail -n 1)
key=$(echo $key | tr -d ' ')
key=$(echo $key | /usr/bin/perl -ne 's/([0-9a-f]{2})/print chr hex $1/gie')
printf $key

Run the script, and you should get your key printed back, directly from the TPM device.

Adding required commands to initramfs

For the system to be capable of running the script, it needs several commands, and their required libraries and so on. For that, we will use a hook file called /etc/initramfs-tools/hooks/decryptkey

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/sh
PREREQ=""
prereqs()
 {
     echo "$PREREQ"
 }
case $1 in
 prereqs)
     prereqs
     exit 0
     ;;
esac
. /usr/share/initramfs-tools/hook-functions
copy_exec /usr/sbin/tpm2_nvread
copy_exec /usr/bin/perl
exit 0

The file has 755 permissions, and is owned by root:root

Crypttab

To combine all of this together, we will need to edit /etc/crypttab and modify the argument (for our relevant LUKS) by appending the directive ‘keyscript=/usr/local/sbin/key’

My file looks like this:

sda3_crypt UUID=d4a5a9a4-a2da-4c2e-a24c-1c1f764a66d2 none luks,discard,keyscript=/usr/local/sbin/key

Of course – make sure to save a copy of this file before changing it.

Combining everything together

We should create a new initramfs file, however, we should make sure we keep the existing one for fault handing, if required to. I suggest you run the following command before anything else, so keep the original initramfs file:

sudo cp /boot/initrd.img-`uname -r` /boot/initrd.img-`uname -r`.orig

If we are required to use the regular mechanism, we should edit our GRUB entry and change the initrd entry, and append ‘.orig’ to its name.

Now comes the time when we create the new initramfs, and make ourselves ready for the big change. Do not do it if the TPM read script failed! If it failed, your system will not boot using this initramfs, and you will have to use the alternate (.orig) one. You should continue only if your whole work so far has been error-free. Be warned!

sudo mkinitramfs -o /boot/initrd.img-`uname -r` `uname -r`

You are now ready to reboot and to test your new automated key pull.

Troubleshooting

Things might not work well. There are a few methods to debug.

Boot the system with the original initramfs

If you discover your system cannot boot, and you want to boot your system as-it-were, use your original initramfs file, by changing GRUB during the initial menu (you might need to press on ‘Esc’ key at a critical point) . Change the initrd.img-<your kernel here> to initrd.img-<your kernel here>.orig and boot the system. It should prompt for the passphrase, where you can enter the passphrase used during installation.

Debug the boot process

By editing your GRUB menu and appending the word ‘debug=vc’ (without the quotes) to the kernel line, as well as removing the ‘quiet’, ‘splash’ and ‘$vt_hansoff’ directives, you will be able to view the boot process on-screen. It could help a lot.

Stop the boot process at a certain stage

The initramfs goes through a series of “steps”, and you can stop it wherever you want. The reasonable “step” would be right before ‘premount’ stage, which should be just right for you. Add to the kernel line in GRUB menu the directive ‘break’ without the quotes, and remove the excessive directives as mentioned above. This flag can work with ‘debug’ mentioned above, and might give you a lot of insight into your problems. You can read more here about initramfs and how to debug it.

Conclusion

These directives should work well for you. TPM2 enabled, SecureBoot enabled, fresh Ubuntu installation – you should be just fine. Please let me know how it works for you, and hope it helps!

Redhat Cluster and Citrix XenServer

Thursday, April 9th, 2015

I wanted to write down a guide for RHCS on RHEL/Centos6 and XenServer.

If you want to do that, you need to go through two major challenges which you will encounter. I want to save on the search and sum it all up together here.

The first difficulty is the shared disk. In order to set up most common cluster scenarios, you will need a shared storage. You could either map the VMs to an iSCSI LUNs external to the environment, however, if you do not have such infrastructure (either because everything is based on SAS/FC, or you do not have the ability to set up iSCSI storage with reasonable level of availability), you will want XenServer to allow you to share the VDI between two VMs.

In order to do so, you will need to add a flag to all your pool’s XenServers, and to create the VDI in a specific method. First – the flag – you need to create a file in /etc/xensource called “allow_multiple_vdi_attach”. Do not forget to add it to all your XenServers:

touch /etc/xensource/allow_multiple_vdi_attach

Next, you will need to create your VDI as “raw” type. This is an example. You need to change the SR UUID to the one you use:

xe vdi-create sm-config:type=raw sr-uuid=687a023b-0b20-5e5f-d1ef-3db777ce7ae4 name-label=”My Raw LVM VDI” virtual-size=8GiB type=user

You can find Citrix article about it here.

Following that, you can complete your cluster setup and configuration. I will not add details about it here, as this is not the focus of this article. However, when it comes to fencing, you will need a solution. The solution I used was a fencing agent which was written specifically for XenServer using XenAPI, by using the agent called fence-xenserver. I did not use the fencing agents repository (which this page also points to), because I was unable to compile the required components to run on Centos6. They just don’t compile well. This is, however, a simple Python script which actually works.

In order to make it work, I did the following:

  • Extracted the archive (version 0.8)
  • Placed fence_cxs* in /usr/sbin, and removed their ‘.py’ suffix
  • Placed XenAPI.py as-is in /usr/sbin
  • Verified /usr/sbin/fence_cxs* had execution permissions.

Now, I needed to add it to the cluster configuration. Since the agent cannot handle accessing a non-pool master, it had to be defined for each pool member (I cannot tell in advance which of them is going to have the pool master role when a failover should happen). So, this is my cluster.conf relevant parts:

<fencedevices>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver01″ passwd=”password” session_url=”https://xenserver01″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver02″ passwd=”password” session_url=”https://xenserver02″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver03″ passwd=”password” session_url=”https://xenserver03″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver04″ passwd=”password” session_url=”https://xenserver04″/>
</fencedevices>
<clusternodes>
<clusternode name=”clusternode1″ nodeid=”1″>
<fence>
<method name=”xenserver01″>
<device name=”xenserver01″ vm_name=”clusternode1″/>
</method>
<method name=”xenserver02″>
<device name=”xenserver02″ vm_name=”clusternode1″/>
</method>
<method name=”xenserver03″>
<device name=”xenserver03″ vm_name=”clusternode1″/>
</method>
<method name=”xenserver04″>
<device name=”xenserver04″ vm_name=”clusternode1″/>
</method>
</fence>
</clusternode>
<clusternode name=”clusternode2″ nodeid=”2″>
<fence>
<method name=”xenserver01″>
<device name=”xenserver01″ vm_name=”clusternode2″/>
</method>
<method name=”xenserver02″>
<device name=”xenserver02″ vm_name=”clusternode2″/>
</method>
<method name=”xenserver03″>
<device name=”xenserver03″ vm_name=”clusternode2″/>
</method>
<method name=”xenserver04″>
<device name=”xenserver04″ vm_name=”clusternode2″/>
</method>
</fence>
</clusternode>
</clusternodes>

Attached xenserver-fencing-cluster.xml for clarity (WordPress makes a mess out of that)

Note that I used four (4) entries, since my pool has four hosts. Also note the VM name (it is case sensitive), and your methods – one for each host, since you don’t want them running in parallel, but one at a time. Failover time is between 5-15 seconds on my tests, depending on who is the actually pool master (xenserver04 takes the longest, obviously). I did not test it with pool master down (before or without HA kicking in), nor with the hosts down and thus TCP timeout is longer (than when attempting to connect a host which responds immediately that it is not the pool master). However, if ILO fencing takes about 30-60 seconds, I am not complaining about the current timeouts.

NetApp “Broken disk label”

Thursday, August 1st, 2013

When using ‘disk show -v’ on a NetApp filer version 7.3.x, following replacement or addition of disk(s), you might see the above mentioned message. It is caused by incorrect disk label – of OnTap version 8, on an OnTap version 7.3.x system. The system cannot handle the incorrect label, and thus – ignores the disk.

A set of actions is required to clean the label and allow the NetApp to use this specific disk. The easiest method (although it will not be described here) would be to place the disk back in an OnTap 8 NetApp device, and clean the label from there, however, it is not always possible.

On your OnTap 7.3.x system, do the following (assuming you know the address of the disk, right?) – taken from NetApp’s forums here.

disk assign <diskid>
priv set diag
labelmaint isolate <diskid>
label wipe <diskid>
label wipev1 <diskid>
label makespare <diskid>
labelmaint unisolate
priv set

The fifth or sixth lines might fail to run, but still – the process will succeed as a whole.

HP EVA SSSU and fixed LUN WWID

Monday, July 14th, 2008

Linux works perfectly well with multiple storage links using dm-multipath. Not only that, but HP has released their own spawn of dm-multipath, which is optimized (or so claimed, but, anyhow, well configured) to work with EVA and MSA storage devices.

This is great, however, what do you do when mapping volume snapshots through dm-multipath? For each new snapshot, you enjoy a new WWID, which will remap to a new “mpath” name, or raw wwid (if “user_friendly_name” is not set). This can, and will set chaos to remote scripts. On each reboot, the new and shiny snapshot will aquire a new name, thus making scripting a hellish experience.

For the time being I have not tested ext3 labels. I suspect that using labels will fail, as the dm-multipath over layer device does not hide the under layered sd devices, and thus – the system might detect the same label more than once – once for each under layered device, and once for the dm-multipath over layer.

A solution which is both elegant and useful is to fixate the snapshots’ WWID through a small alteration to SSSU command. Append a string such as this to the snap create command:

WORLD_WIDE_LUN_Name="6300-0000-0000-0000-0010-0000"

Don’t use the numbers supplied here. “invent” your own 🙂

Mind you that you must use dashes, else the command will fail.

Doing so will allow you to always use the same WWID for the snapshots, and thus – save tons of hassle after system reboot when accessing snapshots through dm-multipath.