ZFS with Redhat Cluster Suite
This is a very nice project I have been working on. The hardware at hand – two servers, with a shared SAS bus containing several SAS disks. Since it’s a shared bus, no RAID solution would cut it, and as I don’t want to waste disks with ASM (“normal” redundancy meaning half the size…), I went to ZFS storage.
ZFS is a wonderful technology, with many advantages, but with some dangerous pitfalls. As I prefer Linux, I did not bother with any Sloaris solutions, and went directly to Centos 6. I will describe my cluster setup below.
I will disclose the entire setup, including hardware layout, Linux platform, ZFS module parameters, the Redhat Cluster Suite ZFS agent I wrote and the cluster.conf configuration file. I will also share my considerations regarding some of the choices I made. In addition, this system was designed to act as NFS storage for Citrix XenServer pool, so I will have to describe the changed I had to perform on the XenServer itself (which might make it unsupported, but I will have to live with it), to allow it to handle the timeouts resulting by server failover.
So first – the servers – each having a single CPU (quad core), 24GB RAM, and dual 1Gb/s NICs. Also – a tiny internal SATA disk is used for the OS. The shared disks – at the moment, 10 SAS disks, dual port (notice – older HP disks might mark in a very small letters that they are only a single-port SAS disks…), 72GB, 10K RPM. Zpool called ‘share’ with two 5 disks RaidZ1 vdevs. As I mentioned before – ZFS seemed like the best possible option allowing me to achieve my goals at minimal cost.
When I came to this project, I wanted to be able to use a native ZFS cluster agent, and not a ‘script’ agent, which takes a very long time to respond (30 seconds). Also – I wanted to be able to handle multiple storage pools concurrently – each floating on its own. While I have only one at the moment, I wanted the ability to have a fine-grained control over multiple pools. In addition – I am unable (or unwilling?) to handle the multiple filesystems introduced with each pool. I wanted to be able to import or export the pool silently, and with a clear head, thus I had to verify that the multiple filesystems are not in use as part of the export process.
As an agent, I wanted to comply with Redhat Cluster Suite (RHCS from now on) OCF syntax. I used the supplied fs.sh script as an inspiration for my agent script, so some of it might look familiar. All credit goes to the original authors, of course.
The operating system I selected was Centos 6. Centos is based on Redhat Linux, and I find it mature and stable, which is exactly what I want when I plan a production-ready, enterprise-class storage solution. The version had to be x86_64, due to ZFS requirements, and due to the amount of RAM in the server.
To handle ZFS options, I added a file called /etc/modprobe.d/zfs.conf, with the following content
install zfs /bin/rm -f /etc/zfs/zpool .cache && /sbin/modprobe --ignore- install zfs options zfs zfs_arc_max=12593790976 options zfs zfs_arc_min=12593790975 |
I had to verify there is no zpool.cache file. Since my pool was rather small (planned for 24 disks max), I was not concerned by the longer import process caused by not having the zpool.cache file. I was more concerned with automatic import process which might happen, and had to prevent it at almost any cost. In addition, I learned from other systems that the arc memory should never exceed half the RAM, and it should be given just a little under that.
Of course, when changing such module settings, you need to recreate initrd (dracut -f) to be on the safe side later on.
The zfs.sh agent script was placed in /usr/share/cluster directory. You must have rgmanager installed for this directory to exist, and anyhow, without rgmanager, you will have no cluster whatsoever.
This is the contents of the zfs.sh file. Notice that it is not compatible with Luci, so if you’re using it – them kids won’t play well together.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 | #!/bin/bash LC_ALL=C LANG=C PATH= /bin : /sbin : /usr/bin : /usr/sbin export LC_ALL LANG PATH # Private return codes FAIL=2 NO=1 YES=0 YES_STR= "yes" . $( dirname $0) /ocf-shellfuncs meta_data() { cat < EOT 1.0 This script will import and export ZFS storage pools It will make sure to mount and umount all child filesystems This is a ZFS pool Symbolic name for this zfs pool File System Name ZFS Pool name or ID ZFS pool name ZFS Pool alternate mount ZFS pool alternate mount If set , the cluster will kill all processes using this file system when the resource group is stopped. Otherwise, the unmount will fail, and the resource group will be restarted. Force Unmount If set and unmounting the file system fails, the node will immediately reboot. Generally, this is used in conjunction with force-unmount support, but it is not required. Seppuku Unmount EOT } ocf_log() { echo $* } verify_driver() { ocf_log info "Verifying ZFS driver" lsmod | grep -w zfs > /dev/null >&1 && return 0 ocf_log err "ZFS driver is not loaded" return $OCF_ERR_ARGS } verify_poolname() { ocf_log info "Verify pool name " if [ -z "$OCF_RESKEY_pool" ] then ocf_log err "Missing pool name" return $OCF_ERR_ARGS fi zpool import | grep pool: | grep -w $OCF_RESKEY_pool > /dev/null 2>&1 && return 0 ocf_log err "Cannot identify pool name" return $OCF_ERR_ARGS } verify_mounted_poolname() { ocf_log info "Verify pool name " if [ -z "$OCF_RESKEY_pool" ] then ocf_log err "Missing pool name" return $OCF_ERR_ARGS fi zpool list $OCF_RESKEY_pool > /dev/null >&1 && return 0 ocf_log err "Cannot identify pool name" return $OCF_ERR_ARGS } verify_mountpath() { ocf_log info "Verifying alternate root mount path" [ -z "$OCF_RESKEY_mount" ] && return 0 declare mp= "${OCF_RESKEY_mount}" case "$mp" in /*) # found it ;; *) # invalid format ocf_log err "verify_mountpath: Invalid mount point format (must begin with a '/'): '$mp'" return $OCF_ERR_ARGS ;; esac } pool_import() { ocf_log info "Importing pool" OPTS= "" [ -n "$OCF_RESKEY_mount" ] && OPTS= "-R $OCF_RESKEY_mount" zpool import $OCF_RESKEY_pool $OPTS RET= "$?" if [ "$RET" - ne "0" ] then ocf_log info "Cannot import without applying force" zpool import -f $OCF_RESKEY_pool $OPTS RET= "$?" fi if [ "$RET" - ne "0" ] then ocf_log err "Pool import failed for $OCF_RESKEY_pool. error=$RET" return 1 fi ocf_log info "Imported ZFS pool" return $RET } check_and_release_fs() { ocf_log info "Checking and releasing FS" FS= "" case ${OCF_RESKEY_force_unmount} in $YES_STR|on| true |1) force_umount=$YES ;; *) force_umount= "" ;; esac RET=0 for i in `zfs list -t filesystem | grep ^${OCF_RESKEY_pool} | awk '{print $NF}' ` do # To be on the safe side. Why not? sleep 1 # Is it mounted? if ! df -l | grep -w "$i" > /dev/null 2>&1 then ocf_log info "Filesystem $i is not mounted" continue fi if [ ` lsof $i | wc -l` -gt "0" ] then ocf_log info "Filesystem $i is in use" if [ "$force_umount" ] then ocf_log info "Attempting to kill processes on $i filesystem" fuser -k $i sleep 2 if [ ` lsof $i | wc -l` -gt "0" ] then ocf_log err "Cannot umount filesystem $i - filesystem in use" return 1 fi else ocf_log err "Cannot umount filesystem $i - filesystem in use" return 1 fi fi done return $RET } self_fence() { ocf_log info "Should we validate and call self-fence?" case ${OCF_RESKEY_self_fence} in $YES_STR|on| true |1) self_fence=$YES ;; *) self_fence= "" ;; esac if [ "$self_fence" ]; then ocf_log alert "umount failed - REBOOTING" sync reboot -fn fi return $OCF_ERR_GENERIC } pool_export() { ocf_log info "Exporting zfs pool" check_and_release_fs || self_fence zpool export $OCF_RESKEY_pool RET= "$?" if [ "$RET" - ne "0" ] then ocf_log err "Pool export failed for $OCF_RESKEY_pool. error=$RET" return 1 fi return $RET } start() { ocf_log info "Starting ZFS" verify_driver || return $OCF_ERR_ARGS verify_poolname || return $OCF_ERR_ARGS verify_mountpath || return $OCF_ERR_ARGS pool_import # Handle filesystem? } stop() { ocf_log info "Starting ZFS" verify_driver || return $OCF_ERR_ARGS verify_mounted_poolname || return $OCF_ERR_ARGS verify_mountpath || return $OCF_ERR_ARGS # Handle filesystem? pool_export } is_imported() { ocf_log debug "Checking if $OCF_RESKEY_pool is imported" zpool list ${OCF_RESKEY_pool} > /dev/null >&1 return $? } is_alive() { ocf_log debug "Checking ZFS pool read/write" declare file = ".writable_test.$(hostname)" declare TIMEOUT= "10s" [ -z "$OCF_CHECK_LEVEL" ] && export OCF_CHECK_LEVEL=0 mount_point=`zfs list ${OCF_RESKEY_pool} | grep ${OCF_RESKEY_pool} | awk '{print $NF}' ` test -d "$mount_point" if [ $? - ne 0 ]; then ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: $mount_point is not a directory" return $FAIL fi [ $OCF_CHECK_LEVEL -lt 10 ] && return $YES # depth 10 test (read test) timeout -s 9 $TIMEOUT ls "$mount_point" > /dev/null 2> /dev/null errcode=$? if [ $errcode - ne 0 ]; then ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: failed read test on [$mount_point]. Return code: $errcode" return $NO fi [ $OCF_CHECK_LEVEL -lt 20 ] && return $YES # depth 20 check (write test) rw=$YES for o in ` echo $OCF_RESKEY_options | sed -e s/,/ /g `; do if [ "$o" = "ro" ]; then rw=$NO fi done if [ $rw - eq $YES ]; then file = "$mount_point" /$ file while true ; do if [ -e "$file" ]; then file =${ file }_tmp continue else break fi done timeout -s 9 $TIMEOUT touch $ file > /dev/null 2> /dev/null errcode=$? if [ $errcode - ne 0 ]; then ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: failed write test on [$mount_point]. Return code: $errcode" return $NO fi rm -f $ file > /dev/null 2> /dev/null fi return $YES } monitor() { ocf_log debug "Checking ZFS pool $OCF_RESKEY_pool, Level $OCF_CHECK_LEVEL" verify_driver || return $OCF_ERR_ARGS is_imported RET=$? if [ "$RET" - ne $YES ]; then ocf_log err "${OCF_RESOURCE_INSTANCE}: ${OCF_RESKEY_device} is not mounted on ${OCF_RESKEY_mountpoint}" return $OCF_NOT_RUNNING fi is_alive return $RET } if [ -z "$OCF_CHECK_LEVEL" ]; then OCF_CHECK_LEVEL=0 fi case $1 in start) ocf_log info "zfs start $OCF_RESKEY_pooln" OCF_CHECK_LEVEL=0 monitor [ "$?" - ne "0" ] && start || ocf_log info "$OCF_RESKEY_pool is already mounted" exit $? ;; stop) ocf_log info "zfs stop $OCF_RESKEY_pooln" OCF_CHECK_LEVEL=0 monitor [ "$?" - eq "0" ] && stop || ocf_log info "$OCF_RESKEY_pool is not mounted" exit $? ;; status|monitor) ocf_log debug "ZFS monitor $OCF_RESKEY_pool" monitor exit $? ;; meta-data) echo -e "zfs metadat $OCF_RESKEY_addressn" >> /tmp/out meta_data exit 0 ;; validate-all) exit 0 ;; *) echo "usage: $0 {start|stop|status|monitor|restart|meta-data|validate-all}" exit $OCF_ERR_UNIMPLEMENTED ;; esac |
All I had to do now was to build the cluster.conf file.
The reason I placed the IP address as the last to start and the first to stop was that the other way around, the NFS client would receive an ordered disconnection command, and would not bother to establish a connection with the remaining server. Abruptly taking away the clustered IP address causes the NFS clients to initiate a reconnection process, of which the systems are supposed to recover
I have left this article incomplete for a while now. It has some stuff I do like to share, so I am sharing it as-is. I will (some day) complete it.
2 Comments