| | | |

ZFS with Redhat Cluster Suite

This is a very nice project I have been working on. The hardware at hand – two servers, with a shared SAS bus containing several SAS disks. Since it’s a shared bus, no RAID solution would cut it, and as I don’t want to waste disks with ASM (“normal” redundancy meaning half the size…), I went to ZFS storage.

ZFS is a wonderful technology, with many advantages, but with some dangerous pitfalls. As I prefer Linux, I did not bother with any Sloaris solutions, and went directly to Centos 6. I will describe my cluster setup below.

I will disclose the entire setup, including hardware layout, Linux platform, ZFS module parameters, the Redhat Cluster Suite ZFS agent I wrote and the cluster.conf configuration file. I will also share my considerations regarding some of the choices I made. In addition, this system was designed to act as NFS storage for Citrix XenServer pool, so I will have to describe the changed I had to perform on the XenServer itself (which might make it unsupported, but I will have to live with it), to allow it to handle the timeouts resulting by server failover.

So first – the servers – each having a single CPU (quad core), 24GB RAM, and dual 1Gb/s NICs. Also – a tiny internal SATA disk is used for the OS. The shared disks – at the moment, 10 SAS disks, dual port (notice – older HP disks might mark in a very small letters that they are only a single-port SAS disks…), 72GB, 10K RPM. Zpool called ‘share’ with two 5 disks RaidZ1 vdevs. As I mentioned before – ZFS seemed like the best possible option allowing me to achieve my goals at minimal cost.

When I came to this project, I wanted to be able to use a native ZFS cluster agent, and not a ‘script’ agent, which takes a very long time to respond (30 seconds). Also – I wanted to be able to handle multiple storage pools concurrently – each floating on its own. While I have only one at the moment, I wanted the ability to have a fine-grained control over multiple pools. In addition – I am unable (or unwilling?) to handle the multiple filesystems introduced with each pool. I wanted to be able to import or export the pool silently, and with a clear head, thus I had to verify that the multiple filesystems are not in use as part of the export process.

As an agent, I wanted to comply with Redhat Cluster Suite (RHCS from now on) OCF syntax. I used the supplied fs.sh script as an inspiration for my agent script, so some of it might look familiar. All credit goes to the original authors, of course.

The operating system I selected was Centos 6. Centos is based on Redhat Linux, and I find it mature and stable, which is exactly what I want when I plan a production-ready, enterprise-class storage solution. The version had to be x86_64, due to ZFS requirements, and due to the amount of RAM in the server.

To handle ZFS options, I added a file called /etc/modprobe.d/zfs.conf, with the following content

install zfs /bin/rm -f /etc/zfs/zpool.cache && /sbin/modprobe --ignore-install zfs
options zfs zfs_arc_max=12593790976
options zfs zfs_arc_min=12593790975

I had to verify there is no zpool.cache file. Since my pool was rather small (planned for 24 disks max), I was not concerned by the longer import process caused by not having the zpool.cache file. I was more concerned with automatic import process which might happen, and had to prevent it at almost any cost. In addition, I learned from other systems that the arc memory should never exceed half the RAM, and it should be given just a little under that.

Of course, when changing such module settings, you need to recreate initrd (dracut -f) to be on the safe side later on.

The zfs.sh agent script was placed in /usr/share/cluster directory. You must have rgmanager installed for this directory to exist, and anyhow, without rgmanager, you will have no cluster whatsoever.

This is the contents of the zfs.sh file. Notice that it is not compatible with Luci, so if you’re using it – them kids won’t play well together.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
#!/bin/bash
 
LC_ALL=C
LANG=C
PATH=/bin:/sbin:/usr/bin:/usr/sbin
export LC_ALL LANG PATH
# Private return codes
FAIL=2
NO=1
YES=0
YES_STR="yes"
 
. $(dirname $0)/ocf-shellfuncs
 
meta_data()
{
    cat < EOT
 
    1.0
 
    This script will import and export ZFS storage pools
    It will make sure to mount and umount all child filesystems
 
        This is a ZFS pool
 
                Symbolic name for this zfs pool
 
                File System Name
 
        ZFS Pool name or ID
 
                ZFS pool name
 
        ZFS Pool alternate mount
 
                ZFS pool alternate mount
 
                If set, the cluster will kill all processes using
                this file system when the resource group is
                stopped.  Otherwise, the unmount will fail, and
                the resource group will be restarted.
 
                Force Unmount
 
                If set and unmounting the file system fails, the node will
                immediately reboot.  Generally, this is used in conjunction
                with force-unmount support, but it is not required.
 
                Seppuku Unmount
 
     
         
 
         
 
EOT
}
 
ocf_log()
{
        echo $*
}
 
verify_driver() {
    ocf_log info "Verifying ZFS driver"
    lsmod | grep -w zfs > /dev/null >&1 && return 0
    ocf_log err "ZFS driver is not loaded"
    return $OCF_ERR_ARGS
}
 
verify_poolname() {
    ocf_log info "Verify pool name "
    if [ -z "$OCF_RESKEY_pool" ]
    then
        ocf_log err "Missing pool name"
        return $OCF_ERR_ARGS
    fi
    zpool import | grep pool: | grep -w $OCF_RESKEY_pool > /dev/null 2>&1 && return 0
    ocf_log err "Cannot identify pool name"
    return $OCF_ERR_ARGS
}
 
verify_mounted_poolname() {
    ocf_log info "Verify pool name "
    if [ -z "$OCF_RESKEY_pool" ]
    then
        ocf_log err "Missing pool name"
        return $OCF_ERR_ARGS
    fi
    zpool list $OCF_RESKEY_pool > /dev/null >&1 && return 0
    ocf_log err "Cannot identify pool name"
    return $OCF_ERR_ARGS
}
 
verify_mountpath() {
    ocf_log info "Verifying alternate root mount path"
    [ -z "$OCF_RESKEY_mount" ] && return 0
    declare mp="${OCF_RESKEY_mount}"
    case "$mp" in
        /*)     # found it
                    ;;
            *)      # invalid format
            ocf_log err
"verify_mountpath: Invalid mount point format (must begin with a '/'): '$mp'"
                return $OCF_ERR_ARGS
                ;;
        esac
}
 
pool_import() {
    ocf_log info "Importing pool"
    OPTS=""
    [ -n "$OCF_RESKEY_mount" ] && OPTS="-R $OCF_RESKEY_mount"
    zpool import $OCF_RESKEY_pool $OPTS
    RET="$?"
    if [ "$RET" -ne "0" ]
    then
        ocf_log info "Cannot import without applying force"
        zpool import -f $OCF_RESKEY_pool $OPTS
        RET="$?"
    fi
    if [ "$RET" -ne "0" ]
    then
        ocf_log err "Pool import failed for $OCF_RESKEY_pool. error=$RET"
        return 1
    fi
    ocf_log info "Imported ZFS pool"
    return $RET
}
 
check_and_release_fs() {
    ocf_log info "Checking and releasing FS"
    FS=""
    case ${OCF_RESKEY_force_unmount} in
        $YES_STR|on|true|1) force_umount=$YES ;;
        *)              force_umount="" ;;
        esac
 
    RET=0
    for i in `zfs list -t filesystem | grep ^${OCF_RESKEY_pool} | awk '{print $NF}'`
    do
        # To be on the safe side. Why not?
        sleep 1
        # Is it mounted?
        if ! df -l | grep -w "$i" > /dev/null 2>&1
        then
            ocf_log info "Filesystem $i is not mounted"
            continue
        fi 
        if [ `lsof $i | wc -l` -gt "0" ]
        then
            ocf_log info "Filesystem $i is in use"
            if [ "$force_umount" ]
            then
                ocf_log info "Attempting to kill processes on $i filesystem"
                fuser -k $i
                sleep 2
                if [ `lsof $i | wc -l` -gt "0" ]
                then
                    ocf_log err "Cannot umount filesystem $i - filesystem in use"
                    return 1
                fi
            else
                ocf_log err "Cannot umount filesystem $i
 - filesystem in use"
                                return 1
            fi
        fi
    done
    return $RET
}
 
self_fence() {
    ocf_log info "Should we validate and call self-fence?"
    case ${OCF_RESKEY_self_fence} in
        $YES_STR|on|true|1)       self_fence=$YES ;;
            *)              self_fence="" ;;
        esac   
 
    if [ "$self_fence" ]; then
        ocf_log alert "umount failed - REBOOTING"
                sync
                reboot -fn
    fi
    return $OCF_ERR_GENERIC
}
 
pool_export() {
    ocf_log info "Exporting zfs pool"
    check_and_release_fs || self_fence
    zpool export $OCF_RESKEY_pool
    RET="$?"
    if [ "$RET" -ne "0" ]
    then
        ocf_log err "Pool export failed for $OCF_RESKEY_pool. error=$RET"
        return 1
    fi
    return $RET
}
 
start() {
    ocf_log info "Starting ZFS"
    verify_driver || return $OCF_ERR_ARGS
    verify_poolname || return $OCF_ERR_ARGS
    verify_mountpath || return $OCF_ERR_ARGS
    pool_import
    # Handle filesystem?
}
 
stop() {
    ocf_log info "Starting ZFS"
    verify_driver || return $OCF_ERR_ARGS
    verify_mounted_poolname || return $OCF_ERR_ARGS
    verify_mountpath || return $OCF_ERR_ARGS
    # Handle filesystem?
    pool_export
}
 
is_imported() {
    ocf_log debug "Checking if $OCF_RESKEY_pool is imported"
    zpool list ${OCF_RESKEY_pool} > /dev/null >&1
    return $?
}
 
is_alive() {
    ocf_log debug "Checking ZFS pool read/write"
    declare file=".writable_test.$(hostname)"
    declare TIMEOUT="10s"
    [ -z "$OCF_CHECK_LEVEL" ] && export OCF_CHECK_LEVEL=0
    mount_point=`zfs list ${OCF_RESKEY_pool} | grep ${OCF_RESKEY_pool} | awk '{print $NF}'`
    test -d "$mount_point"
        if [ $? -ne 0 ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: $mount_point is not a directory"
                return $FAIL
        fi
    [ $OCF_CHECK_LEVEL -lt 10 ] && return $YES
 
        # depth 10 test (read test)
        timeout -s 9 $TIMEOUT ls "$mount_point" > /dev/null 2> /dev/null
        errcode=$?
        if [ $errcode -ne 0 ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: failed read test on [$mount_point]. Return code: $errcode"
                return $NO
        fi
 
    [ $OCF_CHECK_LEVEL -lt 20 ] && return $YES
 
        # depth 20 check (write test)
        rw=$YES
        for o in `echo $OCF_RESKEY_options | sed -e s/,/ /g`; do
                if [ "$o" = "ro" ]; then
                        rw=$NO
                fi
        done
    if [ $rw -eq $YES ]; then
                file="$mount_point"/$file
                while true; do
                        if [ -e "$file" ]; then
                                file=${file}_tmp
                                continue
                        else
                                break
                        fi
                done
                timeout -s 9 $TIMEOUT touch $file > /dev/null 2> /dev/null
                errcode=$?
                if [ $errcode -ne 0 ]; then
                        ocf_log err "${OCF_RESOURCE_INSTANCE}: is_alive: failed write test on [$mount_point]. Return code: $errcode"
                        return $NO
                fi
                rm -f $file > /dev/null 2> /dev/null
        fi
 
    return $YES
}
 
monitor() {
    ocf_log debug "Checking ZFS pool $OCF_RESKEY_pool, Level $OCF_CHECK_LEVEL"
    verify_driver || return $OCF_ERR_ARGS
    is_imported
    RET=$?
    if [ "$RET" -ne $YES ]; then
                ocf_log err "${OCF_RESOURCE_INSTANCE}: ${OCF_RESKEY_device} is not mounted on ${OCF_RESKEY_mountpoint}"
                return $OCF_NOT_RUNNING
        fi
    is_alive
    return $RET
}
 
if [ -z "$OCF_CHECK_LEVEL" ]; then
    OCF_CHECK_LEVEL=0
fi
 
case $1 in
start)
    ocf_log info "zfs start $OCF_RESKEY_pooln"
    OCF_CHECK_LEVEL=0
    monitor
    [ "$?" -ne "0" ] && start || ocf_log info "$OCF_RESKEY_pool is already mounted"
    exit $?
    ;;
stop)
    ocf_log info "zfs stop $OCF_RESKEY_pooln"
    OCF_CHECK_LEVEL=0
    monitor
    [ "$?" -eq "0" ] && stop || ocf_log info "$OCF_RESKEY_pool is not mounted"
    exit $?
    ;;
status|monitor)
    ocf_log debug "ZFS monitor $OCF_RESKEY_pool"
    monitor
    exit $?
    ;;
meta-data)
    echo -e "zfs metadat $OCF_RESKEY_addressn" >>/tmp/out
    meta_data
    exit 0
    ;;
validate-all)
    exit 0
    ;;
*)
    echo "usage: $0 {start|stop|status|monitor|restart|meta-data|validate-all}"
    exit $OCF_ERR_UNIMPLEMENTED
    ;;
esac

All I had to do now was to build the cluster.conf file.

The reason I placed the IP address as the last to start and the first to stop was that the other way around, the NFS client would receive an ordered disconnection command, and would not bother to establish a connection with the remaining server. Abruptly taking away the clustered IP address causes the NFS clients to initiate a reconnection process, of which the systems are supposed to recover

I have left this article incomplete for a while now. It has some stuff I do like to share, so I am sharing it as-is. I will (some day) complete it.

Similar Posts

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.