Archive for the ‘Clusters’ Category

Oracle Clusterware as a 3rd party HA framework

Friday, June 12th, 2009

Oracle begin to push their Clusterware as a 3rd party HA framework. In this article we will review a quick example of how to do it. I will refer to this post as a quick-guide, as this is by no means any full-scale guide.

This article assumes you have installed Oracle Clusterware following one of the few links and guides available on the net. This quick-guide applies to both Clusterware 10 and Clusterware 11.

We will discuss the method of adding an additional NFS service on Linux.

In order to do so, you will need a shared storage – assuming the goal of the exercise is to supply the clients with a consistent storage services based on NFS. I, for myself, prefer to use OCFS2 as the choice file system for shared disks. This goes well with Oracle Clusterware, as this cluster framework does not handle disk mounts very well, and unless you are to write/search an agent which will make sure that every mount and umount behave correctly (you wouldn’t want to get a file system corruption, would you?), you will probably prefer to do the same. The lack of need to manage the disk mount actions will both save time on planned failover, and will guarantee storage safety. If you have not placed your CRS and Vote on OCFS2, you will need to install OCFS2 from here and here, and then to configure it. We will not discuss OCFS2 configuration in this post.

We will need to assume the following prerequisites:

  • Service-related IP address: 1.2.3.4. Netmask 255.255.255.248. You need this IP to be member of the same class as your public network card is.
  • Shared Storage: Formatted to OCFS2, and mounted on both nodes on /shared
  • Oracle Clusterware installed and working
  • Cluster nodes names are “node1″ and “node2″
  • Have $CRS_HOME point to your CRS installation
  • Have $CRS_HOME/bin in your $PATH

We need to create the service-related IP resource first. I would recommend to have an entry in /etc/hosts for this IP address on both nodes. Assuming the public NIC is eth0, The command would be

crs_profile -create nfs_ip -t application -a $CRS_HOME/bin/usrvip -o oi=eth0,ov=1.2.3.4,on=255.255.255.248

Now you will need to set running permissions for the oracle user. In my case, the user name is actually “oracle”:

crs_setperm nfs_ip -o root
crs_serperm nfs_ip -u user:oracle:r-x

Test that you can start the service as the oracle user:

crs_start nfs_ip

Now we need to setup NFS. For this to work, we need to setup the NFS daemon first. Edit /etc/exports and add a line such as this:

/shared *(rw,no_root_sqush,sync)

Make sure that nfs service is disabled during startup:

chkconfig nfs off
chkconfig nfslock off

Now is the time to setup Oracle Clusterware for the task:

crs_profile -create share_nfs -t application -B /etc/init.d/nfs -d “Shared NFS” -r nfs_ip -a sharenfs.scr -p favored -h “node1 node2″ -o ci=30,ft=3,fi=12,ra=5
crs_register share_nfs

Deal with permissions:

crs_setperms share_nfs -o root
crs_setperms share_nfs -u user:oracle:r-x

Fix the “sharenfs.scr” script. First, find it. It should reside in $CRS_HOME/crs/scripts if everything is OK. If not, you will be able to find it in $CRS_HOME using find.

Edit the “sharenfs.scr” script and modify the following variables which are defined relatively in the beginning of the script:

PROBE_PROCS=”nfsd”
START_APPCMD=”/etc/init.d/nfs start
START_APPCMD2=”/etc/init.d/nfslock start”
STOP_APPCMD=”/etc/init.d/nfs stop”
STOP_APPCMD2=”/etc/init.d/nfslock stop”

Copy the modified script file to the other node. Verify this script has execution permissions on both nodes.

Start the service as the oracle user:

crs_start sharenfs

Test the service. The following command should return the export path:

showmount -e 1.2.3.4

Relocate the service and test again:

crs_relocate -f sharenfs
showmount -e 1.2.3.4

Done. You now have HA NFS service above Oracle Clusterware framework.

I used this web page as a reference. I thank him for his great work!

RedHat Cluster custom Oracle “Agent”/script V1.0

Friday, April 24th, 2009

Working with RH Cluster quite a lot, I have decided to create an online store of customer agents/scripts.

I have not, so far, invested the effort of making these agents accept settings from the cluster.conf file, but this might happen.

Let the library be!

Oracle DB script/agent:

Although I discovered (a bit late) that RH Cluster for Oracle Ent. Linux 5.2 does include oracle DB agent, this script should be good enough for RHEL4 RH Cluster versions as well.

This script only checks that the ‘smon’ process is up. Nothing fancy. This script can include, in the future, the ability to check that Oracle responses to SQL queries (meaning – actually working).

?Download oracle.sh
#!/bin/bash
#Service script for Oracle DB under RH Cluster
#Written by Ez-Aton
#http://run.tournament.org.il
 
# Global variables
ORACLE_USER=oracle
HOMEDIR=/home/$ORACLE_USER
OVERRIDE_FILE=/var/tmp/oracle_override
REC_LIST="user@domain.com"
 
function override () {
	if [ -f $OVERRIDE_FILE ]
	then
		exit 0
	fi
}
 
function start () {
	su - $ORACLE_USER -c ". $HOMEDIR/.bash_profile ; sqlplus / as sysdba << EOF
startup
EOF
"
	status
}
 
function stop () {
	su - $ORACLE_USER -c ". $HOMEDIR/.bash_profile ; sqlplus / as sysdba << EOF
shutdown immediate
EOF
"
	status && return 1 || return 0
}
 
function status () {
	ps -afu $ORACLE_USER | grep -v grep | grep smon
	return $?
}
 
function notify () {
	mail -s "$1 oracle on `hostname`" $REC_LIST < /dev/null
}
 
override
case "$1" in
start)	start
	notify $1
	;;
stop)	stop
#	notify $1
	;;
status)	status
	;;
*)	echo "Usage: $0 start|stop|status"
	;;
esac

I usually place this script (with execution permissions, of course) in /usr/local/sbin and call it as a “script” from the cluster configuration. You will probably be required to alter the first few variable lines to match to your environment.

Listener Agent/script:

The tnslsnr should be started/stopped as well, if we want the $ORACLE_HOME to migrate as well. This is its agent/script:

?Download lsnr.sh
#!/bin/bash
#Service script for Oracle DB under RH Cluster
#Written by Ez-Aton
#http://run.tournament.org.il
 
ORACLE_USER=oracle
HOMEDIR=/home/$ORACLE_USER
OVERRIDE_FILE=/var/tmp/oracle_override
 
function override () {
if [ -f $OVERRIDE_FILE ]
then
exit 0
fi
}
 
function start () {
su - $ORACLE_USER -c ". $HOMEDIR/.bash_profile ; lsnrctl start"
status
}
 
function stop () {
su - $ORACLE_USER -c ". $HOMEDIR/.bash_profile ; lsnrctl stop"
status && return 1 || return 0
}
 
function status () {
su - $ORACLE_USER -c ". $HOMEDIR/.bash_profile ; lsnrctl status"
}
 
override
case "$1" in
start)    start
;;
stop)    stop
;;
status)    status
;;
*)    echo "Usage: $0 start|stop|status"
;;
esac

Again – place it in /usr/local/sbin and call it from the cluster configuration file as type “script”.

I will add more agents and more resources for RedHat Cluster in the future.

Tips and tricks for Redhat Cluster

Saturday, January 31st, 2009

Redhat Cluster is a nice HA product. I have been implementing it for a while now, lecturing about it, and yes – I like it. But like any other software product, it has few flaws and issues which you should take under consideration – especially when you create custom “agents” – plugins to control (start/stop/status) your 3rd party application.

I want to list several tips and good practices which will help you create your own agent or custom script, and will help you sleep better at night.

Nighty Night: Sleeping is easier when your cluster is quiet. It usually means that  you don’t want the cluster to suddenly failover during night time, or – for that matter, during any hour, unexpectedly.
Below are some tips to help you sleep better, or to perform an easier postmortem of any cluster failure.

Chop the Logs: Since RedHat Cluster logging might be hidden and filled with lots of irrelevant information, you want your agents to be nice about it. Let them log out somewhere the result of running “status” or “stop” or even “start”. Of course – either recycle the output logs, or rotate them away. You could use

exec &>/tmp/my_script_name.out

much like HACMP does (or at least – behaves as if it does). You can also use specific logging facility for different subsystems of the cluster (cman, rg, qdiskd)

Mind the Gap: Don’t trust unknown scripts or applications’ return codes. Your cluster will fail miserably if a script or a file you expect to run will not be there. Do not automatically assume that the vmware script, for example, will return normal values. Check the return codes and decide how to respond accordingly.

Speak the Language: A service in RedHat Cluster is a combination of one or more resources. This can be somewhat confusing as we tend to refer to resources as a (system) service. Use the correct lingo.  I will try to do just that in this document, so heed the difference between the terms “service” and “system service”, which can be a cluster resource.

Divide and Conquer: Split your services to the minimal set of resources possible. If your service consists of hundreds of resources failure to one of them could cause the entire service to restart, taking down all other working resources. If you keep it to the minimum, you actually protect yourself.

Trust No One: To stress out the “Mind the Gap” point above – don’t trust 3rd party scripts or applications to return a correct error code. Don’t trust their configuration files, and don’t trust the users to “do it right”. They will not. Try to create your own services as fault-protected as possible. Don’t crash because some stupid user (or a stupid administrator – for a cluster implementer both are the same, right?) used incorrect input parameters, or because he has kept  an important configuration file in a different name than was required.

I have some special things I want to do with regard to RedHat Cluster Suite. Stay tuned :-)

Redhat Cluster NFS client service – things to notice

Friday, January 16th, 2009

I encountered an interesting bug/feature of RHCS on RHEL4.

A snip of my configuration looks like this:

<resources>
    <fs device="/dev/mapper/mpath6p1" force_fsck="1" force_umount="1" fstype="ext3" name="share_prd" mountpint="/share_prd" options="" self_fence="0" fsid="02001"/>
    <nfsexport name="nfs export4"/>
    <nfsclient name="all ro" target="192.168.0.0/255.255.255.0" options="ro,no_root_sqush,sync"/>
    <nfsclient name="app1" target="app1" options="rw,no_root_squash,sync"/>
</resources>

<service autostart="1" domain="prd" name="prd" nfslock="1">
    <fs ref="share_prd">
       <nfsexport ref="nfs export 4">
          <nfsclient ref="all ro"/>
          <nfsclient ref="app1"/>
       </nfsexport>
    </fs>
</service>

This setup was working just fine, until a glitch in the DNS occurred.This glitch resulted in inability to resolve names (which were not present inside /etc/hosts at this time), and lead to a failover with the following error:

clurgmgrd: [7941]: <err> nfsclient:app1 is missing!

All range-based nfsclient agents seemed to function correctly. I could manage to look into it only a while later (after setting simple range-based allow-all access), and through some googling, I found out this explanation – it was a change of how the agent responds to “status” command.

I should have looked inside /var/lib/nfs/etab and see that app1 server appeared with its full name. I changed the resource settings to reflect it:

<nfsclient name="app1" target="app1.mydomain.org" options="rw,no_root_squash,sync"/>

and it seems to work just fine now.

Persistent raw devices for Oracle RAC with iSCSI

Saturday, December 6th, 2008

If you’re into Oracle RAC over iSCSI, you should be rather content – this configuration is a simple and supported. However, working with some iSCSI target devices, order and naming is not consistent between both Oracle nodes.

The simple solutions are by using OCFS2 labels, or by using ASM, however, if you decide to place your voting disks and cluster registry disks on raw devices, you are to face a problem.

iSCSI on RHEL5:

There are few guides, but the simple method is this:

  1. Configure mapping in your iSCSI target device
  2. Start the iscsid and iscsi services on your Linux
    • service iscsi start
    • service iscsid start
    • chkconfig iscsi on
    • chkconfig iscsid on
  3. Run “iscsiadm -m discovery -t st -p target-IP
  4. Run “iscsiadm -m node -L all
  5. Edit /etc/iscsi/send_targets and add to it the IP address of the target for automatic login on restart

You need to configure partitioning according to the requirements.

If you are to setup OCFS2 volumes for the voting and for the cluster registry, there should not be a problem as long as you use labels, however, if you require raw volumes, you need to change udev to create your raw devices for you.

On a system with persistent disk naming, follow this process, however, on a system with changing disk names (every reboot names are different), the process can become a bit more complex.

First, detect your scsi_id for each device. While names might change upon reboots, scsi_ids do not.

scsi_id -g -u -s /block/sdc

Replace sda with the device name you are looking for. Notice that /block/sda is a reference to /sys/block/sdc

Use the scsi_id generated by that to create the raw devices. Edit /etc/udev/rules.d/50-udev.rules and find line 298. Add a line below with the following contents:

KERNEL==”sd*[0-9]“, ENV{ID_SERIAL}==”14f70656e66696c000000000004000000010f00000e000000″, SYMLINK+=”disk/by-id/$env{ID_BUS}-$env{ID_SERIAL}-part%n” ACTION==”add” RUN+=”/bin/raw /dev/raw/raw%n %N”

Things to notice:

  1. The ENV{ID_SERIAL} is the same scsi_id obtained earlier
  2. This line will create a raw device in the name of raw and number in /dev/raw for each partition
  3. If you want to differtiate between two (or more) disks, change the name from raw to an aduqate name, like “crsa”, “crsb”, etc, for example:

KERNEL==”sd*[0-9]“, ENV{ID_SERIAL}==”14f70656e66696c000000000005000000010f00000e000000″, SYMLINK+=”disk/by-id/$env{ID_BUS}-$env{ID_SERIAL}-part%n” ACTION==”add” RUN+=”/bin/raw /dev/raw/crs%n %N”

Following these changes, run “udevtrigger” to reload the rules. Be advised that “udevtrigger” might reset network connection.

Protect Vmware guest under RedHat Cluster

Monday, November 17th, 2008

Most documentation on the net is about how to run a cluster-in-a-box under Vmware. Very few seem to care about protecting Vmware guests under real RedHat cluster with a shared storage.

This article is just about it. While I would not recommend using Vmware in such a setup, it has been the case, and that Vmware guest actually resides on the shared storage. To relocate it is out of the question, so migrating it together with other resources is the only valid option.

To do so, I have created a simple script which will accept start/stop/status arguments. The Vmware guest VMX is hard-coded into the script, but in an easy-to-change format. This script will attempt to freeze the Vmware guest, and only if it fails, to shut it down. Mind you that the blog’s HTML formatting might alter quotation marks into UTF-8 marks which will not be understood well by shell.

#!/bin/bash
# This script will start/stop/status VMware machine
# Written by Ez-Aton
# http://www.tournament.org.il/run
 
# Hardcoded. Change to match your own settings!
VMWARE="/export/vmware/hosts/Windows_XP_Professional/Windows XP Professional.vmx"
VMRUN="/usr/bin/vmrun"
TIMEOUT=60
 
function status () {
  # This function will return success if the VM is up
  $VMRUN list | grep "$VMWARE" &amp;&gt;/dev/null
  if [[ "$?" -eq "0" ]]
  then
    echo "VM is up"
    return 0
  else
    echo "VM is down"
    return 1
  fi
}
 
function start () {
  # This function will start the VM
  $VMRUN start "$VMWARE"
  if [[ "$?" -eq "0" ]]
  then
    echo "VM is starting"
    return 0
  else
    echo "VM failed"
    return 1
  fi
}
 
function stop () {
  # This function will stop the VM
  $VMRUN suspend "$VMWARE"
  for i in `seq 1 $TIMEOUT`
  do
    if status
    then
      echo
    else
      echo "VM Stopped"
      return 0
    fi
    sleep 1
  done
  $VMRUN stop "$VMWARE" soft
}
 
case "$1" in
start)     start
        ;;
stop)      stop
        ;;
status)   status
        ;;
esac
RET=$?
 
exit $RET

Since the formatting is killed by the blog, you can find the script here: vmware1

I intend on building a “real” RedHat Cluster agent script, but this should do for the time being.

Enjoy!

Raw devices for Oracle on RedHat (RHEL) 5

Tuesday, October 21st, 2008

There is a major confusion among DBAs regarding how to setup raw devices for Oracle RAC or Oracle Clusterware. This confusion is caused by the turn RedHat took in how to define raw devices.

Raw devices are actually a manifestation of character devices pointing to block devices. Character devices are non-buffered, so they act as FIFO, and have no OS cache, which is why Oracle likes them so much for Clusterware CRS and voting.

On other Unix types, commonly there are two invocations for each disk device – a block device (i.e /dev/dsk/c0d0t0s1) and a character device (i.e. /dev/rdsk/c0d0t0s1). This is not the case for Linux, and thus, a special “raw”, aka character, device is to be defined for each partition we want to participate in the cluster, either as CRS or voting disk.

On RHEL4, raw devices were setup easily using the simple and coherent file /etc/sysconfig/rawdevices, which included an internal example. On RHEL5 this is not the case, and customizing in a rather less documented method the udev subsystem is required.

Check out the source of this information, at this entry about raw devices. I will add it here, anyhow, with a slight explanation:

1. Add to /etc/udev/rules.d/60-raw.rules:

ACTION==”add”, KERNEL==”sdb1″, RUN+=”/bin/raw /dev/raw/raw1 %N”

2. To set permission (optional, but required for Oracle RAC!), create a new /etc/udev/rules.d/99-raw-perms.rules containing lines such as:

KERNEL==”raw[1-2]“, MODE=”0640″, GROUP=”oinstall”, OWNER=”oracle”

Notice this:

  1. The raw-perms.rules file name has to begin with the number 99, which defines its order during rules apply, so that it will be used after all other rules take place. Using lower numbers might cause permissions to be incorrect.
  2. The following permissions have to apply:
  • OCR Device(s): root:oinstall , mode 0640
  • Voting device(s): oracle:oinstall, mode 0666
  • You don’t have to use raw devices for ASM volumes on Linux, as the ASMLib library is very effective and easier to manage.

    Oracle RAC with EMC iSCSI Storage Panics

    Tuesday, October 14th, 2008

    I have had a system panicking when running the mentioned below configuration:

    • RedHat RHEL 4 Update 6 (4.6) 64bit (x86_64)
    • Dell PowerEdge servers
    • Oracle RAC 11g with Clusterware 11g
    • EMC iSCSI storage
    • EMC PowerPate
    • Vote and Registry LUNs are accessible as raw devices
    • Data files are accessible through ASM with libASM

    During reboots or shutdowns, the system used to panic almost before the actual power cycle. Unfortunately, I do not have a screen capture of the panic…

    Tracing the problem, it seems that iSCSI, PowerIscsi (EMC PowerPath for iSCSI) and networking services are being brought down before “killall” service stops the CRS.

    The service file init.crs was never to be executed with a “stop” flag by the start-stop of services, as it never left a lock file (for example, in /var/lock/subsys), and thus, its existence in /etc/rc.d/rc6.d and /etc/rc.d/rc0.d is merely a fake.

    I have solved it by changing /etc/init.d/init.crs script a bit:

    • On “Start” action, touch a file called /var/lock/subsys/init.crs
    • On “Stop” action, remove a file called /var/lock/subsys/init.crs

    Also, although I’m not sure about its necessity, I have changed init.crs script SYSV execution order in /etc/rc.d/rc0.d and /etc/rc.d/rc6.d from wherever it was (K96 in one case and K76 on another) to K01, so it would be executed with the “stop” parameter early during shutdown or reboot cycle.

    It solved the problem, although future upgrades to Oracle ClusterWare will require being aware of this change.

    Oracle ASM and EMC PowerPath

    Wednesday, May 28th, 2008

    Setting up an Oracle ASM disks is rather simple, and the procedure can be easily obtained from here, for example. This is nice and pretty, and works well for most environments.

    EMC PowerPath creates meta devices which utilize the underlying paths, as mod_scsi sees them in Linux, without hiding them (unlike IBM’s RDAC, for example). This results in the ability to view and access each LUN either through the PowerPath meta device (/dev/emcpower*) or through the underlying SCSI disk device (/dev/sd*). You can obtain the existing paths of a single meta devices through running the command

    powermt display dev=emcpowera

    where ‘emcpowera’ is an example. It can be any of your power meta devices. You will see the underlying SCSI devices.

    During startup, Oracle ASM (startup script: /etc/init.d/oracleasm) scans all block devices for ASM headers. On a system with many LUNs, this can take a while (half an hour, and sometimes much more). Not only that, but since ASM scans the available block devices in a semi-random order, the chances are very high that the /dev/sd* will be used instead of the /dev/emcpower* block device. This results in degraded performance, where active-active configuration has been set for PowerPath (because it will not be used), and moreover – a failure of that specific link will result in failure to access the specific LUN through that path, with disregard to any other existing paths to the LUN.

    To "set things right", you need to edit /etc/sysconfig/oracleasm, and exclude all ‘sd’ devices from ASM scan.

    To verify that you’re actually using the right block device:

    /etc/init.d/oracleasm listdisks

    Select any one of the DG disks, and then

    /etc/init.d/oracleasm querydisk DATA1
    Disk “DATA1″ is a valid ASM disk on device [120, 6]

    The numbers are the major and minor of the block device. You can easily find the device through this command:

    ls -la /dev/ | grep MAJOR | grep MINOR

    In our example, the MAJOR will be 120, and the MINOR will be 6. The result would look like a single block device.

    If you’re using EMC PowerPath, your block device major would be 120 and around that number. If you’re (mistakenly) using one of the underlying paths, your major would be 8 and nearby numbers. If you’re using Linux LVM, your major would be around the number 253. The expected result, when using EMC PowerPath is always with major of 120 – always using the /dev/emcpower* devices.

    This also decreases the boot time rather dramatically.

    RedHat 4 working cluster (on VMware) config

    Sunday, November 11th, 2007

    I have been struggling with RH Cluster 4 with VMware fencing device. This was also a good experiance with qdiskd, the Disk Quorum directive and utilization. I have several conclusions out of this experience. First, the configuration, as is:

    <?xml version=”1.0″?>
    <cluster alias=”alpha_cluster” config_version=”17″ name=”alpha_cluster”>
    <quorumd interval=”1″ label=”Qdisk1″ min_score=”3″ tko=”10″ votes=”3″>
    <heuristic interval=”2″ program=”ping vm-server -c1 -t1″ score=”10″/>
    </quorumd>
    <fence_daemon post_fail_delay=”0″ post_join_delay=”3″/>
    <clusternodes>
    <clusternode name=”clusnode1″ nodeid=”1″ votes=”1″>
    <multicast addr=”224.0.0.10″ interface=”eth0″/>
    <fence>
    <method name=”1″>
    <device name=”vmware”
    port=”/vmware/CLUSTER/Node1/Node1.vmx”/>
    </method>
    </fence>
    </clusternode>
    <clusternode name=”clusnode2″ nodeid=”2″ votes=”1″>
    <multicast addr=”224.0.0.10″ interface=”eth0″/>
    <fence>
    <method name=”1″>
    <device name=”vmware”
    port=”/vmware/CLUSTER/Node2/Node2.vmx”/>
    </method>
    </fence>
    </clusternode>
    </clusternodes>
    <cman>
    <multicast addr=”224.0.0.10″/>
    </cman>
    <fencedevices>
    <fencedevice agent=”fence_vmware” ipaddr=”vm-server” login=”cluster”
    name=”vmware” passwd=”clusterpwd”/>
    </fencedevices>
    <rm>
    <failoverdomains>
    <failoverdomain name=”cluster_domain” ordered=”1″ restricted=”1″>
    <failoverdomainnode name=”clusnode1″ priority=”1″/>
    <failoverdomainnode name=”clusnode2″ priority=”1″/>
    </failoverdomain>
    </failoverdomains>
    <resources>
    <fs device=”/dev/sdb2″ force_fsck=”1″ force_unmount=”1″ fsid=”62307″
    fstype=”ext3″ mountpoint=”/mnt/sdb1″ name=”data”
    options=”" self_fence=”1″/>
    <ip address=”10.100.1.8″ monitor_link=”1″/>
    <script file=”/usr/local/script.sh” name=”My_Script”/>
    </resources>
    <service autostart=”1″ domain=”cluster_domain” name=”Test_srv”>
    <fs ref=”data”>
    <ip ref=”10.100.1.8″>
    <script ref=”My_Script”/>
    </ip>
    </fs>
    </service>
    </rm>
    </cluster>

    Several notes:

    1. You should run mkqdisk -c /dev/sdb1 -l Qdisk1 (or whatever device is for your quorum disk)
    2. qdiskd should be added to the chkconfig db (chkconfig –add qdiskd)
    3. qdiskd order should be changed from 22 to 20, so it precedes cman
    4. Changes to fence_vmware according to the past directives, including Yoni’s comment for RH4
    5. Changes in structure. Instead of using two fence devices, I use only one fence device but with different “ports”. A port is translated to “-n” in fence_vmware, just as it is being translated to “-n” in fence_brocade – fenced translates it
    6. lock_gulmd should be turned off using chkconfig

    A little about command-line version change:

    When you update the cluster.conf file, it is not enough to update the ccsd using “ccs_tool update /etc/cluster/cluster.conf“, but you also need to understand that cman is still on the older version. Using “cman_tool version -r <new version>“, you can force it to allow other nodes to join after a reboot, when they’re using the latest config version. If you fail to do it, other nodes might be rejected.

    I will add additional information as I move along.