Archive for January, 2009

Tips and tricks for Redhat Cluster

Saturday, January 31st, 2009

Redhat Cluster is a nice HA product. I have been implementing it for a while now, lecturing about it, and yes – I like it. But like any other software product, it has few flaws and issues which you should take under consideration – especially when you create custom “agents” – plugins to control (start/stop/status) your 3rd party application.

I want to list several tips and good practices which will help you create your own agent or custom script, and will help you sleep better at night.

Nighty Night: Sleeping is easier when your cluster is quiet. It usually means that  you don’t want the cluster to suddenly failover during night time, or – for that matter, during any hour, unexpectedly.
Below are some tips to help you sleep better, or to perform an easier postmortem of any cluster failure.

Chop the Logs: Since RedHat Cluster logging might be hidden and filled with lots of irrelevant information, you want your agents to be nice about it. Let them log out somewhere the result of running “status” or “stop” or even “start”. Of course – either recycle the output logs, or rotate them away. You could use

exec &>/tmp/my_script_name.out

much like HACMP does (or at least – behaves as if it does). You can also use specific logging facility for different subsystems of the cluster (cman, rg, qdiskd)

Mind the Gap: Don’t trust unknown scripts or applications’ return codes. Your cluster will fail miserably if a script or a file you expect to run will not be there. Do not automatically assume that the vmware script, for example, will return normal values. Check the return codes and decide how to respond accordingly.

Speak the Language: A service in RedHat Cluster is a combination of one or more resources. This can be somewhat confusing as we tend to refer to resources as a (system) service. Use the correct lingo.  I will try to do just that in this document, so heed the difference between the terms “service” and “system service”, which can be a cluster resource.

Divide and Conquer: Split your services to the minimal set of resources possible. If your service consists of hundreds of resources failure to one of them could cause the entire service to restart, taking down all other working resources. If you keep it to the minimum, you actually protect yourself.

Trust No One: To stress out the “Mind the Gap” point above – don’t trust 3rd party scripts or applications to return a correct error code. Don’t trust their configuration files, and don’t trust the users to “do it right”. They will not. Try to create your own services as fault-protected as possible. Don’t crash because some stupid user (or a stupid administrator – for a cluster implementer both are the same, right?) used incorrect input parameters, or because he has kept  an important configuration file in a different name than was required.

I have some special things I want to do with regard to RedHat Cluster Suite. Stay tuned 🙂

Adding modules to Damn Small Linux (DSL)

Friday, January 30th, 2009

This is not a simple task. Adding or compiling modules is a tricky feat when it comes to 2.4 kernels. It requires you compile the entire kernel yourself before.

I have used a diskless machine, with an NFS mount as the place where I have kept all persistent data. Mounted it at /tmp/mnt directory.

Preparations

  • Boot into DSL
  • Mount remote NFS with no_root_squash server options
  • Use mydslPanel.lua if this is the first run to download and save the packages described below to the NFS share, or
  • Use mydsl-load to install packages

Compiling Kernel

Installing required packages

  • mydsl-load gcc-2.95.unc (could be .dsl, too)
  • mydsl-load gcc1-with-libs.dsl
  • mydsl-load gnu-utils.dsl
  • mydsl-load kernelheaders
  • Download kernel 2.4.26

Extracting the kernel

  • mkdir /usr/src (if doesn’t exist already)
  • cd /usr/src
  • tar xzf /full/path/to/kernel/linux-2.4.26.tar.gz
  • patch -p1 -d linux-2.4.26 < knoppix-kernel.patch
  • chown -R dsl linux-2.4.26

Configure the kernel – as the user “dsl”

  • cd /usr/src/linux-2.4.26
  • make mrproper
  • Get config file from DSL mirror and save it to /usr/src/config in the kernel source directory
  • make menuconfig
  • Go to last before bottom option in the menu – “Load alternate configuration file”
  • Change the path and name of your config file. In our example – change the value to /usr/src/config
  • Save and exit kernel configuraiton
  • Run make dep clean bzImage

Compiling the module

I used make in rtl8168 driver source. This should cause no problems, if you have followed the previos notes

Loading the modules

You can run “insmod src/r8168.o” to load the module, however, some of the modules will require predefined modules for them to load. In the r8168 module, you must insmod crc32 modules before.

Keep the new module for later use

You don’t need to follow this entire process to load the modules on another DSL system with the same kernel running.

I will follow this up with an explenation on how to close the ISO correctly under Ubuntu, and make it work later on. The source for this specific post can be found here.

Redhat Cluster NFS client service – things to notice

Friday, January 16th, 2009

I encountered an interesting bug/feature of RHCS on RHEL4.

A snip of my configuration looks like this:

<resources>
    <fs device="/dev/mapper/mpath6p1" force_fsck="1" force_umount="1" fstype="ext3" name="share_prd" mountpint="/share_prd" options="" self_fence="0" fsid="02001"/>
    <nfsexport name="nfs export4"/>
    <nfsclient name="all ro" target="192.168.0.0/255.255.255.0" options="ro,no_root_sqush,sync"/>
    <nfsclient name="app1" target="app1" options="rw,no_root_squash,sync"/>
</resources>

<service autostart="1" domain="prd" name="prd" nfslock="1">
    <fs ref="share_prd">
       <nfsexport ref="nfs export 4">
          <nfsclient ref="all ro"/>
          <nfsclient ref="app1"/>
       </nfsexport>
    </fs>
</service>

This setup was working just fine, until a glitch in the DNS occurred.This glitch resulted in inability to resolve names (which were not present inside /etc/hosts at this time), and lead to a failover with the following error:

clurgmgrd: [7941]: <err> nfsclient:app1 is missing!

All range-based nfsclient agents seemed to function correctly. I could manage to look into it only a while later (after setting simple range-based allow-all access), and through some googling, I found out this explanation – it was a change of how the agent responds to “status” command.

I should have looked inside /var/lib/nfs/etab and see that app1 server appeared with its full name. I changed the resource settings to reflect it:

<nfsclient name="app1" target="app1.mydomain.org" options="rw,no_root_squash,sync"/>

and it seems to work just fine now.

Blog Migration

Monday, January 12th, 2009

I have been quiet during the last few days as I was playing with WordPress Mu as a solution for containing several WordPress sites, with one management interface. This is an amazing product, and following my implementation, about 18 blogs are already there.
My site required some special handling. As you know, my URL was http://www.tournament.org.il/run up until now, however, looking up – you will see the address http://run.tournament.org.il in the address bar. This required some special handling, and I hope it will do no evil to my page ranks. The previous change did horrible things to it…
So – using Apache redirection methods, as can be found in this link, was the easiest solution to maintain the whole URL with only a minor shift.

Comments, please

Friday, January 9th, 2009

As of today, I have switched back to using akismet and not the “Yawasp” plugin which was theoretically amazing, but as it seems – blocked too many legitimate comments, and allowed spam (only a little, but still). So you can feel free to comment whenever and however you want.  I was disconnected for too long…