NFS problems in failover – MC Service Guard. Applicable to other Linux HA clusters
Problem: Two Linux servers (RHEL4) running NFS Server in High-Availability (failover) mode. When failovering the resources, an NFS client can continue to work. When failing back, the NFS client times-out for 5+ minutes.
Further problem information: While using RHEL3, that same (exact) configuration worked flawlessly.
Solution: set NFS options to UDP instead of TCP.
Explanation: RHEL3 has used NFS3 with UDP by default. RHEL4 uses NFS4 with TCP by default, which is a significant difference between them two.
Searching the web a while, to better understand the cause of the problem, I discovered an article in linux-ha (which looks like a very good place to visit if you’re into HA in Linux environmnets) which recommended using UDP instead of TCP. Quote:
"If your kernel defaults to using TCP for NFS (as is the case in 2.6
kernels), switch to UDP instead by using the ‘udp’ mount option. If you
don’t do this, you won’t be able to quickly switch from server "A" to
"B" and back to "A" because "A" will hold the TCP connection in
TIME_WAIT state for 15-20 minutes and refuse to reconnect." (quoted from the "Hints" section).
So, although I did not expect this cause (I had a hunch about Portmapper), the solution suggested worked fine (and only later we got to understand the cause). Good.