Archive for August, 2006

Update – Netboot on RHEL x86 (32bit) with Broadcom (tg3) – no network

Sunday, August 27th, 2006

In my post just below, I have defined a set of tests to verify the possible cause of the tg3 problem. It had nothing to do with autoneg, and was fixed in RHEL 4 Update 4. That 32bit installer works correctly.

One last thing to test – rebuild the installer initrd, and replace tg3 module by one built from source (for example, HP’s tg3 drivers from the Proliant Support Pack) for this kernel. I wonder if it will work.

Netboot on RHEL x86 (32bit) with Broadcom (tg3) – no network

Thursday, August 24th, 2006

I have a PXE installation server setup, and it usually works quite well. I have tried to install a Tyan based system using this setup, but this time – RHEL4 U3 X86 and not the usual X86_64 system.

RH installer starts by asking few questions (language, keyboard, method of installation) and fails to obtain DHCP IP. Even setting manual IP results with no communication.

I got an idea from a friend and would try it today – since the 1Gb/s Broadcom is connected to 100Mb/s switch, I should try and disable the auto-negotiation, and set a predefined speed for the card. We’ll see how/if it works, and if it allows for the 32bit installer to work. The 64bit installer works fine, by the way.

Tyan Thunder K8QE and Linux RHEL 4 Update 3

Sunday, August 13th, 2006

This board is a tricky board. 4GB RAM and above behave in a weird manner in Linux. It appears that PCI 32bit mapping doesn’t work correctly under Linux.

To allow Linux to work on this hardware without failure (such as kernel crush during startup), you must follow these three simple guidelines:

1. Spread the memory equally near all CPUs. For example, if you have 4GB RAM for the four CPU version (8 cores, in my case), spread the memory 1GB near each CPU.

2. Make sure you set the type of OS to Linux in the BIOS. PCI mapping won’t work otherwise.

3. Do not put PCI 32bit cards in the PCI-X slots. It will render the onboard network cards unusable.

HP-UX – allowed shells, and connecting FC Multipath to NetApp

Thursday, August 10th, 2006

When adding a certain shell to an HP-UX system, for example, /usr/bin/tcsh, each user set to use this shell will not be able to FTP to the machine, until there is entry in /etc/shells. The trick is that even if the file doesn’t exist, you have to create it. By default, HP-UX allows only /sbin/sh and /bin/sh shells, but as soon as you setup this file, you can allow more shells. Mind you that you have to include /sbin/sh and /bin/sh in /etc/shells, else other things might not work correctly. Taken from here.

Connecting HP-UX to SAN storage is never too simple. The actual list of actions is:

1. Install HP-UX drivers for the FC adapter

2. Map the PWWN obtained from (reading the sticker at the back of the machine, or querying the storage/SAN switch) the machine to the relevant LUNs.

3. Run “/usr/sbin/ioscan -fnC disk” and see that the new disk devices are detected.

4. Run “/usr/sbin/ioinit -i” to create the relevant device files.

A note – HP-UX might require a reboot after the initial connection. On several cases I’ve noticed that if the server was running for a while with disconnected fiber, only being connected during before startup would result in link and in SAN registration. Of course, the driver must be installed then.

If you are to connect your HP-UX to NetApp device, as we did, take a day (or more) notice and open “now” account in http://now.netapp.com. You can find documentation about HP-UX (including step-by-step), you can find the “SAN Attach Kit for HP-UX” which will make your life easier, and set of best-practice guides. Just follow these guides, and you will find it easy and simple task to do.

Troubleshooting weird networking problem

Wednesday, August 9th, 2006

Problem as follows: A Linux server is connected to a 1Gb/s LAN using 1Gb/s interface.

I was told that SSH to the machine fails with “socket error” when done from Windows/Putty. One of the tests was done using Linux/ssh client, and it went fine. A switch was replaced, and other methods of detection showed weird results.

When I came to the place I have started with the usual procedure – ifconfig, and to see there are no TX or RX errors, dmesg, checking /var/log/messages, ethtool. All produced the results expected when everything is working fine. I even switched network interfaces (using the 2nd Ethernet port on the server), but for no avail.

The actual results looked a bit different – clients were unable to connect to the server using SSH for the first time (in general), but were able to connect the next time. You can’t run your Oracle server on such a setup…

I have escalated my tests into tcpdump, which showed only part of the information expected, but gave too much junk to be readable enough to fetch anything out of it.

Using remote desktop from another server to client’s desktop we’ve encountered that same problem – first time failure, and then success, and then it hit me! On another (it was third or fourth desktop) I have looked in the output of “arp -a” (Windows Desktop) right after the first failure, and saw that the MAC address assigned to the server’s IP is a wrong one. Some other machine on the network had this same IP address. Replacing the Linux Server’s IP address to a free one solved everything, as it seems, and resulted in a fine working server, and some free time devouted to hunting down the renegade spoofing machine.

HP-UX and Software Raid1

Tuesday, August 8th, 2006

I have installed today an HP-UX 11i V2 on PA-Risc server, and it went quite fine. I have used the “Technical Environment” DVDs for installation, and it went fine. I was unable to find, however, the Raid1 (Mirror) tools for the LVM.

Symptoms: There is no parameter “-m” to “lvextend“. According to documentaion (or even better, HP Forum1 and HP Forum2), it is plain simple, using the lvextend. Only here I got to figure that it was part of the LVM package for Enterprise Servers.

I finally found it in the CD called “Mission Critical Operating Environment DVD 1″. Inside,in a bundle called “HPUX11i-OE-Ent”. I have selected “LVM” from the list there, installed, let the system recompile the kernel, and reboot. Then lvextend will started accepting the “-m” flag.

Per the posts described above, I run:

for LVOL in `ls /dev/vg00/l*` ; do

lvextend -m 1 $LVOL

done

Took a while, but at least it worked.

NFS problems in failover – MC Service Guard. Applicable to other Linux HA clusters

Monday, August 7th, 2006

Problem: Two Linux servers (RHEL4) running NFS Server in High-Availability (failover) mode. When failovering the resources, an NFS client can continue to work. When failing back, the NFS client times-out for 5+ minutes.

Further problem information: While using RHEL3, that same (exact) configuration worked flawlessly.

Solution: set NFS options to UDP instead of TCP.

Explanation: RHEL3 has used NFS3 with UDP by default. RHEL4 uses NFS4 with TCP by default, which is a significant difference between them two.

Searching the web a while, to better understand the cause of the problem, I discovered an article in linux-ha (which looks like a very good place to visit if you’re into HA in Linux environmnets) which recommended using UDP instead of TCP. Quote:

"If your kernel defaults to using TCP for NFS (as is the case in 2.6
kernels), switch to UDP instead by using the ‘udp’ mount option. If you
don’t do this, you won’t be able to quickly switch from server "A" to
"B" and back to "A" because "A" will hold the TCP connection in
TIME_WAIT state for 15-20 minutes and refuse to reconnect.
" (quoted from the "Hints" section).

So, although I did not expect this cause (I had a hunch about Portmapper), the solution suggested worked fine (and only later we got to understand the cause). Good.