Troubleshooting weird networking problem

Problem as follows: A Linux server is connected to a 1Gb/s LAN using 1Gb/s interface.

I was told that SSH to the machine fails with “socket error” when done from Windows/Putty. One of the tests was done using Linux/ssh client, and it went fine. A switch was replaced, and other methods of detection showed weird results.

When I came to the place I have started with the usual procedure – ifconfig, and to see there are no TX or RX errors, dmesg, checking /var/log/messages, ethtool. All produced the results expected when everything is working fine. I even switched network interfaces (using the 2nd Ethernet port on the server), but for no avail.

The actual results looked a bit different – clients were unable to connect to the server using SSH for the first time (in general), but were able to connect the next time. You can’t run your Oracle server on such a setup…

I have escalated my tests into tcpdump, which showed only part of the information expected, but gave too much junk to be readable enough to fetch anything out of it.

Using remote desktop from another server to client’s desktop we’ve encountered that same problem – first time failure, and then success, and then it hit me! On another (it was third or fourth desktop) I have looked in the output of “arp -a” (Windows Desktop) right after the first failure, and saw that the MAC address assigned to the server’s IP is a wrong one. Some other machine on the network had this same IP address. Replacing the Linux Server’s IP address to a free one solved everything, as it seems, and resulted in a fine working server, and some free time devouted to hunting down the renegade spoofing machine.

Tags: , , , , ,

Leave a Reply