Multihomed routing (split access load balancing) and OpenVPN
We have one connection via ATM like interface and we have one PPP connection via xDSL (described here), and we want load balancing for this whole party.
Following this specific part of lartc.org guide, we’ve managed to get this to work. The idea goes like this (Centos 4.3):
1. Do not state default route for the machine. Not in /etc/sysconfig/network and not in /etc/sysconfig/network-scripts/ifcfg-ethX
2. Using adsl-setup, we’ve defined our ADSL connection. Verify you have an entry DEFROUTE=no in your /etc/sysconfig/network-scripts/ifcfg-ppp0
3. find a way to start the following script after your network interfaces are up. I assume, in this script, that your ATM interface is eth1. multiroute.txt
The reason for specifically stating SERVER is that our DNS server requires recursive DNS for its settings, and I can use my ISP’s DNS Server only when using the corresponding link. Since both links are for different ISPs, I need to “bind” SERVER to a specific route.
Note that this solution is only temporary. At the moment, it is far from being complete, and many tests should be done yet, before I can call it a working solution. I might combine it with /etc/ppp/ip-up.local script, or I might add it as a seperated service in /etc/init.d, which would start after all interfaces are up and running. Not final yet.
With all this working like charm, we’ve had a huge issue – our OpenVPN server, which worked correctly just until then failed to work smoothly. Sometimes clients were able to connect, and sometimes they were unable to do so…
I got the following error message in my logs: “x.y.z.m:2839 TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)”
The cause, as it seemed to me, was that OpenVPN’s UDP packets were routed via alternate route for each target client. Being UDP, they were not part of an active session, but were stateless, which resulted in a different routing descision each time they were directed at the OpenVPN client. I’ve searched for it, although I was not optimistic, because multihomed routing, with multiple ways out wasn’t very common. I was suprised to find this post, with it’s follow-up, which dealt exactly with my case.
Since I cannot bind it to an internal IP address (although I’ve tried – it didn’t work), I will test TCP based configuration tomorrow morning.
===============================================================================
Update
===============================================================================
I don’t usually update posts but add new posts with links. However, in this case it was important enough for me to update this hot topic so I’ve decided to just add the new stuff.
First – I’ve failed. Since I do not have too much time here, I did not feel confident to leave a system yet untested. Especially when such a router is an essential link in this company.
I’ve tried using TCP based connection, but, still again, one client was able to connect, while the 2nd one did so for only a short while, and failed maintaining a working connection. I went back to UDP…
I came up with the following idea – if I can use some sort of tagging to differentiate the UDP packets sourced at the router, at the OpenVPN application, I could try and set a routing rule which will force them into a specific routing chain, and force them through my interface.
It didn’t work quite well. I was able to do the followin trick, but for no avail:
iptables -t mangle -A OUTPUT -p udp –sport 5001 -j MARK –set-mark 1
and then, using “ip” command:
ip rule add fwmark 1 table T1
which should have redirected all outbound UDP with source port 5001 (this is the one I use for my OpenVPN, due to legacy considerations), to the T1 routing table – a table directed outside with default route via eth1.
I don’t know why it failed. Almost seemed to work, but no…
I returned the system to a single-path setup, with PPP0 only acting as a manual alternate path in case where the primary path is down. Would work for now.
For me using TCP fixed it as well. May be your problem originates somewhere else and cause TCP not to work as well.
It could be, although I was testing it.
It might have been caused by what I’ve had to do with IP translation (iptables with DNAT), so it can be an issue.
After all, I’ve had to capture and translate for another IP address, both directions.
I migth try and test it again, although this whole setup got too complicated for my replacement to handle on-site.
Ez