Proccess monitoring, Keepalive, etc
My new Linux server-to-be will require some remote monitoring and process keepalive going there. It’s that I’ve noticed nscd (which is required, when dealing with hundreds of LDAP based accounts) tends to
die once a while. I’ve also made a mistake once, and managed to kill all SSH daemons, including the running ones. I am happy to say it was solved by going down one floor, and connecting a screen to the machine, and restarting the service, however, it would have been nasty has it happened in relocation room, inside our ISP’s server farm…
So I’m trying to solve problems *before* they appear, I’ve decided to search for process KeepAlive daemon, or something which will ease my life, and make sure I don’t get any phone calls.
At first searching for "process keepalive" led me to some pages about HA-servers, aka, High Availability clusters. I don’t need multi-node keepalive, so I didn’t bother with it. Installing Centos’ or Dag’s keepalived proved to be exactly the thing I did not look for. So I’ve removed it, and kept on searching.
In the process, I found this link, which should have been put into cron. Nice going for one or two processes, but maintaining a full load of about 10 processes, which I must keep alive at all times, is a bit too big for this one. Without being able to code perl, I needed something else, better scalable.
I’ve seen lots of things, and some of them looked like they could interest me, but I wanted it as part of my package tree. I wanted it to be an RPM, and me to be able to upgrade it, if there are updates. All this, without actually tracking each package in person (which is a good enough reason to having package management system in the first place).
I was able to find in Dag Wieers RPM repository just the thing for me. It’s called "monit", and it was just the thing. Took me about 10 minutes to set the thing up, and make it work, tested, for most of my more important daemons.
Example of a configuration file is here monit.conf
It works, and it made my life a lot easier. I can easily recover both human mistakes and machine errors now. I might add some mail notification, but for now I will settle for logs only.