rc.d/3.testbed.sh.in · c47cefa19e55d53a5c32a0817a6d7231470ac102 · emulab / emulab-devel

A watchdog daemon to try and catch (and recover from) the periodic · c47cefa1

Leigh B. Stoller authored Apr 26, 2005

mysqld hangs that cause the entire system to grind to a halt. The
basic theory of operation is like this:

* Once a minute fork a child (protected by a 60 second timeout) to
  connect to the DB and issue a simple query. If the child can access
  the DB okay, it exits with a zero status.

* If the alarm fires, the child is killed. This indicates that mysqld
  is no longer responding in a reasonable amount of time (60 seconds).
  We shift into trying to restart mysqld:

     * Send mysqld a TERM. Wait for 30 seconds.

     * Try query again; typically, the situation will not have changed one
       bit, but I do it anyway.

     * If mysqld was running, send it a kill -9. Wait for 15 seconds.

     * Start mysqld. Wait for 5 seconds.

     * Try query again. If query succeeds, we are done, and no one
       will have to deal with it Sunday morning at 6am (thanks Tim).

     * If query still fails, send email and give up trying to do fix
       anything. The daemon continues to query the DB once a minute;
       once the query succeeds (cause a human fixed things up), the
       daemon goes back into its normal mode (attempt to fix things
       next time it fails).

So, the problem is what happens when someone kills off mysqld for some
other reason. It may be that this daemon should only try to restart
mysqld if and only if, it actually killed a running mysqld. Comments?

c47cefa1