Skip to content
  • Leigh B. Stoller's avatar
    A watchdog daemon to try and catch (and recover from) the periodic · c47cefa1
    Leigh B. Stoller authored
    mysqld hangs that cause the entire system to grind to a halt. The
    basic theory of operation is like this:
    
    * Once a minute fork a child (protected by a 60 second timeout) to
      connect to the DB and issue a simple query. If the child can access
      the DB okay, it exits with a zero status.
    
    * If the alarm fires, the child is killed. This indicates that mysqld
      is no longer responding in a reasonable amount of time (60 seconds).
      We shift into trying to restart mysqld:
    
         * Send mysqld a TERM. Wait for 30 seconds.
    
         * Try query again; typically, the situation will not have changed one
           bit, but I do it anyway.
    
         * If mysqld was running, send it a kill -9. Wait for 15 seconds.
    
         * Start mysqld. Wait for 5 seconds.
    
         * Try query again. If query succeeds, we are done, and no one
           will have to deal with it Sunday morning at 6am (thanks Tim).
    
         * If query still fails, send email and give up trying to do fix
           anything. The daemon continues to query the DB once a minute;
           once the query succeeds (cause a human fixed things up), the
           daemon goes back into its normal mode (attempt to fix things
           next time it fails).
    
    So, the problem is what happens when someone kills off mysqld for some
    other reason. It may be that this daemon should only try to restart
    mysqld if and only if, it actually killed a running mysqld. Comments?
    c47cefa1