Skip to content
  • Kirk Webb's avatar
    · 5b52831c
    Kirk Webb authored
    Well, here it is:  The checkin implementing robust recovery/retry and
    asynchronous safe termination in plab allocation/deallocation/setup.
    
    Here are some of the more prominent changes/additions:
    
    * Bounded plab agent communication
      Scripts should never hang waiting for plab xmlrpc commands to complete;
      they have their own internal timeouts.  Node.create() in libplab is an
      exception, but is always run under a timeout constraint in vnode_setup
      and can be changed easily if the need arises.
    
    * Wrote functions in libplab to do the retry/recovery/timeout of remote
      command exection.
    
    * Wrapped critical sections with a signal watcher.
    
    * Added code to handle various error conditions properly
    
    * Added a libtestbed function, TBForkCmd, which runs a given program in
      a child process, and can optionally catch incoming SIGTERMs and terminate
      the child (then exit itself).
    
    * Fixed up vnode_setup to batch the 'plabnode free' operation along with
      a few other cleanups.  This should alleviate Jay's concern about how
      long it used to take to teardown a plab expt.
    
    * Whacked plabmonitord into better shape; fixed a couple bugs, taught it how
      to daemonize, and implemented a priority list for testing broken plab nodes.
      This list causes new (as yet unseen) nodes to be tried first over ones that
      have been tested already.
    5b52831c