• Kirk Webb's avatar
    · 5b52831c
    Kirk Webb authored
    Well, here it is:  The checkin implementing robust recovery/retry and
    asynchronous safe termination in plab allocation/deallocation/setup.
    Here are some of the more prominent changes/additions:
    * Bounded plab agent communication
      Scripts should never hang waiting for plab xmlrpc commands to complete;
      they have their own internal timeouts.  Node.create() in libplab is an
      exception, but is always run under a timeout constraint in vnode_setup
      and can be changed easily if the need arises.
    * Wrote functions in libplab to do the retry/recovery/timeout of remote
      command exection.
    * Wrapped critical sections with a signal watcher.
    * Added code to handle various error conditions properly
    * Added a libtestbed function, TBForkCmd, which runs a given program in
      a child process, and can optionally catch incoming SIGTERMs and terminate
      the child (then exit itself).
    * Fixed up vnode_setup to batch the 'plabnode free' operation along with
      a few other cleanups.  This should alleviate Jay's concern about how
      long it used to take to teardown a plab expt.
    * Whacked plabmonitord into better shape; fixed a couple bugs, taught it how
      to daemonize, and implemented a priority list for testing broken plab nodes.
      This list causes new (as yet unseen) nodes to be tried first over ones that
      have been tested already.
Last commit
Last update
etc Loading commit data...
libdslice Loading commit data...
GNUmakefile.in Loading commit data...
libplab.py.in Loading commit data...
plabdaemon.in Loading commit data...
plabmetrics.in Loading commit data...
plabmonitord.in Loading commit data...
plabnode.in Loading commit data...
plabslice.in Loading commit data...
plabstats.in Loading commit data...
webplabstats.in Loading commit data...