Well, here it is: The checkin implementing robust recovery/retry and asynchronous safe termination in plab allocation/deallocation/setup. Here are some of the more prominent changes/additions: * Bounded plab agent communication Scripts should never hang waiting for plab xmlrpc commands to complete; they have their own internal timeouts. Node.create() in libplab is an exception, but is always run under a timeout constraint in vnode_setup and can be changed easily if the need arises. * Wrote functions in libplab to do the retry/recovery/timeout of remote command exection. * Wrapped critical sections with a signal watcher. * Added code to handle various error conditions properly * Added a libtestbed function, TBForkCmd, which runs a given program in a child process, and can optionally catch incoming SIGTERMs and terminate the child (then exit itself). * Fixed up vnode_setup to batch the 'plabnode free' operation along with a few other cleanups. This should alleviate Jay's concern about how long it used to take to teardown a plab expt. * Whacked plabmonitord into better shape; fixed a couple bugs, taught it how to daemonize, and implemented a priority list for testing broken plab nodes. This list causes new (as yet unseen) nodes to be tried first over ones that have been tested already.
Showing with 810 additions and 271 deletions