-
Kirk Webb authored
Well, here it is: The checkin implementing robust recovery/retry and asynchronous safe termination in plab allocation/deallocation/setup. Here are some of the more prominent changes/additions: * Bounded plab agent communication Scripts should never hang waiting for plab xmlrpc commands to complete; they have their own internal timeouts. Node.create() in libplab is an exception, but is always run under a timeout constraint in vnode_setup and can be changed easily if the need arises. * Wrote functions in libplab to do the retry/recovery/timeout of remote command exection. * Wrapped critical sections with a signal watcher. * Added code to handle various error conditions properly * Added a libtestbed function, TBForkCmd, which runs a given program in a child process, and can optionally catch incoming SIGTERMs and terminate the child (then exit itself). * Fixed up vnode_setup to batch the 'plabnode free' operation along with a few other cleanups. This should alleviate Jay's concern about how long it used to take to teardown a plab expt. * Whacked plabmonitord into better shape; fixed a couple bugs, taught it how to daemonize, and implemented a priority list for testing broken plab nodes. This list causes new (as yet unseen) nodes to be tried first over ones that have been tested already.
5b52831c