New plab vnode monitor framework, now with proactive node checking action!
The old monitor has been completely replaced. The new one uses modular pools to test and track plab nodes. There are currently two pool modules: good and bad. THe good pool tests nodes that have are not known to have issues to proactively find problems and push nodes into the "bad" pool when necessary. The bad pool acts similarly to the old plabmonitor; it does and end to end test on nodes, and if and when they finally come up, moves them to the good pool. Both pools have a testing backoff mechanism that works as follows: * The node is tested right away upon entering either pool * Node fails to setup: * goodpool: node is sent to bad pool (hwdown) * badpool: node is scheduled to be retested according to an additive backoff function, maxing out at 1 hour. * Node setup succeeds: * goodpool: node is scheduled to be retested according to an additive backoff function, maxing out at 1 hour. * badpool: node is moved to good pool. The backoff thing may be bogus, we'll see. It seems like a reasonable thing to do though - no need to hammer a node with tests if it consistently succeeds or fails. Nodes that flop back and forth will get the most testing punishment. A future enhancement will be to watch for flopping and force nodes that exhibit this behavior to pass several consecutive tests before being eligible for return back into the good pool. The monitor only allows a configurable window's worth of outstanding tests to go on at once. When tests finish, more nodes tests are allowed to start up right away. Some refactoring needs to be done. Currently the good and bad pools share quite a bit of duplicated code. I don't know if I dare venture into inheritance with perl, but that would be a good way to approach this. Some other pool module ideas: * dynamic setup pools When experiments w/ plab vnodes are swapped in, use the plab monitor to manage setting up the vnodes by dynamically creating pools on a per-experiment basis. This has the advantage that the monitor can keep a global cap on the number of outstanding setup operations. These pools might also try to bring up vnodes that failed to setup during swapin later on, along with other vnode monitoring tasks. * "all nodes" pools Similar to the dynamic pools just mentioned, but with the mission to extend experiments to all plab nodes possible (as nodes come and go). Useful for services.
Showing with 1141 additions and 205 deletions