tbsetup/plab/plabmonitord.in · f1fa5a5147992cc80f26fb1d54a417bc063b68c8 · emulab / emulab-devel

· f1fa5a51
Kirk Webb authored Aug 18, 2006
New plab vnode monitor framework, now with proactive node checking action!

The old monitor has been completely replaced.  The new one uses modular pools
to test and track plab nodes.  There are currently two pool modules:
good and bad.  THe good pool tests nodes that have are not known to have
issues to proactively find problems and push nodes into the "bad" pool
when necessary.  The bad pool acts similarly to the old plabmonitor; it
does and end to end test on nodes, and if and when they finally come up,
moves them to the good pool.  Both pools have a testing backoff mechanism
that works as follows:

  * The node is tested right away upon entering either pool
  * Node fails to setup:
    * goodpool: node is sent to bad pool (hwdown)
    * badpool:  node is scheduled to be retested according to
                an additive backoff function, maxing out at 1 hour.
  * Node setup succeeds:
    * goodpool: node is scheduled to be retested according to
                an additive backoff function, maxing out at 1 hour.
    * badpool:  node is moved to good pool.

The backoff thing may be bogus, we'll see.  It seems like a reasonable thing
to do though - no need to hammer a node with tests if it consistently
succeeds or fails.  Nodes that flop back and forth will get the most
testing punishment.  A future enhancement will be to watch for flopping
and force nodes that exhibit this behavior to pass several consecutive
tests before being eligible for return back into the good pool.

The monitor only allows a configurable window's worth of outstanding
tests to go on at once.  When tests finish, more nodes tests are allowed
to start up right away.

Some refactoring needs to be done.  Currently the good and bad pools share
quite a bit of duplicated code.  I don't know if I dare venture into
inheritance with perl, but that would be a good way to approach this.

Some other pool module ideas:

* dynamic setup pools

When experiments w/ plab vnodes are swapped in, use the plab monitor to
manage setting up the vnodes by dynamically creating pools on a per-experiment
basis.  This has the advantage that the monitor can keep a global cap on
the number of outstanding setup operations.  These pools might also try to
bring up vnodes that failed to setup during swapin later on, along with other
vnode monitoring tasks.

* "all nodes" pools

Similar to the dynamic pools just mentioned, but with the mission to extend
experiments to all plab nodes possible (as nodes come and go).  Useful for
services.
f1fa5a51