Commit f1fa5a51 authored by Kirk Webb's avatar Kirk Webb

New plab vnode monitor framework, now with proactive node checking action!

The old monitor has been completely replaced.  The new one uses modular pools
to test and track plab nodes.  There are currently two pool modules:
good and bad.  THe good pool tests nodes that have are not known to have
issues to proactively find problems and push nodes into the "bad" pool
when necessary.  The bad pool acts similarly to the old plabmonitor; it
does and end to end test on nodes, and if and when they finally come up,
moves them to the good pool.  Both pools have a testing backoff mechanism
that works as follows:

  * The node is tested right away upon entering either pool
  * Node fails to setup:
    * goodpool: node is sent to bad pool (hwdown)
    * badpool:  node is scheduled to be retested according to
                an additive backoff function, maxing out at 1 hour.
  * Node setup succeeds:
    * goodpool: node is scheduled to be retested according to
                an additive backoff function, maxing out at 1 hour.
    * badpool:  node is moved to good pool.

The backoff thing may be bogus, we'll see.  It seems like a reasonable thing
to do though - no need to hammer a node with tests if it consistently
succeeds or fails.  Nodes that flop back and forth will get the most
testing punishment.  A future enhancement will be to watch for flopping
and force nodes that exhibit this behavior to pass several consecutive
tests before being eligible for return back into the good pool.

The monitor only allows a configurable window's worth of outstanding
tests to go on at once.  When tests finish, more nodes tests are allowed
to start up right away.

Some refactoring needs to be done.  Currently the good and bad pools share
quite a bit of duplicated code.  I don't know if I dare venture into
inheritance with perl, but that would be a good way to approach this.

Some other pool module ideas:

* dynamic setup pools

When experiments w/ plab vnodes are swapped in, use the plab monitor to
manage setting up the vnodes by dynamically creating pools on a per-experiment
basis.  This has the advantage that the monitor can keep a global cap on
the number of outstanding setup operations.  These pools might also try to
bring up vnodes that failed to setup during swapin later on, along with other
vnode monitoring tasks.

* "all nodes" pools

Similar to the dynamic pools just mentioned, but with the mission to extend
experiments to all plab nodes possible (as nodes come and go).  Useful for
services.
parent 70cbdf5e
......@@ -2329,8 +2329,11 @@ outfiles="$outfiles Makeconf GNUmakefile \
tbsetup/plab/mod_dslice.py tbsetup/plab/mod_PLC.py \
tbsetup/plab/mod_PLCNM.py \
tbsetup/plab/plabslice tbsetup/plab/plabnode tbsetup/plab/plabrenewd \
tbsetup/plab/plabrenewonce \
tbsetup/plab/plabmetrics tbsetup/plab/plabstats \
tbsetup/plab/plabmonitord tbsetup/plab/plablinkdata \
tbsetup/plab/plabmonitord \
tbsetup/plab/plabmon_badpool.pm tbsetup/plab/plabmon_goodpool.pm \
tbsetup/plab/plablinkdata \
tbsetup/plab/libdslice/GNUmakefile tbsetup/plab/etc/GNUmakefile \
tbsetup/plab/plabdist tbsetup/plab/plabhttpd \
tbsetup/plab/plabdiscover tbsetup/plab/etc/netbed_files/GNUmakefile \
......
......@@ -764,8 +764,11 @@ outfiles="$outfiles Makeconf GNUmakefile \
tbsetup/plab/mod_dslice.py tbsetup/plab/mod_PLC.py \
tbsetup/plab/mod_PLCNM.py \
tbsetup/plab/plabslice tbsetup/plab/plabnode tbsetup/plab/plabrenewd \
tbsetup/plab/plabrenewonce \
tbsetup/plab/plabmetrics tbsetup/plab/plabstats \
tbsetup/plab/plabmonitord tbsetup/plab/plablinkdata \
tbsetup/plab/plabmonitord \
tbsetup/plab/plabmon_badpool.pm tbsetup/plab/plabmon_goodpool.pm \
tbsetup/plab/plablinkdata \
tbsetup/plab/libdslice/GNUmakefile tbsetup/plab/etc/GNUmakefile \
tbsetup/plab/plabdist tbsetup/plab/plabhttpd \
tbsetup/plab/plabdiscover tbsetup/plab/etc/netbed_files/GNUmakefile \
......
......@@ -30,6 +30,7 @@ use vars qw(@ISA @EXPORT);
PROJROOT GROUPROOT USERROOT TBOPSPID EXPTLOGNAME
PLABMOND_PID PLABMOND_EID PLABHOLDING_PID PLABHOLDING_EID
PLABTESTING_PID PLABTESTING_EID PLABDOWN_PID PLABDOWN_EID
TBTrustConvert TBMinTrust TBGrpTrust TBProjTrust MapNumericUID
......@@ -402,8 +403,12 @@ sub NODEDEAD_PID() { $TBOPSPID; }
sub NODEDEAD_EID() { "hwdown"; }
sub PLABMOND_PID() { $TBOPSPID; }
sub PLABMOND_EID() { "plab-monitor"; }
sub PLABTESTING_PID() { $TBOPSPID; }
sub PLABTESTING_EID() { "plab-testing"; }
sub PLABHOLDING_PID() { $TBOPSPID; }
sub PLABHOLDING_EID() { "plabnodes"; }
sub PLABHOLDING_EID() { "plabup"; }
sub PLABDOWN_PID() { $TBOPSPID; }
sub PLABDOWN_EID() { "plabdown"; }
sub OLDRESERVED_PID() { $TBOPSPID; }
sub OLDRESERVED_EID() { "oldreserved"; }
sub NFREELOCKED_PID() { $TBOPSPID; }
......
......@@ -18,7 +18,8 @@ SBIN_STUFF = plabslice plabnode plabrenewd plabmetrics plabstats \
plabmonitord plablinkdata plabdist plabhttpd plabdiscover \
plabrenewonce
LIB_STUFF = libplab.py mod_dslice.py mod_PLC.py mod_PLCNM.py
LIB_STUFF = libplab.py mod_dslice.py mod_PLC.py mod_PLCNM.py \
plabmon_badpool.pm plabmon_goodpool.pm
LIBEXEC_STUFF = webplabstats
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment