Skip to content
  • Kirk Webb's avatar
    · f1fa5a51
    Kirk Webb authored
    New plab vnode monitor framework, now with proactive node checking action!
    
    The old monitor has been completely replaced.  The new one uses modular pools
    to test and track plab nodes.  There are currently two pool modules:
    good and bad.  THe good pool tests nodes that have are not known to have
    issues to proactively find problems and push nodes into the "bad" pool
    when necessary.  The bad pool acts similarly to the old plabmonitor; it
    does and end to end test on nodes, and if and when they finally come up,
    moves them to the good pool.  Both pools have a testing backoff mechanism
    that works as follows:
    
      * The node is tested right away upon entering either pool
      * Node fails to setup:
        * goodpool: node is sent to bad pool (hwdown)
        * badpool:  node is scheduled to be retested according to
                    an additive backoff function, maxing out at 1 hour.
      * Node setup succeeds:
        * goodpool: node is scheduled to be retested according to
                    an additive backoff function, maxing out at 1 hour.
        * badpool:  node is moved to good pool.
    
    The backoff thing may be bogus, we'll see.  It seems like a reasonable thing
    to do though - no need to hammer a node with tests if it consistently
    succeeds or fails.  Nodes that flop back and forth will get the most
    testing punishment.  A future enhancement will be to watch for flopping
    and force nodes that exhibit this behavior to pass several consecutive
    tests before being eligible for return back into the good pool.
    
    The monitor only allows a configurable window's worth of outstanding
    tests to go on at once.  When tests finish, more nodes tests are allowed
    to start up right away.
    
    Some refactoring needs to be done.  Currently the good and bad pools share
    quite a bit of duplicated code.  I don't know if I dare venture into
    inheritance with perl, but that would be a good way to approach this.
    
    Some other pool module ideas:
    
    * dynamic setup pools
    
    When experiments w/ plab vnodes are swapped in, use the plab monitor to
    manage setting up the vnodes by dynamically creating pools on a per-experiment
    basis.  This has the advantage that the monitor can keep a global cap on
    the number of outstanding setup operations.  These pools might also try to
    bring up vnodes that failed to setup during swapin later on, along with other
    vnode monitoring tasks.
    
    * "all nodes" pools
    
    Similar to the dynamic pools just mentioned, but with the mission to extend
    experiments to all plab nodes possible (as nodes come and go).  Useful for
    services.
    f1fa5a51