1. 17 Aug, 2006 5 commits
    • Kirk Webb's avatar
      · f1fa5a51
      Kirk Webb authored
      New plab vnode monitor framework, now with proactive node checking action!
      The old monitor has been completely replaced.  The new one uses modular pools
      to test and track plab nodes.  There are currently two pool modules:
      good and bad.  THe good pool tests nodes that have are not known to have
      issues to proactively find problems and push nodes into the "bad" pool
      when necessary.  The bad pool acts similarly to the old plabmonitor; it
      does and end to end test on nodes, and if and when they finally come up,
      moves them to the good pool.  Both pools have a testing backoff mechanism
      that works as follows:
        * The node is tested right away upon entering either pool
        * Node fails to setup:
          * goodpool: node is sent to bad pool (hwdown)
          * badpool:  node is scheduled to be retested according to
                      an additive backoff function, maxing out at 1 hour.
        * Node setup succeeds:
          * goodpool: node is scheduled to be retested according to
                      an additive backoff function, maxing out at 1 hour.
          * badpool:  node is moved to good pool.
      The backoff thing may be bogus, we'll see.  It seems like a reasonable thing
      to do though - no need to hammer a node with tests if it consistently
      succeeds or fails.  Nodes that flop back and forth will get the most
      testing punishment.  A future enhancement will be to watch for flopping
      and force nodes that exhibit this behavior to pass several consecutive
      tests before being eligible for return back into the good pool.
      The monitor only allows a configurable window's worth of outstanding
      tests to go on at once.  When tests finish, more nodes tests are allowed
      to start up right away.
      Some refactoring needs to be done.  Currently the good and bad pools share
      quite a bit of duplicated code.  I don't know if I dare venture into
      inheritance with perl, but that would be a good way to approach this.
      Some other pool module ideas:
      * dynamic setup pools
      When experiments w/ plab vnodes are swapped in, use the plab monitor to
      manage setting up the vnodes by dynamically creating pools on a per-experiment
      basis.  This has the advantage that the monitor can keep a global cap on
      the number of outstanding setup operations.  These pools might also try to
      bring up vnodes that failed to setup during swapin later on, along with other
      vnode monitoring tasks.
      * "all nodes" pools
      Similar to the dynamic pools just mentioned, but with the mission to extend
      experiments to all plab nodes possible (as nodes come and go).  Useful for
    • Jonathon Duerig's avatar
      Rationalized Rob's previous checkin with mine to remove the additional... · 70cbdf5e
      Jonathon Duerig authored
      Rationalized Rob's previous checkin with mine to remove the additional dependencies that I had made to the now defunct IpHeader
    • Robert Ricci's avatar
      Comment much of PacketSensor.cc while looking for bugs. · d0f8197a
      Robert Ricci authored
      Add a lot of additional debugging output.
      Fix incorrect TCP payload size calculation - assumed Ethernet,
      ignored IP and TCP option headers.
      Replace weird nonstandart IpHeader structure with 'struct ip' from
      netinet/ip.h .
    • Jonathon Duerig's avatar
      Added sensor replay. It seems to be working perfectly. A replay is... · afa661e8
      Jonathon Duerig authored
      Added sensor replay. It seems to be working perfectly. A replay is automatically saved after every run in plab-n/local/logs/stub.replay. You can get a replay by running a command similar to: sudo ./magent --replay-load=/proj/tbres/exp/pelab-generated/logs/plab-1/local/logs/stub.replay
    • Mike Hibler's avatar
      If no agents are specified (and thus no UID given), call "tmcc creator" · a62bd59b
      Mike Hibler authored
      to get a UID to use.
  2. 16 Aug, 2006 2 commits
    • Mike Hibler's avatar
    • Kevin Atkinson's avatar
      - Added tbreport database schema (added three tables), storage for · 9c5d3308
      Kevin Atkinson authored
        tbreport errors & context.
      - Modified fatal() in swapexp, batchexp, and tbprerun, and die_noretry()
        in os_setup to pass hash parameter to tblog functions.
      - Added tbreport errror & context information for select errors in
        swapexp, tbswap, assign_wrapper2, snmpit_lib, snmpit, batchexp,
        assign_wrapper, os_setup, parse-ns, & tbprerun.
      - Added assign error parser in assign_wrapper2.
      - Added parse.tcl error parser in parse-ns.
      - Added severity constants for tbreport in libtblog_simple.
      - Added tbreport() function & context table mappging for reporting
        discrete error types to libtblog.
  3. 15 Aug, 2006 7 commits
  4. 14 Aug, 2006 7 commits
    • Russ Fish's avatar
      Mike reported a problem with duplicating experiments. I've seen it too. · 02042929
      Russ Fish authored
      It only happens with old experiments with no archive yet.
      There's a missing code path getting the ns file in CopyInArchive().
      Get the old experiment nsfile from the db in that case.
    • Leigh B. Stoller's avatar
      Add new report context tables. · 63c0bdc7
      Leigh B. Stoller authored
    • Kevin Atkinson's avatar
      · 07dda0d8
      Kevin Atkinson authored
      Prep for Mike Kasick report code.  Updated database schema and
      installed hooks for his code.
      Cleaned up how errors were handled in tblog(...).
      Allow SENDMAIL to be called before the path is untained in '-T' scripts.
      Other small changes.
    • Kevin Atkinson's avatar
      commit.log · 09d78ac6
      Kevin Atkinson authored
    • Leigh B. Stoller's avatar
      Checkpoint my dynamic event stuff, crude as it is. The idea for this first · 9d021a07
      Leigh B. Stoller authored
      draft is that the user will at the end of an experiment run, log into one
      of his nodes and perform some analysis which is intended to be repeated at
      the end of the next run, and in future instantiations of the template.
      A new table called experiment_template_events holds the dynamic events for
      the template. Right now I am supporting just program events, but it will be
      easy to support arbitrary events later. As an absurd example:
      	node6> /usr/local/bin/template_analyze ~/data_analyze arg arg ...
      The user is currently responsible for making sure the output goes into a
      file in the archive. I plan to make the template_analyze wrapper handle
      that automatically later, but for now what you really want is to invoke a
      script that encapsulates that, redirecting output to $ARCHIVE (this
      variable is installed in the environment template_analyze.
      The wrapper script will save the current time, and then run the program.
      If the program terminates with a zero exit status, it will ssh over to ops
      and invoke an xmlrpc routine to tell boss to add a program event to both
      the eventlist for the current instance, and to the template_eventlist for
      future instances. The time of the event is the relative start time that was
      saved above (remember, each experiment run replays the event stream from
      time zero).
      For the future, we want to allow this to be done on ops as well, but
      that will take more infrastructure, to run "program agents" on ops.
      It would be nice to install the ssl xmlrpc client side on our images so
      that we do not have to ssh to ops to invoke the client.
    • Mike Hibler's avatar
    • Leigh B. Stoller's avatar
      Change for templates. A new experiment run will cause the program · 0607b3b4
      Leigh B. Stoller authored
      agent to exit. rc.progagent now loops, restarting the program agent,
      but first getting new copies of the agent list and the environment
      from tmcd.
      Note that this conflicts slightly with the pa-wrapper used on plab
      nodes, which also loops. I think we can just get rid of pa-wrapper
      now, along with a slight change to rc.progagent. I'm gonna let Kirk
      comment on this.
      Need new images ...
  5. 11 Aug, 2006 19 commits