1. 02 Feb, 2011 2 commits
  2. 17 Nov, 2010 2 commits
  3. 12 Nov, 2010 1 commit
  4. 10 Nov, 2010 1 commit
  5. 09 Nov, 2010 1 commit
  6. 29 Sep, 2010 2 commits
    • Mike Hibler's avatar
      5ae92284
    • Mike Hibler's avatar
      Handle a common failure on the node reload path. · 4dc57d48
      Mike Hibler authored
      Under load, nodes that have just entered reloading and have just rebooted
      might fail to get bootinfo.  The default behavior in this case is for the
      node to boot from disk (dubious, but that is the topic for another day).
      This causes the node to fall off the RELOAD path, winding up in either
      TBFAILED or ISUP.  Worse, if the node makes it to ISUP, its reload state
      is cleared and even if the reload_daemon reboots the node, it will still
      not go through the reloading process.
      
      The result is a bunch of nodes left in reloading.  Now if a node makes an
      invalid transition to TBFAILED or ISUP while in the RELOAD state machine,
      it fires the new REBOOT trigger which does...well, you figure it out.
      Note that in the ISUP case, this trigger overrides the default that would
      otherwise clear the reload state--so reboot is sufficient to get the machine
      back on the RELOAD track.
      4dc57d48
  7. 28 May, 2010 1 commit
  8. 26 May, 2010 1 commit
  9. 25 May, 2010 1 commit
  10. 20 May, 2010 2 commits
    • Robert P Ricci's avatar
      Add a new timeout action; STATE · 159552a3
      Robert P Ricci authored
      Allows the state_timeouts table to contain a new type of action
      to take on timeout: STATE:newstate . This will force stated to
      transition the node to newstate, and take any trigger actions
      associated with that state.
      
      We will use this to make timeouts in the secure reload path
      force the node into the SECVIOLATION state.
      
      Not yet tested.
      159552a3
    • Robert P Ricci's avatar
      Add two new triggers for secure boot · 9f5a312f
      Robert P Ricci authored
      Add 'POWEROFF' and 'EMAILNOTIFY' state triggers - the idea is that
      these will be used as triggers when a node enters the 'SECVIOLATION'
      state in the secure reload path, to turn off the node and send
      testbed-ops mail about it.
      
      Not yet tested.
      9f5a312f
  11. 05 Jan, 2010 2 commits
  12. 04 Aug, 2009 1 commit
    • Kevin Atkinson's avatar
      Implement frontend and middleend support for loading multiple images · e7871305
      Kevin Atkinson authored
      at once with Frisbee (excludes the actual MFS changes).
      
      Os_load now takes take a list of comma serrated image names for the
      "-i" and "-m" options.  The default OS is the OS for the last image
      specified in the list.  I also changed the "-p" option of osload to
      search both the project specified and emulab-ops for the image rather
      than just the project specified in order to simplify specifying
      multiple images (and because I personally found that behavior annoying
      when using osload).
      
      I modified the current_reloads table to be able to specify more than one
      image for a node by adding an "idx" column which controls the order of
      the reloads.  I also added a "prepare" column to the table (explained
      below)
      
      I modified tmcd to basically loop over the entries in the table and
      create a multiline LOADINFO responsive, and modified rc.frisbee to
      handle the multiline response and load each image in turn.
      
      I modified os_load to take a new option "-P" which will tell rc.frisbee
      to zap the superblocks even if a whole disk image is not specified.
      To do this I set the prepare entry for the first image in the
      current_reloads table to true.  Tmcd than passes this into to
      rc.frisbee in the LOADINFO line.  When rc.frisbee sees this it will
      make sure to zap the superblock before loading that image.
      
      To support having multiple images as the default, "default_imageid"
      can now be a comma separated list.  I implemented a hack to be able to
      set multiple imageids via editnodetype.php3.  Basically the form
      splits default_imageid into default_imageid_0, default_imageid_1, etc
      and than adds an empty default_imageid_# slot to allow adding an
      imageid.  Multiple images can be added by adding one image, than
      submitting the form, and than adding another into the empty slot.  Not
      the best, but I don't thing this will be a very common operation.
      When the form is submitted it will than combine all default_imageid_#
      into a comma separated list ignoring any that are deleted or set to
      "No ImageID" (ie 0).
      
      Everything will work fine with old MFSs as long as only one image is
      loaded.  If multiple images are loaded with an old MFS, an email will
      be sent to testbed-ops.  This works by having tmcd detect old MFS's by
      using the version number and setting the state to RELOADOLDMFS.  Stated
      will pick up on the and send the email to testbed-ops via a trigger.
      e7871305
  13. 06 Mar, 2007 1 commit
  14. 30 Aug, 2006 1 commit
  15. 26 Apr, 2006 1 commit
  16. 25 Apr, 2006 1 commit
  17. 07 Feb, 2006 1 commit
  18. 01 Dec, 2005 1 commit
  19. 19 Aug, 2005 1 commit
  20. 18 Aug, 2005 1 commit
  21. 12 Jan, 2005 1 commit
  22. 19 Aug, 2004 1 commit
  23. 18 Aug, 2004 1 commit
    • Leigh B. Stoller's avatar
      Minor extension to stated. Add a trigger mechanism for invoking an · 6cf3e936
      Leigh B. Stoller authored
      "arbitrary" script as defined in the stated_triggers table. Currently
      using this to invoke the new opsreboot script whenever ISUP comes in
      from ops.
      
      The opsreboot script is currently a skeleton. All it does is send
      email.  I'll add the rest later (which really won't be much at first;
      just getting the event schedulers started).
      6cf3e936
  24. 22 Jul, 2004 1 commit
  25. 11 Feb, 2004 1 commit
  26. 15 Jan, 2004 1 commit
    • Mac Newbold's avatar
      libdb changes: · 78ad260c
      Mac Newbold authored
      - add functions to recursively dump hashes and arrays into a string
        suitable for printing as debugging output (great for data structures)
      
      - add three new trigger strings
      
      - add 'use strict', do corresponding cleanup
      
      stated changes:
      
      - move special-cased stuff in handleEvent for PXEBOOTING and BOOTING into
        triggers (PXEBOOTING, BOOTING, and CHECKGENISUP)
      
      - clarify (via comments) the existing kinds of triggers and which ones run
        when, and add a new kind (global "any-mode" triggers). We already had
        per-node mode-specific, per-node any-mode, and global mode-specific
        triggers. Now you can have a trigger that is good for any mode in a
        given state, that can be overridden on a mode-specific basis. This is
        great for PXEBOOTING, BOOTING, and ISUP, since they each have a trigger
        list that should be run regardless of what mode you're in. Now they only
        require 3 entries instead of 3*N that have to be maintained per mode.
      
           # A note about triggers:
           #
           # "per-node" triggers only affect their specific node in a
           # particular mode/state, and are run first of all. "global"
           # triggers are triggers for a given mode/state that affect all
           # nodes, and are run after any per-node triggers. "Any-mode"
           # triggers are tied to a state, and occur in that state in any
           # mode. The any-mode triggers are over-ridden by global triggers,
           # and if an "Any-mode" trigger for state XYZ exists as well as a
           # global trigger for mode FOOBAR state XYZ, then when I arrive in
           # XYZ any per-node triggers will be run. Then, if I'm in mode
           # FOOBAR, only the global trigger will run. If I'm in any other
           # mode, only the any-mode trigger will run.
      
           # (our "*" is stored as $TBANYMODE)
           # Per-node triggers have a specific node_id
           # Global triggers have "*" as the node_id
           # Any-mode triggers have "*" as the mode, and can be global or per-node
      
        The updated table looks like this in the accompanying change to
        database-fill.sql:
      
      +---------+----------+------------+-----------------------+
      | node_id | op_mode  | state      | trigger               |
      +---------+----------+------------+-----------------------+
      | *       | *        | BOOTING    | BOOTING, CHECKGENISUP |
      | *       | *        | ISUP       | RESET                 |
      | *       | *        | PXEBOOTING | PXEBOOT               |
      | *       | RELOAD   | RELOADDONE | RESET, RELOADDONE     |
      | *       | ALWAYSUP | SHUTDOWN   | ISUP                  |
      +---------+----------+------------+-----------------------+
      
      - I also cleaned up the functions that add, get, and delete triggers.
        Before, the get function didn't include global triggers. Now it does,
        and has an option to just get the per-node triggers. Add and delete are
        still just per-node, of course.
      
      - Also found and fixed some little bugs while I was in there. (global
        triggers not taking a list,
      
      These changes are me getting ready to re-add all the changes I made months
      ago in order to do a before-and-after experiment for my thesis. Between
      now and the end of next week I'll be working on taking before numbers,
      patching stated with the changes, and getting after numbers.
      
      The problems I'm trying to replicate are the problems and slowdowns we
      used to get when os_{load,setup} would reboot a node, thinking it had
      timed out, when it really didn't know whether it was making progress or
      not. The fix includes making os_{load,setup} depend on stated to watch for
      progress and timeouts, and do any appropriate retries. Part of that is the
      StateWait stuff, that lets programs watch for events easily, and the
      node_reboot-with-events stuff that puts stated in control of nodes as they
      reboot.
      78ad260c
  27. 12 Jan, 2004 1 commit
    • Leigh B. Stoller's avatar
      Hmm, this file dropped from previous commit. Added support for · 5378d87c
      Leigh B. Stoller authored
      handling PXEWAKUP timeouts, retrying 3 times and then forcing a power
      cycle.  Changed BOOTING event action to auto switch in and out of the
      special PXEKERNEL state machine that all local nodes use since all
      local nodes boot the same pxeboot kernel and talk to bootinfo (as
      directed to by dhcp).
      5378d87c
  28. 07 Jan, 2004 1 commit
    • Leigh B. Stoller's avatar
      A set of debugging changes to allow running multiple stateds. This is · cf61f6f3
      Leigh B. Stoller authored
      probably imperfect, but better then nothing. New option, "-t tag"
      allows you to specify an arbitrary tag to match against the stated_tag
      of the nodes table. The stated invocation will only operate on nodes
      that match the tag, ignoring all events for other nodes. If
      unspecified, stated will operate on all nodes with a NULL tag. This is
      setup up at the beginning of time (or during a reload) saving the
      per-node tag in the $nodes hash. Each time an event arrives, check the
      tag in the table, ignoring the event if not a match.
      
      On signaled reload() must also be careful to throw away timeouts from
      the queue (and be careful not to set up new timeouts for ignored
      nodes).  So, this allows you to set the tag for a node in the DB, and
      then HUP stated so that it reloads it tables. That node will now be
      ignored by that stated.
      
      Also made some changes to debug mode. In debug mode, don't worry about
      the pidfile or the lockfile or checking for other running stated
      (which causes my debug version to exit! right away). Also, added a new
      -l option to turn of syslog output and just send it all to stdout with
      the debug output. -l can be only be used with -d of course.
      
      So what can I do with all this:
      
      	update nodes set stated_tag='lbs' where node_id='pc5';
      	sudo kill -HUP `cat /var/run/stated.pid`
      	sudo stated -d -l -t lbs
      
      Which tells the main stated to ignore pc5. Then I run a debugging
      stated that operates only on pc5. Later when done:
      
      	update nodes set stated_tag=NULL where node_id='pc5';
      	sudo kill -HUP `cat /var/run/stated.pid`
      
      Which tells the main stated to operate on pc5 again.
      cf61f6f3
  29. 15 Oct, 2003 1 commit
    • Mike Hibler's avatar
      Uniform syslog'ing. Change everything I could find to use a syslog facility · cc6d6fa7
      Mike Hibler authored
      as defined in the defs-* file (e.g. "TBLOGFACIL=local2").  The default is
      "local5" which is what we are setup to use so you shouldn't need to mess
      with your defs- file!
      
      perl scripts just get this value configured in when configure is run.
      C programs get the value in two ways.  For programs that are intimate with
      the testbed infrastructure, and include "config.h", they just get it from
      that file.  For programs that we sometimes use outside the Emulab build
      environment (e.g., frisbee, capture) and that don't include config.h,
      the value is set via a "-DLOG_TESTBED=..." in the GNUmakefile build line.
      If the value isn't set, it defaults to what it used to be (usually LOG_USER).
      
      Still to do: healthd, hmcd (whose build doesn't seem to be completely
      integrated) and plabdaemon.in (since its icky python :-)
      cc6d6fa7
  30. 13 Oct, 2003 1 commit
  31. 10 Oct, 2003 1 commit
    • Mac Newbold's avatar
      New StateWait changes - the main point of all this is to move to our new · 2b2a306d
      Mac Newbold authored
      model of waiting for state changes. Before we were watching the database
      (which means we can only watch for terminal/stable/long-lived states, and
      have to poll the db). Now things that are waiting for states to change
      become event listeners, and watch the stream of events flow by, and don't
      have to do any polling. They can now watch for any state, and even
      sequences of states (ie a Shutdown followed by an Isup).
      
      To do this, there is now a cool StateWait.pm library that encapsulates the
      functionality needed. To use it, you call initStateWait before you start
      the chain of events (ie before you call node reboot). Then do your stuff,
      and call waitForState() when you're ready to wait. It can be told to
      return periodically with the results so far, and you can cancel waiting
      for things. An example program called waitForState is in
      testbed/event/stated/ , and can also be used nicely as a command line tool
      that wraps up the library functionality.
      
      This also required the introduction of a TBFAILED event that can be sent
      when a node isn't going to make it to the state that someone may be
      waiting for. Ie if it gets wedged coming up, and stated retries, but
      eventually gives up on it, it sends this to let things know that the node
      is hozed and won't ever come up.
      
      Another thing that is part of this is that node_reboot moves (back) to the
      fully-event-driven model, where users call node reboot, and it does some
      checks and sends some events. Then stated calls node_reboot in "real mode"
      to actually do the work, and handles doing the appropriate retries until
      the node either comes up or is deemed "failed" and stated gives up on it.
      This means stated is also the gatekeeper of when you can and cannot reboot
      a node. (See mail archives for extensive discussions of the details.)
      
      A big part of the motivation for this was to get uninformed timeouts and
      retries out of os_load/os_setup and put them in stated where we can make a
      wiser choice. So os_load and os_setup now use this new stuff and don't
      have to worry about timing out on nodes and rebooting. Stated makes sure
      that they either come up, get retried, or fail to boot. tbrestart also
      underwent a similar change.
      2b2a306d
  32. 21 Aug, 2003 1 commit
  33. 19 Jun, 2003 1 commit
    • Mac Newbold's avatar
      The new and fully functional rebooting-via-events stuff and the · 1daaa992
      Mac Newbold authored
      really-reboot-nodes-that-timeout stuff.
      
      NOTE: Until the timeout/retry stuff is gone from os_load/os_setup, it is
      disabled in stated. It will still only send email. But all the stuff is
      there and has been tested.
      
      NOTE: Until other things don't depend on the old behavior of node_reboot
      (when it returns, all nodes are in SHUTDOWN), the event stuff is disabled.
      Real mode is the default, and can be run by anyone.
      
      In short, this commit is new versions of stated and node_reboot that act
      almost exactly like the old ones. But I wanted to commit them before I go
      on making a bunch more changes, to have a checkpoint that I know works.
      1daaa992
  34. 09 Jun, 2003 1 commit
  35. 06 Jun, 2003 1 commit
    • Mac Newbold's avatar
      First batch of changes for adding TBCOMMAND events. Currently, here's what · 71b82cc4
      Mac Newbold authored
      is supported:
      
      - stated listens for TBCOMMAND events, and currently handles REBOOT,
        POWEROFF, POWERON, and POWERCYCLE events. It does everything except make
        the actual calls to node_reboot and power. And it accepts batches of
        nodes instead of just single ones.
      
      - Timeouts were added to the db for these commands, with no timeout for
        the power ones (since the node can't hang during those), and a 15 second
        timeout from reboot until the SHUTDOWN state.
      
      - If a rebootimes out, it tries it again, up to 3 times. If it gets to
        three times without working, it sends mail to tbops and turns the
        machine off instead of continuing to reboot it. Right now I haven't
        made it do node_reboot -f or power cycle on retries, but it easily
        could.
      
      - Stuff to be done before they work: make node_reboot send an event
        instead of doing the work, and make a new script that has node_reboot's
        old guts. Note that this requires authentication in our events for these
        commands, and a way to make sure that the command that came in as an
        event was properly authenticated.
      
      - For future growth and expansion, it is set up so it should be relatively
        easy to add other commands that do different things, even if they take
        arbitrary params that aren't nodes or lists of nodes.
      71b82cc4