1. 10 May, 2011 1 commit
    • Leigh B Stoller's avatar
      Gack, must call "select STDOUT" after the reopen operation, since we · 84a6e9fe
      Leigh B Stoller authored
      used "select STDERR" to change the line buffering. The result was that
      after the log roll, the child was printing to STDERR instead of
      STDOUT, and so the parent never saw any new events.
      
      Note that USR1 (re-exec binary) does not work since exec bypasses the
      END block, and things get messed up. Not fixed yet.
      84a6e9fe
  2. 13 Mar, 2011 1 commit
  3. 25 Feb, 2011 1 commit
    • Mike Hibler's avatar
      Fix some nagging bugs. · 85d8986c
      Mike Hibler authored
      We were not processing the timeout queue because we got stuck forever in
      the loop that processed events. Now before looping back to sysread, make
      sure there is something to read so we don't block.
      
      When we startup or re-read the DB state, ignore really old state timeout
      values; e.g., for nodes that have been dead for ages but happen to be in
      a state such as SHUTDOWN that has a timeout.
      
      In the main loop, handle any re-read of the DB state before testing the
      queue length to see if we can do a blocking poll. Re-reading the state may
      add timeouts to the queue.
      85d8986c
  4. 24 Feb, 2011 1 commit
  5. 04 Feb, 2011 1 commit
  6. 02 Feb, 2011 2 commits
  7. 01 Feb, 2011 1 commit
  8. 25 Jan, 2011 1 commit
  9. 24 Jan, 2011 1 commit
  10. 17 Nov, 2010 2 commits
  11. 12 Nov, 2010 1 commit
  12. 10 Nov, 2010 1 commit
  13. 09 Nov, 2010 1 commit
  14. 29 Sep, 2010 2 commits
    • Mike Hibler's avatar
      5ae92284
    • Mike Hibler's avatar
      Handle a common failure on the node reload path. · 4dc57d48
      Mike Hibler authored
      Under load, nodes that have just entered reloading and have just rebooted
      might fail to get bootinfo.  The default behavior in this case is for the
      node to boot from disk (dubious, but that is the topic for another day).
      This causes the node to fall off the RELOAD path, winding up in either
      TBFAILED or ISUP.  Worse, if the node makes it to ISUP, its reload state
      is cleared and even if the reload_daemon reboots the node, it will still
      not go through the reloading process.
      
      The result is a bunch of nodes left in reloading.  Now if a node makes an
      invalid transition to TBFAILED or ISUP while in the RELOAD state machine,
      it fires the new REBOOT trigger which does...well, you figure it out.
      Note that in the ISUP case, this trigger overrides the default that would
      otherwise clear the reload state--so reboot is sufficient to get the machine
      back on the RELOAD track.
      4dc57d48
  15. 28 May, 2010 1 commit
  16. 26 May, 2010 1 commit
  17. 25 May, 2010 1 commit
  18. 20 May, 2010 2 commits
    • Robert P Ricci's avatar
      Add a new timeout action; STATE · 159552a3
      Robert P Ricci authored
      Allows the state_timeouts table to contain a new type of action
      to take on timeout: STATE:newstate . This will force stated to
      transition the node to newstate, and take any trigger actions
      associated with that state.
      
      We will use this to make timeouts in the secure reload path
      force the node into the SECVIOLATION state.
      
      Not yet tested.
      159552a3
    • Robert P Ricci's avatar
      Add two new triggers for secure boot · 9f5a312f
      Robert P Ricci authored
      Add 'POWEROFF' and 'EMAILNOTIFY' state triggers - the idea is that
      these will be used as triggers when a node enters the 'SECVIOLATION'
      state in the secure reload path, to turn off the node and send
      testbed-ops mail about it.
      
      Not yet tested.
      9f5a312f
  19. 05 Jan, 2010 2 commits
  20. 04 Aug, 2009 1 commit
    • Kevin Atkinson's avatar
      Implement frontend and middleend support for loading multiple images · e7871305
      Kevin Atkinson authored
      at once with Frisbee (excludes the actual MFS changes).
      
      Os_load now takes take a list of comma serrated image names for the
      "-i" and "-m" options.  The default OS is the OS for the last image
      specified in the list.  I also changed the "-p" option of osload to
      search both the project specified and emulab-ops for the image rather
      than just the project specified in order to simplify specifying
      multiple images (and because I personally found that behavior annoying
      when using osload).
      
      I modified the current_reloads table to be able to specify more than one
      image for a node by adding an "idx" column which controls the order of
      the reloads.  I also added a "prepare" column to the table (explained
      below)
      
      I modified tmcd to basically loop over the entries in the table and
      create a multiline LOADINFO responsive, and modified rc.frisbee to
      handle the multiline response and load each image in turn.
      
      I modified os_load to take a new option "-P" whi...
      e7871305
  21. 06 Mar, 2007 1 commit
  22. 30 Aug, 2006 1 commit
  23. 26 Apr, 2006 1 commit
  24. 25 Apr, 2006 1 commit
  25. 07 Feb, 2006 1 commit
  26. 01 Dec, 2005 1 commit
  27. 19 Aug, 2005 1 commit
  28. 18 Aug, 2005 1 commit
  29. 12 Jan, 2005 1 commit
  30. 19 Aug, 2004 1 commit
  31. 18 Aug, 2004 1 commit
    • Leigh B. Stoller's avatar
      Minor extension to stated. Add a trigger mechanism for invoking an · 6cf3e936
      Leigh B. Stoller authored
      "arbitrary" script as defined in the stated_triggers table. Currently
      using this to invoke the new opsreboot script whenever ISUP comes in
      from ops.
      
      The opsreboot script is currently a skeleton. All it does is send
      email.  I'll add the rest later (which really won't be much at first;
      just getting the event schedulers started).
      6cf3e936
  32. 22 Jul, 2004 1 commit
  33. 11 Feb, 2004 1 commit
  34. 15 Jan, 2004 1 commit
    • Mac Newbold's avatar
      libdb changes: · 78ad260c
      Mac Newbold authored
      - add functions to recursively dump hashes and arrays into a string
        suitable for printing as debugging output (great for data structures)
      
      - add three new trigger strings
      
      - add 'use strict', do corresponding cleanup
      
      stated changes:
      
      - move special-cased stuff in handleEvent for PXEBOOTING and BOOTING into
        triggers (PXEBOOTING, BOOTING, and CHECKGENISUP)
      
      - clarify (via comments) the existing kinds of triggers and which ones run
        when, and add a new kind (global "any-mode" triggers). We already had
        per-node mode-specific, per-node any-mode, and global mode-specific
        triggers. Now you can have a trigger that is good for any mode in a
        given state, that can be overridden on a mode-specific basis. This is
        great for PXEBOOTING, BOOTING, and ISUP, since they each have a trigger
        list that should be run regardless of what mode you're in. Now they only
        require 3 entries instead of 3*N that have to be maintained per mode.
      
           # A note about triggers:
           #
           # "per-node" triggers only affect their specific node in a
           # particular mode/state, and are run first of all. "global"
           # triggers are triggers for a given mode/state that affect all
           # nodes, and are run after any per-node triggers. "Any-mode"
           # triggers are tied to a state, and occur in that state in any
           # mode. The any-mode triggers are over-ridden by global triggers,
           # and if an "Any-mode" trigger for state XYZ exists as well as a
           # global trigger for mode FOOBAR state XYZ, then when I arrive in
           # XYZ any per-node triggers will be run. Then, if I'm in mode
           # FOOBAR, only the global trigger will run. If I'm in any other
           # mode, only the any-mode trigger will run.
      
           # (our "*" is stored as $TBANYMODE)
           # Per-node triggers have a specific node_id
           # Global triggers have "*" as the node_id
           # Any-mode triggers have "*" as the mode, and can be global or per-node
      
        The updated table looks like this in the accompanying change to
        database-fill.sql:
      
      +---------+----------+------------+-----------------------+
      | node_id | op_mode  | state      | trigger               |
      +---------+----------+------------+-----------------------+
      | *       | *        | BOOTING    | BOOTING, CHECKGENISUP |
      | *       | *        | ISUP       | RESET                 |
      | *       | *        | PXEBOOTING | PXEBOOT               |
      | *       | RELOAD   | RELOADDONE | RESET, RELOADDONE     |
      | *       | ALWAYSUP | SHUTDOWN   | ISUP                  |
      +---------+----------+------------+-----------------------+
      
      - I also cleaned up the functions that add, get, and delete triggers.
        Before, the get function didn't include global triggers. Now it does,
        and has an option to just get the per-node triggers. Add and delete are
        still just per-node, of course.
      
      - Also found and fixed some little bugs while I was in there. (global
        triggers not taking a list,
      
      These changes are me getting ready to re-add all the changes I made months
      ago in order to do a before-and-after experiment for my thesis. Between
      now and the end of next week I'll be working on taking before numbers,
      patching stated with the changes, and getting after numbers.
      
      The problems I'm trying to replicate are the problems and slowdowns we
      used to get when os_{load,setup} would reboot a node, thinking it had
      timed out, when it really didn't know whether it was making progress or
      not. The fix includes making os_{load,setup} depend on stated to watch for
      progress and timeouts, and do any appropriate retries. Part of that is the
      StateWait stuff, that lets programs watch for events easily, and the
      node_reboot-with-events stuff that puts stated in control of nodes as they
      reboot.
      78ad260c
  35. 12 Jan, 2004 1 commit
    • Leigh B. Stoller's avatar
      Hmm, this file dropped from previous commit. Added support for · 5378d87c
      Leigh B. Stoller authored
      handling PXEWAKUP timeouts, retrying 3 times and then forcing a power
      cycle.  Changed BOOTING event action to auto switch in and out of the
      special PXEKERNEL state machine that all local nodes use since all
      local nodes boot the same pxeboot kernel and talk to bootinfo (as
      directed to by dhcp).
      5378d87c