1. 13 Oct, 2003 1 commit
  2. 10 Oct, 2003 1 commit
    • Mac Newbold's avatar
      New StateWait changes - the main point of all this is to move to our new · 2b2a306d
      Mac Newbold authored
      model of waiting for state changes. Before we were watching the database
      (which means we can only watch for terminal/stable/long-lived states, and
      have to poll the db). Now things that are waiting for states to change
      become event listeners, and watch the stream of events flow by, and don't
      have to do any polling. They can now watch for any state, and even
      sequences of states (ie a Shutdown followed by an Isup).
      
      To do this, there is now a cool StateWait.pm library that encapsulates the
      functionality needed. To use it, you call initStateWait before you start
      the chain of events (ie before you call node reboot). Then do your stuff,
      and call waitForState() when you're ready to wait. It can be told to
      return periodically with the results so far, and you can cancel waiting
      for things. An example program called waitForState is in
      testbed/event/stated/ , and can also be used nicely as a command line tool
      that wraps up the library functionality.
      
      This also required the introduction of a TBFAILED event that can be sent
      when a node isn't going to make it to the state that someone may be
      waiting for. Ie if it gets wedged coming up, and stated retries, but
      eventually gives up on it, it sends this to let things know that the node
      is hozed and won't ever come up.
      
      Another thing that is part of this is that node_reboot moves (back) to the
      fully-event-driven model, where users call node reboot, and it does some
      checks and sends some events. Then stated calls node_reboot in "real mode"
      to actually do the work, and handles doing the appropriate retries until
      the node either comes up or is deemed "failed" and stated gives up on it.
      This means stated is also the gatekeeper of when you can and cannot reboot
      a node. (See mail archives for extensive discussions of the details.)
      
      A big part of the motivation for this was to get uninformed timeouts and
      retries out of os_load/os_setup and put them in stated where we can make a
      wiser choice. So os_load and os_setup now use this new stuff and don't
      have to worry about timing out on nodes and rebooting. Stated makes sure
      that they either come up, get retried, or fail to boot. tbrestart also
      underwent a similar change.
      2b2a306d
  3. 09 Oct, 2003 1 commit
  4. 29 Aug, 2003 1 commit
  5. 25 Aug, 2003 1 commit
  6. 22 Aug, 2003 1 commit
  7. 19 Jun, 2003 1 commit
    • Mac Newbold's avatar
      The new and fully functional rebooting-via-events stuff and the · 1daaa992
      Mac Newbold authored
      really-reboot-nodes-that-timeout stuff.
      
      NOTE: Until the timeout/retry stuff is gone from os_load/os_setup, it is
      disabled in stated. It will still only send email. But all the stuff is
      there and has been tested.
      
      NOTE: Until other things don't depend on the old behavior of node_reboot
      (when it returns, all nodes are in SHUTDOWN), the event stuff is disabled.
      Real mode is the default, and can be run by anyone.
      
      In short, this commit is new versions of stated and node_reboot that act
      almost exactly like the old ones. But I wanted to commit them before I go
      on making a bunch more changes, to have a checkpoint that I know works.
      1daaa992
  8. 06 Jun, 2003 1 commit
  9. 13 May, 2003 1 commit
  10. 04 Apr, 2003 1 commit
  11. 20 Mar, 2003 3 commits
  12. 19 Mar, 2003 1 commit
    • Mac Newbold's avatar
      New slothd change: · b73aee17
      Mac Newbold authored
      node_reboot reports node activity into the "last_ext_act" column of
      node_activity. (Ie activity that is external to the node.)
      
      This means that swapin, swapout, reload, etc etc, anything that reboots
      the node from boss/ops, will count as activity.
      b73aee17
  13. 07 Jan, 2003 1 commit
  14. 31 Dec, 2002 1 commit
    • Leigh B. Stoller's avatar
      Add support for rebooing jailed (virtual) nodes, either remote or · ab8b901f
      Leigh B. Stoller authored
      local. For local nodes, need to cull out jailed nodes if the phys node
      is also going to reboot. Jailed nodes are rebooted serially since they
      go down much faster.
      
      Fix up recently added wait mode for jailed nodes. Also, I noticed that
      I was having problems with events not filtering through stated before
      going into the ISUP wait loop; I was catching the nodes still in ISUP
      instead of SHUTDOWN. I added a sleep(2) before going into wait mode,
      but this might be something to watch out for elsewhere too.
      ab8b901f
  15. 18 Oct, 2002 1 commit
    • Mac Newbold's avatar
      Merge the newstated branch with the main tree. · 5c961517
      Mac Newbold authored
      Changes to watch out for:
      
      - db calls that change boot info in nodes table are now calls to os_select
      
      - whenever you want to change a node's pxe boot info, or def or next boot
      osids or paths, use os_select.
      
      - when you need to wait for a node to reach some point in the boot process
      (like ISUP), check the state in the database using the lib calls
      
      - Proxydhcp now sends a BOOTING state for each node that it talks to.
      
      - OSs that don't send ISUP will have one generated for them by stated
      either when they ping (if they support ping) or immediately after they get
      to BOOTING.
      
      - States now have timeouts. Actions aren't currently carried out, but they
      will be soon. If you notice problems here, let me know... we're still
      tuning it. (Before all timeouts were set to "none" in the db)
      
      One temporary change:
      
      - While I make our new free node manager daemon (freed), all nodes are
      forced into reloading when they're nfreed and the calls to reset the os
      are disabled (that will move into freed).
      5c961517
  16. 17 Oct, 2002 1 commit
  17. 07 Oct, 2002 1 commit
  18. 20 Sep, 2002 1 commit
    • Mac Newbold's avatar
      Remove -e flag from calls to power. node_reboot sends an event only when ssh... · 8b23d335
      Mac Newbold authored
      Remove -e flag from calls to power. node_reboot sends an event only when ssh reboot or ipod are successful in rebooting the node, and only calls power when they are not successful. So an event should be sent by power every time node_reboot calls it. This explains some of the problems we were having with tons of email from stated about invalid transitions: since the state changes weren't always happening, it appeared to skip over states.
      8b23d335
  19. 07 Jul, 2002 1 commit
  20. 19 Jun, 2002 1 commit
  21. 16 Jun, 2002 1 commit
  22. 07 Jun, 2002 1 commit
  23. 05 Jun, 2002 1 commit
    • Leigh B. Stoller's avatar
      Changes to sshtb. Remove sshremote, and convert sshtb into a perl · 231fc2b1
      Leigh B. Stoller authored
      script that checks the database to see if local or remote. The problem
      with this is that the ssh syntax makes it hard to determine the host
      name by inspection. Would need to parse all the ssh args (bad idea),
      ot work backwards and try to figure out the difference between the
      command (which is not a string but a sequence of args) and the host
      and the preceeding ssh args. Hell with that! Changed sshtb to require
      a specific -host argument. Read the args and look for it. Error out of
      not found, to catch improper usage.
      
      The moral of this update: "sshtb [ssh args] -host <host> [more args ...]
      231fc2b1
  24. 22 Apr, 2002 1 commit
  25. 17 Apr, 2002 1 commit
    • Robert Ricci's avatar
      Moved EventSend calls to the TBSetNodeEventState() function. This has · 15c13c32
      Robert Ricci authored
      two benefits: (1) More general (2) Regains ability to run without the
      event system. Previously, since programs that watned to set node state
      had to 'use event', this broke our ability to run without the event
      system. Now, we can do a check in libdb for the event system, and not
      use it if EVENTSYS is not set. If not, we update state in the database
      directly rather than sending an event.
      
      Also added equivalent calls for node operational mode, as well as new
      constants for both state and mode.
      
      Converted power and node_reboot to use this new scheme.
      15c13c32
  26. 03 Apr, 2002 1 commit
  27. 01 Apr, 2002 1 commit
    • Robert Ricci's avatar
      Transition to tmcd and event-based node state reporting. · 44311142
      Robert Ricci authored
      Changed scripts that used the 'eventstatus' column to use the more
      descriptively-named 'eventstate' column.
      
      The FreeBSD and Linux starup scripts report a 'REBOOTED' state to tmcd
      when they start, and 'ISUP' when the starup script is done.
      
      node_reboot and power now send TBNODESTATE/REBOOTING events.
      44311142
  28. 05 Mar, 2002 2 commits
    • Leigh B. Stoller's avatar
    • Leigh B. Stoller's avatar
      A wide ranging set of event system changes: · 0318cc22
      Leigh B. Stoller authored
      assign_wrapper.in: Hack in a change that ensures a delay node is
      created for any link on which an event is posted (up,down,modify),
      no matter what its initial parameters are. ie: If a link is created
      with no delay, but there is an event that adds a delay later, then we
      must drop in a delay node. Same for up/down on a link. We do this in
      the delay node. I am reasonably confident that this change is fine for
      duplex links, but I am less sure of the effect on lans!
      
      eventsys_control.in: Checkpoint latest changes. Add "replay" option,
      which right now just stops and starts the event scheduler so that it
      reloads the entire event list. Add check for existing experiment, and
      that the experiment is either active or swapping (do not want to start
      a scheduler for a swapped out experiment!). Add check to see if there
      are any events, and skip startup if there are not events in the DB.
      Lastly, get very serious about preventing more than one scheduler from
      being started, either by accident or intentionally. My protocol is to
      lock the table, grab and set the pid to -pid, test the pid for a
      positive value, and if positive, send the scheduler a kill(TERM) so
      that it can cleanup, clear the pid to zero in the DB, and exit. This
      approach ensures that we do not try to send a kill to a pid that is no
      longer active or owned by the user (this last part is not really
      necessary cause of how pids are reused, but it was easy to add so why
      not).
      
      exports_setup.in: Trivial change to make it easier to turn this on
      temporarily in devel trees.
      named_setup.in: Ditto.
      
      node_reboot.in: Add call to TBdbfork() in child cause of apparent DB
      connection problems across forks. In the child, set the eventstatus
      for the node to REBOOT if successful (not this event status stuff is
      temporary, will be recast in next set of revisions).
      
      GNUmakefile:  Add new controlling program, eventsys_control.
      power.in:     Ditto previous comment about REBOOT.
      os_setup.in:  Non event system cleanups.
      tbend.in:     Add DB cleanup of the new virt_trafgens and eventlist tables.
      tbprerun.in:  Ditto.
      tbreport.in:  Print out the event list in a pretty print format.
      tbswapin.in:  Add call to start the event system. Also a big fix; move
                    the named script up above the os_setup so that the named
                    tables have been updated by the time the first node
                    reboots. I noticed that nodes were failing on gethostbyname().
      tbswapout.in: Add call to stop the event system.
      0318cc22
  29. 27 Nov, 2001 1 commit
  30. 16 Oct, 2001 1 commit
  31. 25 Jul, 2001 2 commits
    • Mac Newbold's avatar
      Fix small syntax error. · 27de539d
      Mac Newbold authored
      27de539d
    • Leigh B. Stoller's avatar
      Another Shark hack. Well, maybe not. Batch node_reboots in groups of 8 · 73437a5c
      Leigh B. Stoller authored
      to avoid a blizzard of reboots all at once. This might solve the
      problem of sharks rebooting okay, but failing to become proper members
      of the testbed. A good thing to do in any event, especially with
      people trying to run 50 node experiments. The reason for 8 of course
      is that I want to isolate each shelf (after sorting the list). I pause
      15 seconds between each shelf, and 10 seconds between each batch of 8
      pcs.
      73437a5c
  32. 12 Jul, 2001 1 commit
  33. 11 Jul, 2001 2 commits
  34. 26 Jun, 2001 1 commit
    • Robert Ricci's avatar
      New script: sshtb · 9de266c3
      Robert Ricci authored
      sshtb is a _very_ simple shell script that runs ssh with a few commandline
      parameters, which make it play nicer in an script environment. These
      parameters can be changed with the '--with-ssh-args' argument, but default to:
      '-q -o "BatchMode yes" -o "StrictHostKeyChecking no"'
      All ssh calls now use this script.
      9de266c3
  35. 05 Jun, 2001 1 commit