1. 09 Jun, 2003 1 commit
  2. 06 Jun, 2003 1 commit
    • Mac Newbold's avatar
      First batch of changes for adding TBCOMMAND events. Currently, here's what · 71b82cc4
      Mac Newbold authored
      is supported:
      
      - stated listens for TBCOMMAND events, and currently handles REBOOT,
        POWEROFF, POWERON, and POWERCYCLE events. It does everything except make
        the actual calls to node_reboot and power. And it accepts batches of
        nodes instead of just single ones.
      
      - Timeouts were added to the db for these commands, with no timeout for
        the power ones (since the node can't hang during those), and a 15 second
        timeout from reboot until the SHUTDOWN state.
      
      - If a rebootimes out, it tries it again, up to 3 times. If it gets to
        three times without working, it sends mail to tbops and turns the
        machine off instead of continuing to reboot it. Right now I haven't
        made it do node_reboot -f or power cycle on retries, but it easily
        could.
      
      - Stuff to be done before they work: make node_reboot send an event
        instead of doing the work, and make a new script that has node_reboot's
        old guts. Note that this requires authentication in our events for these
        commands, and a way to make sure that the command that came in as an
        event was properly authenticated.
      
      - For future growth and expansion, it is set up so it should be relatively
        easy to add other commands that do different things, even if they take
        arbitrary params that aren't nodes or lists of nodes.
      71b82cc4
  3. 23 May, 2003 1 commit
    • Mac Newbold's avatar
      Fix two problems: · c8848155
      Mac Newbold authored
      1. timeouts for nodes weren't getting reset when they had a mode
      ransition, so they were timing out in shutdown after changing modes.
      2. It was still going back into a blocking wait, even though a signal had
      been recieved, and not quitting back up to the main loop to handle it.
      c8848155
  4. 22 May, 2003 2 commits
  5. 20 May, 2003 2 commits
    • Mac Newbold's avatar
      Back out last set of changes until we figure out some odd behaviors that I · 1934cc59
      Mac Newbold authored
      didn't see in testing. Specifically, why it pegs at 100% CPU after a
      while, and why it gets timeouts after it has removed the timeout from the
      queue.
      1934cc59
    • Mac Newbold's avatar
      Bunch of pretty good-sized changes to stated: · b438d5f5
      Mac Newbold authored
      1. Change from inefficient timeout search algo that ran once per second to
      a highly efficient priority queue method of managing timeouts. Now
      instead of checking every node's timestamps, we just look at the head of
      the queue, and it is often much less frequent than once a second, since we
      know how long we have until the next timeout.
      
      2. Start using a blocking poll for events, so I can sleep for long periods
      of time instead of having to wake up at least once a second to check for
      timeouts and events. Will set the block timeout for the shortest of: the
      time to send out the next batch of queued emails, the next time a timeout
      may occur, or when there are no mails waiting and no timeouts possible, 10
      minutes. Comes back as soon as an event comes in.
      
      3. Given the above two items, we no longer need a sleep(1) in our main
      loop.
      
      One small glitch is in the progress of being fixed. When using blocking
      polls, things hang when trying to unregister from the event system. Not a
      big deal, just ^C twice to kill it. (May cause it to need two SIGUSR1's
      to get it to restart, too.)
      
      In the next update, look for:
       - Really take action on timeouts.
         - keep track of how many times we've retried, and notify if something
           may be wrong with the node.
         - Find out policy on taking action with timeouts.
           - Do it if the expt is in transition or the node is free
           - Probably don't touch if the expt is established.
           - Maybe? in active expt, send (good) email to expt owner on timeouts
      
      Related "coming soon" items:
      os_load/os_setup etc.:
       - Add the waitforstate stuff we've talked about
       - make os_load/os_setup use it
      b438d5f5
  6. 30 Apr, 2003 1 commit
  7. 28 Apr, 2003 1 commit
  8. 17 Apr, 2003 1 commit
    • Mac Newbold's avatar
      Add generic per-node state triggers to stated. You can put a trigger · 074149f5
      Mac Newbold authored
      on any node on any state, in any specific mode, or without any mode
      restriction.
      
      The imediate use of this is the FREENODE trigger. Now RELOADDONE adds
      a FREENODE trigger on the ISUP state, if the node is in the reloading
      expt. Then next time the node hits ISUP, it gets freed from the
      reloading expt.
      
      This fix solves the race where recently freed (and still rebooting)
      nodes get grabbed by an expt and get rebooted in a way that may hoze
      their FS's.
      
      Also fixed a problem that was making it load the db twice on startup.
      074149f5
  9. 19 Mar, 2003 1 commit
  10. 12 Mar, 2003 1 commit
  11. 08 Mar, 2003 1 commit
  12. 07 Mar, 2003 1 commit
    • Mac Newbold's avatar
      A few changes to stated: · 92fa4ae2
      Mac Newbold authored
       - fix bad indenting to a uniform 4 spaces (before was 2, 4 and 8 mixed)
       - Move ping-for-isup functionality into a separate script
       - Make sure every transition triggered by stated (directly or indirectly)
         sends an event, instead of taking shortcuts.
      
      This called for a new script, eventping, which just pings until the node
      is pingable, then sends an ISUP event. Stated runs this in the background
      where necessary, and nothing else should run it.
      
      Adding eventping meant modifying configure and the utils makefile, too.
      92fa4ae2
  13. 26 Feb, 2003 1 commit
  14. 05 Feb, 2003 2 commits
  15. 29 Jan, 2003 3 commits
  16. 07 Jan, 2003 2 commits
  17. 20 Dec, 2002 1 commit
  18. 16 Dec, 2002 1 commit
    • Mac Newbold's avatar
      Fix the 1-event-per-second limitations. Poll until I don't get more · a77a1559
      Mac Newbold authored
      events. This may delay handling of other stuff that happens in my main
      loop, but not by too much. To prevent skew, everything (including reload
      frequency) is done strictly by seconds elapsed, not by iterations or
      anything.
      
      I found that even polling for multiple events without sleeping, I could
      only handle a little over 1 per second when I was calling inuse/statetime
      for additional info on every event. Even though this only happens in the
      worst case (every event is wrong), it won't do. So I took that out. I'll
      probably end up adding a faster lookup of the info I need (mostly
      reservation, and what osid it thinks it is running). That change took it
      up to at least 4 per second (as fast as I could send them manually), more
      than 4x our previous performance. So we should be able to keep up now.
      
      Also, add the support for "announcements" to testbed ops when I die and
      such. (Been in a few days, but this is the first commit of it)
      a77a1559
  19. 09 Dec, 2002 1 commit
  20. 03 Dec, 2002 1 commit
  21. 22 Nov, 2002 1 commit
  22. 14 Nov, 2002 1 commit
    • Mac Newbold's avatar
      Lots of changes. · 349db7bf
      Mac Newbold authored
      First, fix up the isup generation code. When a node/OS doesn't send its
      own isups, but is pingable, we need to fork and ping it, and send ISUP
      when it pings. The code was there, but was broken. This fixes it. The one
      time that it may cause errant messages is in modes other than MINIMAL.
      When we get BOOTING, we check if it needs isup generated. If we have to
      ping it, when it pings we send ISUP. This means that if we are really in
      NORMAL mode, we might send ISUP before the node sends REBOOTED (or TBSETUP
      in NORMALv1), and it would look funny. But that case will be really rare,
      since everything that sends REBOOTED or TBSETUP has no reason not to send
      ISUP itself.
      
      Second, after mailbombing myself a couple of times, Kirk and I decided I'd
      better put some throttling in the notification code that stated uses. So
      now it throttles itself and digests the messages if they're sent too close
      together. The first message it gets will get sent immediately. If the next
      one is long enough after that, it sends it immediately too. If a message
      comes too soon after sending one, we queue it up, and send it later
      after enough time has passed. Currently it is set to wait 5 seconds
      between messages, so it will send up to 12 per second, and wait no more
      than 5 seconds before sending a message that is queued up.
      
      (Something similar to this may be a nice thing in the rest of our stuff,
      but it was made a lot easier by the fact that stated already had a polling
      loop in it. Without that, you'd have to use alarms or some other weird
      thing, which would be painful.)
      349db7bf
  23. 05 Nov, 2002 1 commit
  24. 04 Nov, 2002 1 commit
    • Mac Newbold's avatar
      Bunch o' changes. · e9dcf743
      Mac Newbold authored
       - Better pidfile handling, do proper locking, etc.
       - Change die() to fatal(), so it sends mail and goes to syslog instead of
         to /dev/null
       - Fix RESET to not reset pxe_boot_path for Mike.
       - Fix sendmail call to have proper to and from addrs
      e9dcf743
  25. 01 Nov, 2002 1 commit
  26. 31 Oct, 2002 1 commit
  27. 22 Oct, 2002 2 commits
  28. 18 Oct, 2002 1 commit
    • Mac Newbold's avatar
      Merge the newstated branch with the main tree. · 5c961517
      Mac Newbold authored
      Changes to watch out for:
      
      - db calls that change boot info in nodes table are now calls to os_select
      
      - whenever you want to change a node's pxe boot info, or def or next boot
      osids or paths, use os_select.
      
      - when you need to wait for a node to reach some point in the boot process
      (like ISUP), check the state in the database using the lib calls
      
      - Proxydhcp now sends a BOOTING state for each node that it talks to.
      
      - OSs that don't send ISUP will have one generated for them by stated
      either when they ping (if they support ping) or immediately after they get
      to BOOTING.
      
      - States now have timeouts. Actions aren't currently carried out, but they
      will be soon. If you notice problems here, let me know... we're still
      tuning it. (Before all timeouts were set to "none" in the db)
      
      One temporary change:
      
      - While I make our new free node manager daemon (freed), all nodes are
      forced into reloading when they're nfreed and the calls to reset the os
      are disabled (that will move into freed).
      5c961517
  29. 20 Sep, 2002 2 commits
  30. 19 Sep, 2002 1 commit
    • Robert Ricci's avatar
      A few changes for use with the testsuite's 'full' mode: · 509c7b38
      Robert Ricci authored
      1) Checks database redirects for nodes, and ignore events that aren't
         directed to its database.
      2) Doesn't insist on being run as root (doesn't need to be right now,
         anyway.)
      3) '-f' option that prevents it from forking into the backgound, for
         easier killing.
      509c7b38
  31. 10 Jul, 2002 1 commit
  32. 12 Jun, 2002 1 commit