1. 19 Nov, 2003 1 commit
  2. 06 Nov, 2003 1 commit
    • Leigh B. Stoller's avatar
      Prevent reload_daemon from exiting. · 33e45640
      Leigh B. Stoller authored
      * If a reboot stuck node fails, move the node to hwdown, send email,
        and log an entry in the nodelog. Then continue on.
      
      * If os_load fails, record the nodes that failed, and try again if the
        nodes fail to reload at the retry interval. Do not exit. I was going
        to call os_load again immediately, but decided not to since these
        changes were quite easy.
      
        The above change not really tested ... waiting for os_load to fail!
      33e45640
  3. 15 Sep, 2003 1 commit
  4. 25 Mar, 2003 1 commit
  5. 22 Mar, 2003 1 commit
    • Mac Newbold's avatar
      Grab a batch at a time instead of a single node per loop iteration. · 4a34327a
      Mac Newbold authored
      Scaling and speed now depends primarily on os_load (and indirectly,
      node_reboot). The time a batch spends in the reload_daemon code appears to
      be <1s per node now, instead of taking 30s per node to grab, setup, and
      reboot.
      
      Also, finally remove the "obsolete section" that's been sitting in there
      for a long time. This was the part that did netdisk reloads, and has
      already been neutered out of the code path for several months at least.
      4a34327a
  6. 31 Jan, 2003 1 commit
  7. 30 Jan, 2003 1 commit
  8. 29 Jan, 2003 1 commit
  9. 18 Dec, 2002 1 commit
  10. 16 Dec, 2002 1 commit
    • Mac Newbold's avatar
      Decrease the sleep between loops from 2 to 1, and fix a typo. This should · 6bdba92c
      Mac Newbold authored
      help nodes in reload_pending get sucked into reloading faster. If it
      doesn't do enough, we'll need to do more batching of stuff, so we get some
      parallelism in os_load instead of forcing it to serialize by calling
      os_load one node at a time.
      
      I was tempted to nuke all the stuff that was in there from the netdisk
      reload type, but decided not to. It won't be too long (relatively
      speaking) before we have freed, the new "free node manager" that will
      replace/supersede our current reload_daemon anyway.
      6bdba92c
  11. 11 Dec, 2002 1 commit
  12. 04 Nov, 2002 1 commit
  13. 01 Nov, 2002 1 commit
  14. 18 Oct, 2002 1 commit
    • Mac Newbold's avatar
      Merge the newstated branch with the main tree. · 5c961517
      Mac Newbold authored
      Changes to watch out for:
      
      - db calls that change boot info in nodes table are now calls to os_select
      
      - whenever you want to change a node's pxe boot info, or def or next boot
      osids or paths, use os_select.
      
      - when you need to wait for a node to reach some point in the boot process
      (like ISUP), check the state in the database using the lib calls
      
      - Proxydhcp now sends a BOOTING state for each node that it talks to.
      
      - OSs that don't send ISUP will have one generated for them by stated
      either when they ping (if they support ping) or immediately after they get
      to BOOTING.
      
      - States now have timeouts. Actions aren't currently carried out, but they
      will be soon. If you notice problems here, let me know... we're still
      tuning it. (Before all timeouts were set to "none" in the db)
      
      One temporary change:
      
      - While I make our new free node manager daemon (freed), all nodes are
      forced into reloading when they're nfreed and the calls to reset the os
      are disabled (that will move into freed).
      5c961517
  15. 07 Jul, 2002 1 commit
  16. 13 May, 2002 1 commit
  17. 12 Feb, 2002 1 commit
  18. 08 Feb, 2002 1 commit
    • Leigh B. Stoller's avatar
      Big round of image/osid changes. This is the first cut (final cut?) at · a73e627e
      Leigh B. Stoller authored
      supporting autocreating and autoloading images. The imageid form now
      sports a field to specify a nodeid to create the image from; If set,
      the backend create_image script is invoked. Thats the easy part.
      Slightly harder is autoloading images based on the osid specified in
      the NS file. To support this, I have added a new DB table called
      osidtoimageid, which holds the mapping from osid/pctype to imageid.
      When users create images, they must specify what node types that image
      is good for. Obviously, the mappings have to be unique or it would be
      impossible to figure it out! Anyway, once that image mapping is
      in place and the image created, the user can specify that ID in the NS
      file. I've changed os_setup to to look for IDs that are not loaded,
      and to try and find one in the osidtoimageid. If found, it invokes
      os_load. To keep things running in parallel as much as possible,
      os_setup issues all the loads/reboots (could be more than a single set
      of loads is multiple IDs are in the NS file) at once, and waits for
      all the children to exit. I've hacked up os_load a bit to try and be
      more robust in the face of PXE failures, which still happen and are
      rather troublsesome. Need an event system!
      
      Contained in this revision are unrelated changed to make the OS and
      Image IDs per-project unique instead of globally unique, since thats a
      pain for the users. This turns out to be very messy, since underneath
      we do not want to pass around pid/ID in all the various places its
      used. Rather, I create a globally unique name and extened the OS and
      Image tables to include pid/name/ID. The user selects pid/name, and I
      create the globally unique ID. For the most part this is invisible
      throughout the system, except where we interface with the user, say in
      the web pages; the user should see his chosen name where possible, and
      the should invoke scripts (os_load, create_image, etc) using his/her
      name not the internal ID. Also, in the front end the NS file should
      use the user name not the ID. All in all, this accounted for a number
      of annoying changes and some special cases that are unavoidable.
      a73e627e
  19. 07 Feb, 2002 1 commit
  20. 14 Jan, 2002 1 commit
    • Leigh B. Stoller's avatar
      Make Frisbee.Redux live: · d08b5e41
      Leigh B. Stoller authored
      * Add appropriate goo to os/GNUMakefile so that Frisbee daemon is
        built and installed.
      
      * Rework the frisbee launcher slightly. Aside from little changes
        (send email to tbops when frisbeed dies, new cmdline syntax to
        frisbeed), allow for frisbeed to exit gracefully after a period of
        inactivity (no client requests for 30 minutes, at present). In order
        to prevent a race condition with a new client being added (and
        rebooted) and frisbeed terminating before the client gets started,
        add a load_busy indicator to the images table (next to load_address
        slot) and set that to one each time to frisbeelauncher is invoked.
        When frisbeed exits, test and clear that bit atomically (lock
        tables) and go around another time (restart frisbeed for another 30
        minute period).
      
      * Rework waitmode in os_load. Wait for all of the nodes to finish at
        once, and track which nodes never finish. Retry those nodes again by
        rebooting. The number of retries is configurable in the script, and
        is currently set to one. This should take care of some PXE boot
        related problems, although obviously not all.
      
      * Got rid of -w option to os_load and made waitmode the default. The
        -s option can be used to start a reload, but not to wait for it to
        complete.
      
      * Minor changes to sched_reload and reload_daemon; pass in -s option
        to os_load.
      d08b5e41
  21. 04 Dec, 2001 1 commit
  22. 27 Nov, 2001 1 commit
  23. 07 Nov, 2001 1 commit
  24. 06 Nov, 2001 1 commit
  25. 05 Nov, 2001 1 commit
  26. 23 Oct, 2001 1 commit
  27. 17 Oct, 2001 3 commits
  28. 30 Sep, 2001 1 commit
  29. 28 Sep, 2001 1 commit
  30. 26 Sep, 2001 1 commit
  31. 18 Sep, 2001 1 commit
  32. 17 Sep, 2001 1 commit
  33. 06 Sep, 2001 1 commit
    • Leigh B. Stoller's avatar
      Changes to nfree in how scheduled reloads are handled. Instead of · 2007505d
      Leigh B. Stoller authored
      firing off an os_load, just move the node from its current reservation
      to emulab-ops/reloadpending. This moves the operation out of band from
      the user's perspective (he gets more immediate response when an
      experiment ends, and besides we cannot handle mass reloads anyway, and
      so this approach is unusable until Frisbee.
      
      Change the reload_daemon to look for free nodes that need a reload (as
      before) *and* nodes in emulab-ops/reloadpending (as put there by nfree).
      In this case, the imageid comes from the reloads table instead of the
      node-types table. I also updated the reload_daemon to use libdb routines.
      
      Also change testbed/reloading to emulab-ops/reloading. Maybe someday
      I'll remove these hardwired strings.
      2007505d
  34. 23 Aug, 2001 1 commit
    • Mac Newbold's avatar
      Lots of small changes for turning our 'require lib*' lines into 'use lib*'... · e2ed8a1c
      Mac Newbold authored
      Lots of small changes for turning our 'require lib*' lines into 'use lib*' lines. Proper modules declare themselves as a package, and use Exporter to export the names of the subroutines that should be visible from the outside world. Many of ours didn't do that, it was just a file with a bunch of subs in it. So now I've fixed many of them to be proper, and removed the requires and 'push(@INC,...)' hacks and changed it to the proper 'use lib @prefix@/lib/;' and use lib*.
      e2ed8a1c
  35. 01 Aug, 2001 1 commit
  36. 21 Jul, 2001 1 commit
    • Mac Newbold's avatar
      Many changes and updates for handling new types. The db now has types like... · 78b4e4f5
      Mac Newbold authored
      Many changes and updates for handling new types. The db now has types like 'pc600', 'pc850', and 'dnard', and each type has a class like 'pc' or 'shark'. This updates scripts that use types to use classes where appropriate, and to handle the new types where there were hardcoded things that couldn't be eliminated right now.
      78b4e4f5
  37. 29 Jun, 2001 1 commit
  38. 10 May, 2001 1 commit
    • Leigh B. Stoller's avatar
      Lots of little changes for sending email to the right places, with · 3285bc3e
      Leigh B. Stoller authored
      proper headers. Split out some of the mail into testbed-logs,
      testbed-ops, and testbed-approval. Added a library for including from
      our perl scripts. Contains a couple of mail helper functions, but will
      hopefully contain more as time goes by.
      
      Fixed a bug in the web interface that was causing breakage for people
      with multiple accounts. Mac and Jay have noticed this, when logging
      out and trying to join or create a project under a new or different
      name.
      3285bc3e