1. 09 Feb, 2010 1 commit
  2. 08 Feb, 2010 1 commit
  3. 04 Feb, 2010 1 commit
  4. 03 Feb, 2010 3 commits
  5. 29 Jan, 2010 1 commit
  6. 28 Jan, 2010 1 commit
  7. 25 Jan, 2010 1 commit
  8. 22 Jan, 2010 1 commit
  9. 21 Jan, 2010 1 commit
  10. 15 Jan, 2010 1 commit
    • Mike Hibler's avatar
      Fix waitmode==2 logic. · ee7974ab
      Mike Hibler authored
      We were getting stuck in an infinite loop if a node failed to come back
      up for longer than its time limit.
  11. 14 Jan, 2010 1 commit
  12. 13 Jan, 2010 1 commit
  13. 12 Jan, 2010 2 commits
  14. 11 Jan, 2010 1 commit
  15. 08 Jan, 2010 2 commits
  16. 07 Jan, 2010 1 commit
  17. 06 Jan, 2010 1 commit
  18. 05 Jan, 2010 2 commits
  19. 28 Dec, 2009 1 commit
  20. 23 Dec, 2009 1 commit
    • Leigh Stoller's avatar
      A couple of changes that attempt to cut short the waiting when · 28ac96a5
      Leigh Stoller authored
      a node has failed.
      * In the main wait loop, I check the eventstate for the node, for
        TBFAILED or PXEFAILED. Neither of these should happen after the
        reboot, so it makes sense to quit waiting if they do.
      * I added an event handler to libosload, specifically to watch for
        nodes entering RELOADSETUP or RELOADING, after the reboot. Because
        of the race with reboot, this was best done with a handler instead
        of polling the DB state like case #1 above. The idea is that a node
        should hit one of these two states within a fairly short time (I
        currently have it set to 5 minutes). If not, something is wrong and
        the loop bails on that node. ÊWhat happens after is subject to the
        normal waiting times.
      I believe that these two tests will catch a lot of cases where osload
      is waiting on something that will never finish.
  21. 22 Dec, 2009 4 commits
  22. 21 Dec, 2009 1 commit
    • Leigh Stoller's avatar
      New approach to dealing with nodes that fail to boot is os_setup, and · 5cf6aad2
      Leigh Stoller authored
      land in hwdown.
      Currently, if a node fails to boot in os_setup and the node is running
      a system image, it is moved into hwdown. 99% of the time this is
      wasted work; the node did not fail for hardware reasons, but for some
      other reason that is transient.
      The new approach is to move the node into another holding experiment,
      emulab-ops/hwcheckup. The daemon watches that experiment, and nodes
      that land in it are freshly reloaded with the default image and
      rebooted. If the node reboots okay after reload, it is released back
      into the free pool. If it fails any part of the reload/reboot, it is
      officially moved into hwdown.
      Another possible use; if you have a suspect node, you go wiggle some
      hardware, and instead of releasing it into the free pool, you move it
      into hwcheckup, to see if it reloads/reboots. If not, it lands in
      hwdown again. Then you break out the hammer.
      Most of the changes in Node.pm, libdb.pm, and os_setup are
      organizational changes to make the code cleaner.
  23. 18 Dec, 2009 1 commit
    • Leigh Stoller's avatar
      Changes to support the SPP nodes. My approach was a little odd. · fd015646
      Leigh Stoller authored
      What I did was create node table entries for the three SPP nodes.
      These are designated as local, shared nodes, reserved to a holding
      experiment. This allowed me to use all of the existing shared node
      pool support, albeit with a couple of tweaks in libvtop that I will
      not bother to mention since they are hideous (another thing I need to
      The virtual nodes that are created on the spp nodes are figments; they
      will never be setup, booted or torn down. They exist simply as place
      holders in the DB, in order hold the reserved bandwidth on the network
      interfaces. In other words, you can create as many of these imaginary
      spp nodes (in different slices if you like) as there are interfaces on
      the spp node. Or you can create a single spp imaginary node with all
      of the interfaces. You get the idea; its the reserved bandwidth that
      drives the allocation.
      There are also some minor spp specific changes in vnode_setup.in to
      avoid trying to generalize things. I will return to this later as
      See this wiki page for info and sample rspecs:
  24. 17 Dec, 2009 1 commit
  25. 15 Dec, 2009 2 commits
  26. 14 Dec, 2009 1 commit
  27. 11 Dec, 2009 1 commit
  28. 03 Dec, 2009 1 commit
  29. 02 Dec, 2009 1 commit
  30. 13 Nov, 2009 1 commit
  31. 11 Nov, 2009 1 commit