Skip to content
  • Leigh B. Stoller's avatar
    New approach to dealing with nodes that fail to boot is os_setup, and · 5cf6aad2
    Leigh B. Stoller authored
    land in hwdown.
    
    Currently, if a node fails to boot in os_setup and the node is running
    a system image, it is moved into hwdown. 99% of the time this is
    wasted work; the node did not fail for hardware reasons, but for some
    other reason that is transient.
    
    The new approach is to move the node into another holding experiment,
    emulab-ops/hwcheckup. The daemon watches that experiment, and nodes
    that land in it are freshly reloaded with the default image and
    rebooted. If the node reboots okay after reload, it is released back
    into the free pool. If it fails any part of the reload/reboot, it is
    officially moved into hwdown.
    
    Another possible use; if you have a suspect node, you go wiggle some
    hardware, and instead of releasing it into the free pool, you move it
    into hwcheckup, to see if it reloads/reboots. If not, it lands in
    hwdown again. Then you break out the hammer.
    
    Most of the changes in Node.pm, libdb.pm, and os_setup are
    organizational changes to make the code cleaner.
    5cf6aad2