New approach to dealing with nodes that fail to boot is os_setup, and
land in hwdown. Currently, if a node fails to boot in os_setup and the node is running a system image, it is moved into hwdown. 99% of the time this is wasted work; the node did not fail for hardware reasons, but for some other reason that is transient. The new approach is to move the node into another holding experiment, emulab-ops/hwcheckup. The daemon watches that experiment, and nodes that land in it are freshly reloaded with the default image and rebooted. If the node reboots okay after reload, it is released back into the free pool. If it fails any part of the reload/reboot, it is officially moved into hwdown. Another possible use; if you have a suspect node, you go wiggle some hardware, and instead of releasing it into the free pool, you move it into hwcheckup, to see if it reloads/reboots. If not, it lands in hwdown again. Then you break out the hammer. Most of the changes in Node.pm, libdb.pm, and os_setup are organizational changes to make the code cleaner.
Showing with 375 additions and 73 deletions