Rejiggered reload_daemon to enforce a max time.
There are now some sitevars to control its behavior, the one of interest here is reload/failtime: The way the reload daemon is supposed to work now is that nodes will be started on their reloading adventure with an os_load. If they are still there after reload/retrytime minutes, then they will either be rebooted (if the os_load was successful) or os_load'ed again (if the first os_load failed outright). The logic for either of these is that there might have been some transient condition that caused the failure. If we do have to perform this "retry" then we will send email to testbed-ops if reload/warnonretry is set. If, after another reload/retrytime minutes, a node is still there, then the node will be sent to hwdown, possibly powering it off or booting it into the admin MFS depending on the setting of reload/hwdownaction. So really, reload/failtime should not be needed. All node should exit reloading in 2 * reload/retrytime minutes. But it is there as a backstop (and because I didn't understand the logic of the reload daemon at first!) Well, it also comes into play if the reload daemon is restarted after being down for a long period of time. In this case, all nodes in reloading will get moved to hwdown. May need to reconsider this...
Showing with 355 additions and 113 deletions