• Mike Hibler's avatar
    Rejiggered reload_daemon to enforce a max time. · b6d272a2
    Mike Hibler authored
    There are now some sitevars to control its behavior, the one of interest here
    is reload/failtime:
    
    The way the reload daemon is supposed to work now is that nodes will be
    started on their reloading adventure with an os_load. If they are still there
    after reload/retrytime minutes, then they will either be rebooted (if the
    os_load was successful) or os_load'ed again (if the first os_load failed
    outright). The logic for either of these is that there might have been some
    transient condition that caused the failure. If we do have to perform this
    "retry" then we will send email to testbed-ops if reload/warnonretry is set.
    If, after another reload/retrytime minutes, a node is still there, then the
    node will be sent to hwdown, possibly powering it off or booting it into the
    admin MFS depending on the setting of reload/hwdownaction.
    
    So really, reload/failtime should not be needed. All node should exit
    reloading in 2 * reload/retrytime minutes. But it is there as a backstop
    (and because I didn't understand the logic of the reload daemon at first!)
    Well, it also comes into play if the reload daemon is restarted after being
    down for a long period of time. In this case, all nodes in reloading will
    get moved to hwdown. May need to reconsider this...
    b6d272a2