Commit b6d272a2 authored by Mike Hibler's avatar Mike Hibler

Rejiggered reload_daemon to enforce a max time.

There are now some sitevars to control its behavior, the one of interest here
is reload/failtime:

The way the reload daemon is supposed to work now is that nodes will be
started on their reloading adventure with an os_load. If they are still there
after reload/retrytime minutes, then they will either be rebooted (if the
os_load was successful) or os_load'ed again (if the first os_load failed
outright). The logic for either of these is that there might have been some
transient condition that caused the failure. If we do have to perform this
"retry" then we will send email to testbed-ops if reload/warnonretry is set.
If, after another reload/retrytime minutes, a node is still there, then the
node will be sent to hwdown, possibly powering it off or booting it into the
admin MFS depending on the setting of reload/hwdownaction.

So really, reload/failtime should not be needed. All node should exit
reloading in 2 * reload/retrytime minutes. But it is there as a backstop
(and because I didn't understand the logic of the reload daemon at first!)
Well, it also comes into play if the reload daemon is restarted after being
down for a long period of time. In this case, all nodes in reloading will
get moved to hwdown. May need to reconsider this...
parent 58db6edd
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment