tbsetup/reload_daemon.in · lbs-greatness · emulab / emulab-devel

Rejiggered reload_daemon to enforce a max time. · b6d272a2

Mike Hibler authored Aug 10, 2016

There are now some sitevars to control its behavior, the one of interest here
is reload/failtime:

The way the reload daemon is supposed to work now is that nodes will be
started on their reloading adventure with an os_load. If they are still there
after reload/retrytime minutes, then they will either be rebooted (if the
os_load was successful) or os_load'ed again (if the first os_load failed
outright). The logic for either of these is that there might have been some
transient condition that caused the failure. If we do have to perform this
"retry" then we will send email to testbed-ops if reload/warnonretry is set.
If, after another reload/retrytime minutes, a node is still there, then the
node will be sent to hwdown, possibly powering it off or booting it into the
admin MFS depending on the setting of reload/hwdownaction.

So really, reload/failtime should not be needed. All node should exit
reloading in 2 * reload/retrytime minutes. But it is there as a backstop
(and because I didn't understand the logic of the reload daemon at first!)
Well, it also comes into play if the reload daemon is restarted after being
down for a long period of time. In this case, all nodes in reloading will
get moved to hwdown. May need to reconsider this...

b6d272a2