Hook into DHCP to add intermediate stated states, to catch hung nodes.
This is in reference to this message from @hibler (see below). Turns out that Mike discovered a hook in dhcpd that allows us to call an external program when a node dhcpd's and we answer. This is currently used on just the moonshot cluster, but we can use it more generally to catch when nodes get stuck after we send it a pxewakeup or a reboot, using stated to notice the failed transition.
This is something we have seen before, nodes going unresponsive in pxewait.
I don't recall if it has always been at Clemson. We have a stated timeout
for state BOOTING, that is supposed to respond with a REBOOT. But long ago,
Mac changed it to just notify rather than doing a reboot so it would not
interact badly with higher-level (i.e., tbswap) timeouts. I recently relaxed
that a bit:
#
# XXX Temporary! For now notify instead of
# really rebooting, until the timeout/retry
# stuff is gone from os_setup and os_load
#
# XXX "temporary" is going on 13 years now and the
# lack of reboot does cause us grief. In particular,
# the case of NORMALv2/BOOTING, we can wind up here
# if a PXEWAKEUP at swap in is unsuccessful. We have
# seen this when IPMI SOL issues have caused the
# console and OS to hang up in the post-wakeup boot
# process or if the PXEWAKEUP is lost. Since there is
# only the overarching swapin timeout at this point,
# and that is typically quite large, we'll risk a
# bad timeout interaction.
#
if ("$mode/$state" eq "NORMALv2/BOOTING") {
handleCommand($node,$TBREBOOT,$timedout,1);
} else {
notify("Node $node has timed out in state ".
"$mode/$state - REBOOT requested\n");
}
but in this case we were in "RELOAD/BOOTING" so it fell back to the old
"just notify" behavior.
I might relax this further, either for all cases or when $state eq BOOTING.
The timeout period is either 3 minutes (from the BOOTING state) or 3 minutes
(from the SHUTDOWN state) for all cases that trigger a REBOOT. For BOOTING,
this is essentially the time it takes the kernel to boot far enough to run
Emulab scripts (i.e., from first DHCP til we do a "tmcc state" of some sort).
For SHUTDOWN it is the time it takes to get through the BIOS (i.e., reboot
til first DHCP). May need to up these timeout periods as these seem pretty
marginal given today's BIOSes and leisurely kernel boots. Hmm...looking at
/usr/testbed/log/stated-mail.log where the notifications go, these do seem
to be pretty common. At least in the SHUTDOWN case. Must contemplate.